Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

need to break on allocating fab requests when using inject #1338

Closed
hppritcha opened this issue Apr 25, 2017 · 5 comments
Closed

need to break on allocating fab requests when using inject #1338

hppritcha opened this issue Apr 25, 2017 · 5 comments
Labels

Comments

@hppritcha
Copy link
Member

We need to have a heuristic for throttling allocation of fab requests.

The following simple OpenSHMEM code will show the problem:

#include <shmem.h>

long long val = 0;

int main(int argc, char **argv) {
    int i;

    shmem_init();

    const int pe = shmem_my_pe();
    const int npes = shmem_n_pes();

    while (1) {
        shmem_longlong_add(&val, 1, (pe + 1) % npes);
    }

    shmem_finalize();
    return 0;
}

you will get killed by OOM. Note the problem is artificial with the while(1) loop, but for a loop with sufficiently big iteration could, you'll eventually get zapped by OOM. The test is using the inject path through the provider. For the inject path, we should definitely try to brake on the number of requests allocated since the app is never going to turn around to read off CQEs to recover them.

@bcernohous you may want to check this with your OpenSHMEM implementation. We observed this using the sandia openshmem (SOS).

@hppritcha hppritcha added the bug label Apr 25, 2017
@hppritcha
Copy link
Member Author

The OpenSHMEM developer says the problem can be illustrated with 1 PE and takes about 2-3 minutes to hit OOM depending on how slow your current OFI libfabric provider is and process memory limits.

@bcernohous
Copy link

I assume a -EAGAIN when you run out of resources. I then complete all pending requests and retry/continue.

I also have an env that lets me set a nbi block size and I quiet after pending operations.

I brought this up in issues #1285 #1199

@hppritcha
Copy link
Member Author

I think for this type of scenario we can just have the GNI provider internally step on the brake and harvest GNI TX CQEs and free up requests.

@jswaro
Copy link
Member

jswaro commented Aug 1, 2017

@hppritcha Can you verify that we can close this?

@hppritcha
Copy link
Member Author

yes this can be closed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants