-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Running libfabric/GNI and MPICH/GNI in parallel #1406
Comments
You have the correct forum. libugni's restrictions and requirements are the responsibility of the provider and application. If the application and libfabric agree on a threading model, it is up to the provider to ensure the application can use said threading model based on the agreements defined in the threading model definition. I'd be interested to know what threading model was specified in the fi_getinfo call. |
Based on your output though, I'd have to say that there are some aspects of these stack traces that don't seem valid to me. The main thread should not be calling libugni directly. Edit: I misunderstood your original description. libugni is not written to be thread safe. You cannot use libfabric with the GNI provider side-by-side with libugni from a different context. There is a caveat to this, but it doesn't apply to your use case. |
Does the explanation above make sense? |
James, thanks for the clarification. Though it is rather unexpected. |
I'm going to follow up with a colleague of mine. I know there are instances where libraries or applications use libugni in the manner that you are suggesting, but I suspect they use a different approach. What you are trying to accomplish might be possible. |
So, I'll retract my statement. There are perfectly valid cases for using libugni from multiple contexts, and multiple threads, but it is predicated on some assumptions. Libugni is thread-safe to the communication domain (a libugni construct). Would you mind explaining to me how it is that you are using libugni in the main thread, and how you are using libfabric? Specifically, how are you initializing each of the different communication contexts (libugni vs libfabric)? |
That is very encouraging. Thanks for looking deeper into the issue.
I can't for the main thread, its
[ Error checking removed for clarity. ] After that we create a completion queue, an adress vector, and an end point. Do you need to see this too? Thanks. |
Not really. I just wanted to sanity check some things. How does this application function? You said it runs some operations with libfabric and some with MPI. What is libfabric being used for? Could you attach a core dump? |
Here is the full back-trace of the error:
We're developing a tool (frames 8,9) that wraps MPI to record function calls of a MPI application (frames 10 - 14). Libfabric provides the communication infrastructure for transferring the recorded data to other tool processes where the processing is done. I hope this gives you a vague impression of our application. |
It is certainly interesting. I have a vague idea where your application is crashing. Would you mind adding this to your aprun/srun? UGNI_USE_LOGFILE=output.$ALPS_APP_PE UGNI_DEBUG=9 |
does the |
It doesn't seem like it needs it, no. |
@JoZie will try it tonight. We are also trying to workarround it with a mutex. I.e., manual |
I finally got a run with the uGNI logs active, but only with UGNI_DEBUG=4. I hope this is still helpful. The problem with higher debug levels is the amount of data that is generated (up to 25GB per process) Here are the output files of the erroneous process: |
I suspect you're getting a cdm id collision problem in uGNI's datagram path, although I thought the algorithm we're using in the GNI provider was different enough from that used in craypich that this problem would not happen. |
Howard: Is there a simple way for us to change the CDM ID generation for us to test this idea? |
Hmm... actually I'm not sure about the cdm id collision thing. That would have resulted in an error return from This approach should work however. I suggest 2 things:
|
We may try this. Though I have a question: do you think that the problem arise because we are interfacing with UGNI through different threads or because we are interfacing with UGNI through two different higher level interfaces (i.e., MPI and libfabric)? CDM id collisions sounds to me it falls into the latter case. Though we are currently using a mutex to multiplex the interaction with UGNI between our two threads. And this seems to avoid this problem. But I can't imagine that the CDM id collision can be voided, just by using a mutex. |
It may possibly be the former - accessing UGNI through different threads, although as @jswaro pointed out, since you're using separate uGNI objects (cdm, ep's, cq's) for craypich and the OFI GNI provider, you should be okay. If you were hitting a problem owing to using uGNI through two different high level interfaces, and using the same GNI RDMA credentials, and using a similar scheme for generating CDM Ids, you'd hit the id collision problem. But as I said above, if you were hitting that, you'd be getting a different error very near initialization. I think we need someone with access to the relevant uGNI source code to look and see where abort is being called in the uGNI calls showing up in the traceback. hmmm... actually since datagram path is almost entirely in the kernel, you may also get better info by using strace, and also running dmesg on the nodes where the job was run. If we're lucky |
I just came back from conference. Sorry about the lack of response. The abort is coming from GNI_PostDataProbeById, specifically from the ioctl where it attempts to post the dataprobe to the device through the kgni ioctl system. Given the error code reported by the fatal, it seems like it can't find the device based on what was provided. The device comes from the data embedded in the nic_handle, so perhaps the NIC handle is bad? |
interesting. You're probably right @jswaro the nic handle craypich is using has somehow gotten corrupted. Was craypich initialized with MPI_THREAD_MULTIPLE support? |
Out of curiosity, how is the helper thread created? Is it done via a call to pthread_create? |
The problem seems to be present with and without requested
Yes. |
i'll take this and try to reproduce this problem with a simple test case with and without using pthreads. |
Quick question though, at the point you see the abort in uGNI library, has the app already done |
Definitely yes. We start the libfabric thread inside the |
@bertwesarg which FI version are you feeding to |
|
@bertwesarg if you're working off of the ofi-cray/libfabric-cray source, could you rerun your test? |
@bertwesarg first a heads up. You'll need to make sure when you configure libfabric with
otherwise libfabric will fail in the call to fi_domain. esp. if you're using a CLE 6 system. That being said, I tested mixing Cray MPI with an app which uses the libfabric api and GNI provider directly and could not reproduce your problem. I'd suggest retesting using head of master for libfabric and see if #1411 has helped with the problem you're seeing. |
@hppritcha unfortunately we weren't able to test #1411 because we, for the moment, reverted back to a MPI only solution. We thought this workaround worked, however after som time the bug re-appeared . So it's unlikely that the effect is the result of an interaction between MPICH and libfabric on the uGNI level. We're suspecting that libunwind somehow interferes on the the uGNI level. But we are still working on a "minimal" example to reproduce this error. |
@JoZie What version of the gcc module are you using? If you are using a 7.x version, I've seen problems issues with it. You could try using gcc/6.1 or /6.3 Specifically, the problems that I have observed have been with libunwind. |
@jswaro Thanks for the advice! Last week I cold continue exploring the bug. I came to the conclusion that our problems with libfabric and libunwind are separate bugs. However I did some digging in the open Issues to see if there are related bugs and came across #1312. I also tried disabling the memory registration cache which seems to fix the bug. But with this setting all applications are awfully slow. Setting it to Now my goal is to use FI_LOCAL_MR and do the memory management myself. But my implementation doesn't work yet (some GNI_PostRdma failed: GNI_RC_INVALID_PARAM error). Is there somewhere a reference implementation for this except the |
Do you have the capability to compile libfabric with kdreg support? If not, then the internal memory registration cache could fault and cause all sorts of issues. Keep in mind when using FI_LOCAL_MR, it is a deprecated flag for libfabric version 1.5 -- however FI_MR_LOCAL is not. If you use FI_MR_LOCAL or FI_LOCAL_MR (pre1.5), to turn off the GNI provider caching code with the fabric ops. That should eliminate any code paths the provider might take to optimize registration. Given you have an invalid param, I suspect the same code that was tripping you up without FI_LOCAL_MR is still present until you disable the caching mechanisms. |
Thanks for the info! I cannot build with kdreg support since the header is missing, so I contacted my administrators. I took me a while to get 1.5 with FI_MR_LOCAL to run. It doesn't work with MR_BASIC, you have to use all values of the BASIC_MAP. But in the end the error is the same although the mr_cache is disabled. And the UGNI_DEBUG=9 flag doesn't provide any useful output as well. |
Dear all,
I'm not sure if this is the right forum, but anyway:
We would like to use
libfabric
with the GNI provider from inside a MPI application which uses MPICH/GNI on a Cray XC40 platform. But we have the impression that this does not play well together regarding threads. We start our own thread for doing justlibfabric
calls but no MPI calls, the reverse holds for the main thread. But we getabort()
s from inside thelibugni
library when the main thread does MPI calls.Here are two examples:
Is
libugni
prepared for this kind of usage at all?Thanks.
The text was updated successfully, but these errors were encountered: