-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fi_read not returning error #1285
Comments
I have a variation on this. I'm doing non-blocking get(read) and using too many resources. But I'm getting an abort when I was hoping for -FI_EAGAIN. From the fi_rma man page: Return Value and fi_msg says: -FI_EAGAIN : Indicates that the underlying provider currently lacks the resources needed to initiate the requested operation. The reasons for a provider returning FI_EAGAIN are varied. However, common reasons include insufficient internal buffering or full processing queues. What I get: libfabric:gni:ep_data:__gnix_rma_fill_pd_indirect_get():608 [23368:1] RAN OUT OF INT_TX_BUFS Application 3025958 is crashing. ATP analysis proceeding... ATP Stack walkback for Rank 0 starting: |
And I wonder what the limit is? Seems to be around 256 but I don't see any info field that matches. I'm trying to figure out how often to block. Every 128 works. |
I think this one is same as #1199 |
Bob, do you have a test case that I could run to reproduce this? |
Yes, I should. I put in a workaround so I’ll need to undo that first.
But it's not a simple testcase. This is the SHMEM library and the OSU get non-blocking testcase.
|
It's dynamically linked so you can use your own libfabrics but you need my SHMEM @ $ LD_LIBRARY_PATH=/cray/css/users/bcernohous/repos/mpt_base/smax/libfab/install/lib/:/cray/css/users/bcernohous/repos/mpt_base/smax/opt-1285/lib/:$LD_LIBRARY_PATH SHMEM_COLL_OPT_OFF=1 aprun -q -n2 -N1 -d6 /cray/css/users/bcernohous/osu-micro-benchmarks-5.3.2/openshmem/osu_oshm_get_nbi.x.1285 heap with FI_LOG_LEVEL=debug, the last bit of the trace is: libfabric:gni:ep_ctrl:_gnix_vc_ep_get_vc():2130 [29161:1] |
@bcernohous, can I run this on tiger? I seem to be missing libsma2.so |
@bcernohous, the issue you reported here is actually #1199 which is now fixed. |
The fix for #1199 appears to fix this. But now that you grow the pool, is there still a limit that I should consider for my non-blocking data transfers? I haven't hit any limit yet in my testing. |
@bcernohous |
I used to hit the 128 easily. But I'm not hitting the new limit, for some reason. I start 100K+ non-blocking get/read's and complete them all at once with fi_cntr_read.
|
fi_read can only return an error if there is already a connection established. Is this what we are asking for here? |
These cases should return a CQ error. |
Handing off to @hppritcha for retry part of this. |
As per ofiwg#2807, fi_read is not returning an error when it probably should be. The following warning is issued:
The text was updated successfully, but these errors were encountered: