Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fi_read not returning error #1285

Open
sungeunchoi opened this issue Mar 13, 2017 · 14 comments
Open

fi_read not returning error #1285

sungeunchoi opened this issue Mar 13, 2017 · 14 comments
Assignees
Labels
Milestone

Comments

@sungeunchoi
Copy link

As per ofiwg#2807, fi_read is not returning an error when it probably should be. The following warning is issued:

libfabric:gni:ep_data:_gnix_rma_post_req():1278<warn> [28831:12] GNI_Post*() failed: GNI_RC_INVALID_PARAM
@bcernohous
Copy link

I have a variation on this. I'm doing non-blocking get(read) and using too many resources. But I'm getting an abort when I was hoping for -FI_EAGAIN.

From the fi_rma man page:

Return Value
Returns 0 on success. On error, a negative value corresponding to fabric errno is returned. Fabric errno values are defined in rdma/fi_errno.h.
Errors
-FI_EAGAIN : See fi_msg(3) for a detailed description of handling FI_EAGAIN.

and fi_msg says:

-FI_EAGAIN : Indicates that the underlying provider currently lacks the resources needed to initiate the requested operation. The reasons for a provider returning FI_EAGAIN are varied. However, common reasons include insufficient internal buffering or full processing queues.

What I get:

libfabric:gni:ep_data:__gnix_rma_fill_pd_indirect_get():608 [23368:1] RAN OUT OF INT_TX_BUFS

Application 3025958 is crashing. ATP analysis proceeding...

ATP Stack walkback for Rank 0 starting:
<...>
_smati_ofi_get_nbi@smat_ofi_get.c:117
fi_readmsg@fi_rma.h:115
gnix_ep_readmsg@0x486e45
_gnix_rma@0x4b7a13
_gnix_vc_queue_tx_req@0x45ea15
_gnix_rma_post_req@0x4b5f6a
__gnix_rma_fill_pd_indirect_get@0x4b00cc
abort@abort.c:78
raise@pt-raise.c:37

@bcernohous
Copy link

And I wonder what the limit is? Seems to be around 256 but I don't see any info field that matches. I'm trying to figure out how often to block. Every 128 works.

@sungeunchoi
Copy link
Author

I think this one is same as #1199

@chuckfossen chuckfossen self-assigned this Apr 17, 2017
@chuckfossen
Copy link

Bob, do you have a test case that I could run to reproduce this?

@bcernohous
Copy link

bcernohous commented Apr 17, 2017 via email

@bcernohous
Copy link

bcernohous commented Apr 17, 2017

It's dynamically linked so you can use your own libfabrics but you need my SHMEM @
/cray/css/users/bcernohous/repos/mpt_base/smax/opt-1285/lib/

$ LD_LIBRARY_PATH=/cray/css/users/bcernohous/repos/mpt_base/smax/libfab/install/lib/:/cray/css/users/bcernohous/repos/mpt_base/smax/opt-1285/lib/:$LD_LIBRARY_PATH SHMEM_COLL_OPT_OFF=1 aprun -q -n2 -N1 -d6 /cray/css/users/bcernohous/osu-micro-benchmarks-5.3.2/openshmem/osu_oshm_get_nbi.x.1285 heap
OSU OpenSHMEM Non-blocking Get Test
Size Latency (us)
_pmiu_daemon(SIGCHLD): [NID 00012] [c0-0c0s3n0] [Mon Apr 17 17:37:29 2017] PE RANK 0 exit signal Aborted

with FI_LOG_LEVEL=debug, the last bit of the trace is:

libfabric:gni:ep_ctrl:_gnix_vc_ep_get_vc():2130 [29161:1]
libfabric:gni:ep_ctrl:__gnix_vc_get_vc_by_fi_addr():218 [29161:1]
libfabric:gni:ep_data:_gnix_rma():1387 [29161:1] Using tmp buf for unaligned GET, req: 0x947a00
libfabric:gni:ep_ctrl:_gnix_vc_ep_get_vc():2130 [29161:1]
libfabric:gni:ep_ctrl:__gnix_vc_get_vc_by_fi_addr():218 [29161:1]
libfabric:gni:ep_data:__gnix_rma_fill_pd_indirect_get():608 [29161:1] RAN OUT OF INT_TX_BUFS_pmiu_daemon(SIGCHLD): [NID 00012] [c0-0c0s3n0] [Mon Apr 17 17:38:29 2017] PE RANK 0 exit signal Aborted

@chuckfossen
Copy link

@bcernohous, can I run this on tiger? I seem to be missing libsma2.so

@chuckfossen
Copy link

@bcernohous, the issue you reported here is actually #1199 which is now fixed.

@bcernohous
Copy link

The fix for #1199 appears to fix this.

But now that you grow the pool, is there still a limit that I should consider for my non-blocking data transfers? I haven't hit any limit yet in my testing.

@chuckfossen
Copy link

@bcernohous
The original buffer pool was 128 tx buffers.
The current limit is 256 x 128. This will be further enhanced when we add the capability to return an EAGAIN when this limit is hit.

@bcernohous
Copy link

bcernohous commented Apr 26, 2017

I used to hit the 128 easily.

But I'm not hitting the new limit, for some reason. I start 100K+ non-blocking get/read's and complete them all at once with fi_cntr_read.

 PE 0 - _smati_tx_complete (0x85aec0)SMA_TX_COMPLETE(0xad19f0) pending 1, timeout -1
 PE 0 - _smati_tx_complete fi_cntr_read returned "(null)" 1 expected 1
 <... reads ...>
 PE 0 - _smati_tx_complete (0x85aec0)SMA_TX_COMPLETE(0xad19f0) pending 101001, timeout -1
 PE 0 - _smati_tx_complete fi_cntr_read returned "(null)" 101001 expected 101001

@chuckfossen
Copy link

fi_read can only return an error if there is already a connection established. Is this what we are asking for here?

@sungeunchoi
Copy link
Author

These cases should return a CQ error.

@sungeunchoi
Copy link
Author

Handing off to @hppritcha for retry part of this.

@sungeunchoi sungeunchoi assigned hppritcha and unassigned chuckfossen Jul 17, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants