fi_read not returning error #1285

sungeunchoi · 2017-03-13T16:57:01Z

As per ofiwg#2807, fi_read is not returning an error when it probably should be. The following warning is issued:

libfabric:gni:ep_data:_gnix_rma_post_req():1278<warn> [28831:12] GNI_Post*() failed: GNI_RC_INVALID_PARAM

The text was updated successfully, but these errors were encountered:

bcernohous · 2017-04-10T21:26:07Z

I have a variation on this. I'm doing non-blocking get(read) and using too many resources. But I'm getting an abort when I was hoping for -FI_EAGAIN.

From the fi_rma man page:

Return Value
Returns 0 on success. On error, a negative value corresponding to fabric errno is returned. Fabric errno values are defined in rdma/fi_errno.h.
Errors
-FI_EAGAIN : See fi_msg(3) for a detailed description of handling FI_EAGAIN.

and fi_msg says:

-FI_EAGAIN : Indicates that the underlying provider currently lacks the resources needed to initiate the requested operation. The reasons for a provider returning FI_EAGAIN are varied. However, common reasons include insufficient internal buffering or full processing queues.

What I get:

libfabric:gni:ep_data:__gnix_rma_fill_pd_indirect_get():608 [23368:1] RAN OUT OF INT_TX_BUFS

Application 3025958 is crashing. ATP analysis proceeding...

ATP Stack walkback for Rank 0 starting:
<...>
_smati_ofi_get_nbi@smat_ofi_get.c:117
fi_readmsg@fi_rma.h:115
gnix_ep_readmsg@0x486e45
_gnix_rma@0x4b7a13
_gnix_vc_queue_tx_req@0x45ea15
_gnix_rma_post_req@0x4b5f6a
__gnix_rma_fill_pd_indirect_get@0x4b00cc
abort@abort.c:78
raise@pt-raise.c:37

bcernohous · 2017-04-10T21:40:08Z

And I wonder what the limit is? Seems to be around 256 but I don't see any info field that matches. I'm trying to figure out how often to block. Every 128 works.

sungeunchoi · 2017-04-10T21:49:15Z

I think this one is same as #1199

chuckfossen · 2017-04-17T20:53:35Z

Bob, do you have a test case that I could run to reproduce this?

bcernohous · 2017-04-17T22:04:38Z

Yes, I should. I put in a workaround so I’ll need to undo that first. But it's not a simple testcase. This is the SHMEM library and the OSU get non-blocking testcase.

bcernohous · 2017-04-17T22:40:19Z

It's dynamically linked so you can use your own libfabrics but you need my SHMEM @
/cray/css/users/bcernohous/repos/mpt_base/smax/opt-1285/lib/

$ LD_LIBRARY_PATH=/cray/css/users/bcernohous/repos/mpt_base/smax/libfab/install/lib/:/cray/css/users/bcernohous/repos/mpt_base/smax/opt-1285/lib/:$LD_LIBRARY_PATH SHMEM_COLL_OPT_OFF=1 aprun -q -n2 -N1 -d6 /cray/css/users/bcernohous/osu-micro-benchmarks-5.3.2/openshmem/osu_oshm_get_nbi.x.1285 heap
OSU OpenSHMEM Non-blocking Get Test
Size Latency (us)
_pmiu_daemon(SIGCHLD): [NID 00012] [c0-0c0s3n0] [Mon Apr 17 17:37:29 2017] PE RANK 0 exit signal Aborted

with FI_LOG_LEVEL=debug, the last bit of the trace is:

libfabric:gni:ep_ctrl:_gnix_vc_ep_get_vc():2130 [29161:1]
libfabric:gni:ep_ctrl:__gnix_vc_get_vc_by_fi_addr():218 [29161:1]
libfabric:gni:ep_data:_gnix_rma():1387 [29161:1] Using tmp buf for unaligned GET, req: 0x947a00
libfabric:gni:ep_ctrl:_gnix_vc_ep_get_vc():2130 [29161:1]
libfabric:gni:ep_ctrl:__gnix_vc_get_vc_by_fi_addr():218 [29161:1]
libfabric:gni:ep_data:__gnix_rma_fill_pd_indirect_get():608 [29161:1] RAN OUT OF INT_TX_BUFS_pmiu_daemon(SIGCHLD): [NID 00012] [c0-0c0s3n0] [Mon Apr 17 17:38:29 2017] PE RANK 0 exit signal Aborted

chuckfossen · 2017-04-20T14:48:56Z

@bcernohous, can I run this on tiger? I seem to be missing libsma2.so

chuckfossen · 2017-04-25T13:29:11Z

@bcernohous, the issue you reported here is actually #1199 which is now fixed.

bcernohous · 2017-04-25T20:12:34Z

The fix for #1199 appears to fix this.

But now that you grow the pool, is there still a limit that I should consider for my non-blocking data transfers? I haven't hit any limit yet in my testing.

chuckfossen · 2017-04-26T18:42:30Z

@bcernohous
The original buffer pool was 128 tx buffers.
The current limit is 256 x 128. This will be further enhanced when we add the capability to return an EAGAIN when this limit is hit.

bcernohous · 2017-04-26T19:06:46Z

I used to hit the 128 easily.

But I'm not hitting the new limit, for some reason. I start 100K+ non-blocking get/read's and complete them all at once with fi_cntr_read.

 PE 0 - _smati_tx_complete (0x85aec0)SMA_TX_COMPLETE(0xad19f0) pending 1, timeout -1
 PE 0 - _smati_tx_complete fi_cntr_read returned "(null)" 1 expected 1
 <... reads ...>
 PE 0 - _smati_tx_complete (0x85aec0)SMA_TX_COMPLETE(0xad19f0) pending 101001, timeout -1
 PE 0 - _smati_tx_complete fi_cntr_read returned "(null)" 101001 expected 101001

chuckfossen · 2017-05-05T21:08:35Z

fi_read can only return an error if there is already a connection established. Is this what we are asking for here?

sungeunchoi · 2017-05-11T22:01:54Z

These cases should return a CQ error.

sungeunchoi · 2017-07-17T20:09:43Z

Handing off to @hppritcha for retry part of this.

sungeunchoi added this to the 1.5.0 milestone Mar 13, 2017

sungeunchoi added the bug label Mar 13, 2017

sungeunchoi mentioned this issue Mar 28, 2017

fi_cntr_read() does not return with error when one has occured #1304

Closed

chuckfossen self-assigned this Apr 17, 2017

bcernohous mentioned this issue Apr 25, 2017

need to break on allocating fab requests when using inject #1338

Closed

sungeunchoi assigned hppritcha and unassigned chuckfossen Jul 17, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fi_read not returning error #1285

fi_read not returning error #1285

sungeunchoi commented Mar 13, 2017

bcernohous commented Apr 10, 2017

bcernohous commented Apr 10, 2017

sungeunchoi commented Apr 10, 2017

chuckfossen commented Apr 17, 2017

bcernohous commented Apr 17, 2017 via email •

edited

Loading

bcernohous commented Apr 17, 2017 •

edited

Loading

chuckfossen commented Apr 20, 2017

chuckfossen commented Apr 25, 2017

bcernohous commented Apr 25, 2017

chuckfossen commented Apr 26, 2017

bcernohous commented Apr 26, 2017 •

edited

Loading

chuckfossen commented May 5, 2017

sungeunchoi commented May 11, 2017

sungeunchoi commented Jul 17, 2017

fi_read not returning error #1285

fi_read not returning error #1285

Comments

sungeunchoi commented Mar 13, 2017

bcernohous commented Apr 10, 2017

bcernohous commented Apr 10, 2017

sungeunchoi commented Apr 10, 2017

chuckfossen commented Apr 17, 2017

bcernohous commented Apr 17, 2017 via email • edited Loading

bcernohous commented Apr 17, 2017 • edited Loading

chuckfossen commented Apr 20, 2017

chuckfossen commented Apr 25, 2017

bcernohous commented Apr 25, 2017

chuckfossen commented Apr 26, 2017

bcernohous commented Apr 26, 2017 • edited Loading

chuckfossen commented May 5, 2017

sungeunchoi commented May 11, 2017

sungeunchoi commented Jul 17, 2017

bcernohous commented Apr 17, 2017 via email •

edited

Loading

bcernohous commented Apr 17, 2017 •

edited

Loading

bcernohous commented Apr 26, 2017 •

edited

Loading