Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Assert hdr->length <= (iface->am_ buf_size - sizeof(uct_tcp_am_hdr_t)) #4406

Closed
akesandgren opened this issue Nov 7, 2019 · 5 comments
Closed
Assignees

Comments

@akesandgren
Copy link

Just got hit by the above when running IOR on a single node. Quite unexpected since I've been running it a fair amount of times without seeing it.

Any thoughts on what might be triggering this?

Any other info you want?

@dmitrygx dmitrygx self-assigned this Nov 7, 2019
@dmitrygx
Copy link
Member

dmitrygx commented Nov 7, 2019

@akesandgren
Neither master nor v1.7.x branches have this assertion. Could you try master or v1.7.x, please?

@akesandgren
Copy link
Author

You mean that the assert itself is no longer there?

I can try but it will take a while before I have time and I don't have an easy reproducer.

@dmitrygx
Copy link
Member

dmitrygx commented Nov 7, 2019

You mean that the assert itself is no longer there?

yep

I can try but it will take a while before I have time and I don't have an easy reproducer.

thank you!

@akesandgren
Copy link
Author

akesandgren commented Dec 2, 2019

Finally got a case that somewhat reliably show the above problem using UCX 1.6.1.
Running with 1.7.0-rc1 I instead got this, in one run so far.

Running on two nodes here.

[b-cn0131:542585:0:542585]      tcp_ep.c:739  Assertion `hdr->length <= (iface->config.rx_seg_size - sizeof(uct_tcp_am_hdr_t))' failed
==== backtrace (tid: 542585) ====
 0 0x000000000001f130 ucs_fatal_error_message()  /scratch/eb-ake/UCX/1.7.0-rc1/GCCcore-7.3.0/ucx-1.7.0-rc1/src/ucs/debug/assert.c:33
 1 0x000000000001f2ce ucs_fatal_error_format()  /scratch/eb-ake/UCX/1.7.0-rc1/GCCcore-7.3.0/ucx-1.7.0-rc1/src/ucs/debug/assert.c:49
 2 0x0000000000019388 uct_tcp_ep_progress_rx()  /scratch/eb-ake/UCX/1.7.0-rc1/GCCcore-7.3.0/ucx-1.7.0-rc1/src/uct/tcp/tcp_ep.c:738
 3 0x000000000001b2d9 uct_tcp_iface_handle_events()  /scratch/eb-ake/UCX/1.7.0-rc1/GCCcore-7.3.0/ucx-1.7.0-rc1/src/uct/tcp/tcp_iface.c:180
 4 0x0000000000027575 ucs_event_set_wait()  /scratch/eb-ake/UCX/1.7.0-rc1/GCCcore-7.3.0/ucx-1.7.0-rc1/src/ucs/sys/event_set.c:213
 5 0x000000000001b39f uct_tcp_iface_progress()  /scratch/eb-ake/UCX/1.7.0-rc1/GCCcore-7.3.0/ucx-1.7.0-rc1/src/uct/tcp/tcp_iface.c:197
 6 0x0000000000022392 uct_iface_progress()  /scratch/eb-ake/UCX/1.7.0-rc1/GCCcore-7.3.0/ucx-1.7.0-rc1/src/uct/api/uct.h:2984
 7 0x0000000000022664 ucp_worker_iface_check_events_progress()  /scratch/eb-ake/UCX/1.7.0-rc1/GCCcore-7.3.0/ucx-1.7.0-rc1/src/ucp/core/ucp_worker.c:746
 8 0x000000000001a16a ucs_callbackq_slow_proxy()  /scratch/eb-ake/UCX/1.7.0-rc1/GCCcore-7.3.0/ucx-1.7.0-rc1/src/ucs/datastruct/callbackq.c:397
 9 0x0000000000024d0a ucs_callbackq_dispatch()  /scratch/eb-ake/UCX/1.7.0-rc1/GCCcore-7.3.0/ucx-1.7.0-rc1/src/ucs/datastruct/callbackq.h:211
10 0x0000000000024d0a uct_worker_p[b-cn0131:542585:0:542585]      tcp_ep.c:739  Assertion `hdr->length <= (iface->config.rx_seg_size - sizeof(uct_tcp_am_hdr_t))' failed
==== backtrace (tid: 542585) ====
 0 0x000000000001f130 ucs_fatal_error_message()  /scratch/eb-ake/UCX/1.7.0-rc1/GCCcore-7.3.0/ucx-1.7.0-rc1/src/ucs/debug/assert.c:33
 1 0x000000000001f2ce ucs_fatal_error_format()  /scratch/eb-ake/UCX/1.7.0-rc1/GCCcore-7.3.0/ucx-1.7.0-rc1/src/ucs/debug/assert.c:49
 2 0x0000000000019388 uct_tcp_ep_progress_rx()  /scratch/eb-ake/UCX/1.7.0-rc1/GCCcore-7.3.0/ucx-1.7.0-rc1/src/uct/tcp/tcp_ep.c:738
 3 0x000000000001b2d9 uct_tcp_iface_handle_events()  /scratch/eb-ake/UCX/1.7.0-rc1/GCCcore-7.3.0/ucx-1.7.0-rc1/src/uct/tcp/tcp_iface.c:180
 4 0x0000000000027575 ucs_event_set_wait()  /scratch/eb-ake/UCX/1.7.0-rc1/GCCcore-7.3.0/ucx-1.7.0-rc1/src/ucs/sys/event_set.c:213
 5 0x000000000001b39f uct_tcp_iface_progress()  /scratch/eb-ake/UCX/1.7.0-rc1/GCCcore-7.3.0/ucx-1.7.0-rc1/src/uct/tcp/tcp_iface.c:197
 6 0x0000000000022392 uct_iface_progress()  /scratch/eb-ake/UCX/1.7.0-rc1/GCCcore-7.3.0/ucx-1.7.0-rc1/src/uct/api/uct.h:2984
 7 0x0000000000022664 ucp_worker_iface_check_events_progress()  /scratch/eb-ake/UCX/1.7.0-rc1/GCCcore-7.3.0/ucx-1.7.0-rc1/src/ucp/core/ucp_worker.c:746
 8 0x000000000001a16a ucs_callbackq_slow_proxy()  /scratch/eb-ake/UCX/1.7.0-rc1/GCCcore-7.3.0/ucx-1.7.0-rc1/src/ucs/datastruct/callbackq.c:397
 9 0x0000000000024d0a ucs_callbackq_dispatch()  /scratch/eb-ake/UCX/1.7.0-rc1/GCCcore-7.3.0/ucx-1.7.0-rc1/src/ucs/datastruct/callbackq.h:211
10 0x0000000000024d0a uct_worker_progress()  /scratch/eb-ake/UCX/1.7.0-rc1/GCCcore-7.3.0/ucx-1.7.0-rc1/src/uct/api/uct.h:2203
11 0x0000000000024d0a ucp_worker_progress()  /scratch/eb-ake/UCX/1.7.0-rc1/GCCcore-7.3.0/ucx-1.7.0-rc1/src/ucp/core/ucp_worker.c:1897
12 0x0000000000003aa7 mca_pml_ucx_progress()  ???:0
13 0x000000000002c61c opal_progress()  ???:0
14 0x0000000000003c7d mca_pml_ucx_recv()  ???:0
15 0x00000000000a398d ompi_coll_base_allreduce_intra_recursivedoubling()  ???:0
16 0x000000000006275f MPI_Allreduce()  ???:0
17 0x0000000001cf5706 lj_vm_ffi_call()  crtstuff.c:0
18 0x0000000001d15baa lj_ccall_func()  crtstuff.c:0
19 0x0000000001cf1663 lj_cf_ffi_meta___call()  lib_ffi.c:0
20 0x0000000001cf3767 lj_BC_FUNCC()  crtstuff.c:0
21 0x0000000001ce26ac lua_pcall()  ???:0
22 0x000000000045477f main()  ???:0
23 0x0000000000020830 __libc_start_main()  /build/glibc-LK5gWL/glibc-2.23/csu/../csu/libc-start.c:291
24 0x0000000000454d09 _start()  ???:0
=================================
rogress()  /scratch/eb-ake/UCX/1.7.0-rc1/GCCcore-7.3.0/ucx-1.7.0-rc1/src/uct/api/uct.h:2203
11 0x0000000000024d0a ucp_worker_progress()  /scratch/eb-ake/UCX/1.7.0-rc1/GCCcore-7.3.0/ucx-1.7.0-rc1/src/ucp/core/ucp_worker.c:1897
12 0x0000000000003aa7 mca_pml_ucx_progress()  ???:0
13 0x000000000002c61c opal_progress()  ???:0
14 0x0000000000003c7d mca_pml_ucx_recv()  ???:0
15 0x00000000000a398d ompi_coll_base_allreduce_intra_recursivedoubling()  ???:0
16 0x000000000006275f MPI_Allreduce()  ???:0
17 0x0000000001cf5706 lj_vm_ffi_call()  crtstuff.c:0
18 0x0000000001d15baa lj_ccall_func()  crtstuff.c:0
19 0x0000000001cf1663 lj_cf_ffi_meta___call()  lib_ffi.c:0
20 0x0000000001cf3767 lj_BC_FUNCC()  crtstuff.c:0
21 0x0000000001ce26ac lua_pcall()  ???:0
22 0x000000000045477f main()  ???:0
23 0x0000000000020830 __libc_start_main()  /build/glibc-LK5gWL/glibc-2.23/csu/../csu/libc-start.c:291
24 0x0000000000454d09 _start()  ???:0
=================================

@dmitrygx
Copy link
Member

closing this issue as it is duplication of #4525

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants