Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

1.7.0-rc1 showing tcp_ep.c:739 Assertion hdr->length <= (iface->config.rx_seg_size - sizeof(uct_tcp_am_hdr_t)) #4525

Closed
akesandgren opened this issue Dec 2, 2019 · 31 comments · Fixed by #4612
Assignees
Labels

Comments

@akesandgren
Copy link

Describe the bug

Got a sporadic
tcp_ep.c:739 Assertion `hdr->length <= (iface->config.rx_seg_size - sizeof(uct_tcp_am_hdr_t))' failed
when chasing issue #4406

Could this be caused by outside connection attempts by port sniffers?

[b-cn0131:542585:0:542585]      tcp_ep.c:739  Assertion `hdr->length <= (iface->config.
rx_seg_size - sizeof(uct_tcp_am_hdr_t))' failed
==== backtrace (tid: 542585) ====
 0 0x000000000001f130 ucs_fatal_error_message()  /scratch/eb-ake/UCX/1.7.0-rc1/GCCcore-
7.3.0/ucx-1.7.0-rc1/src/ucs/debug/assert.c:33
 1 0x000000000001f2ce ucs_fatal_error_format()  /scratch/eb-ake/UCX/1.7.0-rc1/GCCcore-7.3.0/ucx-1.7.0-rc1/src/ucs/debug/assert.c:49
 2 0x0000000000019388 uct_tcp_ep_progress_rx()  /scratch/eb-ake/UCX/1.7.0-rc1/GCCcore-7.3.0/ucx-1.7.0-rc1/src/uct/tcp/tcp_ep.c:738
 3 0x000000000001b2d9 uct_tcp_iface_handle_events()  /scratch/eb-ake/UCX/1.7.0-rc1/GCCcore-7.3.0/ucx-1.7.0-rc1/src/uct/tcp/tcp_iface.c:180
 4 0x0000000000027575 ucs_event_set_wait()  /scratch/eb-ake/UCX/1.7.0-rc1/GCCcore-7.3.0/ucx-1.7.0-rc1/src/ucs/sys/event_set.c:213
 5 0x000000000001b39f uct_tcp_iface_progress()  /scratch/eb-ake/UCX/1.7.0-rc1/GCCcore-7.3.0/ucx-1.7.0-rc1/src/uct/tcp/tcp_iface.c:197
 6 0x0000000000022392 uct_iface_progress()  /scratch/eb-ake/UCX/1.7.0-rc1/GCCcore-7.3.0/ucx-1.7.0-rc1/src/uct/api/uct.h:2984
 7 0x0000000000022664 ucp_worker_iface_check_events_progress()  /scratch/eb-ake/UCX/1.7.0-rc1/GCCcore-7.3.0/ucx-1.7.0-rc1/src/ucp/core/ucp_worker.c:746
 8 0x000000000001a16a ucs_callbackq_slow_proxy()  /scratch/eb-ake/UCX/1.7.0-rc1/GCCcore-7.3.0/ucx-1.7.0-rc1/src/ucs/datastruct/callbackq.c:397
 9 0x0000000000024d0a ucs_callbackq_dispatch()  /scratch/eb-ake/UCX/1.7.0-rc1/GCCcore-7.3.0/ucx-1.7.0-rc1/src/ucs/datastruct/callbackq.h:211
10 0x0000000000024d0a uct_worker_progress()  /scratch/eb-ake/UCX/1.7.0-rc1/GCCcore-7.3.0/ucx-1.7.0-rc1/src/uct/api/uct.h:2203
11 0x0000000000024d0a ucp_worker_progress()  /scratch/eb-ake/UCX/1.7.0-rc1/GCCcore-7.3.0/ucx-1.7.0-rc1/src/ucp/core/ucp_worker.c:1897
12 0x0000000000003aa7 mca_pml_ucx_progress()  ???:0
13 0x000000000002c61c opal_progress()  ???:0
14 0x0000000000003c7d mca_pml_ucx_recv()  ???:0
15 0x00000000000a398d ompi_coll_base_allreduce_intra_recursivedoubling()  ???:0
16 0x000000000006275f MPI_Allreduce()  ???:0
17 0x0000000001cf5706 lj_vm_ffi_call()  crtstuff.c:0
18 0x0000000001d15baa lj_ccall_func()  crtstuff.c:0
19 0x0000000001cf1663 lj_cf_ffi_meta___call()  lib_ffi.c:0
20 0x0000000001cf3767 lj_BC_FUNCC()  crtstuff.c:0
21 0x0000000001ce26ac lua_pcall()  ???:0
22 0x000000000045477f main()  ???:0
23 0x0000000000020830 __libc_start_main()  /build/glibc-LK5gWL/glibc-2.23/csu/../csu/libc-start.c:291
24 0x0000000000454d09 _start()  ???:0
=================================

Steps to Reproduce

  • Command line
  • UCX version used 1.7.0-rc1
  • Any UCX environment variables used

Setup and versions

  • OS version (e.g Linux distro) + CPU architecture (x86_64/aarch64/ppc64le/...)
    Ubuntu 16.04.6, HWE kernel 4.15.0-70-generic

Additional information (depending on the issue)

  • OpenMPI 3.1.1
  • MOFED 4.4
@akesandgren
Copy link
Author

Managed to trigger it again.

@dmitrygx
Copy link
Member

@akesandgren is it possible to reproduce this with UCX master branch? could you check it pls?

@akesandgren
Copy link
Author

I can try. It takes anywhere from .5h to 2d before it happens. And it'll take a while before I have time to look at it but not too long.

@dmitrygx
Copy link
Member

I can try. It takes anywhere from .5h to 2d before it happens. And it'll take a while before I have time to look at it but not too long.

thank you! good to know.
btw, it would be great to try to reproduce the bug with #4559 (PR on top of the master) that could check my assumption about the possible cause of the problem. Can you checkout changes from my PR and try with it? Thank you in advance!

@akesandgren
Copy link
Author

Master (4f90824) still fails with:
Assertion `hdr->length <= (iface->config.rx_seg_size - sizeof(uct_tcp_am_hdr_t))' failed

Will try with 4559 on top of that, but it takes a long time...

@akesandgren
Copy link
Author

akesandgren commented Dec 13, 2019

Hmmm, that one, dmitrygx@80a2251, doesn't even build:

In file included from /scratch/eb-ake/UCX/20191210-4f90824302-pr4559/GCCcore-7.3
.0/ucx/src/ucs/sys/compiler.h:14:0,
                 from /scratch/eb-ake/UCX/20191210-4f90824302-pr4559/GCCcore-7.3
.0/ucx/src/uct/base/uct_iface.h:20,
                 from tcp/tcp.h:10,
                 from tcp/tcp_md.c:10:
tcp/tcp_md.c:77:51: error: initialization from incompatible pointer type [-Werro
r=incompatible-pointer-types]
     .cm_open            = UCS_CLASS_NEW_FUNC_NAME(uct_tcp_sockcm_t),
                                                   ^
/scratch/eb-ake/UCX/20191210-4f90824302-pr4559/GCCcore-7.3.0/ucx/src/ucs/sys/pre
processor.h:14:41: note: in definition of macro __UCS_TOKENPASTE_HELPER
 #define __UCS_TOKENPASTE_HELPER(x, y)   x ## y
                                         ^
/scratch/eb-ake/UCX/20191210-4f90824302-pr4559/GCCcore-7.3.0/ucx/src/ucs/type/cl
ass.h:257:5: note: in expansion of macro UCS_PP_TOKENPASTE
     UCS_PP_TOKENPASTE(_type, _new)
     ^~~~~~~~~~~~~~~~~
tcp/tcp_md.c:77:27: note: in expansion of macro UCS_CLASS_NEW_FUNC_NAME
     .cm_open            = UCS_CLASS_NEW_FUNC_NAME(uct_tcp_sockcm_t),
                           ^~~~~~~~~~~~~~~~~~~~~~~
tcp/tcp_md.c:77:51: note: (near initialization for uct_tcp_component.cm_open)
     .cm_open            = UCS_CLASS_NEW_FUNC_NAME(uct_tcp_sockcm_t),
                                                   ^
/scratch/eb-ake/UCX/20191210-4f90824302-pr4559/GCCcore-7.3.0/ucx/src/ucs/sys/pre
processor.h:14:41: note: in definition of macro __UCS_TOKENPASTE_HELPER
 #define __UCS_TOKENPASTE_HELPER(x, y)   x ## y
                                         ^
/scratch/eb-ake/UCX/20191210-4f90824302-pr4559/GCCcore-7.3.0/ucx/src/ucs/type/cl
ass.h:257:5: note: in expansion of macro UCS_PP_TOKENPASTE
     UCS_PP_TOKENPASTE(_type, _new)
     ^~~~~~~~~~~~~~~~~
tcp/tcp_md.c:77:27: note: in expansion of macro UCS_CLASS_NEW_FUNC_NAME
     .cm_open            = UCS_CLASS_NEW_FUNC_NAME(uct_tcp_sockcm_t),
                           ^~~~~~~~~~~~~~~~~~~~~~~

@akesandgren
Copy link
Author

(I was a little lazy and just took tree at that commit instead of applying 4559 on top of the master tree i had tested with)

@dmitrygx
Copy link
Member

(I was a little lazy and just took tree at that commit instead of applying 4559 on top of the master tree i had tested with)

yes, picking up the commit has to work.
sorry for the inconvenience, the tree of this PR contains the compilation error (due to race in PR merging) that was fixed in the master recently

@akesandgren
Copy link
Author

So your PR needs a merge with master.... I picked the changes manually instead and it's now running...
Don't expect any results until after the weekend. (Or saturday after lunch local time)

@dmitrygx
Copy link
Member

So your PR needs a merge with master.... I picked the changes manually instead and it's now running...

@akesandgren thank you!

Don't expect any results until after the weekend. (Or saturday after lunch local time)

great! hope this gives us more information to investigate the problem

@akesandgren
Copy link
Author

With 4559 on top of master commit 4f90824 I still get:
tcp_ep.c:891 Assertion `hdr->length <= (iface->config.rx_seg_size - sizeof(uct_tcp_am_hdr_t))' failed

@akesandgren
Copy link
Author

Both my test runs got the same asssert as above.

@dmitrygx
Copy link
Member

Both my test runs got the same asssert as above.

@akesandgren thank you for testing this!
Unfortunately, our internal tests and UCX CI don't show that problem.
Could you explain a little about the problem?

  • Is it some publically available benchmark or test?
  • Do you run into this issue every time when running this application?
  • How many MPI ranks are launched?
  • Do you set any UCX environment variables?

Thank you in advance for answering!

@akesandgren
Copy link
Author

It's not a benchmark but rather user code. And I'm fairly sure it's not a problem with the user code as such. We've seen the original problem, #4406, on multiple codes, including IOR as mention in that issue.
It has, so far, happened every time I run the code, but it varies from a few hours to >3 days before it happens.
The user code I've been using has been running with 56 tasks spread over two 28 core nodes.
There are no UCX specific env vars set. I do have, SLURM_MPI_TYPE=pmix_v2 and SLURM_PMIX_DIRECT_CONN_UCX=false.
This is with OpenMPI 3.1.1, PMIx 2.2.3, and the UCX version mentioned above.

My favorite suspicion is external port sniffers, we do not have any firewalls in place.
(But even those should not cause an assert, if they do, the UCX code is, in my view, broken.)
We do get a ton of:

[btl_tcp.c:559:mca_btl_tcp_recv_blocking] recv(49) failed: Connection reset by peer (104)

which is fairly annoying to users. I think that type of events, when coming from hosts not in the Comm, should be handled without printing a message.

@dmitrygx
Copy link
Member

[btl_tcp.c:559:mca_btl_tcp_recv_blocking] recv(49) failed: Connection reset by peer (104)

The messages come from OMPI/BTL/TCP module and it's not related to the UCX
Do you see these messages before or after the assert() happens?

@akesandgren
Copy link
Author

Yeah, I know, I was just venting frustration a bit :-)

We get a lot of these before the assert. And usually not related in time, though it's a bit hard to tell with a 3 day job...

@dmitrygx
Copy link
Member

We get a lot of these before the assert. And usually not related in time, though it's a bit hard to tell with a 3 day job...

yes, I see. Thank you!
I'm going to prepare the patch with traces that should help me to investigate the issue.

@akesandgren
Copy link
Author

Let me know when it's ready and I'll kick off a new build/run... Going on holiday trip tomorrow though, so latency will be somewhat higher...

@dmitrygx
Copy link
Member

Let me know when it's ready and I'll kick off a new build/run... Going on holiday trip tomorrow though, so latency will be somewhat higher...

@akesandgren thank you! let you know when it's ready

@dmitrygx
Copy link
Member

Let me know when it's ready and I'll kick off a new build/run... Going on holiday trip tomorrow though, so latency will be somewhat higher...

@akesandgren thank you! let you know when it's ready

Could you check out the following changes from my GitHub UCX fork:
https://github.com/dmitrygx/ucx/tree/topic/uct/tcp_am_debug
The branch was prepared on top of the UCX master
It has to give us more information for further investigations.

Thank you in advance for the help in the debugging of the issue!

@akesandgren
Copy link
Author

Tests running, I'll return with results when they show up...

@akesandgren
Copy link
Author

akesandgren commented Dec 20, 2019

First crash:

EP (130.239.242.224:59593<-><invalid address family>, conn state - ACCEPTING): 7
07723264 vs 65536
[b-cn1436:741689:0:741689]      tcp_ep.c:908  Assertion `hdr->length <= (iface->
config.rx_seg_size - sizeof(uct_tcp_am_hdr_t))' failed

130.239.242.224 is one of the two nodes in that job. And the job is node exclusive.

@dmitrygx
Copy link
Member

First crash:

EP (130.239.242.224:59593<-><invalid address family>, conn state - ACCEPTING): 7
07723264 vs 65536
[b-cn1436:741689:0:741689]      tcp_ep.c:908  Assertion `hdr->length <= (iface->
config.rx_seg_size - sizeof(uct_tcp_am_hdr_t))' failed

130.239.242.224 is one of the two nodes in that job. And the job is node exclusive.

@akesandgren great! thank you so much for collecting this trace!
It seems that the following flow happens:

  • A user creates EP (and socket - fd1) to node1 from node2
  • A user creates EP (and socket - fd1) to node2 from node1
  • TCP iface listener accept()ed on node1 a connection from node2 - creates fd0; it closes its client socket fd1 to the same peer, because its addr < then the remote one.
  • TCP iface listener accept()ed on node2 a connection from node1 - creates fd0; waiting for CONN REQ message from a peer on fd0 - it fails to do send CONN ACK as the connection was closed; we don't set *ep_p to NULL, but this EP was destroyed, so RX buffer is corrupted and we catch signal when trying to do smth with this EP inside the progress loop.

Could you try my preliminary possible fix here (the same branch, pls do git fetch <my fork> to update it and check it out again)?
Thank you!

@akesandgren
Copy link
Author

Ok, test running. If it works it will take several days before it finishes.

@akesandgren
Copy link
Author

FYI, second crash (running without the fix) was the same. So at least nothing else has popped up yet.

The runs with the fix are still running.

@dmitrygx
Copy link
Member

Ok, test running. If it works it will take several days before it finishes.

FYI, second crash (running without the fix) was the same. So at least nothing else has popped up yet.

The runs with the fix are still running.

@akesandgren thank you! waiting for the test result

@dmitrygx
Copy link
Member

Ok, test running. If it works it will take several days before it finishes.

FYI, second crash (running without the fix) was the same. So at least nothing else has popped up yet.
The runs with the fix are still running.

@akesandgren thank you! waiting for the test result

@akesandgren if the crash still persists with the provided fix, let's try to use v1.5.x UCX branch where there is no ring buffer that used to avoid excessive memmove() that we had in v1.5.x. If v1.5.x works for you, I should deeper investigate our ring buffer implementation (currently, it looks good to me)

@akesandgren
Copy link
Author

akesandgren commented Dec 22, 2019

One test run with the fix has crashed with:

EP (130.239.240.135:53455<-><invalid address family>, conn state - ACCEPTING): 790647877 vs 65536
[b-cn0135:892799:0:892799]      tcp_ep.c:908  Assertion `hdr->length <= (iface->config.rx_seg_size - sizeof(uct_tcp_am_hdr_t))' failed

Regarding 1.5.x, you meant like 1.5.2. Any risk of problems using that instead of 1.6.1, which this OpenMPI version is built with? I.e. do I need to rebuild OpenMPI to use 1.5.2 or can I , like I did for 1.7.0 and your branches, just load an older module of UCX?

Just wanting to make sure since I'm going backwards this time...

@dmitrygx
Copy link
Member

@akesandgren thank you for updating me!
It seems that an unexpected connection request from non UCX process is received during your test. So, we should prevent UCX from crashing when receiving data from an unknown sender.
I’ve reproduced something similar simulating this issue in UCX gtest. Looking for a good solution for this problem.

@akesandgren
Copy link
Author

The next crash (with the fix in place) shows

[1577137400.572410] [b-cn1433:926326:0]           sock.c:242  UCX  ERROR recv(fd=49) failed: Connection reset by peer
EP (130.239.242.221:40029<-><invalid address family>, conn state - ACCEPTING): 790647877 vs 65536
[b-cn1433:926326:0:926326]      tcp_ep.c:908  Assertion `hdr->length <= (iface->config.rx_seg_size - sizeof(uct_tcp_am_hdr_t))' failed

130.239.242.221 == b-cn1433

@FaDee1
Copy link

FaDee1 commented Jan 5, 2020

อธิบายข้อผิดพลาด

เตรียมพร้อมเป็นระยะ
tcp_ep.c: 739 การยืนยัน `hdr-> ความยาว <= (iface-> config.rx_seg_size - ขนาดของ (uct_tcp_am_hdr_t)) 'ล้มเหลว
เมื่อไล่ปัญหา# 4406

สิ่งนี้อาจเกิดจากการพยายามเชื่อมต่อจากภายนอกโดยผู้ดมกลิ่นพอร์ต?

[b-cn0131:542585:0:542585]      tcp_ep.c:739  Assertion `hdr->length <= (iface->config.
rx_seg_size - sizeof(uct_tcp_am_hdr_t))' failed
==== backtrace (tid: 542585) ====
 0 0x000000000001f130 ucs_fatal_error_message()  /scratch/eb-ake/UCX/1.7.0-rc1/GCCcore-
7.3.0/ucx-1.7.0-rc1/src/ucs/debug/assert.c:33
 1 0x000000000001f2ce ucs_fatal_error_format()  /scratch/eb-ake/UCX/1.7.0-rc1/GCCcore-7.3.0/ucx-1.7.0-rc1/src/ucs/debug/assert.c:49
 2 0x0000000000019388 uct_tcp_ep_progress_rx()  /scratch/eb-ake/UCX/1.7.0-rc1/GCCcore-7.3.0/ucx-1.7.0-rc1/src/uct/tcp/tcp_ep.c:738
 3 0x000000000001b2d9 uct_tcp_iface_handle_events()  /scratch/eb-ake/UCX/1.7.0-rc1/GCCcore-7.3.0/ucx-1.7.0-rc1/src/uct/tcp/tcp_iface.c:180
 4 0x0000000000027575 ucs_event_set_wait()  /scratch/eb-ake/UCX/1.7.0-rc1/GCCcore-7.3.0/ucx-1.7.0-rc1/src/ucs/sys/event_set.c:213
 5 0x000000000001b39f uct_tcp_iface_progress()  /scratch/eb-ake/UCX/1.7.0-rc1/GCCcore-7.3.0/ucx-1.7.0-rc1/src/uct/tcp/tcp_iface.c:197
 6 0x0000000000022392 uct_iface_progress()  /scratch/eb-ake/UCX/1.7.0-rc1/GCCcore-7.3.0/ucx-1.7.0-rc1/src/uct/api/uct.h:2984
 7 0x0000000000022664 ucp_worker_iface_check_events_progress()  /scratch/eb-ake/UCX/1.7.0-rc1/GCCcore-7.3.0/ucx-1.7.0-rc1/src/ucp/core/ucp_worker.c:746
 8 0x000000000001a16a ucs_callbackq_slow_proxy()  /scratch/eb-ake/UCX/1.7.0-rc1/GCCcore-7.3.0/ucx-1.7.0-rc1/src/ucs/datastruct/callbackq.c:397
 9 0x0000000000024d0a ucs_callbackq_dispatch()  /scratch/eb-ake/UCX/1.7.0-rc1/GCCcore-7.3.0/ucx-1.7.0-rc1/src/ucs/datastruct/callbackq.h:211
10 0x0000000000024d0a uct_worker_progress()  /scratch/eb-ake/UCX/1.7.0-rc1/GCCcore-7.3.0/ucx-1.7.0-rc1/src/uct/api/uct.h:2203
11 0x0000000000024d0a ucp_worker_progress()  /scratch/eb-ake/UCX/1.7.0-rc1/GCCcore-7.3.0/ucx-1.7.0-rc1/src/ucp/core/ucp_worker.c:1897
12 0x0000000000003aa7 mca_pml_ucx_progress()  ???:0
13 0x000000000002c61c opal_progress()  ???:0
14 0x0000000000003c7d mca_pml_ucx_recv()  ???:0
15 0x00000000000a398d ompi_coll_base_allreduce_intra_recursivedoubling()  ???:0
16 0x000000000006275f MPI_Allreduce()  ???:0
17 0x0000000001cf5706 lj_vm_ffi_call()  crtstuff.c:0
18 0x0000000001d15baa lj_ccall_func()  crtstuff.c:0
19 0x0000000001cf1663 lj_cf_ffi_meta___call()  lib_ffi.c:0
20 0x0000000001cf3767 lj_BC_FUNCC()  crtstuff.c:0
21 0x0000000001ce26ac lua_pcall()  ???:0
22 0x000000000045477f main()  ???:0
23 0x0000000000020830 __libc_start_main()  /build/glibc-LK5gWL/glibc-2.23/csu/../csu/libc-start.c:291
24 0x0000000000454d09 _start()  ???:0
=================================

ขั้นตอนในการทำซ้ำ

  • บรรทัดคำสั่ง
  • ใช้รุ่น UCX 1.7.0-rc1
  • ตัวแปรสภาพแวดล้อม UCX ใด ๆ ที่ใช้

การตั้งค่าและรุ่น

  • เวอร์ชั่นของระบบปฏิบัติการ (เช่น Linux distro) + สถาปัตยกรรม CPU (x86_64 / aarch64 / ppc64le / ... )
    Ubuntu 16.04.6, เคอร์เนล HWE 4.15.0-70-generic

ข้อมูลเพิ่มเติม (ขึ้นอยู่กับปัญหา)

  • OpenMPI 3.1.1
  • ย้ายแล้ว 4.4

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants