-
Notifications
You must be signed in to change notification settings - Fork 421
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
1.7.0-rc1 showing tcp_ep.c:739 Assertion hdr->length <= (iface->config.rx_seg_size - sizeof(uct_tcp_am_hdr_t)) #4525
Comments
Managed to trigger it again. |
@akesandgren is it possible to reproduce this with UCX master branch? could you check it pls? |
I can try. It takes anywhere from .5h to 2d before it happens. And it'll take a while before I have time to look at it but not too long. |
thank you! good to know. |
Master (4f90824) still fails with: Will try with 4559 on top of that, but it takes a long time... |
Hmmm, that one, dmitrygx@80a2251, doesn't even build:
|
(I was a little lazy and just took tree at that commit instead of applying 4559 on top of the master tree i had tested with) |
yes, picking up the commit has to work. |
So your PR needs a merge with master.... I picked the changes manually instead and it's now running... |
@akesandgren thank you!
great! hope this gives us more information to investigate the problem |
With 4559 on top of master commit 4f90824 I still get: |
Both my test runs got the same asssert as above. |
@akesandgren thank you for testing this!
Thank you in advance for answering! |
It's not a benchmark but rather user code. And I'm fairly sure it's not a problem with the user code as such. We've seen the original problem, #4406, on multiple codes, including IOR as mention in that issue. My favorite suspicion is external port sniffers, we do not have any firewalls in place.
which is fairly annoying to users. I think that type of events, when coming from hosts not in the Comm, should be handled without printing a message. |
The messages come from OMPI/BTL/TCP module and it's not related to the UCX |
Yeah, I know, I was just venting frustration a bit :-) We get a lot of these before the assert. And usually not related in time, though it's a bit hard to tell with a 3 day job... |
yes, I see. Thank you! |
Let me know when it's ready and I'll kick off a new build/run... Going on holiday trip tomorrow though, so latency will be somewhat higher... |
@akesandgren thank you! let you know when it's ready |
Could you check out the following changes from my GitHub UCX fork: Thank you in advance for the help in the debugging of the issue! |
Tests running, I'll return with results when they show up... |
First crash:
130.239.242.224 is one of the two nodes in that job. And the job is node exclusive. |
@akesandgren great! thank you so much for collecting this trace!
Could you try my preliminary possible fix here (the same branch, pls do |
Ok, test running. If it works it will take several days before it finishes. |
FYI, second crash (running without the fix) was the same. So at least nothing else has popped up yet. The runs with the fix are still running. |
@akesandgren thank you! waiting for the test result |
@akesandgren if the crash still persists with the provided fix, let's try to use v1.5.x UCX branch where there is no ring buffer that used to avoid excessive |
One test run with the fix has crashed with:
Regarding 1.5.x, you meant like 1.5.2. Any risk of problems using that instead of 1.6.1, which this OpenMPI version is built with? I.e. do I need to rebuild OpenMPI to use 1.5.2 or can I , like I did for 1.7.0 and your branches, just load an older module of UCX? Just wanting to make sure since I'm going backwards this time... |
@akesandgren thank you for updating me! |
The next crash (with the fix in place) shows
130.239.242.221 == b-cn1433 |
|
Describe the bug
Got a sporadic
tcp_ep.c:739 Assertion `hdr->length <= (iface->config.rx_seg_size - sizeof(uct_tcp_am_hdr_t))' failed
when chasing issue #4406
Could this be caused by outside connection attempts by port sniffers?
Steps to Reproduce
Setup and versions
Ubuntu 16.04.6, HWE kernel 4.15.0-70-generic
Additional information (depending on the issue)
The text was updated successfully, but these errors were encountered: