Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(test): flaky t0182-circuit-relay.sh #8967

Draft
wants to merge 2 commits into
base: master
Choose a base branch
from
Draft

Conversation

lidel
Copy link
Member

@lidel lidel commented May 12, 2022

This PR removes fixup introduced in #8868 (comment) so we can dig into the underlying cause.

About the flaky test

It started failing every 3-10 run after we switched to go-libp2p 0.19.0 in #8868 – tests in go-ipfs/test/sharness/t0182-circuit-relay.sh became flaky.

ipfsi 2 swarm connect /p2p/$PEERID_1/p2p-circuit/p2p/$PEERID_0 sometimes fails with:

NO_RESERVATION (204)
Error: connect 12D3KooWLuY1Q13o6Q91NoJn4Qv7RSq5rYS6uC4YCzKCJ3S8GnwQ failure: failed to dial 12D3KooWLuY1Q13o6Q91NoJn4Qv7RSq5rYS6uC4YCzKCJ3S8GnwQ:
  * [/p2p/12D3KooWNfs8uFpQf6NnseZKLyS3c8EFppGtUGejHpViBeg8f7Vt/p2p-circuit] error opening relay circuit: NO_RESERVATION (204)`
  • it happens randomly, bumping it to 10 seconds does not help much, it still fails sometimes.
  • shutting down nodes and starting them again fixes the problem (what is what removed code did)
    • ...but it feels like covering up some underlying racy bug: once NO_RESERVATION error is returned, the relay will never work, no matter how long we wait and retry again (only way to fix it is to reboot of the relay node / testbed).

Repro (keep running t0182-circuit-relay.sh until the error):

$ killall ipfs ; i=0; while ./t0182-circuit-relay.sh -v; do echo -n " -----> $i <----\n\n\n"; sleep 1; ((i=i+1)) ; done

cc @aschmahmann @marten-seemann @schomatis – any ideas how to debug this?

@marten-seemann
Copy link
Member

Would be interesting to see why we don't get a reservation: GOLOG_LOG_LEVEL="autorelay=debug".

@lidel lidel self-assigned this May 12, 2022
@BigLep
Copy link
Contributor

BigLep commented May 12, 2022

2022-05-12 conversation: this may be relevant for the release if it means a new version of go-libp2p is needed. It may just be a problem with the tests though.

@schomatis
Copy link
Contributor

Who's maintaining https://github.com/ipfs/iptb? cc @BigLep

@lidel
Copy link
Member Author

lidel commented May 16, 2022

@schomatis afaik nobody – it was not touched since I remember (2018 or later)

@lidel lidel removed their assignment May 16, 2022
@BigLep
Copy link
Contributor

BigLep commented May 26, 2022

2022-05-26 conversation: we're not viewing this as release blocking for 0.13 because even if there are some issues with circuit relay v2, we at least have mostly working hole punching. 0.13 is better than before.

The ideal quick thing to do is to see if this is a new regression or long standing. We could use an older version and run Lidel's loop command to see if the flakiness manifests.

@schomatis
Copy link
Contributor

The ideal quick thing to do is to see if this is a new regression or long standing. We could use an older version and run Lidel's loop command to see if the flakiness manifests.

@lidel seems to have already pinpoint this to #8868 per OP.

I can get the detailed logs for @marten-seemann early next week but I'm going to push for either us really owning IPTB or replacing it altogether (with testground or whatever we use these days).

@schomatis schomatis self-assigned this May 26, 2022
@lidel
Copy link
Member Author

lidel commented May 31, 2022

Sounds upstream tests are flaky as well libp2p/go-libp2p#1576 (comment)

@schomatis schomatis force-pushed the fix/circuit-relay-test branch 2 times, most recently from e68bd2f to 43d520b Compare June 1, 2022 00:34
@schomatis
Copy link
Contributor

This one-liner hack was the easiest I've found to get the logs of IPTB nodes.

After the failed sharness test the logs are in

NID=0
cat "./trash directory.t0182-circuit-relay.sh/.iptb/testbeds/default/$NID/daemon.stderr"

Note that a successful sharness test run removes this directory (so it's left only after the failed one, copy the log before re-running) and restarting the node (iptb start) within the sharness test also truncates the log file.

I'm seeing an error in node 0, but you'd likely need to extend the log subsystem to get more details:

DEBUG	autorelay	autorelay/relay_finder.go:224	node 12D3KooWFMQMKKrDVsQ3Qwks615pZHiSsDkg57Pafsav53BxxX7T not accepted as a candidate: error connecting to relay 12D3KooWFMQMKKrDVsQ3Qwks615pZHiSsDkg57Pafsav53BxxX7T: failed to dial 12D3KooWFMQMKKrDVsQ3Qwks615pZHiSsDkg57Pafsav53BxxX7T:
  * [/ip4/127.0.0.1/tcp/38569] dial tcp4 127.0.0.1:38569: connect: connection refused

@schomatis
Copy link
Contributor

cc @lidel @marten-seemann

@schomatis schomatis removed their request for review June 1, 2022 00:43
@schomatis schomatis removed their assignment Jun 1, 2022
@BigLep
Copy link
Contributor

BigLep commented Jun 14, 2022

2022-06-14 conversation: we're going to wait until the next go-libp2p release before investigating this further since there are changes coming.

@BigLep BigLep requested review from aschmahmann and removed request for aschmahmann September 1, 2022 01:36
@lidel
Copy link
Member Author

lidel commented Sep 1, 2022

go-libp2p v0.21 landed in #9074, rebasing this PR to see if the problem is still present

lidel and others added 2 commits September 1, 2022 15:36
Removes fixup introduced in #8868 (comment)
so we can dig into the underlying cause
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
No open projects
Status: 🥞 Todo
Development

Successfully merging this pull request may close these issues.

4 participants