Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lotus sync issue: libp2p 0.31.1 to 0.33.2 regression #2764

Closed
Stebalien opened this issue Apr 11, 2024 · 25 comments
Closed

Lotus sync issue: libp2p 0.31.1 to 0.33.2 regression #2764

Stebalien opened this issue Apr 11, 2024 · 25 comments

Comments

@Stebalien
Copy link
Member

We've seen reports of a chain-sync regression between lotus 1.25 and 1.26. Notably:

  1. We updated go-libp2p from v0.31.1 to v0.33.2.
  2. I've seen reports of peers failing to resume sync after transient network issues.
  3. Users are reporting "low" peer counts.

We're not entirely sure what's going on, but I'm starting an issue here so we can track things.

@Stebalien
Copy link
Member Author

Stebalien commented Apr 11, 2024

My first guess, given (2), is libp2p/specs#573 (comment). This is unconfirmed, but high on my list.

  • Test: does disabling tcp reuseport fix this?

@Stebalien
Copy link
Member Author

My second guess is #2650. This wouldn't be the fault of libp2p, but TLS may be more impacted by the GFW? That seems unlikely...

@Stebalien
Copy link
Member Author

My third guess is something related to QUIC changes.

@MarcoPolo
Copy link
Collaborator

Have you been able to repro 2 or 3 locally?

  • For GFW theory, we could try connecting to peers over both tls and noise and seeing if there's a difference.
  • Can you run lotus 1.26 on the older version of go-libp2p and see if you still see any errors?
  • Is the transient network issue something that would affect my connectivity to everyone or only a subset of peers? e.g. is my internet down or is my connection to a subset down?
  • For a typical well behaved node, what's the breakdown in connection types? (TCP+TLS, QUIC, TCP+Noise). For a node seeing this regression, what is its breakdown?

@Stebalien
Copy link
Member Author

I can't repro this at the moment, unfortunately (not at home, node down). But I'll do some more digging later this week.

@Stebalien
Copy link
Member Author

Ok, I got one confirmation that disabling reuseport seems to fix the issue and one report that it makes no difference.

@Stebalien
Copy link
Member Author

Ok, that confirmation appeared to be a fluke. This doesn't appear to have been the issue

@sukunrt
Copy link
Member

sukunrt commented Apr 25, 2024

From eyeballing the commits, I can see that the major changes apart from WebRTC are

  • We've upgraded QUIC
  • Implemented Happy eyeballs for TCP
  • removed multistream simultaneous connect

Can we test this with an only QUIC node and an only TCP node to see if it's a problem with QUIC or TCP?

@Stebalien
Copy link
Member Author

I'll try. Unfortunately, the issue is hard to reproduce and tends to happen in production (hard to get people to run random patches). Right now we're waiting on goroutine dumps hoping to get a bit of an idea about what might be stuck (e.g., may not be libp2p).

@vyzo
Copy link
Contributor

vyzo commented Apr 25, 2024

It might be the silently broken PX -- see libp2p/go-libp2p-pubsub#555

@vyzo
Copy link
Contributor

vyzo commented Apr 25, 2024

I am almost certain this is the culprit as the bootstrap really relies on it.

@Stebalien
Copy link
Member Author

AH.. that would definitely explain it.

@MarcoPolo
Copy link
Collaborator

I thought that could be it as well, but I was thrown off by the premise that this wasn't an issue in v0.31.1.

PX broke after this change: #2325 which was included in the v0.28.0 release. So v0.31.1 should have the same PX issue.

@vyzo
Copy link
Contributor

vyzo commented Apr 25, 2024

I cant imagine what else it could be.
Was there a recent "mandatory release" where everyone upgraded to the more recent libp2p?

@MarcoPolo
Copy link
Collaborator

Users are reporting "low" peer counts.

Are these low peer counts low peers in your gossipsub mesh or low number of peers we are actually connected to?

@sukunrt
Copy link
Member

sukunrt commented Apr 25, 2024

Do we know if these nodes are running both QUIC and TCP? If yes, it's unlikely that the problem is with either transport and is probably at a layer above the go-libp2p transports?

@rjan90
Copy link

rjan90 commented May 3, 2024

Are these low peer counts low peers in your gossipsub mesh or low number of peers we are actually connected to?

Just chiming in here from the Lotus-side, it´s the number of peers we are connected to, after upgrading to 0.33.2 the count is around:

lotus info
Network: mainnet
Peers to: [publish messages 105] [publish blocks 106]

On the previos version (0.33.1), it was stable around the 200 range.

@MarcoPolo
Copy link
Collaborator

Just chiming in here from the Lotus-side, it´s the number of peers we are connected to, after upgrading to 0.33.2 the count is around:

lotus info
Network: mainnet
Peers to: [publish messages 105] [publish blocks 106]

On the previos version (0.33.1), it was stable around the 200 range.

I think these are the number of peers in your gossipsub topic mesh. A subset of the peers you are actually connected to. Could you find the number of peers you are connected to? And compare that between versions?

@sukunrt
Copy link
Member

sukunrt commented Jun 20, 2024

Did the situation improve after gossip-sub v0.11 and go-libp2p v0.34?

@Stebalien
Copy link
Member Author

We'll likely need to wait for the network to upgrade (~August) to see results.

@hsanjuan
Copy link
Contributor

hsanjuan commented Jun 25, 2024

I have a user with a large ipfs-cluster (>1000 peers) complaining of issues that are consistent with pubsub propagation failures and the issue happens in both go-libp2p-v0.33.2 + go-libp2p-pubsub v0.10.0 and go-libp2p-v0.35.1 with go-libp2p-pubsub-v0.11.0. I cannot 100% say it is the same issue as Lotus, but "low peer counts" is a symptom and it is still happening, apparently.

How confident are we that it was fixed?

@Stebalien
Copy link
Member Author

We're confident that we fixed a issue, but there may be others. My initial thought was #2764 (comment), but if that cluster uses QUIC it shouldn't be affected by that.

@rjan90
Copy link

rjan90 commented Jun 30, 2024

Did the situation improve after gossip-sub v0.11 and go-libp2p v0.34?

So it has improved since upgrading to these version, and the amount of peers are now more stably hovering around 300 peers with the same machine:

lotus info
Network: mainnet
StartTime: 452h37m58s (started at 2024-06-11 15:28:26 +0200 CEST)
Chain: [sync ok] [basefee 100 aFIL] [epoch 4047852]
Peers to: [publish messages 308] [publish blocks 318]

As Steven notes, the real test would be waiting for the network upgrade in August - as that is when most of these issues gets surfaced when people are upgrading and reconnecting to the network.

@hsanjuan
Copy link
Contributor

hsanjuan commented Jul 1, 2024

We're confident that we fixed a issue, but there may be others. My initial thought was #2764 (comment), but if that cluster uses QUIC it shouldn't be affected by that.

Good news: it seems that the issue I described was a user configuration error in the end (very low limits in connection manager).

@rjan90
Copy link

rjan90 commented Aug 6, 2024

Now that the Filecoin Mainnet has upgraded to NV23, and with that, a very large % of the nodes probably have updated to the go-libp2p v0.35.4 release - I´m seeing a significantly larger amount of peers that I´m connected to. It is 5x higher then the amount of peers I was connected to with the same machine in May

lotus info
Network: mainnet
StartTime: 222h51m57s (started at 2024-07-28 10:30:12 +0200 CEST)
Chain: [sync ok] [basefee 100 aFIL] [epoch 4155044]
Peers to: [publish messages 473] [publish blocks 490]

I think we can close this issue now, and rather re-open a more narrowed down issues if we encounter other problems

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants