Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AutoNAT requests time out #3986

Open
Tracked by #4524
b-zee opened this issue May 25, 2023 · 8 comments
Open
Tracked by #4524

AutoNAT requests time out #3986

b-zee opened this issue May 25, 2023 · 8 comments
Labels

Comments

@b-zee
Copy link
Contributor

b-zee commented May 25, 2023

Summary

About 20 publicy reachable peers are added as AutoNAT servers and are probed. From behind a NAT. After the timeout (set to 15s), AutoNAT gives an error OutboundRequest(Timeout) and probes another peer again. This continues, but sometimes a Response(DialError) is reported and that results in an AutoNAT status change.

It's unclear to me why the OutboundRequest(Timeout) occurs. This stems from the request_response protocol. I assume an AutoNAT server can't dial a NAT peer, and takes too long to answer the dial-back probe request.

Expected behaviour

Either the timeouts not to happen, or the timeouts to cause the AutoNAT status to change to Private eventually.

Actual behaviour

Sometimes it takes more than a few minutes (up to 20) for a node to determine it's Private. This only happens after a DialError, no the Timeout.

Debug Output

[2023-05-22T12:28:33.436488Z] Request { probe_id: ProbeId(0), peer: PeerId("12D3KooWAbnBXNeWyy3HAqpV7aFfos9Wv4sPapetvz2Z9oYJ9U81") }
[2023-05-22T12:28:48.437264Z] Error { probe_id: ProbeId(0), peer: Some(PeerId("12D3KooWAbnBXNeWyy3HAqpV7aFfos9Wv4sPapetvz2Z9oYJ9U81")), error: OutboundRequest(Timeout) }
[2023-05-22T12:28:48.437329Z] Request { probe_id: ProbeId(1), peer: PeerId("12D3KooWDzmAGQ9ykuZSb891Df6E8FxkUXJgr9XV38VgPPzgQzx3") }
[2023-05-22T12:29:03.437614Z] Error { probe_id: ProbeId(1), peer: Some(PeerId("12D3KooWDzmAGQ9ykuZSb891Df6E8FxkUXJgr9XV38VgPPzgQzx3")), error: OutboundRequest(Timeout) }
[2023-05-22T12:29:03.437687Z] Request { probe_id: ProbeId(2), peer: PeerId("12D3KooWDzmAGQ9ykuZSb891Df6E8FxkUXJgr9XV38VgPPzgQzx3") }
[2023-05-22T12:29:18.437944Z] Error { probe_id: ProbeId(2), peer: Some(PeerId("12D3KooWDzmAGQ9ykuZSb891Df6E8FxkUXJgr9XV38VgPPzgQzx3")), error: OutboundRequest(Timeout) }
[2023-05-22T12:29:18.438020Z] Request { probe_id: ProbeId(3), peer: PeerId("12D3KooWDzmAGQ9ykuZSb891Df6E8FxkUXJgr9XV38VgPPzgQzx3") }
[2023-05-22T12:29:33.438292Z] Error { probe_id: ProbeId(3), peer: Some(PeerId("12D3KooWDzmAGQ9ykuZSb891Df6E8FxkUXJgr9XV38VgPPzgQzx3")), error: OutboundRequest(Timeout) }
[2023-05-22T12:29:33.438356Z] Request { probe_id: ProbeId(4), peer: PeerId("12D3KooWSf3AciihWbr5sjKdNn55pWQbYqU53taT4RHU378mvDLa") }
[2023-05-22T12:29:48.438727Z] Error { probe_id: ProbeId(4), peer: Some(PeerId("12D3KooWSf3AciihWbr5sjKdNn55pWQbYqU53taT4RHU378mvDLa")), error: OutboundRequest(Timeout) }
[2023-05-22T12:29:48.438803Z] Request { probe_id: ProbeId(5), peer: PeerId("12D3KooWQ6goXTX4QawAzvvFgVymJ28MNXenh3r46bLKCkvKySzP") }
[2023-05-22T12:30:03.439084Z] Error { probe_id: ProbeId(5), peer: Some(PeerId("12D3KooWQ6goXTX4QawAzvvFgVymJ28MNXenh3r46bLKCkvKySzP")), error: OutboundRequest(Timeout) }
[2023-05-22T12:30:03.439166Z] Request { probe_id: ProbeId(6), peer: PeerId("12D3KooWLrZhGdBx8HKcvm2QcLCaZwWHPpj8ScwKYGsTFiV75vqq") }
[2023-05-22T12:30:19.179148Z] Error { probe_id: ProbeId(6), peer: Some(PeerId("12D3KooWLrZhGdBx8HKcvm2QcLCaZwWHPpj8ScwKYGsTFiV75vqq")), error: OutboundRequest(Timeout) }
[2023-05-22T12:30:19.179216Z] Request { probe_id: ProbeId(7), peer: PeerId("12D3KooWLrZhGdBx8HKcvm2QcLCaZwWHPpj8ScwKYGsTFiV75vqq") }
[2023-05-22T12:30:34.179434Z] Error { probe_id: ProbeId(7), peer: Some(PeerId("12D3KooWLrZhGdBx8HKcvm2QcLCaZwWHPpj8ScwKYGsTFiV75vqq")), error: OutboundRequest(Timeout) }
[2023-05-22T12:30:34.179507Z] Request { probe_id: ProbeId(9), peer: PeerId("12D3KooWQ6goXTX4QawAzvvFgVymJ28MNXenh3r46bLKCkvKySzP") }
[2023-05-22T12:30:49.179737Z] Error { probe_id: ProbeId(9), peer: Some(PeerId("12D3KooWQ6goXTX4QawAzvvFgVymJ28MNXenh3r46bLKCkvKySzP")), error: OutboundRequest(Timeout) }
[2023-05-22T12:30:49.179795Z] Request { probe_id: ProbeId(10), peer: PeerId("12D3KooWAHseCqSDRgJbGkhSEiZu6psQ97wVpFDjN6uUHNu5Hbeo") }
[2023-05-22T12:31:04.857627Z] Error { probe_id: ProbeId(10), peer: Some(PeerId("12D3KooWAHseCqSDRgJbGkhSEiZu6psQ97wVpFDjN6uUHNu5Hbeo")), error: OutboundRequest(Timeout) }
[2023-05-22T12:31:04.857685Z] Request { probe_id: ProbeId(11), peer: PeerId("12D3KooWC7cAfXzsrn2VFsR4kzUNZB2FovAtLosbRW43wsBhkesZ") }
[2023-05-22T12:31:19.857992Z] Error { probe_id: ProbeId(11), peer: Some(PeerId("12D3KooWC7cAfXzsrn2VFsR4kzUNZB2FovAtLosbRW43wsBhkesZ")), error: OutboundRequest(Timeout) }
[2023-05-22T12:31:19.858092Z] Request { probe_id: ProbeId(12), peer: PeerId("12D3KooWF9EwYGR7aon6SGzxHv5o2uFrJKEMuGmDjbhm4v9FH7pR") }
[2023-05-22T12:31:34.858414Z] Error { probe_id: ProbeId(12), peer: Some(PeerId("12D3KooWF9EwYGR7aon6SGzxHv5o2uFrJKEMuGmDjbhm4v9FH7pR")), error: OutboundRequest(Timeout) }
[2023-05-22T12:31:34.858492Z] Request { probe_id: ProbeId(13), peer: PeerId("12D3KooWDzmAGQ9ykuZSb891Df6E8FxkUXJgr9XV38VgPPzgQzx3") }
[2023-05-22T12:31:50.232443Z] Error { probe_id: ProbeId(13), peer: Some(PeerId("12D3KooWDzmAGQ9ykuZSb891Df6E8FxkUXJgr9XV38VgPPzgQzx3")), error: OutboundRequest(Timeout) }
[2023-05-22T12:31:50.232501Z] Request { probe_id: ProbeId(14), peer: PeerId("12D3KooWSgSKZ9LqMyW9shqbzqWDEpY5Hdkyi1frinqyTPRfbtL5") }
[2023-05-22T12:32:05.232871Z] Error { probe_id: ProbeId(14), peer: Some(PeerId("12D3KooWSgSKZ9LqMyW9shqbzqWDEpY5Hdkyi1frinqyTPRfbtL5")), error: OutboundRequest(Timeout) }
[2023-05-22T12:32:05.232933Z] Request { probe_id: ProbeId(15), peer: PeerId("12D3KooWAbnBXNeWyy3HAqpV7aFfos9Wv4sPapetvz2Z9oYJ9U81") }
[2023-05-22T12:32:20.233317Z] Error { probe_id: ProbeId(15), peer: Some(PeerId("12D3KooWAbnBXNeWyy3HAqpV7aFfos9Wv4sPapetvz2Z9oYJ9U81")), error: OutboundRequest(Timeout) }
[2023-05-22T12:32:20.233391Z] Request { probe_id: ProbeId(16), peer: PeerId("12D3KooWMnXSasKf1MWKLrwhBinYELzESoeHnpYjospjkMTV8ZUV") }
[2023-05-22T12:32:35.883220Z] Error { probe_id: ProbeId(16), peer: Some(PeerId("12D3KooWMnXSasKf1MWKLrwhBinYELzESoeHnpYjospjkMTV8ZUV")), error: OutboundRequest(Timeout) }
[2023-05-22T12:32:35.883400Z] Request { probe_id: ProbeId(17), peer: PeerId("12D3KooWQ6goXTX4QawAzvvFgVymJ28MNXenh3r46bLKCkvKySzP") }
[2023-05-22T12:32:45.434329Z] Error { probe_id: ProbeId(17), peer: Some(PeerId("12D3KooWQ6goXTX4QawAzvvFgVymJ28MNXenh3r46bLKCkvKySzP")), error: Response(DialError) }
[2023-05-22T12:32:45.434439Z] AutoNAT status changed: Unknown -> Private

Version

  • libp2p version (version number, commit, or branch): 0.51.3

Would you like to work on fixing this bug?

Maybe.

@b-zee
Copy link
Contributor Author

b-zee commented May 25, 2023

I do wonder how long it takes for a dial to timeout. Is this configurable? Does this depend on the OS or transport?

It seems increasing the AutoNAT timeout to 5 minutes works. A TCP dial fails after about 125s in my setup. I assume that means the AutoNAT timeout should currently at least be above the 125s. In that case, AutoNAT will report back a DialError and the AutoNAT client processes that and concludes it's private.

Looks like this started to arise for me when multiple AutoNAT servers are added. With only a single server that is constantly probed, the server will eventually get a DialError and by chance can communicate this back in a current request/response flow (that still has the 'window' opened, i.e. not timed out). With multiple servers that are probed, those servers will get dial errors, but it seems the AutoNAT probes are already timed out, so it's not communicated back.

@thomaseizinger
Copy link
Contributor

Two things:

  • We have AutoNAT v2 in the works, see add autonat v2 spec specs#538.
  • What exactly is timing out in your case? The connection to the AutoNAT server? We have tests that complete in < 1 sec so detecting whether you are reachable on an address shouldn't take as long as your are describing permitting that you are connected to the AutoNAT server.

@b-zee
Copy link
Contributor Author

b-zee commented May 25, 2023

What exactly is timing out in your case?

Two separate issues:

  • Client side: The probe request from a client (with request_response) does not get a response within the timeout configured for autonat (defaults to 30s). The probe request is received fine by the server.
  • Server side: The probe/dial that a server does to a (unreachable) client. On my system this is a TCP OutgoingConnectionError that occurs after 129s after dialing. (Which causes the first timeout, as the server doesn't send a response within the default 30s.)

I am not fully sure of the details of the test, but there could be a few things at play:

  • Sometimes a connection straight out refuses. This will probably happen locally. With a simple TcpStream::connect call to 127.0.0.1:<port>, where port is a random port we're not listening on, an error is immediately received: Os { code: 111, kind: ConnectionRefused, message: "Connection refused" }
  • A router might refuse the connection too, also immediately causing the TCP connection to fail.
  • In other scenarios, TCP will just time out after some OS specified time.

b-zee added a commit to maidsafe/safe_network that referenced this issue May 25, 2023
github-merge-queue bot pushed a commit to maidsafe/safe_network that referenced this issue May 26, 2023
@thomaseizinger
Copy link
Contributor

Interesting, thank you for sharing!

I think we can draw two conclusions from this:

  1. We should recommend to people to lower their TCP dial timeout if they run an AutoNAT server.
  2. We should either:
    a. treat outbound stream timeouts as failed dials and thus assume the address is not reachable
    b. find a way for an AutoNAT server to determine its TCP dial timeout and send a message back to the client just before the stream timeout hits. (This would assume we can agree on such a timeout across all implementations)

@b-zee
Copy link
Contributor Author

b-zee commented May 26, 2023

We should recommend to people to lower their TCP dial timeout if they run an AutoNAT server.

Yesterday I briefly looked at the Linux settings for this, but I was unable to find exactly what to change (lots of settings that relate to SYN/ACK retries and timings). However, I think it's more appropriate to come up with a user land solution. E.g, if the socket connect call doesn't happen within a configured amount of time: timeout. There is a Rust std::net::TcpStream::connect_timeout (docs) method that does exactly this. In case of UDP/QUIC, which is stateless, this might have to be done some other way.

@mxinden
Copy link
Member

mxinden commented May 28, 2023

Using reasonable timeouts within rust-libp2p, instead of requiring operators to change their TCP stack, sounds reasonable to me. You could even do so within the libp2p-autonat NetworkBehaviour implementation, considering a ToSwarm::Dial as failed after a certain timeout with no NetworkBehaviour::on_connection_established.

@thomaseizinger
Copy link
Contributor

You could even do so within the libp2p-autonat NetworkBehaviour implementation, considering a ToSwarm::Dial as failed after a certain timeout with no NetworkBehaviour::on_connection_established.

You mean within the autonat server implementation?

@mxinden
Copy link
Member

mxinden commented May 31, 2023

You could even do so within the libp2p-autonat NetworkBehaviour implementation, considering a ToSwarm::Dial as failed after a certain timeout with no NetworkBehaviour::on_connection_established.

You mean within the autonat server implementation?

Yes, the libp2p-autonat server implementation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants