Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DNS resolver failure mode #39

Open
reynir opened this issue Mar 11, 2024 · 7 comments
Open

DNS resolver failure mode #39

reynir opened this issue Mar 11, 2024 · 7 comments

Comments

@reynir
Copy link
Contributor

reynir commented Mar 11, 2024

On my home network the router listens on TCP port 53, but when querying the DNS resolver over TCP the resolver does not respond. This is (an annoying) failure mode we currently don't handle.

$ nc -v 192.168.1.1 53
Connection to 192.168.1.1 53 port [tcp/domain] succeeded!
^C
$ dig +tcp @192.168.1.1 reyn.ir
;; communications error to 192.168.1.1#53: timed out
;; communications error to 192.168.1.1#53: timed out
;; communications error to 192.168.1.1#53: timed out

; <<>> DiG 9.18.24-1-Debian <<>> +tcp @192.168.1.1 reyn.ir
; (1 server found)
;; global options: +cmd
;; no servers could be reached
@reynir
Copy link
Contributor Author

reynir commented Mar 11, 2024

I observe this in http-lwt-client with a DNS timeout even if I have a more responsive name server as second entry in resolv.conf.

@hannesm
Copy link
Contributor

hannesm commented Mar 11, 2024

Hmm, so there are two ways forward I guess:

  • (a) do the dns-client-lwt work with udp (the mirage part already is able to use udp)
  • (b1) apply the happy-eyeballs getaddrinfo injection changes and (b2) have happy-eyeballs-lwt use Unix.getaddrinfo / Lwt_unix.getaddrinfo (i.e. not our dns stack)

I'm a fan of (b1). And still undecided about (b2) or (a) -- while (b2) has the advantage that we won't have to mess around with it anymore, the disadvantage is that libc resolver is used (i.e. potential security issues, also using dns-client less leads to eventually more bugs in it). The disadvantage of (a) is that it is rather complicated (when to use tcp / when to use udp, and esp. in scenarios described above what the default should be and what the error behaviour should be).

I remember that the dns-resolver code has some parts about retransmitting queries and using TCP if truncated etc. -- would be nice to leverage (maybe first test it more and debug issues) that code for reuse between dns-client and dns-resolver eventually.

So, which path to take? Should we take a look at (b1) at least, from that point it'd be easier to move I suspect. (And both Unix.getaddrinfo and dns-client-lwt could be options, the only question is what to use as default -- and currently I lean towards getaddrinfo). WDYT?

@gsportrix
Copy link

I don't know if this is helpfull at all, but i found that i can only connect to some servers at home using the http-lwt-client.
while
www.ocaml.org shows a DNS request timeout
www.google.com works.

i've tried some more it feels like two out of 10 work...

i don't see any differences in the output of dig

@gsportrix
Copy link

One thing...
yesterday evening it worked at home... oh i am connected to my companies VPN.
Turning off VPN ... DNS request timeout - Dig query times in the 10 to 25 ms range
Turning on VPN everything is fine again... Dig query times in the 5 to 15 ms range

@reynir
Copy link
Contributor Author

reynir commented Mar 13, 2024

I think it is worthwhile to implement udp in dns-client-lwt either way.

My mental model of this is that the DNS happy-eyeballs observes a successful TCP handshake and considers it a done deal. Then my resolver doesn't reply and the request times out. It seems no other nameservers are attempted since doing the TCP handshake is considered a success?! So I don't know if we should try to communicate back to happy-eyeballs "don't try this nameserver+port next".

@gsportrix Can you test with dig +tcp and see if that fails?

@gsportrix
Copy link

gsportrix commented Mar 13, 2024

@reynir
i have tried dig +tcp works in both cases.
just the dns itself changed.

So i tried different DNS
8.8.8.8 works with http-lwt-client
1.1.1.1 works with http-lwt-client

84.200.69.80 works sometimes with http-lwt-client

sometimes not:

  • DNS request timeout
  • error connection to 84.200.69.80 failed: timeout connecting to resolver 84.200.69.80:853, 84.200.69.80:53

Dig outputs...

➜  dig +tcp @84.200.69.80 www.ocaml.org

; <<>> DiG 9.10.6 <<>> +tcp @84.200.69.80 www.ocaml.org
; (1 server found)
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 25497
;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4096
;; QUESTION SECTION:
;www.ocaml.org.			IN	A

;; ANSWER SECTION:
www.ocaml.org.		155	IN	A	51.159.83.169

;; Query time: 758 msec
;; SERVER: 84.200.69.80#53(84.200.69.80)
;; WHEN: Wed Mar 13 22:02:28 CET 2024
;; MSG SIZE  rcvd: 58

➜  dig +tcp @84.200.69.80 www.ocaml.org
;; Connection to 84.200.69.80#53(84.200.69.80) for www.ocaml.org failed: timed out.
;; Connection to 84.200.69.80#53(84.200.69.80) for www.ocaml.org failed: timed out.

; <<>> DiG 9.10.6 <<>> +tcp @84.200.69.80 www.ocaml.org
; (1 server found)
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 45860
;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4096
;; QUESTION SECTION:
;www.ocaml.org.			IN	A

;; ANSWER SECTION:
www.ocaml.org.		134	IN	A	51.159.83.169

;; Query time: 14 msec
;; SERVER: 84.200.69.80#53(84.200.69.80)
;; WHEN: Wed Mar 13 22:02:50 CET 2024
;; MSG SIZE  rcvd: 58

So for now it's just my silly homebox that does not work without any hints in dig...

@hannesm
Copy link
Contributor

hannesm commented May 29, 2024

While this issue has not been addressed, since happy-eyeballs 1.0.0, http-lwt-client will use the standard getaddrinfo() interface instead of the DNS stack developed in OCaml. So, if you upgrade to happy-eyeballs >= 1.0.0, you shouldn't encounter this issue anymore with http-lwt-client.

I will leave this issue open, since it seems we still should improve the failure mode of the DNS client.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

3 participants