Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

if getaddrinfo fails, we back off in a exponential curve. We should limit the back off 5 minutes maximum period #216

Closed
csampsonza opened this issue Jul 15, 2019 · 5 comments

Comments

@csampsonza
Copy link
Contributor

TR4248

If getaddrinfo fails, we back off in an exponential curve. We should be able to configure the maximum back-off time.

Following extracted from the AE log attached.

2019/07/13	03:21:46	918		-	100037ms	(fieldpop) { Error: getaddrinfo EAI_AGAIN www.fieldpop.io www.fieldpop.io:443
2019/07/13	03:23:58	230		-	101143ms	(fieldpop) { Error: getaddrinfo EAI_AGAIN www.fieldpop.io www.fieldpop.io:443
2019/07/13	03:26:17	333		-	108828ms	(fieldpop) { Error: getaddrinfo EAI_AGAIN www.fieldpop.io www.fieldpop.io:443
2019/07/13	03:28:56	428		-	128849ms	(fieldpop) { Error: getaddrinfo EAI_AGAIN www.fieldpop.io www.fieldpop.io:443
2019/07/13	03:31:38	650		-	131927ms	(fieldpop) { Error: getaddrinfo EAI_AGAIN www.fieldpop.io www.fieldpop.io:443
2019/07/13	03:34:36	673		-	147823ms	(fieldpop) { Error: getaddrinfo EAI_AGAIN www.fieldpop.io www.fieldpop.io:443
2019/07/13	03:37:27	678		-	140726ms	(fieldpop) { Error: getaddrinfo EAI_AGAIN www.fieldpop.io www.fieldpop.io:443
2019/07/13	03:40:47	837		-	169979ms	(fieldpop) { Error: getaddrinfo EAI_AGAIN www.fieldpop.io www.fieldpop.io:443
2019/07/13	03:44:37	161		-	199142ms	(fieldpop) { Error: getaddrinfo EAI_AGAIN www.fieldpop.io www.fieldpop.io:443
2019/07/13	03:48:39	471		-	120206ms	(fieldpop) { Error: getaddrinfo EAI_AGAIN www.fieldpop.io www.fieldpop.io:443
2019/07/13	03:52:57	562		-	227898ms	(fieldpop) { Error: getaddrinfo EAI_AGAIN www.fieldpop.io www.fieldpop.io:443
2019/07/13	03:57:39	934		-	252196ms	(fieldpop) { Error: getaddrinfo EAI_AGAIN www.fieldpop.io www.fieldpop.io:443
2019/07/13	04:03:18	80		-	307926ms	(fieldpop) { Error: getaddrinfo EAI_AGAIN www.fieldpop.io www.fieldpop.io:443
2019/07/13	04:09:20	484		-	332228ms	(fieldpop) { Error: getaddrinfo EAI_AGAIN www.fieldpop.io www.fieldpop.io:443
2019/07/13	04:16:39	795		-	409076ms	(fieldpop) { Error: getaddrinfo EAI_AGAIN www.fieldpop.io www.fieldpop.io:443
2019/07/13	04:24:41	238		-	451264ms	(fieldpop) { Error: getaddrinfo EAI_AGAIN www.fieldpop.io www.fieldpop.io:443
2019/07/13	04:34:19	627		-	548201ms	(fieldpop) { Error: getaddrinfo EAI_AGAIN www.fieldpop.io www.fieldpop.io:443
2019/07/13	04:45:20	184		-	630358ms	(fieldpop) { Error: getaddrinfo EAI_AGAIN www.fieldpop.io www.fieldpop.io:443
2019/07/13	04:58:20	818		-	750194ms	(fieldpop) { Error: getaddrinfo EAI_AGAIN www.fieldpop.io www.fieldpop.io:443
2019/07/13	05:13:52	398		-	901351ms	(fieldpop) { Error: getaddrinfo EAI_AGAIN www.fieldpop.io www.fieldpop.io:443
2019/07/13	05:32:33	309		-	1090680m	s (fieldpop) { Error: getaddrinfo EAI_AGAIN www.fieldpop.io www.fieldpop.io:443
2019/07/13	05:55:04	614		-	1321094m	s (fieldpop) { Error: getaddrinfo EAI_AGAIN www.fieldpop.io www.fieldpop.io:443
2019/07/13	06:22:44	995		-	1629985m	s (fieldpop) { Error: getaddrinfo EAI_AGAIN www.fieldpop.io www.fieldpop.io:443
2019/07/13	06:57:06	665		-	2031418m	s (fieldpop) { Error: getaddrinfo EAI_AGAIN www.fieldpop.io www.fieldpop.io:443
2019/07/13	07:40:48	814		-	2591651m	s (fieldpop) { Error: getaddrinfo EAI_AGAIN www.fieldpop.io www.fieldpop.io:443
2019/07/13	08:36:33	498		-	1766475m	s (fieldpop) { Error: getaddrinfo EAI_AGAIN www.fieldpop.io www.fieldpop.io:443
2019/07/13	09:48:35	173		-	1569633m	s (fieldpop) { Error: getaddrinfo EAI_AGAIN www.fieldpop.io www.fieldpop.io:443
2019/07/13	11:23:10	390		-	5644812m	s (fieldpop) { Error: getaddrinfo EAI_AGAIN www.fieldpop.io www.fieldpop.io:443
2019/07/13	13:29:06	8		-	7525088m	s (fieldpop) { Error: getaddrinfo EAI_AGAIN www.fieldpop.io www.fieldpop.io:443
2019/07/13	16:19:04	813		-	10168398	ms (fieldpop) { Error: getaddrinfo EAI_AGAIN www.fieldpop.io www.fieldpop.io:443
2019/07/13	20:11:56	160		-	13940936	ms (fieldpop) { Error: getaddrinfo EAI_AGAIN www.fieldpop.io www.fieldpop.io:443
2019/07/14	01:36:02	30		-	19415491	ms (fieldpop) { Error: getaddrinfo EAI_AGAIN www.fieldpop.io www.fieldpop.io:443
2019/07/14	09:14:04	648		-	27452194	ms (fieldpop) { Error: getaddrinfo EAI_AGAIN www.fieldpop.io www.fieldpop.io:443
2019/07/14	20:10:57	175		-	39382102	ms (fieldpop) { Error: getaddrinfo EAI_AGAIN www.fieldpop.io www.fieldpop.io:443
@csampsonza
Copy link
Contributor Author

ae-log-bad-gateway.txt

@JohanvdWest JohanvdWest transferred this issue from happner/happn Jul 30, 2019
@southbite
Copy link
Contributor

southbite commented Jan 7, 2020

Hi guys,

EAI_AGAIN errors are ones that indicate an intermittent connectivity failure with the DNS servers themselves https://stackoverflow.com/questions/40182121/error-getaddrinfo-eai-again normally due to intermittent connections.

I have tested this in some detail now, the backoff is limited at 3:00 minutes by default - which I can verify. The back-off factor is 2, so backoff times should be doubling, which they are not here, it looks like an intermittent DNS lookup issue as the interval between these failures, instead of doubling, contracts and expands arbitrarily.

Are we absolutely sure this is not a red-herring?

@csampsonza
Copy link
Contributor Author

csampsonza commented Jan 8, 2020

Captured log in issue is filtered to only show how often we see eagain. Please ignore the delta time, and review the date stamp of the log. We see on the 14th, we only try connect 3 times all day. So backoff is not working as expected.

@southbite
Copy link
Contributor

southbite commented Jan 8, 2020

I still don't think this is primus's back-off - which I can verify as being limited to 3 minutes, it is rather an emergent issue that is a result of the event loop being blocked. Found this: nodejs/node#8436 - so Johan was on the right track working around this by doing a DNS resolve first. I am looking at doing so on the happn client as well and have made some headway testing for the issue. What is nice is that this may also be linked to happner/happner-2#229, so solving this may kill 2 birds with one stone.

@JohanvdWest
Copy link
Contributor

Hi @southbite, I've discussed with Craig and this is a problem when we try to connect at startup and there is no internet connection. Once we are connected and the connection breaks we don't have this problem which means primus and happn are acting as expected. I think this ticket can close.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants