Prevent exit on ZK detection timeout #306

sepiroth887 · 2015-10-06T15:09:04Z

Addresses: #284

This one line fix will do it. There are other instances where mesos-dns could fail fatally due to dependent libraries but thats outside the scope of this fix.

jdef · 2015-10-06T15:10:54Z

errch reports errors for components other than zk as well and if those fail (dns/http) then the server should probably die. if you want to treat zk separately then use a separate error channel.

sepiroth887 · 2015-10-06T15:18:03Z

👍 updated

tsenart · 2015-10-06T15:20:02Z

main.go

@@ -40,16 +40,17 @@ func main() {
 	// initialize resolver
 	config := records.SetConfig(*cjson)
 	res := resolver.New(Version, config)
-	errch := make(chan error)
+	genErrch := make(chan error)


doesnt this confuse things a bit? what about generrs?

tsenart · 2015-10-06T15:42:16Z

Thanks! Can you squash the commits into one?

sepiroth887 · 2015-10-06T15:52:38Z

Yup. Though would it be easier to close and reopen ? Not 100% sure but it would require rewriting the history. Does the build system cope with a force update?

jdef · 2015-10-06T16:02:44Z

you can force-push

tsenart · 2015-10-06T16:07:31Z

Wait a minute. Thinking more about this, I don't think this approach is what we want. We'd like ZK to not timeout at all, I think, so that it can connect at any point in the process run-time.

karlkfi · 2015-10-06T16:12:53Z

I don't understand this PR...
Am I correct in reading that it disables the zk timeout and just logs once and keeps on going? Why have the timeout at all? I don't understand.

tsenart · 2015-10-06T16:17:18Z

@karlkfi: Agreed, see my comment above.

karlkfi · 2015-10-06T16:26:02Z

Not timing out is debatable. Perhaps the timeout can be optional?

Do we have a readiness endpoint on mesos-dns that fails until zk is available? It might be nice to distinguish between alive & ready. It's a pattern kubernetes adopted. NotAlive means it needs restarting. Alive but NotReady means still bootstrapping or that dependencies are NotReady.

tsenart · 2015-10-06T16:33:04Z

Mesos-DNS offers multiple services which can function without Zookeeper. It's even allowed to configure it without Zookeeper altogether. The timeout is strictly not needed as long as the code which connects keeps retrying.

sepiroth887 · 2015-10-06T16:33:17Z

Would be easy to remove the timeout. It's just a Ticker anyways. I don't see a reason to keep it.

karlkfi · 2015-10-06T16:34:45Z

Technically it's a Timer, not a Ticker. So it only happens once, as is. A Ticker would invoke repeatedly.

tsenart · 2015-10-06T16:35:36Z

@sepiroth887: Yes, but in order to satisfy #284 you should make sure that the ZK detector keeps retrying to connect.

sepiroth887 · 2015-10-06T16:36:37Z

gah. yea it is a Timer. sorry. regardless though it's not really needed. The ZK detector already keeps retrying indefinitely.

Sending yet another commit shortly.

tsenart · 2015-10-06T16:46:55Z

Since config.ZkDetectionTimeout isn't used anymore, please remove all references to it, including documentation.

sepiroth887 · 2015-10-06T16:52:57Z

rabbit hole :D on it.

tsenart · 2015-10-06T17:02:40Z

main.go

 			logging.VeryVerbose.Printf("new masters detected: %v", masters)
 			res.SetMasters(masters)
 			res.Reload()
 		case err := <-errch:
-			logging.Error.Fatal(err)
+			log.Fatal(err)


Keep using the Error logger please.

this was unchanged. I'll update it though.

How was it unchanged if git lists it as changed?

ugh yes. was keeping it in line with line 59 and 61. Want this changed as well to use the Error logger?

I simply want this line to not change at all.

tsenart · 2015-10-07T10:07:29Z

@sepiroth887: Please squash all your commits into one and the force-push to your branch.

- Dedicated error channel for ZK timeouts - Updated error channel naming - ZK Timeout timer removed - Removed references to ZKTimeout - Use logging for Fatal errors in switch case

sepiroth887 · 2015-10-07T14:49:53Z

done.

Prevent exit on ZK detection timeout

jdef · 2015-10-07T15:22:35Z

For posterity: we added this initial ZK timeout feature because the old mesos-go zk detector implementation was buggy and would sometimes never connect. The problem with this is that even if the masters are initially configured statically in the config, the mesos-dns records that point to masters would get out of date if the master VMs were recycled. This happened in production.

The newer mesos-go zk detector should be much more stable. However, there's no guarantee that it's complete bug free.

I'm not thrilled that this PR has merged. I'd rather have seen a much higher timeout specified, perhaps something along the lines of O(hours)? That way an admin would at least see the mesos-dns service flapping and could investigate why.

jdef · 2015-10-07T15:23:00Z

/cc @cmaloney

tsenart · 2015-10-07T16:11:10Z

@jdef: Couldn't that be solved simply by logging when new masters are detected? An admin would look and see that it's getting data from ZK. When those would be absent, we'd know something is off.

jdef · 2015-10-07T16:13:40Z

My suspicion is that admins are watching a million things and that a
flapping service that's being monitored via some alert system is more
likely to get attention than a log line.

On Wed, Oct 7, 2015 at 12:11 PM, Tomás Senart notifications@github.com
wrote:

@jdef https://github.com/jdef: Couldn't that be solved simply by
logging when new masters are detected? An admin would look and see that
it's getting data from ZK. When those would be absent, we'd know something
is off.

—
Reply to this email directly or view it on GitHub
#306 (comment).

sepiroth887 · 2015-10-07T16:45:11Z

What about a healthy/unhealthy endpoint?

What scared me most about the previous version was that the chance of mesos dns just exiting can have a major impact on a production environment.
Even if it should never recover it would at least not prevent running apps from blowing to pieces.

Monitoring the health of the dns could at least provide ops with the means to alert and take action.

sepiroth887 · 2015-10-07T16:46:38Z

We experienced this outage btw due to that issue. So naturally id like to mitigate that.

karlkfi · 2015-10-07T16:50:04Z

I like the idea of a readiness endpoint that fails if not connected to mesos. Yes, mesos-dns functions without mesos, but I don't think that's actually a legitimate use case. If it can't talk to mesos the operator is gonna want to know.

jdef · 2015-10-07T16:52:52Z

+1 for a readiness endpoint. should also integrate well with DCOS. I'm not
very familiar with how they currently monitor readiness of other critical
components.

On Wed, Oct 7, 2015 at 12:50 PM, Karl Isenberg notifications@github.com
wrote:

I like the idea of a readiness endpoint that fails if not connected to
mesos. Yes, mesos-dns functions without mesos, but I don't think that's
actually a legitimate use case. If it can't talk to mesos the operator is
gonna want to know.

—
Reply to this email directly or view it on GitHub
#306 (comment).

sepiroth887 · 2015-10-07T17:16:12Z

@karlkfi sorry for blatantly selling your idea as mine :D

one gotcha would be that this doesn't help if someone disables http on mesos-dns right? Unless you always bind the readiness endpoint to another port by default? but this also feels weird. :(

Is there some form of DNS entry that could be hijacked to achive the same and support both DNS and HTTP based health?

jdef · 2015-10-07T17:29:13Z

some k8s services have a dedicated endpoint/port that respond to "/healthz"
HTTP requests. Our readiness check could (and probably should) be
completely separate from the HTTP API.

On Wed, Oct 7, 2015 at 1:16 PM, Tobias Haag notifications@github.com
wrote:

@karlkfi https://github.com/karlkfi sorry for blatantly selling your
idea as mine :D

one gotcha would be that this doesn't help if someone disables http on
mesos-dns right? Unless you always bind the readiness endpoint to another
port by default? but this also feels weird. :(

Is there some form of DNS entry that could be hijacked to achive the same
and support both DNS and HTTP based health?

—
Reply to this email directly or view it on GitHub
#306 (comment).

cmaloney · 2015-10-07T18:52:47Z

We do not have any infrastructure to monitor "Readiness" / "liveliness" of a base system service. We can monitor when a process is up and act on that with systemd.

Practically it would be good if Mesos-DNS just always continued operating serving old / cached results and causing an alert, but that requires a bunch of infrastructure we don't have currently.

tsenart · 2015-10-08T13:24:51Z

@cmaloney: So, in the context of DCOS, would you say its preferable to have Mesos-DNS hard crash if it can't connect to Zookeeper at any point in time?

cmaloney · 2015-10-08T17:09:14Z

The two options I see:

Hard Crash (very easy to detect and forward to a sysadmin)
Refuse all connections (prevents us from thinking it is up / working), and report via the systemd notify API (http://www.freedesktop.org/software/systemd/man/systemd-notify.html, http://www.freedesktop.org/software/systemd/man/sd-daemon.html) that it is in a degraded state. (Reasonably easy to notice, shows up in systemd logs).

We always configure it to listen to a localhost zookeeper though, so the only time it can't connect to zookeeper is if zookeeper is overloaded or not currently up. In the case where it is not up we'd rather use another machine, since using that particular

If the "watch zk" stuff was a separate process / daemon from the "serve records" then we could split this much better / make it monitorable as two different services which would be easier to detect / notify system administrators about, while still keeping the "old" state available so the cluster is less likely to fall over. There would also need to be work around not serving leader.mesos / master.mesos when there is no connection to zk (Which will cause certain things to fail properly), and work in generating /etc/resolv.conf so that we detect when mesos-dns is in this "degraded" state and avoid it if there are other healthy nodes.

tsenart · 2015-10-08T17:18:06Z

OK, until we have a decent solution to properly monitor base services in DCOS, I will revert this change set and instead make the timeout optional so that we cover both use cases.

karlkfi · 2015-10-08T17:18:09Z

I'm not a big fan of either of those options, honestly.

The best multi-component systems to maintain are the ones that:

can come up in any order
make it obvious which piece is broken
recover automatically with the least amount of intervention

You don't want to have to trace failures down a cascade of broken services and then have to restart all of them in order. You want to be able to look at a dashboard and see that N systems are Alive but NotReady and 1 system is NotAlive. You go fix that one piece or config that's borked and all the others auto-recover.

Not only that, but every component should be watched for NotAlive and auto-restarted either by process or by container. If the kube-dns crashes every time zk is unreachable it will thrash every time there's a network outage. If that happens at the container level you won't even have a running container to shell into to debug.

tsenart · 2015-10-08T17:19:05Z

@karlkfi: Agree.

cmaloney · 2015-10-08T18:30:13Z

I think it's important to note here that the reason why we're serving DNS is that we're talking to dumb clients. From what I know of internals / reliance on DNS inside of DCOS (Universe, DCOS UI, Frameworks), Mesos-DNS soft failing is going to result in a lot worse error cases than hard failing.

If Mesos-DNS hard fails from a clients perspective it looks just like you lost network access to it. Clients which want to be resillient against network partitions from Mesos-DNS need to handle this failure scenario already.

If we add soft failure of Mesos-DNS, we break some "liveliness" guarantees of responses from it from the PoV of different "dumb" clients. These needs / requirements are often unstated, and happen both inside DCOS components, and likely in things others write on top of Mesos-DNS. Two I'm going to focus on are leader.mesos and master.mesos. There is a lot of software in DCOS (every universe service) which assumes master.mesos resolves to all the masters for using as a zk url. There is some software in DCOS (The UI, dcos history service, dependence on there being a leading mesos master by slaves) which assumes leader.mesos points to the current leading Mesos master.

If Mesos-DNS keeps serving even though it doesn't have a current / up to date leader.mesos + master.mesos (what it gets from zookeeper), then clients which query those are going to get incorrect responsesfrom their point of view. For some clients this is acceptable as a transient error. For others, (the UI, dcos history service, dependence on there being a leading mesos master by slaves) they need to change behavior when they know they are using an old, cached value. If Mesos-DNS keeps serving the old value, they are unable to make this choice. From their perspective as consumers of DNS, they are getting a new authoritative value, and have no reason to suspect the value is possibly outdated. This will lead to that particular service not realizing it is in a degraded state. Services which depend on it, depending on that dependance, may also be in a degraded state but not know it.

If Mesos-DNS hard fails, the clients see that immediately, and can opt to use the old value if they are smart clients, if they are dumb clients they are guaranteed that they won't have their assumptions around the liveliness of the values reported invalidated.

By introducing "soft" failure as keeping up the existing service port listenging / respondign in Mesos-DNS there are more failure scenarios a client must be aware of in order to operate properly with certain guarantees around service discovery. This breaks the simple client implementation. That implementation is why we are doing DNS in the first place.

If we implement "soft" failure in Mesos-DNS as serving on a different fallback port (ex: Port 153 is "always available", port 53 serving is "only when we're all good"), the guarnatees aren't broken, as dumb clients will get the failure they need for correct operation. To use the fallback clients need to opt-in rather than opt out.

One failure line which would happen today with soft failure:

Mesos-DNS keeps serving even though it lost a zk connect
/etc/resolv.conf keeps using 127.0.0.1 as the resolver, because Mesos-DNS is still up
The DCOS UI nginx keeps resolving master.mesos, leader.mesos. It uses leader.mesos to grab "state-summary" from the leading master. Mesos serves this file from every single master always. If you get it from not the leading master, it's contents are frequently incorrect.
An administrator uses the DCOS UI to check on the state of their cluster. The cluster "looks" healthy, because the state-summary being returned says everything is healthy, even though on the current leading master this is not the case.
An administrator uses the DCOS UI to view the Web UI of a particular service inside their cluster. The DCOS UI looks up webui_url as reported by the framework and sent to the ui in "state-summary". The DCOS UI forwards the administators request to that particular server. That server is no longer supposed to be serving the WebUI for the framework. The server is still there, but the framework migrated it to a different host, and updated the webui_url in mesos. The administrator therefore gets old / incorrect information about the service which they believe to be correct / authoritative information.

tsenart · 2015-10-08T18:37:47Z

@cmaloney: Thanks for the detailed insight into these DCOS failure modes. I hope #320 makes everyone happy.

tsenart reviewed Oct 6, 2015
View reviewed changes

tsenart changed the title ~~[WIP] Prevent exit on ZK detection timeout~~ Prevent exit on ZK detection timeout Oct 6, 2015

tsenart added the PTAL label Oct 6, 2015

tsenart added this to the v1.0.0 milestone Oct 6, 2015

tsenart reviewed Oct 6, 2015
View reviewed changes

- Prevent exit on ZK detection timeout

963ce92

- Dedicated error channel for ZK timeouts - Updated error channel naming - ZK Timeout timer removed - Removed references to ZKTimeout - Use logging for Fatal errors in switch case

sepiroth887 force-pushed the still_alive branch from 85249b9 to 963ce92 Compare October 7, 2015 14:47

tsenart added a commit that referenced this pull request Oct 7, 2015

Merge pull request #306 from sepiroth887/still_alive

b963f95

Prevent exit on ZK detection timeout

tsenart merged commit b963f95 into mesosphere:master Oct 7, 2015

tsenart removed the PTAL label Oct 7, 2015

sepiroth887 deleted the still_alive branch October 7, 2015 15:11

tsenart mentioned this pull request Oct 8, 2015

Implement optional ZK detection timeout with resets on watch loss #320

Merged

tsenart mentioned this pull request Oct 9, 2015

Release v0.4.0 #321

Merged

Prevent exit on ZK detection timeout #306

Prevent exit on ZK detection timeout #306

Conversation

sepiroth887 commented Oct 6, 2015

jdef commented Oct 6, 2015

sepiroth887 commented Oct 6, 2015

tsenart Oct 6, 2015

Choose a reason for hiding this comment

sepiroth887 Oct 6, 2015

Choose a reason for hiding this comment

tsenart commented Oct 6, 2015

sepiroth887 commented Oct 6, 2015

jdef commented Oct 6, 2015

tsenart commented Oct 6, 2015

karlkfi commented Oct 6, 2015

tsenart commented Oct 6, 2015

karlkfi commented Oct 6, 2015

tsenart commented Oct 6, 2015

sepiroth887 commented Oct 6, 2015

karlkfi commented Oct 6, 2015

tsenart commented Oct 6, 2015

sepiroth887 commented Oct 6, 2015

tsenart commented Oct 6, 2015

sepiroth887 commented Oct 6, 2015

tsenart Oct 6, 2015

Choose a reason for hiding this comment

sepiroth887 Oct 6, 2015

Choose a reason for hiding this comment

tsenart Oct 6, 2015

Choose a reason for hiding this comment

sepiroth887 Oct 6, 2015

Choose a reason for hiding this comment

tsenart Oct 6, 2015

Choose a reason for hiding this comment

tsenart commented Oct 7, 2015

sepiroth887 commented Oct 7, 2015

jdef commented Oct 7, 2015

jdef commented Oct 7, 2015

tsenart commented Oct 7, 2015

jdef commented Oct 7, 2015

sepiroth887 commented Oct 7, 2015

sepiroth887 commented Oct 7, 2015

karlkfi commented Oct 7, 2015

jdef commented Oct 7, 2015

sepiroth887 commented Oct 7, 2015

jdef commented Oct 7, 2015

cmaloney commented Oct 7, 2015

tsenart commented Oct 8, 2015

cmaloney commented Oct 8, 2015

tsenart commented Oct 8, 2015

karlkfi commented Oct 8, 2015

tsenart commented Oct 8, 2015

cmaloney commented Oct 8, 2015

tsenart commented Oct 8, 2015