-
Notifications
You must be signed in to change notification settings - Fork 137
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Prevent exit on ZK detection timeout #306
Conversation
errch reports errors for components other than zk as well and if those fail (dns/http) then the server should probably die. if you want to treat zk separately then use a separate error channel. |
👍 updated |
@@ -40,16 +40,17 @@ func main() { | |||
// initialize resolver | |||
config := records.SetConfig(*cjson) | |||
res := resolver.New(Version, config) | |||
errch := make(chan error) | |||
genErrch := make(chan error) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
errch
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
doesnt this confuse things a bit? what about generrs?
Thanks! Can you squash the commits into one? |
Yup. Though would it be easier to close and reopen ? Not 100% sure but it would require rewriting the history. Does the build system cope with a force update? |
you can force-push |
Wait a minute. Thinking more about this, I don't think this approach is what we want. We'd like ZK to not timeout at all, I think, so that it can connect at any point in the process run-time. |
I don't understand this PR... |
@karlkfi: Agreed, see my comment above. |
Not timing out is debatable. Perhaps the timeout can be optional? Do we have a readiness endpoint on mesos-dns that fails until zk is available? It might be nice to distinguish between alive & ready. It's a pattern kubernetes adopted. NotAlive means it needs restarting. Alive but NotReady means still bootstrapping or that dependencies are NotReady. |
Mesos-DNS offers multiple services which can function without Zookeeper. It's even allowed to configure it without Zookeeper altogether. The timeout is strictly not needed as long as the code which connects keeps retrying. |
Would be easy to remove the timeout. It's just a Ticker anyways. I don't see a reason to keep it. |
Technically it's a Timer, not a Ticker. So it only happens once, as is. A Ticker would invoke repeatedly. |
@sepiroth887: Yes, but in order to satisfy #284 you should make sure that the ZK detector keeps retrying to connect. |
gah. yea it is a Timer. sorry. regardless though it's not really needed. The ZK detector already keeps retrying indefinitely. Sending yet another commit shortly. |
Since |
rabbit hole :D on it. |
logging.VeryVerbose.Printf("new masters detected: %v", masters) | ||
res.SetMasters(masters) | ||
res.Reload() | ||
case err := <-errch: | ||
logging.Error.Fatal(err) | ||
log.Fatal(err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Keep using the Error logger please.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this was unchanged. I'll update it though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How was it unchanged if git lists it as changed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ugh yes. was keeping it in line with line 59 and 61. Want this changed as well to use the Error logger?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I simply want this line to not change at all.
@sepiroth887: Please squash all your commits into one and the force-push to your branch. |
- Dedicated error channel for ZK timeouts - Updated error channel naming - ZK Timeout timer removed - Removed references to ZKTimeout - Use logging for Fatal errors in switch case
85249b9
to
963ce92
Compare
done. |
Prevent exit on ZK detection timeout
For posterity: we added this initial ZK timeout feature because the old mesos-go zk detector implementation was buggy and would sometimes never connect. The problem with this is that even if the masters are initially configured statically in the config, the mesos-dns records that point to masters would get out of date if the master VMs were recycled. This happened in production. The newer mesos-go zk detector should be much more stable. However, there's no guarantee that it's complete bug free. I'm not thrilled that this PR has merged. I'd rather have seen a much higher timeout specified, perhaps something along the lines of O(hours)? That way an admin would at least see the mesos-dns service flapping and could investigate why. |
/cc @cmaloney |
@jdef: Couldn't that be solved simply by logging when new masters are detected? An admin would look and see that it's getting data from ZK. When those would be absent, we'd know something is off. |
My suspicion is that admins are watching a million things and that a On Wed, Oct 7, 2015 at 12:11 PM, Tomás Senart notifications@github.com
|
What about a healthy/unhealthy endpoint? What scared me most about the previous version was that the chance of mesos dns just exiting can have a major impact on a production environment. Monitoring the health of the dns could at least provide ops with the means to alert and take action. |
We experienced this outage btw due to that issue. So naturally id like to mitigate that. |
I like the idea of a readiness endpoint that fails if not connected to mesos. Yes, mesos-dns functions without mesos, but I don't think that's actually a legitimate use case. If it can't talk to mesos the operator is gonna want to know. |
+1 for a readiness endpoint. should also integrate well with DCOS. I'm not On Wed, Oct 7, 2015 at 12:50 PM, Karl Isenberg notifications@github.com
|
@karlkfi sorry for blatantly selling your idea as mine :D one gotcha would be that this doesn't help if someone disables http on mesos-dns right? Unless you always bind the readiness endpoint to another port by default? but this also feels weird. :( Is there some form of DNS entry that could be hijacked to achive the same and support both DNS and HTTP based health? |
some k8s services have a dedicated endpoint/port that respond to "/healthz" On Wed, Oct 7, 2015 at 1:16 PM, Tobias Haag notifications@github.com
|
We do not have any infrastructure to monitor "Readiness" / "liveliness" of a base system service. We can monitor when a process is up and act on that with systemd. Practically it would be good if Mesos-DNS just always continued operating serving old / cached results and causing an alert, but that requires a bunch of infrastructure we don't have currently. |
@cmaloney: So, in the context of DCOS, would you say its preferable to have Mesos-DNS hard crash if it can't connect to Zookeeper at any point in time? |
The two options I see:
We always configure it to listen to a localhost zookeeper though, so the only time it can't connect to zookeeper is if zookeeper is overloaded or not currently up. In the case where it is not up we'd rather use another machine, since using that particular If the "watch zk" stuff was a separate process / daemon from the "serve records" then we could split this much better / make it monitorable as two different services which would be easier to detect / notify system administrators about, while still keeping the "old" state available so the cluster is less likely to fall over. There would also need to be work around not serving leader.mesos / master.mesos when there is no connection to zk (Which will cause certain things to fail properly), and work in generating /etc/resolv.conf so that we detect when mesos-dns is in this "degraded" state and avoid it if there are other healthy nodes. |
OK, until we have a decent solution to properly monitor base services in DCOS, I will revert this change set and instead make the timeout optional so that we cover both use cases. |
I'm not a big fan of either of those options, honestly. The best multi-component systems to maintain are the ones that:
You don't want to have to trace failures down a cascade of broken services and then have to restart all of them in order. You want to be able to look at a dashboard and see that N systems are Alive but NotReady and 1 system is NotAlive. You go fix that one piece or config that's borked and all the others auto-recover. Not only that, but every component should be watched for NotAlive and auto-restarted either by process or by container. If the kube-dns crashes every time zk is unreachable it will thrash every time there's a network outage. If that happens at the container level you won't even have a running container to shell into to debug. |
@karlkfi: Agree. |
I think it's important to note here that the reason why we're serving DNS is that we're talking to dumb clients. From what I know of internals / reliance on DNS inside of DCOS (Universe, DCOS UI, Frameworks), Mesos-DNS soft failing is going to result in a lot worse error cases than hard failing. If Mesos-DNS hard fails from a clients perspective it looks just like you lost network access to it. Clients which want to be resillient against network partitions from Mesos-DNS need to handle this failure scenario already. If we add soft failure of Mesos-DNS, we break some "liveliness" guarantees of responses from it from the PoV of different "dumb" clients. These needs / requirements are often unstated, and happen both inside DCOS components, and likely in things others write on top of Mesos-DNS. Two I'm going to focus on are leader.mesos and master.mesos. There is a lot of software in DCOS (every universe service) which assumes master.mesos resolves to all the masters for using as a zk url. There is some software in DCOS (The UI, dcos history service, dependence on there being a leading mesos master by slaves) which assumes leader.mesos points to the current leading Mesos master. If Mesos-DNS keeps serving even though it doesn't have a current / up to date leader.mesos + master.mesos (what it gets from zookeeper), then clients which query those are going to get incorrect responsesfrom their point of view. For some clients this is acceptable as a transient error. For others, (the UI, dcos history service, dependence on there being a leading mesos master by slaves) they need to change behavior when they know they are using an old, cached value. If Mesos-DNS keeps serving the old value, they are unable to make this choice. From their perspective as consumers of DNS, they are getting a new authoritative value, and have no reason to suspect the value is possibly outdated. This will lead to that particular service not realizing it is in a degraded state. Services which depend on it, depending on that dependance, may also be in a degraded state but not know it. If Mesos-DNS hard fails, the clients see that immediately, and can opt to use the old value if they are smart clients, if they are dumb clients they are guaranteed that they won't have their assumptions around the liveliness of the values reported invalidated. By introducing "soft" failure as keeping up the existing service port listenging / respondign in Mesos-DNS there are more failure scenarios a client must be aware of in order to operate properly with certain guarantees around service discovery. This breaks the simple client implementation. That implementation is why we are doing DNS in the first place. If we implement "soft" failure in Mesos-DNS as serving on a different fallback port (ex: Port 153 is "always available", port 53 serving is "only when we're all good"), the guarnatees aren't broken, as dumb clients will get the failure they need for correct operation. To use the fallback clients need to opt-in rather than opt out. One failure line which would happen today with soft failure:
|
Addresses: #284
This one line fix will do it. There are other instances where mesos-dns could fail fatally due to dependent libraries but thats outside the scope of this fix.