-
Notifications
You must be signed in to change notification settings - Fork 2.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ensure that the iptables masquerade rule is always installed #808
Ensure that the iptables masquerade rule is always installed #808
Conversation
main.go
Outdated
if err != nil { | ||
// If we fail to set up the IP Masquerade, don't keep retrying | ||
log.Errorf("Failed to set up IP Masquerade: %v", err) | ||
return |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not totally sure that this return
is the right thing, it may be better to continue retrying instead
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I think that continuing to retry or returning an error in such a way that flanneld exits would be better.
Hi @julia-stripe, thanks for another well put together PR! I see one key problem with this PR and then a couple of minor concerns (below) - the main concern is that if a single rule (of the four that flannel installs) is removed, then flannel just restores the one missing rule, which means it might end up being out of order! Since the ordering of the rules matters, I think this defeats the purpose of this PR a bit. e.g. if I remove the first flannel rules I'd be interested to hear if you have any thoughts on automated testing in this area? I know there aren't any existing tests but it would be great to have some. Otherwise, the code looks good to me, though I'd like to leave it up for at least a few days before merging to give others a chance to comment. The slight areas of concern I'd have about this are
@fasaxc I'd love to hear your thoughts on this. |
On (1), looks like the code checks for the existence of the rule first, so it's 4 reads per second. That could be moderately expensive if there are a lot of rules (since most iptables operations do a full table load from the kernel to userspace, even to check one rule). On (2), as long as the other process isn't deleting these rules actively then we should be OK (and if it was actively trying to delete those rules then there's not much we can do. Longer term, you may want to migrate to having only a single rule inserted into the root chain and then jumping fro there to a flannel-owned chain. That's the approach that we have in Calico; it reduces cloberring since there's only one rule and it avoids reordering issues. (And, since we own the calico chain, we can just flush and rewrite the whole chain at will if anyone ever tampers with it.) |
59ab87a
to
2a5059a
Compare
this is a great point. Changed the PR so that it instead:
I also added a mock IPTables struct & some tests to give us a little more confidence that this code actually works as intended. This should be reviewed commit-by-commit -- the first commit just adds tests, and the second actually changes functionality. |
82aec61
to
056c4eb
Compare
@fasaxc thanks for the comments. I agree that longer term, a single jump to a flannel owned chain would be better. @julia-stripe That's looking much better thanks. This code doesn't actually delete the iptables rules when flannel stops, but after a bit of testing I see that the old code didn't either (ooops). I also think it's desirable that the rules aren't cleaned up (so the dataplane keeps working during an upgrade). It's up to you if you want to clean up the code or just leave it as is. I think in future it would be good to have a "--cleanup" command which which removes all traces of flannel from a system (removes the subnet.env file, any iptables rules, any flannel routes and flannel created devices). Could you squash the two commits then I can merge it? |
056c4eb
to
fc5fbd7
Compare
Squashed.
This makes sense to me. |
We ran into an issue where sometimes the iptables masquerade rules that flannel creates get deleted. Obviously ideally this wouldn't happen, but if it does happen it's really bad -- containers can't send packets to any hosts outside the cluster.
This PR makes flannel automatically recover from this situation by recreating the iptables rules, so that the cluster will remain healthy:
AppendUnique
which returns whether or not the rule was appended. This allows flannel to only log when a rule was actually successfully added, instead of always logging (which, now that we're retrying every second, would add too much noise to the logs)We're running this in our production cluster now