-
Notifications
You must be signed in to change notification settings - Fork 362
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
need liveness checks... Antrea ports on EC2 : Modifying Security groups to accomodate agent -> requires agent restarts (agent doesnt recover) #802
Comments
@jayunit100 does VXLAN traffic (UDP port 4789) need to be enabled explicitly in the VPC? |
I created a cluster on EC2 using kops:
I am able to run the whole netpol testsuite with no errors:
I think you will need to provide us with access to your cluster (kubeconfig & ssh access), or share with us steps so we can reproduce the issue. |
yeah im assuming the same thing must be a firewall thing. |
confirmed theres a firewall issue, i guess we should doc this somewhere for antrea ? |
Agree, the required ports should be listed at https://github.com/vmware-tanzu/antrea/blob/master/docs/getting-started.md#ensuring-requirements-are-satisfied, including the VXLAN port (or other tunnel ports if different tunnels are used), and the default api port. |
thanks folks, ok so. if i issued a patch to |
question - would it be possible to configure all OVS to run on a different port ? just as an intermediary hack. |
The OVS tunnel port is not configurable via antrea config file at the moment. There are other tunnel types but would require other ports opening.. If the hack that manually updates OVS configuration works for you, I can find some commands, but I guess it's not what you are looking for.
I don't think of a way that can probe remote UDP port as it doesn't have acknowledgement regardless of the firewall dropping it or not. If you know a proper way, I think it's good to have. |
if an antrea node "A" is sitting there, and absolutely no rules from other nodes are received on the OVS channel, is that an obvious indicator that something might be wrong ? If so we could log a warning, i.e. a periodic print out of how many total OVS rules have been created or received ? that might be a good indicator. but i dont really know. just an idea bc as of nowthere is no signal in the logs that something is wrong. |
It is really orthogonal to OVS rules. As we add more Prometheus metrics, some of them may help in troubleshooting this (e.g. we could show the amount of cross Node traffic). It's hard to test for Node connectivity without creating Pods. We could send an ICMP echo request to the gateway interfaces of other Nodes, but the Node configuration may be such that no reply will be generated, so that could trigger a false positive. For this reason, I am reluctant to include a test like this one in the Antrea agent code. However, that could be part of a sanity check that we build using the traceflow feature / antctl. Note that for someone running Antrea in an ec2 VPC using the default security group, VXLAN tunneling will work fine. I think this case only arises because in your case you block all traffic then enable what you need selectively? This requires knowledge of the components you use in your cluster. However, the fact that we did not document this as Quan pointed out does not help... |
This is an interesting conversation and I've seen this type of troubleshooting process more times than I care to count in the wild. An interesting feature may be an optional canary pod deployed as a daemonset to all nodes that would allow a checkout of the overlay and help to identify possible problems with host and IaaS firewall rules. The dameonset would only need to be deployed following environment configuration changes and could then be destroyed afterward. BTW... this wouldn't necessarily have to be an Antrea feature but could belong to other K8S toolsets. Integrating with Antrea however would allow hints to be given on possible closed ports, etc. preventing traffic since the current overlay configuration could be reflected. |
+1 to the optional canary daemonset to be used for diagnosis , sonobouy with a good e2e filter can do this as well . But if there’s so other purpose to the canary it might be perceived as a heavy weight solution. |
From live debugging w tnqn, we're thinking a liveness check will help alot with this situation....
some pods can stay in |
This issue is stale because it has been open 180 days with no activity. Remove stale label or comment, or this will be closed in 180 days |
Describe the bug
So, originally i filed this bug as general catastrophic failure, but it looks like there is a pattern --- likely pods on different nodes can't talk with one another. I say this because, it seems as though looking at netpol tests, if 'x/a' can talk to 'z/y' then z/y also can talk to x/a and ANY other pods that x/a can talk to , and so on...... meaning that the communication groups have 'equivalence classes'.... the obvious conclusion hinting that pods on the same node can talk to one another.
I guess this makes some sense - maybe ovs doesnt work the same on EC2 without some possible setup of firewall rules or something?
So,
... and so on.
Antrea tests in VSphere seem to be pretty stable for conformance testing. However, in EC2, i saw many failures, including even basic tests for network connectivity between pods.
Also almost all NetworkPolicy tests were failing.
I guess my cluster somehow got into a failed state.
To Reproduce
Im not sure - but i put my installation material and some useful logs here
https://github.com/jayunit100/kubernetes/tree/netpol-impl2/NETPOL_DATA_LOGS
The above folder has:
kubectl get nodes -o yaml
which might have some hintsActual behavior
Almost all pods cant do basic networking
Versions:
Kubernetes 1.18
Antrea 0.7.4
EC2
The text was updated successfully, but these errors were encountered: