need liveness checks... Antrea ports on EC2 : Modifying Security groups to accomodate agent -> requires agent restarts (agent doesnt recover) #802

jayunit100 · 2020-06-05T03:43:46Z

Describe the bug

So, originally i filed this bug as general catastrophic failure, but it looks like there is a pattern --- likely pods on different nodes can't talk with one another. I say this because, it seems as though looking at netpol tests, if 'x/a' can talk to 'z/y' then z/y also can talk to x/a and ANY other pods that x/a can talk to , and so on...... meaning that the communication groups have 'equivalence classes'.... the obvious conclusion hinting that pods on the same node can talk to one another.

I guess this makes some sense - maybe ovs doesnt work the same on EC2 without some possible setup of firewall rules or something?

expected:

-       x/a     y/a     z/a     x/b     y/b     z/b     x/c     y/c     z/c
x/a     .       .       .       .       .       .       .       .       .
y/a     .       .       .       .       .       .       .       .       .
z/a     .       .       .       .       .       .       .       .       .
x/b     .       .       .       .       .       .       .       .       .
y/b     .       .       .       .       .       .       .       .       .
z/b     .       .       .       .       .       .       .       .       .
x/c     .       .       .       .       .       .       .       .       .
y/c     .       .       .       .       .       .       .       .       .
z/c     .       .       .       .       .       .       .       .       .


observed:

-       x/a     y/a     z/a     x/b     y/b     z/b     x/c     y/c     z/c
x/a     .       .       X       .       X       X       X       .       .
y/a     .       .       X       .       X       X       X       .       .
z/a     X       X       .       X       .       .       .       X       X
x/b     .       .       X       .       X       X       X       .       .
y/b     X       X       .       X       .       .       .       X       X
z/b     X       X       .       X       .       .       .       X       X
x/c     X       X       .       X       .       .       .       X       X
y/c     .       .       X       .       X       X       X       .       .
z/c     .       .       X       .       X       X       X       .       .

So,

z/c, y/a, y/c, x/c are one group
z/a, y/b, x/b,z/b are one group ...
... and so on.

Antrea tests in VSphere seem to be pretty stable for conformance testing. However, in EC2, i saw many failures, including even basic tests for network connectivity between pods.

Also almost all NetworkPolicy tests were failing.

I guess my cluster somehow got into a failed state.

To Reproduce

Im not sure - but i put my installation material and some useful logs here

https://github.com/jayunit100/kubernetes/tree/netpol-impl2/NETPOL_DATA_LOGS

The above folder has:

antrea controller and
one of the antrea agents.
the output of kubectl get nodes -o yaml which might have some hints

Actual behavior

Almost all pods cant do basic networking

Versions:
Kubernetes 1.18
Antrea 0.7.4
EC2

The text was updated successfully, but these errors were encountered:

antoninbas · 2020-06-05T04:25:44Z

@jayunit100 does VXLAN traffic (UDP port 4789) need to be enabled explicitly in the VPC?

antoninbas · 2020-06-05T06:13:12Z

I created a cluster on EC2 using kops:

abas-a01:~ abas$ ./kops validate cluster
Using cluster from kubectl context: cluster1.abas.link

Validating cluster cluster1.abas.link

INSTANCE GROUPS
NAME			ROLE	MACHINETYPE	MIN	MAX	SUBNETS
master-us-west-2a	Master	t3.medium	1	1	us-west-2a
nodes			Node	t2.small	2	2	us-west-2a

NODE STATUS
NAME						ROLE	READY
ip-172-20-33-86.us-west-2.compute.internal	node	True
ip-172-20-51-159.us-west-2.compute.internal	node	True
ip-172-20-63-247.us-west-2.compute.internal	master	True

Your cluster cluster1.abas.link is ready

I am able to run the whole netpol testsuite with no errors:

=== TEST FAILURES: 0/14 ===

I think you will need to provide us with access to your cluster (kubeconfig & ssh access), or share with us steps so we can reproduce the issue.

jayunit100 · 2020-06-05T13:18:16Z

yeah im assuming the same thing must be a firewall thing.

jayunit100 · 2020-06-05T13:32:07Z

confirmed theres a firewall issue, i guess we should doc this somewhere for antrea ?

tnqn · 2020-06-05T13:36:53Z

confirmed theres a firewall issue, i guess we should doc this somewhere for antrea ?

Agree, the required ports should be listed at https://github.com/vmware-tanzu/antrea/blob/master/docs/getting-started.md#ensuring-requirements-are-satisfied, including the VXLAN port (or other tunnel ports if different tunnels are used), and the default api port.

jayunit100 · 2020-06-05T13:38:35Z

thanks folks, ok so. if i issued a patch to antrea-agent which logged a warning or failed after checking the udp port is open , would that be accepted ? Seems like its a critical failure so docs is really only the first step, UX would be much better if the microservice detected and actively failed.

jayunit100 · 2020-06-05T14:03:20Z

question - would it be possible to configure all OVS to run on a different port ? just as an intermediary hack.

tnqn · 2020-06-05T15:07:15Z

question - would it be possible to configure all OVS to run on a different port ? just as an intermediary hack.

The OVS tunnel port is not configurable via antrea config file at the moment. There are other tunnel types but would require other ports opening.. If the hack that manually updates OVS configuration works for you, I can find some commands, but I guess it's not what you are looking for.

if i issued a patch to antrea-agent which logged a warning or failed after checking the udp port is open , would that be accepted ? Seems like its a critical failure so docs is really only the first step, UX would be much better if the microservice detected and actively failed.

I don't think of a way that can probe remote UDP port as it doesn't have acknowledgement regardless of the firewall dropping it or not. If you know a proper way, I think it's good to have.

jayunit100 · 2020-06-05T15:37:54Z

if an antrea node "A" is sitting there, and absolutely no rules from other nodes are received on the OVS channel, is that an obvious indicator that something might be wrong ? If so we could log a warning, i.e. a periodic print out of how many total OVS rules have been created or received ? that might be a good indicator. but i dont really know. just an idea bc as of nowthere is no signal in the logs that something is wrong.

antoninbas · 2020-06-05T16:58:11Z

It is really orthogonal to OVS rules.

As we add more Prometheus metrics, some of them may help in troubleshooting this (e.g. we could show the amount of cross Node traffic).

It's hard to test for Node connectivity without creating Pods. We could send an ICMP echo request to the gateway interfaces of other Nodes, but the Node configuration may be such that no reply will be generated, so that could trigger a false positive. For this reason, I am reluctant to include a test like this one in the Antrea agent code. However, that could be part of a sanity check that we build using the traceflow feature / antctl.

Note that for someone running Antrea in an ec2 VPC using the default security group, VXLAN tunneling will work fine. I think this case only arises because in your case you block all traffic then enable what you need selectively? This requires knowledge of the components you use in your cluster. However, the fact that we did not document this as Quan pointed out does not help...

McCodeman · 2020-06-05T17:37:34Z

This is an interesting conversation and I've seen this type of troubleshooting process more times than I care to count in the wild. An interesting feature may be an optional canary pod deployed as a daemonset to all nodes that would allow a checkout of the overlay and help to identify possible problems with host and IaaS firewall rules. The dameonset would only need to be deployed following environment configuration changes and could then be destroyed afterward. BTW... this wouldn't necessarily have to be an Antrea feature but could belong to other K8S toolsets. Integrating with Antrea however would allow hints to be given on possible closed ports, etc. preventing traffic since the current overlay configuration could be reflected.

jayunit100 · 2020-06-07T13:58:44Z

+1 to the optional canary daemonset to be used for diagnosis , sonobouy with a good e2e filter can do this as well . But if there’s so other purpose to the canary it might be perceived as a heavy weight solution.

jayunit100 · 2020-06-08T14:56:13Z

From live debugging w tnqn, we're thinking a liveness check will help alot with this situation....

NAMESPACE     NAME                                                               READY   STATUS    RESTARTS   AGE
kube-system   antrea-agent-4qjzk                                                 2/2     Running   0          3d14h
kube-system   antrea-agent-r6qk9                                                 1/2     Running   0          3d14h
kube-system   antrea-agent-wnpjv                                                 1/2     Running   0          3s
kube-system   antrea-controller-7bc4496b57-lfz58                                 1/1     Running   0          3d14h
ku

some pods can stay in ready 1/2 state for a longgggg time, getting context cancelled events when talking to theAPIServer, when were toggling security groups on and off.

github-actions · 2020-12-06T00:09:38Z

This issue is stale because it has been open 180 days with no activity. Remove stale label or comment, or this will be closed in 180 days

jayunit100 added the bug label Jun 5, 2020

jayunit100 changed the title ~~Lots of failures in EC2 after conformance tests~~ Manny pods failing to talk to other groups of pods (possibly isolated to same node traffic)? Jun 5, 2020

antoninbas changed the title ~~Manny pods failing to talk to other groups of pods (possibly isolated to same node traffic)?~~ Many pods failing to talk to other groups of pods (possibly isolated to same node traffic)? Jun 5, 2020

antoninbas added the triage/needs-information Indicates an issue needs more information in order to work on it. label Jun 5, 2020

tnqn added the documentation label Jun 5, 2020

jayunit100 mentioned this issue Jun 5, 2020

✨ Hacky patch to enable VXLAN for Antrea kubernetes-sigs/cluster-api-provider-aws#1741

Closed

antoninbas removed the triage/needs-information Indicates an issue needs more information in order to work on it. label Jun 5, 2020

jayunit100 changed the title ~~Many pods failing to talk to other groups of pods (possibly isolated to same node traffic)?~~ Antrea ports on EC2 : Modifying Security groups to accomodate agent -> requires agent restarts (agent doesnt recover) Jun 8, 2020

tnqn mentioned this issue Jun 10, 2020

Add healthz path to AlwaysAllowPaths #816

Merged

antoninbas mentioned this issue Jun 10, 2020

Better error logs for Agent - Controller connectivity issues #822

Closed

github-actions bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 6, 2020

antoninbas removed documentation labels Apr 19, 2021

github-actions bot closed this as completed Oct 17, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

need liveness checks... Antrea ports on EC2 : Modifying Security groups to accomodate agent -> requires agent restarts (agent doesnt recover) #802

need liveness checks... Antrea ports on EC2 : Modifying Security groups to accomodate agent -> requires agent restarts (agent doesnt recover) #802

jayunit100 commented Jun 5, 2020 •

edited

Loading

antoninbas commented Jun 5, 2020

antoninbas commented Jun 5, 2020

jayunit100 commented Jun 5, 2020

jayunit100 commented Jun 5, 2020

tnqn commented Jun 5, 2020

jayunit100 commented Jun 5, 2020

jayunit100 commented Jun 5, 2020

tnqn commented Jun 5, 2020

jayunit100 commented Jun 5, 2020

antoninbas commented Jun 5, 2020

McCodeman commented Jun 5, 2020

jayunit100 commented Jun 7, 2020 •

edited

Loading

jayunit100 commented Jun 8, 2020

github-actions bot commented Dec 6, 2020

need liveness checks... Antrea ports on EC2 : Modifying Security groups to accomodate agent -> requires agent restarts (agent doesnt recover) #802

need liveness checks... Antrea ports on EC2 : Modifying Security groups to accomodate agent -> requires agent restarts (agent doesnt recover) #802

Comments

jayunit100 commented Jun 5, 2020 • edited Loading

antoninbas commented Jun 5, 2020

antoninbas commented Jun 5, 2020

jayunit100 commented Jun 5, 2020

jayunit100 commented Jun 5, 2020

tnqn commented Jun 5, 2020

jayunit100 commented Jun 5, 2020

jayunit100 commented Jun 5, 2020

tnqn commented Jun 5, 2020

jayunit100 commented Jun 5, 2020

antoninbas commented Jun 5, 2020

McCodeman commented Jun 5, 2020

jayunit100 commented Jun 7, 2020 • edited Loading

jayunit100 commented Jun 8, 2020

github-actions bot commented Dec 6, 2020

jayunit100 commented Jun 5, 2020 •

edited

Loading

jayunit100 commented Jun 7, 2020 •

edited

Loading