Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

need liveness checks... Antrea ports on EC2 : Modifying Security groups to accomodate agent -> requires agent restarts (agent doesnt recover) #802

Closed
jayunit100 opened this issue Jun 5, 2020 · 14 comments
Labels
lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale.

Comments

@jayunit100
Copy link
Contributor

jayunit100 commented Jun 5, 2020

Describe the bug

So, originally i filed this bug as general catastrophic failure, but it looks like there is a pattern --- likely pods on different nodes can't talk with one another. I say this because, it seems as though looking at netpol tests, if 'x/a' can talk to 'z/y' then z/y also can talk to x/a and ANY other pods that x/a can talk to , and so on...... meaning that the communication groups have 'equivalence classes'.... the obvious conclusion hinting that pods on the same node can talk to one another.

I guess this makes some sense - maybe ovs doesnt work the same on EC2 without some possible setup of firewall rules or something?

expected:

-       x/a     y/a     z/a     x/b     y/b     z/b     x/c     y/c     z/c
x/a     .       .       .       .       .       .       .       .       .
y/a     .       .       .       .       .       .       .       .       .
z/a     .       .       .       .       .       .       .       .       .
x/b     .       .       .       .       .       .       .       .       .
y/b     .       .       .       .       .       .       .       .       .
z/b     .       .       .       .       .       .       .       .       .
x/c     .       .       .       .       .       .       .       .       .
y/c     .       .       .       .       .       .       .       .       .
z/c     .       .       .       .       .       .       .       .       .


observed:

-       x/a     y/a     z/a     x/b     y/b     z/b     x/c     y/c     z/c
x/a     .       .       X       .       X       X       X       .       .
y/a     .       .       X       .       X       X       X       .       .
z/a     X       X       .       X       .       .       .       X       X
x/b     .       .       X       .       X       X       X       .       .
y/b     X       X       .       X       .       .       .       X       X
z/b     X       X       .       X       .       .       .       X       X
x/c     X       X       .       X       .       .       .       X       X
y/c     .       .       X       .       X       X       X       .       .
z/c     .       .       X       .       X       X       X       .       .


So,

  • z/c, y/a, y/c, x/c are one group
  • z/a, y/b, x/b,z/b are one group ...
    ... and so on.

Antrea tests in VSphere seem to be pretty stable for conformance testing. However, in EC2, i saw many failures, including even basic tests for network connectivity between pods.

Also almost all NetworkPolicy tests were failing.

I guess my cluster somehow got into a failed state.

To Reproduce

Im not sure - but i put my installation material and some useful logs here

https://github.com/jayunit100/kubernetes/tree/netpol-impl2/NETPOL_DATA_LOGS

The above folder has:

  • antrea controller and
  • one of the antrea agents.
  • the output of kubectl get nodes -o yaml which might have some hints

Actual behavior

Almost all pods cant do basic networking

Versions:
Kubernetes 1.18
Antrea 0.7.4
EC2

@jayunit100 jayunit100 added the bug label Jun 5, 2020
@jayunit100 jayunit100 changed the title Lots of failures in EC2 after conformance tests Manny pods failing to talk to other groups of pods (possibly isolated to same node traffic)? Jun 5, 2020
@antoninbas
Copy link
Contributor

@jayunit100 does VXLAN traffic (UDP port 4789) need to be enabled explicitly in the VPC?

@antoninbas antoninbas changed the title Manny pods failing to talk to other groups of pods (possibly isolated to same node traffic)? Many pods failing to talk to other groups of pods (possibly isolated to same node traffic)? Jun 5, 2020
@antoninbas
Copy link
Contributor

I created a cluster on EC2 using kops:

abas-a01:~ abas$ ./kops validate cluster
Using cluster from kubectl context: cluster1.abas.link

Validating cluster cluster1.abas.link

INSTANCE GROUPS
NAME			ROLE	MACHINETYPE	MIN	MAX	SUBNETS
master-us-west-2a	Master	t3.medium	1	1	us-west-2a
nodes			Node	t2.small	2	2	us-west-2a

NODE STATUS
NAME						ROLE	READY
ip-172-20-33-86.us-west-2.compute.internal	node	True
ip-172-20-51-159.us-west-2.compute.internal	node	True
ip-172-20-63-247.us-west-2.compute.internal	master	True

Your cluster cluster1.abas.link is ready

I am able to run the whole netpol testsuite with no errors:

=== TEST FAILURES: 0/14 ===

I think you will need to provide us with access to your cluster (kubeconfig & ssh access), or share with us steps so we can reproduce the issue.

@antoninbas antoninbas added the triage/needs-information Indicates an issue needs more information in order to work on it. label Jun 5, 2020
@jayunit100
Copy link
Contributor Author

yeah im assuming the same thing must be a firewall thing.

@jayunit100
Copy link
Contributor Author

confirmed theres a firewall issue, i guess we should doc this somewhere for antrea ?

@tnqn
Copy link
Member

tnqn commented Jun 5, 2020

confirmed theres a firewall issue, i guess we should doc this somewhere for antrea ?

Agree, the required ports should be listed at https://github.com/vmware-tanzu/antrea/blob/master/docs/getting-started.md#ensuring-requirements-are-satisfied, including the VXLAN port (or other tunnel ports if different tunnels are used), and the default api port.

@jayunit100
Copy link
Contributor Author

thanks folks, ok so. if i issued a patch to antrea-agent which logged a warning or failed after checking the udp port is open , would that be accepted ? Seems like its a critical failure so docs is really only the first step, UX would be much better if the microservice detected and actively failed.

@jayunit100
Copy link
Contributor Author

question - would it be possible to configure all OVS to run on a different port ? just as an intermediary hack.

@tnqn
Copy link
Member

tnqn commented Jun 5, 2020

question - would it be possible to configure all OVS to run on a different port ? just as an intermediary hack.

The OVS tunnel port is not configurable via antrea config file at the moment. There are other tunnel types but would require other ports opening.. If the hack that manually updates OVS configuration works for you, I can find some commands, but I guess it's not what you are looking for.

if i issued a patch to antrea-agent which logged a warning or failed after checking the udp port is open , would that be accepted ? Seems like its a critical failure so docs is really only the first step, UX would be much better if the microservice detected and actively failed.

I don't think of a way that can probe remote UDP port as it doesn't have acknowledgement regardless of the firewall dropping it or not. If you know a proper way, I think it's good to have.

@jayunit100
Copy link
Contributor Author

if an antrea node "A" is sitting there, and absolutely no rules from other nodes are received on the OVS channel, is that an obvious indicator that something might be wrong ? If so we could log a warning, i.e. a periodic print out of how many total OVS rules have been created or received ? that might be a good indicator. but i dont really know. just an idea bc as of nowthere is no signal in the logs that something is wrong.

@antoninbas
Copy link
Contributor

It is really orthogonal to OVS rules.

As we add more Prometheus metrics, some of them may help in troubleshooting this (e.g. we could show the amount of cross Node traffic).

It's hard to test for Node connectivity without creating Pods. We could send an ICMP echo request to the gateway interfaces of other Nodes, but the Node configuration may be such that no reply will be generated, so that could trigger a false positive. For this reason, I am reluctant to include a test like this one in the Antrea agent code. However, that could be part of a sanity check that we build using the traceflow feature / antctl.

Note that for someone running Antrea in an ec2 VPC using the default security group, VXLAN tunneling will work fine. I think this case only arises because in your case you block all traffic then enable what you need selectively? This requires knowledge of the components you use in your cluster. However, the fact that we did not document this as Quan pointed out does not help...

@antoninbas antoninbas removed the triage/needs-information Indicates an issue needs more information in order to work on it. label Jun 5, 2020
@McCodeman
Copy link
Contributor

This is an interesting conversation and I've seen this type of troubleshooting process more times than I care to count in the wild. An interesting feature may be an optional canary pod deployed as a daemonset to all nodes that would allow a checkout of the overlay and help to identify possible problems with host and IaaS firewall rules. The dameonset would only need to be deployed following environment configuration changes and could then be destroyed afterward. BTW... this wouldn't necessarily have to be an Antrea feature but could belong to other K8S toolsets. Integrating with Antrea however would allow hints to be given on possible closed ports, etc. preventing traffic since the current overlay configuration could be reflected.

@jayunit100
Copy link
Contributor Author

jayunit100 commented Jun 7, 2020

+1 to the optional canary daemonset to be used for diagnosis , sonobouy with a good e2e filter can do this as well . But if there’s so other purpose to the canary it might be perceived as a heavy weight solution.

@jayunit100
Copy link
Contributor Author

From live debugging w tnqn, we're thinking a liveness check will help alot with this situation....

NAMESPACE     NAME                                                               READY   STATUS    RESTARTS   AGE
kube-system   antrea-agent-4qjzk                                                 2/2     Running   0          3d14h
kube-system   antrea-agent-r6qk9                                                 1/2     Running   0          3d14h
kube-system   antrea-agent-wnpjv                                                 1/2     Running   0          3s
kube-system   antrea-controller-7bc4496b57-lfz58                                 1/1     Running   0          3d14h
ku

some pods can stay in ready 1/2 state for a longgggg time, getting context cancelled events when talking to theAPIServer, when were toggling security groups on and off.

@jayunit100 jayunit100 changed the title Many pods failing to talk to other groups of pods (possibly isolated to same node traffic)? Antrea ports on EC2 : Modifying Security groups to accomodate agent -> requires agent restarts (agent doesnt recover) Jun 8, 2020
@jayunit100 jayunit100 changed the title Antrea ports on EC2 : Modifying Security groups to accomodate agent -> requires agent restarts (agent doesnt recover) need liveness checks... Antrea ports on EC2 : Modifying Security groups to accomodate agent -> requires agent restarts (agent doesnt recover) Jun 8, 2020
@github-actions
Copy link
Contributor

github-actions bot commented Dec 6, 2020

This issue is stale because it has been open 180 days with no activity. Remove stale label or comment, or this will be closed in 180 days

@github-actions github-actions bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 6, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale.
Projects
None yet
Development

No branches or pull requests

4 participants