Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nodes unable to connect to services whose pods are scheduled on other nodes #1266

Closed
heyjared opened this issue Jan 3, 2020 · 7 comments
Closed

Comments

@heyjared
Copy link

heyjared commented Jan 3, 2020

Version:
k3s version v0.9.0 (65d8764)
to
k3s version v1.17.0+k3s.1 (0f64465)

Describe the bug
Since v0.9.0, nodes and pods with hostNetwork: true have been unable to connect to services whose selected pods are scheduled on other nodes. v0.8.1 is unaffected.

To Reproduce

  1. Form cluster of 2 or more nodes, e.g. host1 and host2
  2. Using kube-dns as an example, coredns pods are scheduled on host1.
  3. From host2 or a pod on host2 with hostNetwork: true, attempt to connect to a service IP (e.g. kube-dns)

Expected behavior

[root@host2 ~]# nslookup google.com 10.43.0.10 
Server:		10.43.0.10
Address:	10.43.0.10#53

Non-authoritative answer:
Name:	google.com
Address: 74.125.24.100
[truncated]

Actual behavior

[root@host2 ~]# nslookup google.com 10.43.0.10 
;; connection timed out; no servers could be reached

Additional context

kubectl get nodes

host1.example.com   Ready    master   4m59s   v1.15.4-k3s.1
host2.example.com   Ready    worker   4m41s   v1.15.4-k3s.1

kubectl get svc -A

kube-system   kube-dns     ClusterIP   10.43.0.10   <none>        53/UDP,53/TCP,9153/TCP   5m26s
default       kubernetes   ClusterIP   10.43.0.1    <none>        443/TCP                  5m24s

kubectl get pods -A -o wide

NAMESPACE     NAME                      READY   STATUS    RESTARTS   AGE     IP          NODE                NOMINATED NODE   READINESS GATES
kube-system   coredns-66f496764-rkwdt   1/1     Running   0          6m56s   10.42.0.2   host1.example.com   <none>           <none>
@heyjared
Copy link
Author

heyjared commented Jun 8, 2020

This seems to be related to an issue in VXLAN
flannel-io/flannel#1243

@malikbenkirane
Copy link

malikbenkirane commented Jun 25, 2020

friendly up

Has someone figured out a workaround yet ?

@brandond
Copy link
Member

Don't use vxlan until the upstream issues are resolved?

@malikbenkirane
Copy link

Definitely, I switched to --flannel-backend=none with calico :-)

@transhapHigsn
Copy link
Contributor

@malikbenkirane by any chance, is it possible for you to share manifests/steps for setting up calico?

@mogoman
Copy link

mogoman commented Jul 10, 2020

After recent patching I found my cluster to be very unstable (pods not seeing DNS etc.) Hopefully my 2+ days worth of troubleshooting can help someone here - if anyone has an idea for diagnostics I could run, please let me know:

I wrote a script to run an nslookup (to google.com) and curl to another pod and service (by IP) (running kubectl exec against pods in a daemonset). Here are my findings:

running v1.17.7+k3s1 on AWS AMI-2, flannel/cni, no firewalld

Initial script run (all fine, DNS resolves, service IP pingable, pod IP pingable):

workerc1552-eucentral1c DNS 2 SERVICE 1 POD 1
master803b2-eucentral1c DNS 2 SERVICE 1 POD 1
worker8245d-eucentral1c DNS 2 SERVICE 1 POD 1

reboot worker8245d and immediately ran my ping script:

prod-aws-ems-workerc1552-eucentral1c DNS 2 SERVICE 0 POD 0
prod-aws-ems-master803b2-eucentral1c DNS 2 SERVICE 0 POD 0
prod-aws-ems-worker8245d-eucentral1c DNS 0 SERVICE 0 POD 0

yep, whole cluster not happy.

after a few mins:

prod-aws-ems-workerc1552-eucentral1c DNS 2 SERVICE 1 POD 1
prod-aws-ems-worker8245d-eucentral1c DNS 2 SERVICE 0 POD 1
prod-aws-ems-master803b2-eucentral1c DNS 2 SERVICE 1 POD 1

No matter how long I waited, worker8245d did not recover until....

restarted k3s on worker8245d, waited a few seconds and ran again - back to normal

prod-aws-ems-workerc1552-eucentral1c DNS 2 SERVICE 1 POD 1
prod-aws-ems-master803b2-eucentral1c DNS 2 SERVICE 1 POD 1
prod-aws-ems-worker8245d-eucentral1c DNS 2 SERVICE 1 POD 1

Edit : This seems to be a problem on AWS - Tried with Redhat 8 and also combination of v1.18.4+k3s1 and get the same behaviour - if a node reboots I have to post-reboot log in and restart k3s-agent. Don't see the same on Hetzner cloud using CentOs 8

@stale
Copy link

stale bot commented Jul 31, 2021

This repository uses a bot to automatically label issues which have not had any activity (commit/comment/label) for 180 days. This helps us manage the community issues better. If the issue is still relevant, please add a comment to the issue so the bot can remove the label and we know it is still valid. If it is no longer relevant (or possibly fixed in the latest release), the bot will automatically close the issue in 14 days. Thank you for your contributions.

@stale stale bot added the status/stale label Jul 31, 2021
@stale stale bot closed this as completed Aug 14, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants