Nodes unable to connect to services whose pods are scheduled on other nodes #1266

heyjared · 2020-01-03T07:57:27Z

Version:
k3s version v0.9.0 (65d8764)
to
k3s version v1.17.0+k3s.1 (0f64465)

Describe the bug
Since v0.9.0, nodes and pods with hostNetwork: true have been unable to connect to services whose selected pods are scheduled on other nodes. v0.8.1 is unaffected.

To Reproduce

Form cluster of 2 or more nodes, e.g. host1 and host2
Using kube-dns as an example, coredns pods are scheduled on host1.
From host2 or a pod on host2 with hostNetwork: true, attempt to connect to a service IP (e.g. kube-dns)

Expected behavior

[root@host2 ~]# nslookup google.com 10.43.0.10 
Server:		10.43.0.10
Address:	10.43.0.10#53

Non-authoritative answer:
Name:	google.com
Address: 74.125.24.100
[truncated]

Actual behavior

[root@host2 ~]# nslookup google.com 10.43.0.10 
;; connection timed out; no servers could be reached

Additional context

kubectl get nodes

host1.example.com   Ready    master   4m59s   v1.15.4-k3s.1
host2.example.com   Ready    worker   4m41s   v1.15.4-k3s.1

kubectl get svc -A

kube-system   kube-dns     ClusterIP   10.43.0.10   <none>        53/UDP,53/TCP,9153/TCP   5m26s
default       kubernetes   ClusterIP   10.43.0.1    <none>        443/TCP                  5m24s

kubectl get pods -A -o wide

NAMESPACE     NAME                      READY   STATUS    RESTARTS   AGE     IP          NODE                NOMINATED NODE   READINESS GATES
kube-system   coredns-66f496764-rkwdt   1/1     Running   0          6m56s   10.42.0.2   host1.example.com   <none>           <none>

The text was updated successfully, but these errors were encountered:

heyjared · 2020-06-08T07:18:46Z

This seems to be related to an issue in VXLAN
flannel-io/flannel#1243

malikbenkirane · 2020-06-25T19:16:20Z

friendly up

Has someone figured out a workaround yet ?

brandond · 2020-06-25T19:31:26Z

Don't use vxlan until the upstream issues are resolved?

malikbenkirane · 2020-06-26T09:53:28Z

Definitely, I switched to --flannel-backend=none with calico :-)

transhapHigsn · 2020-07-03T16:27:22Z

@malikbenkirane by any chance, is it possible for you to share manifests/steps for setting up calico?

mogoman · 2020-07-10T15:09:51Z

After recent patching I found my cluster to be very unstable (pods not seeing DNS etc.) Hopefully my 2+ days worth of troubleshooting can help someone here - if anyone has an idea for diagnostics I could run, please let me know:

I wrote a script to run an nslookup (to google.com) and curl to another pod and service (by IP) (running kubectl exec against pods in a daemonset). Here are my findings:

running v1.17.7+k3s1 on AWS AMI-2, flannel/cni, no firewalld

Initial script run (all fine, DNS resolves, service IP pingable, pod IP pingable):

workerc1552-eucentral1c DNS 2 SERVICE 1 POD 1
master803b2-eucentral1c DNS 2 SERVICE 1 POD 1
worker8245d-eucentral1c DNS 2 SERVICE 1 POD 1

reboot worker8245d and immediately ran my ping script:

prod-aws-ems-workerc1552-eucentral1c DNS 2 SERVICE 0 POD 0
prod-aws-ems-master803b2-eucentral1c DNS 2 SERVICE 0 POD 0
prod-aws-ems-worker8245d-eucentral1c DNS 0 SERVICE 0 POD 0

yep, whole cluster not happy.

after a few mins:

prod-aws-ems-workerc1552-eucentral1c DNS 2 SERVICE 1 POD 1
prod-aws-ems-worker8245d-eucentral1c DNS 2 SERVICE 0 POD 1
prod-aws-ems-master803b2-eucentral1c DNS 2 SERVICE 1 POD 1

No matter how long I waited, worker8245d did not recover until....

restarted k3s on worker8245d, waited a few seconds and ran again - back to normal

prod-aws-ems-workerc1552-eucentral1c DNS 2 SERVICE 1 POD 1
prod-aws-ems-master803b2-eucentral1c DNS 2 SERVICE 1 POD 1
prod-aws-ems-worker8245d-eucentral1c DNS 2 SERVICE 1 POD 1

Edit : This seems to be a problem on AWS - Tried with Redhat 8 and also combination of v1.18.4+k3s1 and get the same behaviour - if a node reboots I have to post-reboot log in and restart k3s-agent. Don't see the same on Hetzner cloud using CentOs 8

stale · 2021-07-31T03:59:36Z

This repository uses a bot to automatically label issues which have not had any activity (commit/comment/label) for 180 days. This helps us manage the community issues better. If the issue is still relevant, please add a comment to the issue so the bot can remove the label and we know it is still valid. If it is no longer relevant (or possibly fixed in the latest release), the bot will automatically close the issue in 14 days. Thank you for your contributions.

brandond mentioned this issue Jun 26, 2020

Can't access pods by service cluster ip except you access from the node where pod in. #1958

Closed

transhapHigsn mentioned this issue Jul 4, 2020

[Help wanted] Ingress controller exiting/shutting down unexpectedly jcmoraisjr/haproxy-ingress#612

Closed

stale bot added the status/stale label Jul 31, 2021

stale bot closed this as completed Aug 14, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Nodes unable to connect to services whose pods are scheduled on other nodes #1266

Nodes unable to connect to services whose pods are scheduled on other nodes #1266

heyjared commented Jan 3, 2020 •

edited

Loading

heyjared commented Jun 8, 2020

malikbenkirane commented Jun 25, 2020 •

edited

Loading

brandond commented Jun 25, 2020

malikbenkirane commented Jun 26, 2020

transhapHigsn commented Jul 3, 2020

mogoman commented Jul 10, 2020 •

edited

Loading

stale bot commented Jul 31, 2021

Nodes unable to connect to services whose pods are scheduled on other nodes #1266

Nodes unable to connect to services whose pods are scheduled on other nodes #1266

Comments

heyjared commented Jan 3, 2020 • edited Loading

heyjared commented Jun 8, 2020

malikbenkirane commented Jun 25, 2020 • edited Loading

brandond commented Jun 25, 2020

malikbenkirane commented Jun 26, 2020

transhapHigsn commented Jul 3, 2020

mogoman commented Jul 10, 2020 • edited Loading

stale bot commented Jul 31, 2021

heyjared commented Jan 3, 2020 •

edited

Loading

malikbenkirane commented Jun 25, 2020 •

edited

Loading

mogoman commented Jul 10, 2020 •

edited

Loading