1.17 alpha versions causing regression for kiam? #8562

jhohertz · 2020-02-14T20:33:16Z

1. What kops version are you running? The command kops version, will display
this information.

Any of the 1.17 alphas so far.

2. What Kubernetes version are you running? kubectl version will print the
version if a cluster is running or provide the Kubernetes version specified as
a kops flag.

seen in 1.17.0-rc.2 through 1.17.3. Works without issue on kops/k8s 1.15 and 1.16 built clusters. ONLY change is bump to 1.17.x.

3. What cloud provider are you using?

AWS

4. What commands did you run? What is the simplest way to reproduce this issue?

Try to install kiam via it's included helm chart onto a kops 1.17.x built-cluster

5. What happened after the commands executed?

The kiam-agent daemonsets crashloop

6. What did you expect to happen?

No crashloop.

**7. Please provide your cluster manifest. Execute

Will follow up with this if asked for. Main thing applicable here is we are using CoreDNS

8. Please run the commands with most verbose logging by adding the -v 10 flag.
Paste the logs into this report, or in a gist and provide the gist link here.

From the agent logs w/ gRPC debugging enabled:

kubectl -n kube-system logs kiam-agent-5z8tt
{"level":"info","msg":"started prometheus metric listener 0.0.0.0:9620","time":"2020-02-14T18:13:41Z"}
INFO: 2020/02/14 18:13:41 parsed scheme: "dns"
INFO: 2020/02/14 18:13:46 grpc: failed dns SRV record lookup due to lookup _grpclb._tcp.kiam-server on 100.64.0.10:53: dial udp 100.64.0.10:53: operation was canceled.
WARNING: 2020/02/14 18:13:46 grpc: failed dns A record lookup due to lookup kiam-server on 100.64.0.10:53: dial udp 100.64.0.10:53: operation was canceled.
INFO: 2020/02/14 18:13:46 ccResolverWrapper: got new service config: 
INFO: 2020/02/14 18:13:46 ccResolverWrapper: sending new addresses to cc: []
{"level":"fatal","msg":"error creating server gateway: error dialing grpc server: context deadline exceeded","time":"2020-02-14T18:13:46Z"}

9. Anything else do we need to know?

Bug also posted with kiam folks here: uswitch/kiam#378

The text was updated successfully, but these errors were encountered:

jhohertz · 2020-02-14T21:09:28Z

That kubernetes issue just linked is likely at the root of all this.

jhohertz · 2020-02-18T15:51:45Z

Update: This seems to be specific to using the flannel/canal CNI with the vxlan backend by some accounts, and further testing seems to support that.

jhohertz · 2020-02-18T19:05:20Z

So the problem clearly isn't with kops itself, however, it might be worthwhile to warn users in documentation, or even make invalid configurations with network CNI flannel/canal and vxlan backend with 1.17 versions, as it's going to result it more odd reports like this one. :)

johngmyers · 2020-02-19T02:06:52Z

What, specifically, are the invalid configurations?

jhohertz · 2020-02-19T16:11:05Z

Using the Canal CNI (As it is fixed to use vxlan in kops, not sure it works with other backends)
Using the Flannel CNI in it's default "vxlan" configuration. Superficial testing shows the problem doesn't seem to exist with the "udp" backend, however most using Flannel and working around the issue seem to be suggesting the "host-gw" backend, which is not currently usable via kops.

See the flannel issue for more info here: flannel-io/flannel#1243

johngmyers · 2020-02-20T03:18:08Z

It seems there's not enough information to identify a particular bad configuration. It looks like the issue is still being triaged and is likely a bug in Flannel and/or Canal. There's time before kops 1.17 is released for the bug(s) to be fixed. If it later turns out to be a more permanent situation, we could add an api validation check then.

jhohertz · 2020-02-20T15:21:51Z

See comment above for what constitutes a non-working configuration, which I've detailed as requested.

The bug is in Flannel (which Canal uses), and I've linked the issue involved. Yes, it's possible that there will be a fix made available, but I'm not holding my breath as the project seems to be trending towards dormancy.

johngmyers · 2020-02-22T23:20:55Z

So you're proposing kops should disallow a CNI of Canal or Flannel with Backend of vxlan for Kubernetes versions equal to or greater than 1.17?

justinsb · 2020-04-19T19:59:59Z

Thanks for reporting @jhohertz. The current theory is that it's related to the kernel version, and some kernels have bugs with computation of the checksums which can be worked around by turning off offload of that computation.

Which image (AMI) are you using (or are you using the default kops image)?

jhohertz · 2020-04-27T14:02:20Z

We're currently using the latest Flatcar stable release.

I am currently looking at trying to patch in the ethtool thing for testing.

jhohertz · 2020-04-27T19:26:27Z

I maybe have found hints as to "what's different between 1.16 and 1.17".

A dependency in a netlink library used was bumped, and within that there are specific changes to vxlan and the handling of checksums. It looks like it should have really only added IPv6 UDP support for checksums, but... after searching around for whats different between 1.16 and 1.17, this kind of stands out.

Comment on flannel issue: flannel-io/flannel#1243 (comment)

Perhaps this will help folks find out what's going on? (Or possibly prove to be a red herring...)

That update also includes new ethtoool-related code.

johngmyers · 2020-05-05T21:26:29Z

Is someone able to write up a release note for Kops 1.17? I would prefer we not hold up 1.17 indefinitely for a new version of Flannel.

johngmyers · 2020-05-08T04:03:21Z

Can this be closed now that #9074 has been merged and cherrypicked to 1.17?

jhohertz · 2020-05-11T13:22:09Z

Probably? Any way you could cut another beta with this in place for wider testing?

johngmyers · 2020-05-11T14:45:14Z

/close

k8s-ci-robot · 2020-05-11T14:45:28Z

@johngmyers: Closing this issue.

In response to this:

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

hakman · 2020-05-11T18:18:53Z

@jhohertz I think next release will be more of a RC or final. Not sure anything else can be done to improve things with Flannel until a new release comes.

paalkr · 2020-05-13T21:10:33Z

Is there a kops 1.17.0 build available with this fix included? We have encountered kiam issues when testing kops 1.17.0-beta.2 with flannel networking, which we need for our windows worker nodes to join.

olemarkus · 2020-05-14T18:09:59Z

No release yet. It will go into the next one.

jhohertz · 2020-05-27T18:27:45Z

Just a note to warn, this nightmare may also have just landed in 1.16 as of 1.16.10 k8s. Still investigating but it's behaving the exact same way.

paalkr · 2020-05-27T19:28:48Z

We do run flannel on a non standard port, so for us the suggested fix wont help. But it's easy to already today address this flannel issue, using a custom hook in the cluster manifest.

Replace 4096 with 1 if you run with standard flannel setup.

spec:
  hooks:
  - name: flannel-4096-tx-checksum-offload-disable.service
    # Temporary fix until https://github.com/kubernetes/kops/pull/9074 is released
    roles:
    - Node
    - Master
    useRawManifest: true
    manifest: |
      [Unit]
      Description=Disable TX checksum offload on flannel.4096
      After=sys-devices-virtual-net-flannel.4096.device
      After=sys-subsystem-net-devices-flannel.4096.device
      After=docker.service
      [Service]
      Type=oneshot
      ExecStart=/sbin/ethtool -K flannel.4096 tx-checksum-ip-generic off

jhohertz · 2020-05-28T04:09:08Z

I guess that was a bit dramatic of me. 😄 it just bothered me I couldn't explain why, though looking at the .10 patch looks like an iptables version bump (which also showed up in 1.16.0 to 1.17.0), may be the only thing networking related in the .10 patch.

I'm aware of that workaround but thank you for mentioning it anyways.

jhohertz mentioned this issue Feb 14, 2020

dnsPolicy in hostNetwork not working as expected kubernetes/kubernetes#87852

Closed

johngmyers mentioned this issue Feb 22, 2020

Remove support for Canal and the vxlan Flannel backend #8614

Closed

johngmyers mentioned this issue May 6, 2020

Disable TX checksum offload for Flannel VXLAN #9074

Merged

k8s-ci-robot closed this as completed May 11, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

1.17 alpha versions causing regression for kiam? #8562

1.17 alpha versions causing regression for kiam? #8562

jhohertz commented Feb 14, 2020

jhohertz commented Feb 14, 2020

jhohertz commented Feb 18, 2020

jhohertz commented Feb 18, 2020

johngmyers commented Feb 19, 2020

jhohertz commented Feb 19, 2020

johngmyers commented Feb 20, 2020

jhohertz commented Feb 20, 2020

johngmyers commented Feb 22, 2020 •

edited

Loading

justinsb commented Apr 19, 2020

jhohertz commented Apr 27, 2020

jhohertz commented Apr 27, 2020 •

edited

Loading

johngmyers commented May 5, 2020

johngmyers commented May 8, 2020

jhohertz commented May 11, 2020

johngmyers commented May 11, 2020

k8s-ci-robot commented May 11, 2020

hakman commented May 11, 2020

paalkr commented May 13, 2020

olemarkus commented May 14, 2020

jhohertz commented May 27, 2020

paalkr commented May 27, 2020 •

edited

Loading

jhohertz commented May 28, 2020

1.17 alpha versions causing regression for kiam? #8562

1.17 alpha versions causing regression for kiam? #8562

Comments

jhohertz commented Feb 14, 2020

jhohertz commented Feb 14, 2020

jhohertz commented Feb 18, 2020

jhohertz commented Feb 18, 2020

johngmyers commented Feb 19, 2020

jhohertz commented Feb 19, 2020

johngmyers commented Feb 20, 2020

jhohertz commented Feb 20, 2020

johngmyers commented Feb 22, 2020 • edited Loading

justinsb commented Apr 19, 2020

jhohertz commented Apr 27, 2020

jhohertz commented Apr 27, 2020 • edited Loading

johngmyers commented May 5, 2020

johngmyers commented May 8, 2020

jhohertz commented May 11, 2020

johngmyers commented May 11, 2020

k8s-ci-robot commented May 11, 2020

hakman commented May 11, 2020

paalkr commented May 13, 2020

olemarkus commented May 14, 2020

jhohertz commented May 27, 2020

paalkr commented May 27, 2020 • edited Loading

jhohertz commented May 28, 2020

johngmyers commented Feb 22, 2020 •

edited

Loading

jhohertz commented Apr 27, 2020 •

edited

Loading

paalkr commented May 27, 2020 •

edited

Loading