Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[k3s-upgrade] k3s service failed to start after upgrade #5345

Closed
ac5tin opened this issue Mar 28, 2022 · 15 comments
Closed

[k3s-upgrade] k3s service failed to start after upgrade #5345

ac5tin opened this issue Mar 28, 2022 · 15 comments

Comments

@ac5tin
Copy link

ac5tin commented Mar 28, 2022

Environmental Info:
K3s Version:

k3s version v1.23.4+k3s1 (43b1cb48)
go version go1.17.5

Node(s) CPU architecture, OS, and Version:

5.4.0-1056-raspi #63-Ubuntu
aarch64 aarch64 aarch64 GNU/Linux

Distributor ID: Ubuntu
Description:    Ubuntu 20.04.4 LTS
Release:        20.04
Codename:       focal

Describe the bug:
I tried to upgrade the k3s version of my cluster (master node and worker nodes) by following this : k3s-upgrade

Steps To Reproduce:

kubectl apply -f https://github.com/rancher/system-upgrade-controller/master/manifests/system-upgrade-controller.yaml

# master nodes
kubectl label node <node-name> k3s-master-upgrade=true
# worker nodes
kubectl label node <node-name> k3s-worker-upgrade=true

# apply upgrade plan
kubectl apply -f agent.yml
kubectl apply -f server.yml

my plans:
server.yml

# Server plan
apiVersion: upgrade.cattle.io/v1
kind: Plan
metadata:
  name: server-plan
  namespace: system-upgrade
spec:
  concurrency: 1
  cordon: true
  nodeSelector:
    matchExpressions:
    - key: k3s-master-upgrade
      operator: In
      values:
      - "true"
  serviceAccountName: system-upgrade
  upgrade:
    image: rancher/k3s-upgrade
  version: v1.23.4+k3s1

agent.yml

# Agent plan
apiVersion: upgrade.cattle.io/v1
kind: Plan
metadata:
  name: agent-plan
  namespace: system-upgrade
spec:
  concurrency: 1
  cordon: true
  nodeSelector:
    matchExpressions:
    - key: k3s-worker-upgrade
      operator: In
      values:
      - "true"
  prepare:
    args:
    - prepare
    - server-plan
    image: rancher/k3s-upgrade
  serviceAccountName: system-upgrade
  upgrade:
    image: rancher/k3s-upgrade
  version: v1.23.4+k3s1

Expected behavior:
All nodes to upgrade successfully to k3s version 1.23.4+k3s1

Actual behavior:
master node k3s updated the k3s binary on the machine but failed to start the service

Additional context / logs:

Mar 28 09:25:54 huey sh[3502]: + /usr/bin/systemctl is-enabled --quiet nm-cloud-setup.service
Mar 28 09:25:54 huey sh[3508]: Failed to get unit file state for nm-cloud-setup.service: No such file or directory
Mar 28 09:25:55 huey k3s[799]: time="2022-03-28T09:25:55Z" level=error msg="failed to get CA certs: Get \"https://127.0.0.1:6444/cacerts\": EOF"
Mar 28 09:25:56 huey k3s[3517]: time="2022-03-28T09:25:56Z" level=info msg="Starting k3s v1.23.4+k3s1 (43b1cb48)"
Mar 28 09:25:56 huey k3s[3517]: time="2022-03-28T09:25:56Z" level=info msg="Configuring sqlite3 database connection pooling: maxIdleConns=2, maxOpenConns=0, connMaxLifetime=0s"
Mar 28 09:25:56 huey k3s[3517]: time="2022-03-28T09:25:56Z" level=info msg="Configuring database table schema and indexes, this may take a moment..."
Mar 28 09:25:56 huey k3s[3517]: time="2022-03-28T09:25:56Z" level=info msg="Database tables and indexes are up to date"
Mar 28 09:25:56 huey k3s[3517]: time="2022-03-28T09:25:56Z" level=info msg="Kine available at unix://kine.sock"
Mar 28 09:25:56 huey k3s[3517]: time="2022-03-28T09:25:56Z" level=fatal msg="starting kubernetes: preparing server: failed to normalize token; must be in format K10<CA-HASH>::<USERNAME>:<PASSWORD> or <PASS>
Mar 28 09:25:56 huey systemd[1]: k3s.service: Main process exited, code=exited, status=1/FAILURE
@brandond
Copy link
Member

Mar 28 09:25:56 huey k3s[3517]: time="2022-03-28T09:25:56Z" level=fatal msg="starting kubernetes: preparing server: failed to normalize token; must be in format K10::: or

It looks like the --token value in your config file or systemd unit is in an invalid format. How have you specified it?

@ac5tin
Copy link
Author

ac5tin commented Apr 7, 2022

i haven't changed the the config file. Not sure if it got modified by the update process?
I had to completely uninstall k3s and reinstall from scratch

@vvanouytsel
Copy link

vvanouytsel commented Aug 17, 2022

@brandond
Is there any way to figure out what the token should be in case it got removed in the k3s/server/token file?

@brandond
Copy link
Member

No, if you were not manually configuring the token, and all nodes with a copy of the token file have been lost, there is no way to recover the value with only a copy of the datastore.

@vvanouytsel
Copy link

No, if you were not manually configuring the token, and all nodes with a copy of the token file have been lost, there is no way to recover the value with only a copy of the datastore.

Is it also stored in etcd (or sqlite by default on k3s)?

@brandond
Copy link
Member

The bootstrap data (cluster CA certificates and such) are stored in the datastore, encrypted with the token as the key generation passphrase. The token value cannot be extracted from the datastore; that would render the encryption meaningless.

@vvanouytsel
Copy link

I deleted the k3s/server/token file from the filesystem and restarted the k3s systemd service. In my case k3s was able to restore the contents of that file.

@brandond
Copy link
Member

If you delete that file but the token is not specified elsewhere (in the config or on the CLI), then a new one will be generated on startup. This is most likely fine on single-server clusters, but it will cause problems when using etcd or an external SQL datastore.

@vvanouytsel
Copy link

I am indeed running a single-server cluster. Thanks for your explanation!

@bramnet
Copy link

bramnet commented Aug 23, 2022

What about multi-node clusters? I ran into this issue while trying to upgrade an agent node from 1.22.6+k3s1 to the latest. Can I just grab the token from another node and force inject it during the upgrade? The weirdest part is that it's communicating with the cluster just fine.

@brandond
Copy link
Member

brandond commented Aug 23, 2022

@bramnet this issue has wandered a bit; I may need to lock it so that folks can open their own issues describing their individual problems. What is the exact message you're getting?

@bramnet
Copy link

bramnet commented Aug 23, 2022

I was just trying again to reproduce it, and suddenly it’s saying the node is up to date… not sure what happened here.
All I remember is that it was very similar to what ac5tin had in the 2nd to last line in their logs: level=fatal msg="starting kubernetes: preparing server: failed to normalize token; must be in format K10::: or
What’s also weird is Rancher isn’t reflecting they’re up to date… I’ll have to look into that.

@RaphaelKimmig
Copy link

I'm having the same issue on a single node cluster. I noticed that /var/lib/rancher/k3s/server/token has recently been written and is now empty.

@ryan4yin
Copy link

ryan4yin commented Nov 10, 2022

same here using single master mode, vesrion v1.25.3+k3s1, I resolved this by delete the empty file /var/lib/rancher/k3s/server/token

@brandond
Copy link
Member

I'm not aware of any paths in the k3s code that would cause it to write an empty token file. If anyone else runs into this, and can confirm that they are not using any automation or scripting to manage the content of that file, please open a new issue with steps that can help us reproduce this.

@k3s-io k3s-io locked and limited conversation to collaborators Jun 13, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
Archived in project
Development

No branches or pull requests

6 participants