All retries failed, unable to complete the uncordon after reboot workflow error #685

sushantsoni5392 · 2022-09-08T09:08:04Z

Describe the bug
Hi,

In the logs right after the NTH starts we can see errors frequently like below

2022/09/08 08:18:46 ERR Error when trying to list Nodes w/ label, falling back to direct Get lookup of node error="Get \"https://172.20.0.1:443/api/v1/nodes?labelSelector=kubernetes.io%2Fhostname%3D%3Dip-10-45-5-107.eu-central-1.compute.internal\": dial tcp 172.20.0.1:443: i/o timeout"
2022/09/08 08:18:46 WRN All retries failed, unable to complete the uncordon after reboot workflow error="timed out waiting for the condition"

I wanted to understand if this error affects anything.

Steps to reproduce

Expected outcome
No errors

Application Logs
The log output when experiencing the issue.

2022/09/08 08:18:14 INF aws-node-termination-handler arguments:
	dry-run: false,
	node-name: ip-10-45-5-137.eu-central-1.compute.internal,
	pod-name: aws-node-termination-handler-866sr,
	metadata-url: http://169.254.169.254,
	kubernetes-service-host: 172.20.0.1,
	kubernetes-service-port: 443,
	delete-local-data: true,
	ignore-daemon-sets: true,
	pod-termination-grace-period: -1,
	node-termination-grace-period: 120,
	enable-scheduled-event-draining: true,
	enable-spot-interruption-draining: true,
	enable-sqs-termination-draining: false,
	enable-rebalance-monitoring: true,
	enable-rebalance-draining: false,
	metadata-tries: 3,
	cordon-only: false,
	taint-node: true,
	taint-effect: NoSchedule,
	exclude-from-load-balancers: false,
	json-logging: false,
	log-level: info,
	webhook-proxy: ,
	webhook-headers: <not-displayed>,
	webhook-url: ,
	webhook-template: <not-displayed>,
	uptime-from-file: /proc/uptime,
	enable-prometheus-server: false,
	prometheus-server-port: 9092,
	emit-kubernetes-events: false,
	kubernetes-events-extra-annotations: ,
	aws-region: eu-central-1,
	queue-url: ,
	check-asg-tag-before-draining: true,
	managed-asg-tag: aws-node-termination-handler/managed,
	assume-asg-tag-propagation: false,
	aws-endpoint: ,

2022/09/08 08:18:44 ERR Error when trying to list Nodes w/ label, falling back to direct Get lookup of node error="Get \"https://172.20.0.1:443/api/v1/nodes?labelSelector=kubernetes.io%2Fhostname%3D%3Dip-10-45-5-137.eu-central-1.compute.internal\": dial tcp 172.20.0.1:443: i/o timeout"
2022/09/08 08:18:44 WRN All retries failed, unable to complete the uncordon after reboot workflow error="timed out waiting for the condition"
2022/09/08 08:18:44 INF Started watching for interruption events
2022/09/08 08:18:44 INF Kubernetes AWS Node Termination Handler has started successfully!
2022/09/08 08:18:44 INF Started watching for event cancellations
2022/09/08 08:18:44 INF Started monitoring for events event_type=SCHEDULED_EVENT
2022/09/08 08:18:44 INF Started monitoring for events event_type=SPOT_ITN
2022/09/08 08:18:44 INF Started monitoring for events event_type=REBALANCE_RECOMMENDATION
2022/09/08 08:48:44 INF event store statistics drainable-events=0 size=0

Environment

NTH App Version: 1.16.0
NTH Mode (IMDS/Queue processor): IMDS
OS/Arch: Linux
Kubernetes version: 1.21
Installation method: helm

The text was updated successfully, but these errors were encountered:

github-actions · 2022-10-08T17:08:52Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. If you want this issue to never become stale, please ask a maintainer to apply the "stalebot-ignore" label.

github-actions · 2022-10-13T17:15:36Z

This issue was closed because it has become stale with no activity.

sidewinder12s · 2022-12-16T17:53:24Z

Also seeing similar with v1.18.2

snay2 · 2022-12-16T20:19:40Z

@sidewinder12s Do you run in IMDS mode or QP mode? We have a PR open to fix this for QP mode (#743), but if it's still happening in IMDS mode we'll need some more investigation.

brydoncheyney · 2022-12-22T15:33:50Z

I can confirm that we see this in v1.18.2. Node termination events are received and processed as expected and the error does not appear to affect anything but again... I wanted to understand if this error affects anything.

Default install - no overrides.

2022/12/21 19:29:22 INF aws-node-termination-handler arguments:
	dry-run: false,
	node-name: ip-172-30-109-126.ec2.internal,
	pod-name: aws-node-termination-handler-mmpcp,
	metadata-url: http://169.254.169.254,
	kubernetes-service-host: 10.100.0.1,
	kubernetes-service-port: 443,
	delete-local-data: true,
	ignore-daemon-sets: true,
	pod-termination-grace-period: -1,
	node-termination-grace-period: 120,
	enable-scheduled-event-draining: true,
	enable-spot-interruption-draining: true,
	enable-sqs-termination-draining: false,
	enable-rebalance-monitoring: false,
	enable-rebalance-draining: false,
	metadata-tries: 3,
	cordon-only: false,
	taint-node: false,
	taint-effect: NoSchedule,
	exclude-from-load-balancers: false,
	json-logging: false,
	log-level: info,
	webhook-proxy: ,
	webhook-headers: <not-displayed>,
	webhook-url: ,
	webhook-template: <not-displayed>,
	uptime-from-file: /proc/uptime,
	enable-prometheus-server: false,
	prometheus-server-port: 9092,
	emit-kubernetes-events: false,
	kubernetes-events-extra-annotations: ,
	aws-region: us-east-1,
	queue-url: ,
	check-tag-before-draining: true,
	managed-tag: aws-node-termination-handler/managed,
	use-provider-id: false,
	aws-endpoint: ,

2022/12/21 19:29:22 INF Error when trying to list Nodes w/ label, falling back to direct Get lookup of node
2022/12/21 19:29:24 INF Error when trying to list Nodes w/ label, falling back to direct Get lookup of node
2022/12/21 19:29:26 INF Error when trying to list Nodes w/ label, falling back to direct Get lookup of node
2022/12/21 19:29:28 INF Error when trying to list Nodes w/ label, falling back to direct Get lookup of node
2022/12/21 19:29:30 WRN All retries failed, unable to complete the uncordon after reboot workflow error="timed out waiting for the condition"
2022/12/21 19:29:30 INF Started watching for interruption events
2022/12/21 19:29:30 INF Kubernetes AWS Node Termination Handler has started successfully!
2022/12/21 19:29:30 INF Started watching for event cancellations
2022/12/21 19:29:30 INF Started monitoring for events event_type=SCHEDULED_EVENT_MONITOR
2022/12/21 19:29:30 INF Started monitoring for events event_type=SPOT_ITN_MONITOR

Environment

NTH App Version: 1.18.2
NTH Mode (IMDS/Queue processor): IMDS
OS/Arch: Linux
Kubernetes version: 1.24 (EKS)
Installation method: helm template | krane

brydoncheyney · 2022-12-22T15:36:15Z

In our case, we would also see this due to a mismatch between hostname and kubernetes.io/hostname that is defined at the vpc dhcp option sets.

aws-node-termination-handler/pkg/node/node.go

Line 614 in 70a3986

    
           log.Err(err).Msgf("Error when trying to list Nodes w/ label, falling back to direct Get lookup of node")

It would seem sensible to log this message as INFO/WARNING not ERROR, as this failure does not necessarily indicate an unrecoverable error. If there are no nodes matching the label then it will instead get by nodeName and pass any error up the stack -

aws-node-termination-handler/pkg/node/node.go

Line 615 in 70a3986

    
           return n.drainHelper.Client.CoreV1().Nodes().Get(context.TODO(), nodeName, metav1.GetOptions{})

snay2 · 2022-12-22T19:26:07Z

@brydoncheyney Thanks for the explanation! I agree that this would be better as a WARNING log, since it does not prevent the functionality by itself and there is a fallback mechanism. Feel free to open a PR to that effect if you have time.

sidewinder12s · 2022-12-22T19:58:14Z

Ya I was in IMDS mode and do have a mismatch between the VPC option set and actual hostnames.

lordz-md · 2023-01-02T09:22:42Z

@sidewinder12s, I have the similar error, my hostname is the instance-id. Any fix should be applied to DHCP options to set hostname as instance-id?

github-actions · 2023-02-25T17:03:19Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. If you want this issue to never become stale, please ask a maintainer to apply the "stalebot-ignore" label.

github-actions · 2023-03-02T17:03:23Z

This issue was closed because it has become stale with no activity.

github-actions bot added the stale Issues / PRs with no activity label Oct 8, 2022

github-actions bot closed this as completed Oct 13, 2022

snay2 reopened this Dec 22, 2022

snay2 added Status: Help Wanted and removed stale Issues / PRs with no activity labels Dec 22, 2022

LikithaVemulapalli self-assigned this Jan 18, 2023

LikithaVemulapalli mentioned this issue Jan 19, 2023

Error log fix #756

Merged

cjerad added Pending-Release Pending an NTH or eks-charts release and removed Status: Help Wanted labels Jan 26, 2023

github-actions bot added the stale Issues / PRs with no activity label Feb 25, 2023

github-actions bot closed this as completed Mar 2, 2023

cjerad removed the Pending-Release Pending an NTH or eks-charts release label Mar 8, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

All retries failed, unable to complete the uncordon after reboot workflow error #685

All retries failed, unable to complete the uncordon after reboot workflow error #685

sushantsoni5392 commented Sep 8, 2022 •

edited

Loading

github-actions bot commented Oct 8, 2022

github-actions bot commented Oct 13, 2022

sidewinder12s commented Dec 16, 2022

snay2 commented Dec 16, 2022

brydoncheyney commented Dec 22, 2022 •

edited

Loading

brydoncheyney commented Dec 22, 2022

snay2 commented Dec 22, 2022

sidewinder12s commented Dec 22, 2022

lordz-md commented Jan 2, 2023

github-actions bot commented Feb 25, 2023

github-actions bot commented Mar 2, 2023

All retries failed, unable to complete the uncordon after reboot workflow error #685

All retries failed, unable to complete the uncordon after reboot workflow error #685

Comments

sushantsoni5392 commented Sep 8, 2022 • edited Loading

github-actions bot commented Oct 8, 2022

github-actions bot commented Oct 13, 2022

sidewinder12s commented Dec 16, 2022

snay2 commented Dec 16, 2022

brydoncheyney commented Dec 22, 2022 • edited Loading

brydoncheyney commented Dec 22, 2022

snay2 commented Dec 22, 2022

sidewinder12s commented Dec 22, 2022

lordz-md commented Jan 2, 2023

github-actions bot commented Feb 25, 2023

github-actions bot commented Mar 2, 2023

sushantsoni5392 commented Sep 8, 2022 •

edited

Loading

brydoncheyney commented Dec 22, 2022 •

edited

Loading