Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

All retries failed, unable to complete the uncordon after reboot workflow error #685

Closed
sushantsoni5392 opened this issue Sep 8, 2022 · 11 comments
Assignees
Labels
stale Issues / PRs with no activity

Comments

@sushantsoni5392
Copy link

sushantsoni5392 commented Sep 8, 2022

Describe the bug
Hi,

In the logs right after the NTH starts we can see errors frequently like below

2022/09/08 08:18:46 ERR Error when trying to list Nodes w/ label, falling back to direct Get lookup of node error="Get \"https://172.20.0.1:443/api/v1/nodes?labelSelector=kubernetes.io%2Fhostname%3D%3Dip-10-45-5-107.eu-central-1.compute.internal\": dial tcp 172.20.0.1:443: i/o timeout"
2022/09/08 08:18:46 WRN All retries failed, unable to complete the uncordon after reboot workflow error="timed out waiting for the condition"

I wanted to understand if this error affects anything.

Steps to reproduce

Expected outcome
No errors

Application Logs
The log output when experiencing the issue.

2022/09/08 08:18:14 INF aws-node-termination-handler arguments:
	dry-run: false,
	node-name: ip-10-45-5-137.eu-central-1.compute.internal,
	pod-name: aws-node-termination-handler-866sr,
	metadata-url: http://169.254.169.254,
	kubernetes-service-host: 172.20.0.1,
	kubernetes-service-port: 443,
	delete-local-data: true,
	ignore-daemon-sets: true,
	pod-termination-grace-period: -1,
	node-termination-grace-period: 120,
	enable-scheduled-event-draining: true,
	enable-spot-interruption-draining: true,
	enable-sqs-termination-draining: false,
	enable-rebalance-monitoring: true,
	enable-rebalance-draining: false,
	metadata-tries: 3,
	cordon-only: false,
	taint-node: true,
	taint-effect: NoSchedule,
	exclude-from-load-balancers: false,
	json-logging: false,
	log-level: info,
	webhook-proxy: ,
	webhook-headers: <not-displayed>,
	webhook-url: ,
	webhook-template: <not-displayed>,
	uptime-from-file: /proc/uptime,
	enable-prometheus-server: false,
	prometheus-server-port: 9092,
	emit-kubernetes-events: false,
	kubernetes-events-extra-annotations: ,
	aws-region: eu-central-1,
	queue-url: ,
	check-asg-tag-before-draining: true,
	managed-asg-tag: aws-node-termination-handler/managed,
	assume-asg-tag-propagation: false,
	aws-endpoint: ,

2022/09/08 08:18:44 ERR Error when trying to list Nodes w/ label, falling back to direct Get lookup of node error="Get \"https://172.20.0.1:443/api/v1/nodes?labelSelector=kubernetes.io%2Fhostname%3D%3Dip-10-45-5-137.eu-central-1.compute.internal\": dial tcp 172.20.0.1:443: i/o timeout"
2022/09/08 08:18:44 WRN All retries failed, unable to complete the uncordon after reboot workflow error="timed out waiting for the condition"
2022/09/08 08:18:44 INF Started watching for interruption events
2022/09/08 08:18:44 INF Kubernetes AWS Node Termination Handler has started successfully!
2022/09/08 08:18:44 INF Started watching for event cancellations
2022/09/08 08:18:44 INF Started monitoring for events event_type=SCHEDULED_EVENT
2022/09/08 08:18:44 INF Started monitoring for events event_type=SPOT_ITN
2022/09/08 08:18:44 INF Started monitoring for events event_type=REBALANCE_RECOMMENDATION
2022/09/08 08:48:44 INF event store statistics drainable-events=0 size=0

Environment

  • NTH App Version: 1.16.0
  • NTH Mode (IMDS/Queue processor): IMDS
  • OS/Arch: Linux
  • Kubernetes version: 1.21
  • Installation method: helm
@github-actions
Copy link

github-actions bot commented Oct 8, 2022

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. If you want this issue to never become stale, please ask a maintainer to apply the "stalebot-ignore" label.

@github-actions github-actions bot added the stale Issues / PRs with no activity label Oct 8, 2022
@github-actions
Copy link

This issue was closed because it has become stale with no activity.

@sidewinder12s
Copy link

Also seeing similar with v1.18.2

@snay2
Copy link
Contributor

snay2 commented Dec 16, 2022

@sidewinder12s Do you run in IMDS mode or QP mode? We have a PR open to fix this for QP mode (#743), but if it's still happening in IMDS mode we'll need some more investigation.

@brydoncheyney
Copy link

brydoncheyney commented Dec 22, 2022

I can confirm that we see this in v1.18.2. Node termination events are received and processed as expected and the error does not appear to affect anything but again... I wanted to understand if this error affects anything.

Default install - no overrides.

2022/12/21 19:29:22 INF aws-node-termination-handler arguments:
	dry-run: false,
	node-name: ip-172-30-109-126.ec2.internal,
	pod-name: aws-node-termination-handler-mmpcp,
	metadata-url: http://169.254.169.254,
	kubernetes-service-host: 10.100.0.1,
	kubernetes-service-port: 443,
	delete-local-data: true,
	ignore-daemon-sets: true,
	pod-termination-grace-period: -1,
	node-termination-grace-period: 120,
	enable-scheduled-event-draining: true,
	enable-spot-interruption-draining: true,
	enable-sqs-termination-draining: false,
	enable-rebalance-monitoring: false,
	enable-rebalance-draining: false,
	metadata-tries: 3,
	cordon-only: false,
	taint-node: false,
	taint-effect: NoSchedule,
	exclude-from-load-balancers: false,
	json-logging: false,
	log-level: info,
	webhook-proxy: ,
	webhook-headers: <not-displayed>,
	webhook-url: ,
	webhook-template: <not-displayed>,
	uptime-from-file: /proc/uptime,
	enable-prometheus-server: false,
	prometheus-server-port: 9092,
	emit-kubernetes-events: false,
	kubernetes-events-extra-annotations: ,
	aws-region: us-east-1,
	queue-url: ,
	check-tag-before-draining: true,
	managed-tag: aws-node-termination-handler/managed,
	use-provider-id: false,
	aws-endpoint: ,

2022/12/21 19:29:22 INF Error when trying to list Nodes w/ label, falling back to direct Get lookup of node
2022/12/21 19:29:24 INF Error when trying to list Nodes w/ label, falling back to direct Get lookup of node
2022/12/21 19:29:26 INF Error when trying to list Nodes w/ label, falling back to direct Get lookup of node
2022/12/21 19:29:28 INF Error when trying to list Nodes w/ label, falling back to direct Get lookup of node
2022/12/21 19:29:30 WRN All retries failed, unable to complete the uncordon after reboot workflow error="timed out waiting for the condition"
2022/12/21 19:29:30 INF Started watching for interruption events
2022/12/21 19:29:30 INF Kubernetes AWS Node Termination Handler has started successfully!
2022/12/21 19:29:30 INF Started watching for event cancellations
2022/12/21 19:29:30 INF Started monitoring for events event_type=SCHEDULED_EVENT_MONITOR
2022/12/21 19:29:30 INF Started monitoring for events event_type=SPOT_ITN_MONITOR

Environment

  • NTH App Version: 1.18.2
  • NTH Mode (IMDS/Queue processor): IMDS
  • OS/Arch: Linux
  • Kubernetes version: 1.24 (EKS)
  • Installation method: helm template | krane

@brydoncheyney
Copy link

In our case, we would also see this due to a mismatch between hostname and kubernetes.io/hostname that is defined at the vpc dhcp option sets.

log.Err(err).Msgf("Error when trying to list Nodes w/ label, falling back to direct Get lookup of node")

It would seem sensible to log this message as INFO/WARNING not ERROR, as this failure does not necessarily indicate an unrecoverable error. If there are no nodes matching the label then it will instead get by nodeName and pass any error up the stack -

return n.drainHelper.Client.CoreV1().Nodes().Get(context.TODO(), nodeName, metav1.GetOptions{})

@snay2
Copy link
Contributor

snay2 commented Dec 22, 2022

@brydoncheyney Thanks for the explanation! I agree that this would be better as a WARNING log, since it does not prevent the functionality by itself and there is a fallback mechanism. Feel free to open a PR to that effect if you have time.

@snay2 snay2 reopened this Dec 22, 2022
@snay2 snay2 added Status: Help Wanted and removed stale Issues / PRs with no activity labels Dec 22, 2022
@sidewinder12s
Copy link

Ya I was in IMDS mode and do have a mismatch between the VPC option set and actual hostnames.

@lordz-md
Copy link

lordz-md commented Jan 2, 2023

@sidewinder12s, I have the similar error, my hostname is the instance-id. Any fix should be applied to DHCP options to set hostname as instance-id?

@LikithaVemulapalli LikithaVemulapalli self-assigned this Jan 18, 2023
@cjerad cjerad added Pending-Release Pending an NTH or eks-charts release and removed Status: Help Wanted labels Jan 26, 2023
@github-actions
Copy link

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. If you want this issue to never become stale, please ask a maintainer to apply the "stalebot-ignore" label.

@github-actions github-actions bot added the stale Issues / PRs with no activity label Feb 25, 2023
@github-actions
Copy link

github-actions bot commented Mar 2, 2023

This issue was closed because it has become stale with no activity.

@github-actions github-actions bot closed this as completed Mar 2, 2023
@cjerad cjerad removed the Pending-Release Pending an NTH or eks-charts release label Mar 8, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
stale Issues / PRs with no activity
Projects
None yet
Development

No branches or pull requests

7 participants