Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add liveness probe to Celery workers #25561

Merged
merged 3 commits into from
Aug 26, 2022

Conversation

jedcunningham
Copy link
Member

This adds a liveness probe to our workers, to help guard against the worker being "up" but not communicating with Celery.

Might help with #24731, though it'll be a pretty blunt solution.

@jedcunningham jedcunningham added this to the Airflow Helm Chart 1.7.0 milestone Aug 5, 2022
@boring-cyborg boring-cyborg bot added the area:helm-chart Airflow Helm Chart label Aug 5, 2022
Copy link
Contributor

@pingzh pingzh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this prob can only work when worker_enable_remote_control is true.

@jedcunningham
Copy link
Member Author

@pingzh, very good call. Do you know of a better probe to use when it's disabled?

I'm tempted to just add an enabled flag around this feature so it can just be turned off. What do you think about that?

@pingzh
Copy link
Contributor

pingzh commented Aug 10, 2022

@pingzh, very good call. Do you know of a better probe to use when it's disabled?

I'm tempted to just add an enabled flag around this feature so it can just be turned off. What do you think about that?

I am not aware of other better probe methods. For us, we turn off worker_enable_remote_control, it is due to that we use SQS as the message broker, which worker_enable_remote_control it creates lots of pidbox queues. It should be ok for other cases.

I like the idea of adding an enabled flag`.

Copy link
Member

@ashb ashb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it worth adding a note somewhere about not enabling this with SQS?

@potiuk
Copy link
Member

potiuk commented Aug 26, 2022

Is it worth adding a note somewhere about not enabling this with SQS?

SQS is not officially supported by Airflow. We disucssed it, but Amazon team experience is that it has many more quirks and the level of support in Celery is definitely not on par with Redis/RabbitMQ so we should refrain from even stating that SQS can be used in Airflow #24019.

@anu251989
Copy link

@jedcunningham, I have enabled health checks for workers as workers not processing any messages when redis and workers communication broken.
After enabling the liveness checks ended up with High memory utilization for worker pods. I have disabled the liveness checks and memory utilization fine. Could you please help on this issue.

The liveness checks are causing memory leak.

@potiuk
Copy link
Member

potiuk commented Jan 17, 2023

@jedcunningham, I have enabled health checks for workers as workers not processing any messages when redis and workers communication broken. After enabling the liveness checks ended up with High memory utilization for worker pods. I have disabled the liveness checks and memory utilization fine. Could you please help on this issue.

The liveness checks are causing memory leak.

I believe this is the issue with K8S livenessprobe kubernetes-sigs/vsphere-csi-driver#778 - you can update K8S to latest version and check that the CSI livenessprobe is of the right version kubernetes-csi/livenessprobe#94

Generally upgrading whatever K8S you are usiung to latest version is highly recommended.

Please double-check that @anu251989 and in case you observe the same issue with latest version of K8S, report it please as a new issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:helm-chart Airflow Helm Chart
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants