Skip to content

[Logs] Fix a race condition in CloudWatch Agent startup that could cause nodes bootstrap failures #2993

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: develop
Choose a base branch
from

Conversation

gmarciani
Copy link
Contributor

@gmarciani gmarciani commented Jul 15, 2025

Description of changes

Fix a race condition in CloudWatch Agent startup that could cause nodes bootstrap failures.
To do that we simply need to move the staging configuration created by pcluster from /opt/aws/amazon-cloudwatch-agent/etc/amazon-cloudwatch-agent.json to /etc/parallelcluster/amazon-cloudwatch-agent/amazon-cloudwatch-agent.json.

Why moving that file solves the race condition?
We start the CloudWatch Agent by using amazon-cloudwatch-agent-ctl, providing as input the configuration file we author; see code. In particular we execute two times that command in two different modes, as safety net. However, that command assumes to have full control over directories in /opt/aws/amazon-cloudwatch-agent/etc/. In particular, there is a path in the code that may trigger the removal of files in that directory, thus removing the configuration file. So when the first command fails and the second one is execute, the config file provided as input may have been deleted by the first execution.

Moving the config file out of /opt/aws/amazon-cloudwatch-agent/etc/ is the mitigation suggested by the CloudWatch team itself.

Tests

  • [SUCCESS] Cluster creation succeeded on Ubuntu24 (the OS where, for unknown reasons, this race condition occurred the most) and verified that logs are pushed to CloudWatch.
  • [PENDING] test_cloudwatch_logging

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

… `/etc/parallelcluster/amazon-cloudwatch-agent/amazon-cloudwatch-agent.json`.

This is to prevent a race condition in the way we start the CW agent, that may lead to undesired deletion of the config file and eventually the node bootstrap failure caused by CW agent failing to start.
@gmarciani gmarciani force-pushed the wip/mgiacomo/3140/cwa-race-condition-0715-1 branch from 9546b9d to 81c9d4e Compare July 15, 2025 15:52
@gmarciani gmarciani changed the title [Logs] Move the CW Agent configuration authored by ParallelCluster to… [Logs] Fix a race condition in CloudWatch Agent startup that could cause nodes bootstrap failures Jul 15, 2025
@gmarciani gmarciani added the bug label Jul 15, 2025
@gmarciani gmarciani marked this pull request as ready for review July 15, 2025 16:55
@gmarciani gmarciani requested review from a team as code owners July 15, 2025 16:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant