[Logs] Fix a race condition in CloudWatch Agent startup that could cause nodes bootstrap failures #2993
+10
−4
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description of changes
Fix a race condition in CloudWatch Agent startup that could cause nodes bootstrap failures.
To do that we simply need to move the staging configuration created by pcluster from
/opt/aws/amazon-cloudwatch-agent/etc/amazon-cloudwatch-agent.json
to/etc/parallelcluster/amazon-cloudwatch-agent/amazon-cloudwatch-agent.json
.Why moving that file solves the race condition?
We start the CloudWatch Agent by using
amazon-cloudwatch-agent-ctl
, providing as input the configuration file we author; see code. In particular we execute two times that command in two different modes, as safety net. However, that command assumes to have full control over directories in/opt/aws/amazon-cloudwatch-agent/etc/
. In particular, there is a path in the code that may trigger the removal of files in that directory, thus removing the configuration file. So when the first command fails and the second one is execute, the config file provided as input may have been deleted by the first execution.Moving the config file out of
/opt/aws/amazon-cloudwatch-agent/etc/
is the mitigation suggested by the CloudWatch team itself.Tests
test_cloudwatch_logging
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.