Skip to content

Delete cluster name state file whenever slurm accounting is configured or updated #2994

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: develop
Choose a base branch
from

Conversation

hgreebe
Copy link
Contributor

@hgreebe hgreebe commented Jul 15, 2025

Description of changes

  • Covers slurm known error where cluster id saved in /var/spool/slurm.state/clustername does not match the cluster id that slurm dbd has
  • This fix needs to be in both clear_slurm_accounting and config_slurm_accounting
    • If a cluster is created without slurm accounting and then updated to have slurm accounting it will run config_slurm_accounting and have the cluster id mismatch
    • If a cluster is created with slurm accounting and then updated, it will run clear_slurm_accounting and have the cluster id mismatch
  • Example error message:
[2025-07-10T00:54:46.362] fatal: CLUSTER ID MISMATCH.
slurmctld has been started with "ClusterID=4018"  from the state files in StateSaveLocation, but the DBD thinks it should be "3073".
Running multiple clusters from a shared StateSaveLocation WILL CAUSE CORRUPTION.
Remove /var/spool/slurm.state/clustername to override this safety check if this is intentional.

Tests

  • Created a new AMI with the cookbook changes and ran the test_slurm_accounting and test_slurm integ tests.
  • Tested the if I have one job running and one job pending and then stop slurmctld and delete /var/spool/slurm.state/clustername, once I restart slurmctld, the slurm state remains the same.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@hgreebe hgreebe requested review from a team as code owners July 15, 2025 15:03
Copy link

codecov bot commented Jul 15, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 75.50%. Comparing base (6127e18) to head (e0cc721).
Report is 11 commits behind head on develop.

Additional details and impacted files
@@           Coverage Diff            @@
##           develop    #2994   +/-   ##
========================================
  Coverage    75.50%   75.50%           
========================================
  Files           23       23           
  Lines         2356     2356           
========================================
  Hits          1779     1779           
  Misses         577      577           
Flag Coverage Δ
unittests 75.50% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@gmarciani
Copy link
Contributor

if I have one job running and one job pending and then stop slurmctld and delete /var/spool/slurm.state/clustername, once I restart slurmctld, the slurm state remains the same.

Do we have an integ test capturing the same scenario when we execute the cluster update?
If not, it would be a nice to have to capture it

@@ -88,6 +88,15 @@
retry_delay 10
end unless kitchen_test? || (node['cluster']['node_type'] == "ExternalSlurmDbd")

bash "Remove existing cluster name state file" do
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please cover this new logic with a spec test?

@@ -23,3 +23,12 @@
supports restart: false
action %i(disable stop)
end

bash "Remove existing cluster name state file" do
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please cover this new logic with a spec test?

code <<-CLUSTERSTATE
rm /var/spool/slurm.state/clustername
CLUSTERSTATE
only_if { ::File.exist?('/var/spool/slurm.state/clustername') }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why using an only_if rather than having a rm-f?

code <<-CLUSTERSTATE
rm /var/spool/slurm.state/clustername
CLUSTERSTATE
only_if { ::File.exist?('/var/spool/slurm.state/clustername') }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gmarciani
Copy link
Contributor

With this change we are fixing a bug. Can we surface it in the changelog?

@@ -23,3 +23,12 @@
supports restart: false
action %i(disable stop)
end

bash "Remove existing cluster name state file" do
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This logic is duplicated. What about reducing code duplication by defining this logic into a function and call that function?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants