Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Slurm collector crashes when component field cannot be parsed #510

Closed
QuantumDancer opened this issue Oct 25, 2023 · 0 comments · Fixed by #516
Closed

Slurm collector crashes when component field cannot be parsed #510

QuantumDancer opened this issue Oct 25, 2023 · 0 comments · Fixed by #516
Assignees

Comments

@QuantumDancer
Copy link
Collaborator

QuantumDancer commented Oct 25, 2023

Issue

The slurm collector crashes when a component field cannot be parsed. For example, the MaxRSS field can be empty for very short jobs (as the memory usage is sampled at some frequency).

[root@slurm ~]# sacct -a --format Partition,NCPUS,SystemCPU,UserCPU,TotalCPU,MaxRSS,ReqMem,NNodes,JobID,Start,End,Group,User,State --noconvert -j 6776529
 Partition      NCPUS  SystemCPU    UserCPU   TotalCPU     MaxRSS     ReqMem   NNodes JobID                      Start                 End     Group      User      State 
---------- ---------- ---------- ---------- ---------- ---------- ---------- -------- ------------ ------------------- ------------------- --------- --------- ---------- 
queue               1  00:02.077  00:03.436  00:05.514                 2000M        1 6776529      2023-10-24T13:26:59 2023-10-24T13:27:27     group      user     FAILED 
                    1  00:02.077  00:03.436  00:05.514                              1 6776529.bat+ 2023-10-24T13:26:59 2023-10-24T13:27:27                         FAILED 

When parsing this output, Auditor returns an error, but continues with the next line.

[2023-10-25T07:32:36.834Z] ERROR: AUDITOR-slurm-collector/16770 on slurm.bfg.uni-freiburg.de: [CALLING SACCT AND PARSING OUTPUT - EVENT] Something went wrong during parsing (map1, id: 6782841, key: MaxRSS, value: None) (file=collectors/slurm/src/sacctcaller.rs,line=141,target=auditor_slurm_collector::sacctcaller)

However, when constructing the components for this record, the collector crashes. This is because in this case the MaxRSS key is missing from the job information that is passed to construct_components():

[2023-10-25T07:32:36.840Z] DEBUG: AUDITOR-slurm-collector/16770 on slurm.bfg.uni-freiburg.de: [CONSTRUCT COMPONENTS FROM JOB INFO AND CONFIGURATION - END] (elapsed_milliseconds=1,file=collectors/slurm/src/sacctcaller.rs,line=317,target=auditor_slurm_collector::sacctcaller)
    job: {"NNodes": Integer(1), "SystemCPU": Integer(1995), "JobID": String("6776554"), "ReqMem": Integer(2000), "End": DateTime(2023-10-24T11:27:26Z), "User": String("user"), "Partition": String("queue"), "Start": DateTime(2023-10-24T11:26:59Z), "Group": String("group"), "NCPUS": Integer(1), "TotalCPU": Integer(5349), "State": String("FAILED"), "UserCPU": Integer(3354)}

This is the error message:

thread 'tokio-runtime-worker' panicked at 'no entry found for key', collectors/slurm/src/sacctcaller.rs:339:17
stack backtrace:
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.

This corresponds to this line:

job[&c.key].extract_i64().unwrap_or_else(|_| {

We try to access a key in the job dictionary that does not exist.

Solution

  1. We should check if all keys of the components config exist in the job information. If this is not the case, we should abort constructing the record and instead throw an error, but then continue with the next record.
  2. We should add an option where we can specify a default value if parsing fails, in case users want to keep records with missing data so that they can filter them later.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant