You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The slurm collector crashes when a component field cannot be parsed. For example, the MaxRSS field can be empty for very short jobs (as the memory usage is sampled at some frequency).
[root@slurm ~]# sacct -a --format Partition,NCPUS,SystemCPU,UserCPU,TotalCPU,MaxRSS,ReqMem,NNodes,JobID,Start,End,Group,User,State --noconvert -j 6776529
Partition NCPUS SystemCPU UserCPU TotalCPU MaxRSS ReqMem NNodes JobID Start End Group User State
---------- ---------- ---------- ---------- ---------- ---------- ---------- -------- ------------ ------------------- ------------------- --------- --------- ----------
queue 1 00:02.077 00:03.436 00:05.514 2000M 1 6776529 2023-10-24T13:26:59 2023-10-24T13:27:27 group user FAILED
1 00:02.077 00:03.436 00:05.514 1 6776529.bat+ 2023-10-24T13:26:59 2023-10-24T13:27:27 FAILED
When parsing this output, Auditor returns an error, but continues with the next line.
[2023-10-25T07:32:36.834Z] ERROR: AUDITOR-slurm-collector/16770 on slurm.bfg.uni-freiburg.de: [CALLING SACCT AND PARSING OUTPUT - EVENT] Something went wrong during parsing (map1, id: 6782841, key: MaxRSS, value: None) (file=collectors/slurm/src/sacctcaller.rs,line=141,target=auditor_slurm_collector::sacctcaller)
However, when constructing the components for this record, the collector crashes. This is because in this case the MaxRSS key is missing from the job information that is passed to construct_components():
thread 'tokio-runtime-worker' panicked at 'no entry found for key', collectors/slurm/src/sacctcaller.rs:339:17
stack backtrace:
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.
We try to access a key in the job dictionary that does not exist.
Solution
We should check if all keys of the components config exist in the job information. If this is not the case, we should abort constructing the record and instead throw an error, but then continue with the next record.
We should add an option where we can specify a default value if parsing fails, in case users want to keep records with missing data so that they can filter them later.
The text was updated successfully, but these errors were encountered:
Issue
The slurm collector crashes when a component field cannot be parsed. For example, the
MaxRSS
field can be empty for very short jobs (as the memory usage is sampled at some frequency).When parsing this output, Auditor returns an error, but continues with the next line.
However, when constructing the components for this record, the collector crashes. This is because in this case the
MaxRSS
key is missing from the job information that is passed toconstruct_components()
:This is the error message:
This corresponds to this line:
AUDITOR/collectors/slurm/src/sacctcaller.rs
Line 339 in 4115613
We try to access a key in the
job
dictionary that does not exist.Solution
The text was updated successfully, but these errors were encountered: