Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adjust Queue Logging #202

Merged
merged 9 commits into from
Jan 8, 2024
Merged

Conversation

patricktnast
Copy link
Contributor

@patricktnast patricktnast commented Jan 4, 2024

Adjust Queue Logging

Description

Changes and notes

From the ticket:

Overview: The psimulate logger previously has buckets for pending,  running, and workers that come from when we submitted all jobs to slurm at once. Now, the meaning is somewhat different, or at least ambiguous, because "pending" means all queued jobs, not necessarily the set of jobs (or rather, in this case, workers) which have been submitted to slurm but are waiting to receive resources.

As I imply above, thinking about the "jobs" submitted to the cluster is a sort of category error--it's the workers that are really getting submitted. Ultimately, i thought the most sensible thing would be to leave "pending" as-is, under the understanding that this now means just "idle jobs awating workers". Instead, at the "Queue All" level, I added a listing for "inactive workers", that is, the number of "missing workers" from the number that were intially submitted, that are either a) waiting for cluster resources or b) have completed all jobs in the queue and quit themselves.

I don't think there's a lazy way to disambiguate (a) from (b), you'd need to pull information from outside of the registry--and I figured the two situations can be resolved by context (one happens mostly at the beginning of the sim, and one happens mostly at the end).

Also, "running" and "workers" should effectively be the same, unless rq.Worker.all also counts workers in the queue that are not performing a job (but in that case they should soon quit?)

I too out "workers", assuming the same information is given by "running"

Also consider renaming "Finished" to "Successful" or "Completed"

I renamed this job status (but not the underlying FinishedQueue) to Successful for the purposes of logging.

I also changed some log info statements to debug which IMO deserve it

Testing

Tested against nutrition optimization, but I am not actually getting allocated any workers atm

Copy link
Contributor

@rmudambi rmudambi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you include a screenshot of what the logging output looks like when you are able to get scheduled? I'll approve after I see it look as expected.

N.B. - you can use psimulate test sleep rather than running an actual simulation to reduce our cluster impact.

@stevebachmeier
Copy link
Contributor

Like Rajan, I'd also like to actually see a screenshot of the logging in action.

finished_jobs = self._get_finished_jobs()
start = time.time()
results = []
for job_id in finished_jobs:
result = self._get_result(job_id)
if result is not None:
results.append(result)
self._logger.info(
self._logger.debug(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this change

)
self._status["workers"] = q_workers
self._status["done"] = 100 * self._status["finished"] / self._status["total"]
self._status["done"] = (
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so "done" for this context is basically finished + successful?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I don't think that the "done" percentage is including failed jobs

@patricktnast
Copy link
Contributor Author

psim-log1
psim-log-with-queues
@rmudambi @stevebachmeier here is a screenshot of the logs (with and without individual queues)

@stevebachmeier stevebachmeier self-requested a review January 5, 2024 18:51
Copy link
Contributor

@rmudambi rmudambi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. To be clear, we expect these two constraints to hold?

  • max_workers == inactive_workers + running
  • total_jobs == pending + running + failed + successful

@patricktnast patricktnast merged commit d228061 into main Jan 8, 2024
6 checks passed
@patricktnast patricktnast deleted the bugfix/pnast/MIC-4650-queue-buckets branch January 8, 2024 18:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants