Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add playbook section on manually killing recalcitrant batch jobs #160

Merged
merged 1 commit into from
Sep 3, 2024
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
19 changes: 19 additions & 0 deletions playbooks/hail_batch.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,25 @@ Linked discussions:
- [https://centrepopgen.slack.com/archives/C030X7WGFCL/p1712182771574679](https://centrepopgen.slack.com/archives/C030X7WGFCL/p1712182771574679)


### Jobs that simply cannot be cancelled

Hail has several interfaces for cancelling batches: there is a `Cancel` button in the batch web UI, and `hailctl batch cancel BATCHID` on the command line.
But sometimes individual jobs cannot be cancelled.
In particular, jobs marked `always_run` cannot be cancelled through the user interface.

And sometimes a job gets into such a wedged state that Hail's usual cancellation mechanisms are ineffective, whether it is an `always_run` job or not.
In those cases, operator assistance may be required to terminate the job's processes in a very manual Unix way:

1. Check the job web UI page to see what instance (i.e., worker) the job is running on, which will be something like _batch-worker-default-abcde_.
1. Go to the [Google console](https://console.cloud.google.com/), switch to the appropriate Hail project, and go to the VM Instance page (which can usually be found just by searching for _abcde_).
1. From there, obtain the appropriate `gcloud compute ssh …` command to SSH into the worker.
1. Once logged on to the worker, attach to Hail's Docker container by identifying it via `docker ps` and attaching with `docker exec -it CONTAINERHASH bash`. (This step may or may not be strictly necessary.)
1. Now identify the Unix processes that correspond to the wedged job and are identifiably wedged themselves, and **carefully** terminate these processes in the usual `kill -9` way.

Linked discussions:

- [`always_run` job is stalling and cannot be cancelled](https://centrepopgen.slack.com/archives/C030X7WGFCL/p1718590807296239)


## Python jobs

Expand Down