Skip to content

Commit

Permalink
Add playbook section on manually killing recalcitrant batch jobs (#160)
Browse files Browse the repository at this point in the history
  • Loading branch information
jmarshall authored Sep 3, 2024
1 parent 67af0b4 commit ef34582
Showing 1 changed file with 19 additions and 0 deletions.
19 changes: 19 additions & 0 deletions playbooks/hail_batch.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,25 @@ Linked discussions:
- [https://centrepopgen.slack.com/archives/C030X7WGFCL/p1712182771574679](https://centrepopgen.slack.com/archives/C030X7WGFCL/p1712182771574679)


### Jobs that simply cannot be cancelled

Hail has several interfaces for cancelling batches: there is a `Cancel` button in the batch web UI, and `hailctl batch cancel BATCHID` on the command line.
But sometimes individual jobs cannot be cancelled.
In particular, jobs marked `always_run` cannot be cancelled through the user interface.

And sometimes a job gets into such a wedged state that Hail's usual cancellation mechanisms are ineffective, whether it is an `always_run` job or not.
In those cases, operator assistance may be required to terminate the job's processes in a very manual Unix way:

1. Check the job web UI page to see what instance (i.e., worker) the job is running on, which will be something like _batch-worker-default-abcde_.
1. Go to the [Google console](https://console.cloud.google.com/), switch to the appropriate Hail project, and go to the VM Instance page (which can usually be found just by searching for _abcde_).
1. From there, obtain the appropriate `gcloud compute ssh …` command to SSH into the worker.
1. Once logged on to the worker, attach to Hail's Docker container by identifying it via `docker ps` and attaching with `docker exec -it CONTAINERHASH bash`. (This step may or may not be strictly necessary.)
1. Now identify the Unix processes that correspond to the wedged job and are identifiably wedged themselves, and **carefully** terminate these processes in the usual `kill -9` way.

Linked discussions:

- [`always_run` job is stalling and cannot be cancelled](https://centrepopgen.slack.com/archives/C030X7WGFCL/p1718590807296239)


## Python jobs

Expand Down

0 comments on commit ef34582

Please sign in to comment.