From ef345821909a93d53e3ae06faec86055f7ad913f Mon Sep 17 00:00:00 2001 From: John Marshall Date: Wed, 4 Sep 2024 08:30:21 +1200 Subject: [PATCH] Add playbook section on manually killing recalcitrant batch jobs (#160) --- playbooks/hail_batch.md | 19 +++++++++++++++++++ 1 file changed, 19 insertions(+) diff --git a/playbooks/hail_batch.md b/playbooks/hail_batch.md index 8324b50..348ec3d 100644 --- a/playbooks/hail_batch.md +++ b/playbooks/hail_batch.md @@ -17,6 +17,25 @@ Linked discussions: - [https://centrepopgen.slack.com/archives/C030X7WGFCL/p1712182771574679](https://centrepopgen.slack.com/archives/C030X7WGFCL/p1712182771574679) +### Jobs that simply cannot be cancelled + +Hail has several interfaces for cancelling batches: there is a `Cancel` button in the batch web UI, and `hailctl batch cancel BATCHID` on the command line. +But sometimes individual jobs cannot be cancelled. +In particular, jobs marked `always_run` cannot be cancelled through the user interface. + +And sometimes a job gets into such a wedged state that Hail's usual cancellation mechanisms are ineffective, whether it is an `always_run` job or not. +In those cases, operator assistance may be required to terminate the job's processes in a very manual Unix way: + +1. Check the job web UI page to see what instance (i.e., worker) the job is running on, which will be something like _batch-worker-default-abcde_. +1. Go to the [Google console](https://console.cloud.google.com/), switch to the appropriate Hail project, and go to the VM Instance page (which can usually be found just by searching for _abcde_). +1. From there, obtain the appropriate `gcloud compute ssh …` command to SSH into the worker. +1. Once logged on to the worker, attach to Hail's Docker container by identifying it via `docker ps` and attaching with `docker exec -it CONTAINERHASH bash`. (This step may or may not be strictly necessary.) +1. Now identify the Unix processes that correspond to the wedged job and are identifiably wedged themselves, and **carefully** terminate these processes in the usual `kill -9` way. + +Linked discussions: + +- [`always_run` job is stalling and cannot be cancelled](https://centrepopgen.slack.com/archives/C030X7WGFCL/p1718590807296239) + ## Python jobs