From 574d0277817528bddc11e2fea1380f59506dde15 Mon Sep 17 00:00:00 2001
From: Julien Vignoud <33122365+JulienVig@users.noreply.github.com>
Date: Mon, 28 Apr 2025 14:06:51 +0200
Subject: [PATCH 1/4] Update multinode.md

* Make example command rely on bash, the job fails as it is currently
* Change image from lauzhack to ic-registry.epfl.ch/mlo/mlo:v1
---
 docs/multinode.md | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/docs/multinode.md b/docs/multinode.md
index 55dbc4d..cd36dae 100644
--- a/docs/multinode.md
+++ b/docs/multinode.md
@@ -13,10 +13,10 @@ As an example, the following command launches 3 pods, each with 4 GPUs. Note tha
 ```bash
 runai submit-dist pytorch \
     --name distributed-job-readme \
-    --workers=2 -g 4 -i ic-registry.epfl.ch/mlo/lauzhack:v1 \
+    --workers=2 -g 4 -i ic-registry.epfl.ch/mlo/mlo:v1 \
     --annotation k8s.v1.cni.cncf.io/networks=kube-system/roce \
     --extended-resource rdma/rdma=1 \
-    -- "sleep infinity" 
+    -- bash -c "sleep infinity" 
 ```
 Note that it is not possbile to control how these pods are scheduled so these two pods can be either on the same node or on different nodes. For best performance, local GPUs should be maximized, which would mean asking for pods of 8 GPUs each (taking a full node).
 

From a73eb5f05264964f3fa63d3effed111c72bd4f32 Mon Sep 17 00:00:00 2001
From: Julien Vignoud <33122365+JulienVig@users.noreply.github.com>
Date: Wed, 30 Apr 2025 15:27:17 +0200
Subject: [PATCH 2/4] Update runai doc link

---
 docs/multinode.md | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/docs/multinode.md b/docs/multinode.md
index cd36dae..f052280 100644
--- a/docs/multinode.md
+++ b/docs/multinode.md
@@ -6,7 +6,7 @@
 > [!CAUTION]  
 > This doc explains an advanced usage of RunAI. 
 
-Jobs can be submitted either through RunAI as documented in RunAI's website (https://docs.run.ai/v2.13/Researcher/cli-reference/runai-submit-dist-pytorch/).
+Jobs can be submitted either through RunAI as documented in RunAI's website (https://docs.run.ai/latest/Researcher/cli-reference/runai-submit-dist-pytorch/).
 
 As an example, the following command launches 3 pods, each with 4 GPUs. Note that the number of pods is one more than the number of workers as the master node is not counted as a worker.
 
@@ -65,7 +65,7 @@ However, the communication backend requires additional configuration to use RDMA
     grep 'RoCE v2' /sys/class/infiniband/mlx5_bond_0/ports/1/gid_attrs/types/* 2>/dev/null
     ```
 
-    2.3. The port that appears in both of the above commands is the one we want. For the pods I was running this was always port 9. The following one-liner performs all the above operations:
+    2.3. The port that appears in both of the above commands is the one we want. For the pods I was running this was always port 7 or 9. The following one-liner performs all the above operations:
 
     ```bash
     grep 'RoCE v2' $(grep '0000:0000:0000:0000:0000:ffff' /sys/class/infiniband/mlx5_bond_0/ports/1/gids/* | cut -d ':' -f 1 | sed 's/gids/gid_attrs\/types/') |  sed -e 's/.*\/\([0-9]*\):.*/\1/'

From 1d6da5563d0f10ba250edde38c821d0d713a620b Mon Sep 17 00:00:00 2001
From: Julien Vignoud <33122365+JulienVig@users.noreply.github.com>
Date: Wed, 30 Apr 2025 16:06:20 +0200
Subject: [PATCH 3/4] Emphasize torchrun needs to run on each node

---
 docs/multinode.md | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/docs/multinode.md b/docs/multinode.md
index f052280..d4409ba 100644
--- a/docs/multinode.md
+++ b/docs/multinode.md
@@ -28,7 +28,10 @@ RunAI handles scheduling the pods and also creates the necessary communication (
 * `MASTER_PORT`: Port on which master node is listening
 
 
-For running a training job, torchrun accepts the above variables as arguments and automatically schedules the job. For example the following command can be used to schedule a training job on the 3 pods we launched before. Note that the command needs to be run on each of the pods separately.
+For running a training job, torchrun accepts the above variables as arguments and automatically schedules the job. For example the following command can be used to schedule a training job on the 3 pods we launched before. 
+
+> [!NOTE]
+> The command needs to be run on each of the pods separately.
 
 ```bash
 torchrun \

From ca382fd6fc395e31327223ffa4ff9c9fafa69016 Mon Sep 17 00:00:00 2001
From: Julien Vignoud <33122365+JulienVig@users.noreply.github.com>
Date: Wed, 30 Apr 2025 17:32:02 +0200
Subject: [PATCH 4/4] More details, example script and commands

---
 docs/multinode.md | 79 +++++++++++++++++++++++++++++++++++++++++------
 1 file changed, 69 insertions(+), 10 deletions(-)

diff --git a/docs/multinode.md b/docs/multinode.md
index d4409ba..de11d57 100644
--- a/docs/multinode.md
+++ b/docs/multinode.md
@@ -1,33 +1,34 @@
 # Multi-Node Training with RunAI 
 > [!NOTE]  
 > Multi-Node scheduling needs to be enabled on the cluster and you should be using a RunAI CLI which 
-> supports multi-node jobs. 
+> supports multi-node jobs (>2.13). As of April 30th 2025 this setup works on the RCP cluster with runai 2.18.94.
 
 > [!CAUTION]  
 > This doc explains an advanced usage of RunAI. 
 
-Jobs can be submitted either through RunAI as documented in RunAI's website (https://docs.run.ai/latest/Researcher/cli-reference/runai-submit-dist-pytorch/).
+Jobs can be submitted either through RunAI as documented in RunAI's website (https://docs.run.ai/latest/Researcher/cli-reference/runai-submit-dist-pytorch/) or via Kubernetes YAML (as `csub.py` is doing) documented [here](https://docs.run.ai/v2.18/developer/cluster-api/submit-yaml/)
 
 As an example, the following command launches 3 pods, each with 4 GPUs. Note that the number of pods is one more than the number of workers as the master node is not counted as a worker.
 
 ```bash
 runai submit-dist pytorch \
-    --name distributed-job-readme \
+    --name distributed-job \
     --workers=2 -g 4 -i ic-registry.epfl.ch/mlo/mlo:v1 \
     --annotation k8s.v1.cni.cncf.io/networks=kube-system/roce \
     --extended-resource rdma/rdma=1 \
     -- bash -c "sleep infinity" 
 ```
-Note that it is not possbile to control how these pods are scheduled so these two pods can be either on the same node or on different nodes. For best performance, local GPUs should be maximized, which would mean asking for pods of 8 GPUs each (taking a full node).
+Note that it is not possbile to control how these pods are scheduled so these two pods can be either on the same node or on different nodes. For best performance, local GPUs should be maximized, which would mean asking for pods of 8 GPUs each (taking a full node). You can open a bash on a specific pod with `runai bash job_name --pod pod_name`. You can list pods with `kubectl get pods`.
 
 RunAI handles scheduling the pods and also creates the necessary communication (rendezvous) backend (most likely c10d) between them. The following environment variables are set:
 
 * `WORLD_SIZE`: Number of pods (number of GPUs in each pod does not matter.)
 * `RANK`: Rank of the pod (number of GPUs in each pod does not matter.)
+* `LOCAL_RANK`: Rank of the GPU within a pod
 * `MASTER_ADDR`: IP Address of the master node.
 * `MASTER_PORT`: Port on which master node is listening
 
-
+You can find the exhaustive list on [torchrun's documentation](https://pytorch.org/docs/stable/elastic/run.html#environment-variables)
 For running a training job, torchrun accepts the above variables as arguments and automatically schedules the job. For example the following command can be used to schedule a training job on the 3 pods we launched before. 
 
 > [!NOTE]
@@ -35,19 +36,19 @@ For running a training job, torchrun accepts the above variables as arguments an
 
 ```bash
 torchrun \
-    --nproc-per-node 4 \
+    --nproc-per-node gpu \ # All these parameters can be ommited and automatically inferred
     --nnodes ${WORLD_SIZE} \
     --node_rank ${RANK} \
     --master_addr ${MASTER_ADDR} \
     --master_port ${MASTER_PORT} \
-    main.py
+    main.py # you can use the example script below
 ```
 
-torchrun automatically launches a separate process for each GPU and assigns the correct global rank. As such, for basic usage (e.g. FSDP), no changes to python code is necessary.
+torchrun automatically launches a separate process for each GPU and assigns the correct global and local ranks. As such, for basic usage (e.g. FSDP), no changes to python code is necessary.
 
 ## Using RDMA for efficient inter-node communication
 
-While the above should get a job running, additional setup is necessary for efficient communication, in particular, using RDMA. We have already specified the following flags when running our pods to ensure RDMA support:
+While the previous section should get a job running, additional setup is necessary for efficient communication, in particular, using RDMA. We have already specified the following flags when running our pods to ensure RDMA support:
 ```--annotation k8s.v1.cni.cncf.io/networks=kube-system/roce --extended-resource rdma/rdma=1```.
 
 However, the communication backend requires additional configuration to use RDMA. In particular, the following steps are needed when using NCCL. The necessary steps may vary for different OS distributions or versions as well as when alternative drivers for Inifiniband/RDMA are installed.
@@ -78,10 +79,68 @@ However, the communication backend requires additional configuration to use RDMA
 3. Once we know the device name as well as the correct GID index, we can configure NCCL by settings the environment variable `NCCL_IB_GID_INDEX` to the desired GID index. Furthermore, we should set `NCCL_IB_HCA` to a prefix to ensure NCCL uses the right device. For example, either `mlx5_bond_0` or a prefix such as `mlx5` should work. 
 
     ```bash
+    sudo apt update # on RCP your password is your username
+    sudo apt install libibverbs-dev librdmacm-dev libmlx5-1 # Install RDMA libraries required by NCCL
     export NCCL_IB_GID_INDEX=$(grep 'RoCE v2' $(grep '0000:0000:0000:0000:0000:ffff' /sys/class/infiniband/mlx5_bond_0/ports/1/gids/* | cut -d ':' -f 1 | sed 's/gids/gid_attrs\/types/') |  sed -e 's/.*\/\([0-9]*\):.*/\1/')
     export NCCL_IB_HCA=mlx5
     export NCCL_SOCKET_NTHREADS=4 
     export NCCL_NSOCKS_PERTHREAD=8
     ```
 
-4. You should run torchrun with the above environment variables set. This should usually be enough to get NCCL to correctly use RDMA. To verify this, you can use tools such as ifstats. These tools monitor network traffic that goes through CPU. When using RDMA, no such traffic should be visible (assuming you are not using the network interface for other things).
+4. You should run torchrun with the above environment variables set. This should usually be enough to get NCCL to correctly use RDMA. To verify this, you can use tools such as `ifstat`. These tools monitor network traffic that goes through CPU. When using RDMA, no such traffic should be visible (assuming you are not using the network interface for other things). By exporting `NCCL_DEBUG=INFO`, executing the torchrun command should show `NCCL INFO NET/IB : Using [0]mlx5_bond_0:1/RoCE [RO]` (RoCE is indeed being used) and `NCCL INFO Channel 00/0 : 0[0] -> 2[0] [send] via NET/IB/0(1)/GDRDMA` with possibly different ranks for `0[0] -> 2[0]`. GDRDMA stands for GPUDirect RDMA — direct transfers between GPUs over RDMA without staging through host memory. If RDMA is **not** being used you may see `NCCL INFO NET/Socket : Using [0]net1:172.18.128.3<0> [1]**eth0**:172.16.46.52<0>` and `NCCL INFO Using network Socket`.
+
+## Example PyTorch script
+
+```python
+import os
+import torch
+import torch.distributed as dist
+import torch.nn as nn
+import torch.optim as optim
+from torch.nn.parallel import DistributedDataParallel as DDP
+from torchvision import models
+from time import sleep
+def setup():
+    dist.init_process_group("nccl", rank=int(os.environ['RANK']), world_size=int(os.environ['WORLD_SIZE']))
+    print(f"Process group initialized for rank {os.environ['RANK']}")
+
+def cleanup():
+    dist.destroy_process_group()
+
+def main():
+    setup()
+    local_rank = int(os.environ['LOCAL_RANK'])
+
+    try:
+        # Create a simple model
+        model = models.resnet18().cuda(local_rank)
+        model = DDP(model, device_ids=[local_rank])
+
+        # Create a random input tensor and move it to the current device
+        for _ in range(10):
+            input_tensor = torch.randn(20, 3, 224, 224).cuda(local_rank)
+
+            # Define a simple loss function and optimizer
+            loss_fn = nn.CrossEntropyLoss()
+            optimizer = optim.Adam(model.parameters(), lr=0.001)
+
+            # Forward pass
+            output = model(input_tensor)
+            target = torch.randint(0, 1000, (20,)).cuda(local_rank)
+            
+            # Compute loss
+            loss = loss_fn(output, target)
+
+            # Backward pass and optimization step
+            optimizer.zero_grad()
+            loss.backward()
+            optimizer.step()
+
+            print(f"Local rank {local_rank}, Loss: {loss.item()}")
+            sleep(1)
+    finally:
+        cleanup()
+
+if __name__ == "__main__":
+    main()
+```