Skip to content

Commit d9af371

Browse files
authored
support defining environment variables from configmap keys (#122)
1 parent b272ab3 commit d9af371

File tree

5 files changed

+194
-2
lines changed

5 files changed

+194
-2
lines changed

tools/pytorchjob-generator/chart/README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -39,7 +39,7 @@ customize the Jobs generated by the tool.
3939

4040
| Key | Type | Default | Description |
4141
|-----|------|---------|-------------|
42-
| environmentVariables | array | `nil` | List of variables/values to be defined for all the ranks. Values can be literals or references to Kuberetes secrets. See [values.yaml](values.yaml) for examples of supported syntaxes. NOTE: The following standard [PyTorch Distributed environment variables](https://pytorch.org/docs/stable/distributed.html#environment-variable-initialization) are set automatically and can be referenced in the commands without being set manually: WORLD_SIZE, RANK, MASTER_ADDR, MASTER_PORT. |
42+
| environmentVariables | array | `nil` | List of variables/values to be defined for all the ranks. Values can be literals or references to Kuberetes secrets or configmaps. See [values.yaml](values.yaml) for examples of supported syntaxes. NOTE: The following standard [PyTorch Distributed environment variables](https://pytorch.org/docs/stable/distributed.html#environment-variable-initialization) are set automatically and can be referenced in the commands without being set manually: WORLD_SIZE, RANK, MASTER_ADDR, MASTER_PORT. |
4343
| sshGitCloneConfig | object | `nil` | Private GitHub clone support. See [values.yaml](values.yaml) for additional instructions. |
4444
| setupCommands | array | no custom commands are executed | List of custom commands to be ran at the beginning of the execution. Use `setupCommand` to clone code, download data, and change directories. |
4545
| mainProgram | string | `nil` | Name of the PyTorch program to be executed by `torchrun`. Please provide your program name here and NOT in "setupCommands" as this helm template provides the necessary "torchrun" arguments for the parallel execution. WARNING: this program is relative to the current path set by change-of-directory commands in "setupCommands". If no value is provided; then only `setupCommands` are executed and torchrun is elided. |

tools/pytorchjob-generator/chart/templates/_helpers.tpl

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -110,6 +110,11 @@ env:
110110
secretKeyRef:
111111
name: {{ required "Missing 'name' in 'environmentVariables.secret' list element" $variable.secret.name }}
112112
key: {{ required "Missing 'key' in 'environmentVariables.secret' list element" $variable.secret.key | quote }}
113+
{{- else if $variable.configmap }}
114+
valueFrom:
115+
configMapKeyRef:
116+
name: {{ required "Missing 'name' in 'environmentVariables.configmap' list element" $variable.configmap.name }}
117+
key: {{ required "Missing 'key' in 'environmentVariables.configmap' list element" $variable.configmap.key | quote }}
113118
{{- else if ( kindIs "float64" $variable.value ) }}
114119
value: "0"
115120
{{- else }}

tools/pytorchjob-generator/chart/tests/__snapshot__/helloworld_test.yaml.snap

Lines changed: 164 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1512,3 +1512,167 @@ scheduler can be set:
15121512
- emptyDir:
15131513
medium: Memory
15141514
name: dshm
1515+
user-defined environment variables:
1516+
1: |
1517+
apiVersion: workload.codeflare.dev/v1beta2
1518+
kind: AppWrapper
1519+
metadata:
1520+
annotations:
1521+
workload.codeflare.dev.mlbatch/pytorchGeneratorVersion: 1.1.6
1522+
labels:
1523+
kueue.x-k8s.io/queue-name: default-queue
1524+
name: my-job
1525+
namespace: my-namespace
1526+
spec:
1527+
components:
1528+
- template:
1529+
apiVersion: kubeflow.org/v1
1530+
kind: PyTorchJob
1531+
metadata:
1532+
name: my-job
1533+
spec:
1534+
pytorchReplicaSpecs:
1535+
Master:
1536+
replicas: 1
1537+
restartPolicy: Never
1538+
template:
1539+
spec:
1540+
affinity:
1541+
nodeAffinity:
1542+
requiredDuringSchedulingIgnoredDuringExecution:
1543+
nodeSelectorTerms:
1544+
- matchExpressions:
1545+
- key: autopilot.ibm.com/gpuhealth
1546+
operator: NotIn
1547+
values:
1548+
- ERR
1549+
- TESTING
1550+
- EVICT
1551+
containers:
1552+
- command:
1553+
- sh
1554+
- -c
1555+
- |
1556+
echo "Environment variables set by the kubeflow training operator:"
1557+
echo ${MASTER_ADDR}:${MASTER_PORT}
1558+
echo "PYTHONUNBUFFERED:"${PYTHONUNBUFFERED}
1559+
echo My global rank is ${RANK} / ${WORLD_SIZE}
1560+
echo "Other injected environment variables:"
1561+
echo "NVME_MOUNT_PATH: "${NVME_MOUNT_PATH}
1562+
#
1563+
# User commands
1564+
#
1565+
git clone https://github.com/dbarnett/python-helloworld
1566+
cd python-helloworld
1567+
echo executing: torchrun --nnodes=${WORLD_SIZE} --node_rank=${RANK} --nproc_per_node=8 --rdzv_id=101 --rdzv_endpoint="${MASTER_ADDR}:${MASTER_PORT}" helloworld.py
1568+
torchrun --nnodes=${WORLD_SIZE} --node_rank=${RANK} --nproc_per_node=8 --rdzv_id=101 --rdzv_endpoint="${MASTER_ADDR}:${MASTER_PORT}" helloworld.py
1569+
env:
1570+
- name: EXAMPLE_VAR1
1571+
value: "6"
1572+
- name: EXAMPLE_VAR2
1573+
value: example2string
1574+
- name: EXAMPLE_VAR3
1575+
valueFrom:
1576+
secretKeyRef:
1577+
key: my-secret-key
1578+
name: my-secret-name
1579+
- name: EXAMPLE_VAR4
1580+
valueFrom:
1581+
configMapKeyRef:
1582+
key: my-configmap-key
1583+
name: my-configmap-name
1584+
image: ghcr.io/foundation-model-stack/base:pytorch-latest-nightly-20230126
1585+
imagePullPolicy: IfNotPresent
1586+
name: pytorch
1587+
resources:
1588+
limits:
1589+
cpu: 500m
1590+
memory: 1Gi
1591+
nvidia.com/gpu: 8
1592+
nvidia.com/roce_gdr: 0
1593+
requests:
1594+
cpu: 500m
1595+
memory: 1Gi
1596+
nvidia.com/gpu: 8
1597+
nvidia.com/roce_gdr: 0
1598+
volumeMounts:
1599+
- mountPath: /dev/shm
1600+
name: dshm
1601+
imagePullSecrets: []
1602+
priorityClassName: default-priority
1603+
volumes:
1604+
- emptyDir:
1605+
medium: Memory
1606+
name: dshm
1607+
Worker:
1608+
replicas: 3
1609+
restartPolicy: Never
1610+
template:
1611+
spec:
1612+
affinity:
1613+
nodeAffinity:
1614+
requiredDuringSchedulingIgnoredDuringExecution:
1615+
nodeSelectorTerms:
1616+
- matchExpressions:
1617+
- key: autopilot.ibm.com/gpuhealth
1618+
operator: NotIn
1619+
values:
1620+
- ERR
1621+
- TESTING
1622+
- EVICT
1623+
containers:
1624+
- command:
1625+
- sh
1626+
- -c
1627+
- |
1628+
echo "Environment variables set by the kubeflow training operator:"
1629+
echo ${MASTER_ADDR}:${MASTER_PORT}
1630+
echo "PYTHONUNBUFFERED:"${PYTHONUNBUFFERED}
1631+
echo My global rank is ${RANK} / ${WORLD_SIZE}
1632+
echo "Other injected environment variables:"
1633+
echo "NVME_MOUNT_PATH: "${NVME_MOUNT_PATH}
1634+
#
1635+
# User commands
1636+
#
1637+
git clone https://github.com/dbarnett/python-helloworld
1638+
cd python-helloworld
1639+
echo executing: torchrun --nnodes=${WORLD_SIZE} --node_rank=${RANK} --nproc_per_node=8 --rdzv_id=101 --rdzv_endpoint="${MASTER_ADDR}:${MASTER_PORT}" helloworld.py
1640+
torchrun --nnodes=${WORLD_SIZE} --node_rank=${RANK} --nproc_per_node=8 --rdzv_id=101 --rdzv_endpoint="${MASTER_ADDR}:${MASTER_PORT}" helloworld.py
1641+
env:
1642+
- name: EXAMPLE_VAR1
1643+
value: "6"
1644+
- name: EXAMPLE_VAR2
1645+
value: example2string
1646+
- name: EXAMPLE_VAR3
1647+
valueFrom:
1648+
secretKeyRef:
1649+
key: my-secret-key
1650+
name: my-secret-name
1651+
- name: EXAMPLE_VAR4
1652+
valueFrom:
1653+
configMapKeyRef:
1654+
key: my-configmap-key
1655+
name: my-configmap-name
1656+
image: ghcr.io/foundation-model-stack/base:pytorch-latest-nightly-20230126
1657+
imagePullPolicy: IfNotPresent
1658+
name: pytorch
1659+
resources:
1660+
limits:
1661+
cpu: 500m
1662+
memory: 1Gi
1663+
nvidia.com/gpu: 8
1664+
nvidia.com/roce_gdr: 0
1665+
requests:
1666+
cpu: 500m
1667+
memory: 1Gi
1668+
nvidia.com/gpu: 8
1669+
nvidia.com/roce_gdr: 0
1670+
volumeMounts:
1671+
- mountPath: /dev/shm
1672+
name: dshm
1673+
imagePullSecrets: []
1674+
priorityClassName: default-priority
1675+
volumes:
1676+
- emptyDir:
1677+
medium: Memory
1678+
name: dshm

tools/pytorchjob-generator/chart/tests/helloworld_test.yaml

Lines changed: 19 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -101,6 +101,25 @@ tests:
101101
- matchSnapshot:
102102
path: spec.components[0].template
103103

104+
- it: user-defined environment variables
105+
set:
106+
environmentVariables:
107+
- name: EXAMPLE_VAR1
108+
value: 6
109+
- name: EXAMPLE_VAR2
110+
value: "example2string"
111+
- name: EXAMPLE_VAR3
112+
secret:
113+
name: my-secret-name
114+
key: my-secret-key
115+
- name: EXAMPLE_VAR4
116+
configmap:
117+
name: my-configmap-name
118+
key: my-configmap-key
119+
asserts:
120+
- matchSnapshot:
121+
path: spec.components[0].template
122+
104123
- it: Enabling RoCE GDR
105124
set:
106125
roceGdrResName: nvidia.com/roce_gdr

tools/pytorchjob-generator/chart/values.yaml

Lines changed: 5 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -81,7 +81,7 @@ limitMemoryPerPod: # <optional, default=totalMemoryPerPod> Limit of total memory
8181

8282

8383
# -- (array) List of variables/values to be defined for all the ranks. Values can be literals or
84-
# references to Kuberetes secrets. See [values.yaml](values.yaml) for examples of supported syntaxes.
84+
# references to Kuberetes secrets or configmaps. See [values.yaml](values.yaml) for examples of supported syntaxes.
8585
#
8686
# NOTE: The following standard [PyTorch Distributed environment variables](https://pytorch.org/docs/stable/distributed.html#environment-variable-initialization)
8787
# are set automatically and can be referenced in the commands without being set manually: WORLD_SIZE, RANK, MASTER_ADDR, MASTER_PORT.
@@ -95,6 +95,10 @@ environmentVariables:
9595
# secret:
9696
# name: secret-name
9797
# key: secret-key
98+
# - name: EXAMPLE_VAR4
99+
# configmap:
100+
# name: configmap-name
101+
# key: configmap-key
98102

99103
# Private GitHub clone support.
100104
#

0 commit comments

Comments
 (0)