This repository uses GitOps to deploy NVIDIA GPU support into your cluster. It automates:
- 👉 Node Feature Discovery (NFD) to detect GPU nodes
- 👉 NVIDIA GPU Operator to install drivers, device plugin, runtime, monitoring, etc.
- 👉 NVIDIA DCGM Dashboard to visualize GPU telemetry exposed in Prometheus by using the NVIDIA DCGM Exporter.
- A Kubernetes or OpenShift cluster with at least one GPU-capable node.
oc
(orkubectl
) logged into the cluster as a cluster-admin.kustomize
CLI (optional if you useoc apply -k
).
You can bootstrap these components declaratively without cloning this repo with:
# 1. Log in
oc login <your-cluster-url> --token=<token> --server=<api-url>
# 2. Bootstrap GPU GitOps
oc apply -f https://github.com/na-launch/gpu-gitops/main/gitops/manifests/cluster/bootstrap/base/bigbang-app.yaml
# 3. Monitor progress
oc get pods -n nvidia-gpu-operator -w
oc get nodes -l feature.node.kubernetes.io/pci-10de.present=true -o name
# 4. Verify GPU functionality
oc -n nvidia-gpu-operator get pods -l openshift.driver-toolkit=true -o name | while read pod; do
echo "$pod"
oc exec -n nvidia-gpu-operator -it "$pod" -- nvidia-smi
echo
done
In the OpenShift Container Platform web console:
- Go to Observe > Dashboards → select NVIDIA DCGM Exporter Dashboard from the list.
The NVIDIA DCGM Exporter Dashboard displays GPU-related graphs such as utilization, temperature, memory usage, and more.
Once the GPU infrastructure is installed:
- Create GPU-capable workloads using
resources.limits.nvidia.com/gpu: 1
. - Integrate with ML platform (e.g. OpenDataHub or OpenShift AI).
- Deploy monitoring tools like Prometheus to collect DCGM metrics from GPUs.
- Scale your GPU node pools and GPU workloads using standard GitOps.
- NVIDIA GPU Operator + NFD setup with GitOps workflows
- GPU detection using NFD labels (
pci-10de
) - Typical GPU Operator installation patterns and validation via
nvidia-smi