NVIDIA GPU Operator
Add repository:
$ helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
"nvidia" has been added to your repositories
Install:
$ helm pull nvidia/gpu-operator --version=v25.3.2
$ helm install gpu-operator -n gpu-operator --create-namespace gpu-operator-v25.3.2.tgz --set driver.enabled=false
W0828 11:26:49.734907 75972 warnings.go:70] spec.template.spec.affinity.nodeAffinity.preferredDuringSchedulingIgnoredDuringExecution[0].preference.matchExpressions[0].key: node-role.kubernetes.io/master is use "node-role.kubernetes.io/control-plane" instead
W0828 11:26:49.734838 75972 warnings.go:70] spec.template.spec.affinity.nodeAffinity.preferredDuringSchedulingIgnoredDuringExecution[0].preference.matchExpressions[0].key: node-role.kubernetes.io/master is use "node-role.kubernetes.io/control-plane" instead
W0828 11:26:49.755071 75972 warnings.go:70] unknown field "spec.dcgmExporter.service"
NAME: gpu-operator
LAST DEPLOYED: Thu Aug 28 11:26:49 2025
NAMESPACE: gpu-operator
STATUS: deployed
REVISION: 1
TEST SUITE: None
Note
Nvidia driver is already installed on the gpu node, that is why driver.enabled=false is set.
If some pods of gpu-operator report error: “failed to get sandbox runtime: no runtime for “nvidia” is configured”, you may need to config containerd by (generally the operator do this for you):
$ sudo nvidia-ctk runtime configure --runtime=containerd
INFO[0000] Using config version 2
INFO[0000] Using CRI runtime plugin name "io.containerd.grpc.v1.cri"
INFO[0000] Wrote updated config to /etc/containerd/config.toml
INFO[0000] It is recommended that containerd daemon be restarted.
$ sudo systemctl restart containerd
Now see the gpu node:
$ kubectl describe no las3
⋮
Capacity:
cpu: 8
ephemeral-storage: 203056560Ki
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 8125876Ki
nvidia.com/gpu: 1
pods: 110
⋮
Create a pod config file gpu_po.yaml:
apiVersion: v1
kind: Pod
metadata:
name: gpu
spec:
restartPolicy: OnFailure
containers:
- image: ubuntu:22.04
imagePullPolicy: IfNotPresent
name: gpu-ubuntu
command: ["bash", "-c", "nvidia-smi"]
resources:
requests:
cpu: "1"
memory: 1Gi
nvidia.com/gpu: "1"
limits:
cpu: "1"
memory: 1Gi
nvidia.com/gpu: "1"
Apply to the cluster:
$ kubectl apply -f gpu_po.yaml
pod/gpu created
Check the pod is assigned to gpu node:
$ kubectl get po gpu -owide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
gpu 0/1 Completed 0 104s 192.168.182.14 las3 <none> <none>
See output of the pod:
$ kubectl logs gpu
Thu Aug 28 03:31:47 2025
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 580.65.06 Driver Version: 580.65.06 CUDA Version: 13.0 |
+-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 Tesla P4 Off | 00000000:00:05.0 Off | 0 |
| N/A 36C P8 6W / 75W | 0MiB / 7680MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| No running processes found |
+-----------------------------------------------------------------------------------------+
GPU operator started two process on the GPU hosts using GPU devices:
$ sudo lsof /dev/nvidia-uvm
COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME
dcgm-expo 2985974 root 16u CHR 508,0 0t0 794 /dev/nvidia-uvm
nvidia-de 2986790 root 16u CHR 508,0 0t0 794 /dev/nvidia-uvm
MIG
Get MIG configs:
$ kubectl -n gpu-operator describe configmap/default-mig-parted-config
Name: default-mig-parted-config
Namespace: gpu-operator
Labels: <none>
Annotations: <none>
Data
====
config.yaml:
----
version: v1
mig-configs:
all-disabled:
- devices: all
mig-enabled: false
# A100-40GB, A800-40GB
all-1g.5gb:
- devices: all
mig-enabled: true
mig-devices:
"1g.5gb": 7
Enable MIG on a node:
$ kubectl label no xxxx nvidia.com/mig.config=all-1g.10gb --overwrite
node/xxxx labeled
The default MIG strategy is single, which means each MIG instance appears as a nvidia.com/gpu. It can be changed to mixed during installation by --set mig.strategy=mixed.
The mixed mode means the MIG instances (appearing as nvidia.com/mig-1g.10gb) are coexist with the normal GPUs.