NVIDIA GPU Operator

Add repository:

$ helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
"nvidia" has been added to your repositories

Install:

$ helm pull nvidia/gpu-operator --version=v25.3.2
$ helm install gpu-operator -n gpu-operator --create-namespace gpu-operator-v25.3.2.tgz --set driver.enabled=false
W0828 11:26:49.734907   75972 warnings.go:70] spec.template.spec.affinity.nodeAffinity.preferredDuringSchedulingIgnoredDuringExecution[0].preference.matchExpressions[0].key: node-role.kubernetes.io/master is use "node-role.kubernetes.io/control-plane" instead
W0828 11:26:49.734838   75972 warnings.go:70] spec.template.spec.affinity.nodeAffinity.preferredDuringSchedulingIgnoredDuringExecution[0].preference.matchExpressions[0].key: node-role.kubernetes.io/master is use "node-role.kubernetes.io/control-plane" instead
W0828 11:26:49.755071   75972 warnings.go:70] unknown field "spec.dcgmExporter.service"
NAME: gpu-operator
LAST DEPLOYED: Thu Aug 28 11:26:49 2025
NAMESPACE: gpu-operator
STATUS: deployed
REVISION: 1
TEST SUITE: None

Note

Nvidia driver is already installed on the gpu node, that is why driver.enabled=false is set.

If some pods of gpu-operator report error: “failed to get sandbox runtime: no runtime for “nvidia” is configured”, you may need to config containerd by (generally the operator do this for you):

$ sudo nvidia-ctk runtime configure --runtime=containerd
INFO[0000] Using config version 2                       
INFO[0000] Using CRI runtime plugin name "io.containerd.grpc.v1.cri" 
INFO[0000] Wrote updated config to /etc/containerd/config.toml 
INFO[0000] It is recommended that containerd daemon be restarted.
$ sudo systemctl restart containerd

Now see the gpu node:

$ kubectl describe no las3

Capacity:
  cpu:                8
  ephemeral-storage:  203056560Ki
  hugepages-1Gi:      0
  hugepages-2Mi:      0
  memory:             8125876Ki
  nvidia.com/gpu:     1
  pods:               110

Create a pod config file gpu_po.yaml:

apiVersion: v1
kind: Pod
metadata:
  name: gpu
spec:
  restartPolicy: OnFailure
  containers:
    - image: ubuntu:22.04
      imagePullPolicy: IfNotPresent
      name: gpu-ubuntu
      command: ["bash", "-c", "nvidia-smi"]
      resources:
        requests:
          cpu: "1"
          memory: 1Gi
          nvidia.com/gpu: "1"
        limits:
          cpu: "1"
          memory: 1Gi
          nvidia.com/gpu: "1"

Apply to the cluster:

$ kubectl apply -f gpu_po.yaml
pod/gpu created

Check the pod is assigned to gpu node:

$ kubectl get po gpu -owide 
NAME   READY   STATUS      RESTARTS   AGE    IP               NODE    NOMINATED NODE   READINESS GATES
gpu    0/1     Completed   0          104s   192.168.182.14   las3    <none>           <none>

See output of the pod:

$ kubectl logs gpu
Thu Aug 28 03:31:47 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 580.65.06              Driver Version: 580.65.06      CUDA Version: 13.0     |
+-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  Tesla P4                       Off |   00000000:00:05.0 Off |                    0 |
| N/A   36C    P8              6W /   75W |       0MiB /   7680MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

GPU operator started two process on the GPU hosts using GPU devices:

$ sudo lsof /dev/nvidia-uvm
COMMAND       PID USER   FD   TYPE DEVICE SIZE/OFF NODE NAME
dcgm-expo 2985974 root   16u   CHR  508,0      0t0  794 /dev/nvidia-uvm
nvidia-de 2986790 root   16u   CHR  508,0      0t0  794 /dev/nvidia-uvm

MIG

Get MIG configs:

$ kubectl -n gpu-operator describe configmap/default-mig-parted-config
Name:         default-mig-parted-config
Namespace:    gpu-operator
Labels:       <none>
Annotations:  <none>

Data
====
config.yaml:
----
version: v1
mig-configs:
  all-disabled:
    - devices: all
      mig-enabled: false

  # A100-40GB, A800-40GB
  all-1g.5gb:
    - devices: all
      mig-enabled: true
      mig-devices:
        "1g.5gb": 7

Enable MIG on a node:

$ kubectl label no xxxx nvidia.com/mig.config=all-1g.10gb --overwrite
node/xxxx labeled

The default MIG strategy is single, which means each MIG instance appears as a nvidia.com/gpu. It can be changed to mixed during installation by --set mig.strategy=mixed.

The mixed mode means the MIG instances (appearing as nvidia.com/mig-1g.10gb) are coexist with the normal GPUs.