Nvidia GPU DRA Driver
Note
这个驱动将来可能被整合进 GPU Operator.
安装
首先必须在主机上安装 Nvidia driver, 并且版本要求最低为 570.158.01. 这里安装的是 cuda_13.0.0_580.65.06_linux.
然后需要安装 GPU Operator. 因为主机已经安装了驱动,所以要指定 --set driver.enabled=false. 这里安装的是 v25.10.0.
用 helm 安装 nvidia-dra-driver-gpu:
$ helm pull nvidia/nvidia-dra-driver-gpu --create-namespace
$ helm install nvidia-dra-driver-gpu nvidia-dra-driver-gpu-25.8.0.tgz --create-namespace --namespace nvidia-dra-driver-gpu --set resources.gpus.enabled=true --set gpuResourcesEnabledOverride=true
NAME: nvidia-dra-driver-gpu
LAST DEPLOYED: Tue Nov 18 17:02:44 2025
NAMESPACE: nvidia-dra-driver-gpu
STATUS: deployed
REVISION: 1
TEST SUITE: None
GPU 资源默认是关闭的,如果一定要打开,需要同时设置 --set gpuResourcesEnabledOverride=true.
查看安装后的 Workloads:
$ kubectl get all -owide -n nvidia-dra-driver-gpu
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
pod/nvidia-dra-driver-gpu-controller-b94fd47b6-tn562 1/1 Running 0 27s 192.168.100.169 las0 <none> <none>
pod/nvidia-dra-driver-gpu-kubelet-plugin-7wnc5 2/2 Running 0 27s 192.168.185.45 las3 <none> <none>
NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE CONTAINERS IMAGES SELECTOR
daemonset.apps/nvidia-dra-driver-gpu-kubelet-plugin 1 1 1 1 1 <none> 27s compute-domains,gpus nvcr.io/nvidia/k8s-dra-driver-gpu:v25.8.0,nvcr.io/nvidia/k8s-dra-driver-gpu:v25.8.0 nvidia-dra-driver-gpu-component=kubelet-plugin
NAME READY UP-TO-DATE AVAILABLE AGE CONTAINERS IMAGES SELECTOR
deployment.apps/nvidia-dra-driver-gpu-controller 1/1 1 1 27s compute-domain nvcr.io/nvidia/k8s-dra-driver-gpu:v25.8.0 nvidia-dra-driver-gpu-component=controller
NAME DESIRED CURRENT READY AGE CONTAINERS IMAGES SELECTOR
replicaset.apps/nvidia-dra-driver-gpu-controller-b94fd47b6 1 1 1 27s compute-domain nvcr.io/nvidia/k8s-dra-driver-gpu:v25.8.0 nvidia-dra-driver-gpu-component=controller,pod-template-hash=b94fd47b6
注意 kubelet plugin 只在 GPU 节点上运行,controller 在控制平面节点上运行。
查看安装的 DeviceClasses:
$ kubectl get deviceclass
NAME AGE
compute-domain-daemon.nvidia.com 82s
compute-domain-default-channel.nvidia.com 82s
gpu.nvidia.com 82s
mig.nvidia.com 82s
查看生成的 ResourceSlices:
$ kubectl get resourceslice
NAME NODE DRIVER POOL AGE
las3-compute-domain.nvidia.com-ws7tv las3 compute-domain.nvidia.com las3 100s
las3-gpu.nvidia.com-trjfv las3 gpu.nvidia.com las3 100s
查看 GPU resource 的详情:
$ kubectl describe resourceslice las3-gpu.nvidia.com-trjfv
Name: las3-gpu.nvidia.com-trjfv
Namespace:
Labels: <none>
Annotations: <none>
API Version: resource.k8s.io/v1
Kind: ResourceSlice
⋮
Spec:
Devices:
Attributes:
Architecture:
String: Pascal
Brand:
String: Tesla
Cuda Compute Capability:
Version: 6.1.0
Cuda Driver Version:
Version: 13.0.0
Driver Version:
Version: 580.65.6
Pcie Bus ID:
String: 0000:00:05.0
Product Name:
String: Tesla P4
resource.kubernetes.io/pcieRoot:
String: pci0000:00
Type:
String: gpu
Uuid:
String: GPU-1183b79f-301d-9b3f-a0b7-09ee1b54be60
Capacity:
Memory:
Value: 7680Mi
Name: gpu-0
Driver: gpu.nvidia.com
Node Name: las3
Pool:
Generation: 1
Name: las3
Resource Slice Count: 1
Events: <none>
可见 Device 的属性描述了 GPU 的各种参数。另外 ResourceSlice 是全局的,不分 Namespace.
测试
使用 ResourceClaimTemplate
创建一个 ResourceClaimTemplate:
apiVersion: resource.k8s.io/v1
kind: ResourceClaimTemplate
metadata:
name: gpu
spec:
spec:
devices:
requests:
- name: gpu
exactly:
deviceClassName: gpu.nvidia.com
count: 1
创建一个 Pod, 使用刚才的 ResourceClaimTemplate:
apiVersion: v1
kind: Pod
metadata:
name: gpu
spec:
restartPolicy: OnFailure
containers:
- image: ubuntu:22.04
imagePullPolicy: IfNotPresent
name: gpu-ubuntu
command: ["sh", "-c", "trap exit INT TERM; sleep 60s & wait"]
resources:
requests:
cpu: "1"
memory: 1Gi
limits:
cpu: "1"
memory: 1Gi
claims:
- name: gpu
resourceClaims:
- name: gpu
resourceClaimTemplateName: gpu
查看自动生成的 ResourceClaim:
$ kubectl get resourceclaim
NAME STATE AGE
gpu-gpu-c7f6x allocated,reserved 2m34s
这个 ResourceClaim 会在 Pod 运行结束后(不是删除后)自动删除。
检查 Pod 中的 GPU:
$ kubectl exec gpu -- nvidia-smi
Tue Nov 18 09:22:33 2025
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 580.65.06 Driver Version: 580.65.06 CUDA Version: 13.0 |
+-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 Tesla P4 Off | 00000000:00:05.0 Off | 0 |
| N/A 36C P8 6W / 75W | 0MiB / 7680MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| No running processes found |
+-----------------------------------------------------------------------------------------+
使用 ResourceClaim
创建一个 ResourceClaim:
apiVersion: resource.k8s.io/v1
kind: ResourceClaim
metadata:
name: gpu
spec:
devices:
requests:
- name: gpu
exactly:
deviceClassName: gpu.nvidia.com
count: 1
创建一个 Job, 同时生成两个 Pod 使用刚才的 ResourceClaim:
apiVersion: batch/v1
kind: Job
metadata:
name: gpu
spec:
completions: 2
completionMode: Indexed
parallelism: 2
template:
spec:
restartPolicy: OnFailure
containers:
- image: ubuntu:22.04
imagePullPolicy: IfNotPresent
name: gpu-ubuntu
command: ["sh", "-c", "trap exit INT TERM; sleep 60s & wait"]
resources:
requests:
cpu: "1"
memory: 1Gi
limits:
cpu: "1"
memory: 1Gi
claims:
- name: gpu
resourceClaims:
- name: gpu
resourceClaimName: gpu
容易验证两个 Pod 都能使用同一个 GPU.
另外如果监视 ResourceClaim 的状态,可以得到:
$ kubectl get resourceclaim -w
NAME STATE AGE
gpu pending 0s
gpu pending 6s
gpu allocated,reserved 6s
gpu allocated,reserved 6s
gpu allocated,reserved 69s
gpu allocated,reserved 69s
gpu allocated,reserved 69s
gpu allocated,reserved 69s
gpu allocated,reserved 69s
gpu allocated,reserved 69s
gpu allocated,reserved 69s
gpu allocated,reserved 69s
gpu allocated,reserved 69s
gpu allocated,reserved 69s
gpu allocated,reserved 69s
gpu allocated,reserved 69s
gpu allocated,reserved 69s
gpu allocated,reserved 69s
gpu allocated,reserved 69s
gpu allocated,reserved 69s
gpu allocated,reserved 69s
gpu pending 69s
gpu pending 69s