KAI Scheduler
https://github.com/NVIDIA/KAI-Scheduler
Install
Using helm. Download the chart:
$ helm pull oci://ghcr.io/nvidia/kai-scheduler/kai-scheduler --version v0.7.12
Pulled: ghcr.io/nvidia/kai-scheduler/kai-scheduler:v0.7.12
Digest: sha256:1287e121e6c0b01c6e874fd104f8d36487ce0b773880bdc80d95ec5ab7f2c94a
Install:
$ helm upgrade -i kai-scheduler -n kai-scheduler --create-namespace kai-scheduler-v0.7.12.tgz
Release "kai-scheduler" does not exist. Installing it now.
NAME: kai-scheduler
LAST DEPLOYED: Mon Aug 11 14:42:24 2025
NAMESPACE: kai-scheduler
STATUS: deployed
REVISION: 1
TEST SUITE: None
Show the resources:
$ kubectl get all -n kai-scheduler
NAME READY STATUS RESTARTS AGE
pod/binder-55549f8c66-4qltk 1/1 Running 0 7m55s
pod/podgroup-controller-7f57b65444-42d8m 1/1 Running 0 7m55s
pod/podgrouper-d4b78b659-zcjnf 1/1 Running 0 7m55s
pod/queuecontroller-89dcbd6d4-ss5tp 1/1 Running 0 7m55s
pod/scheduler-5f8dd8d9c9-9448w 1/1 Running 0 7m55s
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
service/binder ClusterIP 10.103.78.230 <none> 443/TCP,8080/TCP 7m55s
service/queuecontroller ClusterIP 10.109.35.28 <none> 443/TCP,8080/TCP 7m55s
service/scheduler ClusterIP 10.106.100.152 <none> 8080/TCP 7m55s
NAME READY UP-TO-DATE AVAILABLE AGE
deployment.apps/binder 1/1 1 1 7m55s
deployment.apps/podgroup-controller 1/1 1 1 7m55s
deployment.apps/podgrouper 1/1 1 1 7m55s
deployment.apps/queuecontroller 1/1 1 1 7m55s
deployment.apps/scheduler 1/1 1 1 7m55s
NAME DESIRED CURRENT READY AGE
replicaset.apps/binder-55549f8c66 1 1 1 7m55s
replicaset.apps/podgroup-controller-7f57b65444 1 1 1 7m55s
replicaset.apps/podgrouper-d4b78b659 1 1 1 7m55s
replicaset.apps/queuecontroller-89dcbd6d4 1 1 1 7m55s
replicaset.apps/scheduler-5f8dd8d9c9 1 1 1 7m55s
Usage
Create queues
Create file default_test_queue.yaml for queues:
apiVersion: scheduling.run.ai/v2
kind: Queue
metadata:
name: default
spec:
resources:
cpu:
quota: -1
limit: -1
overQuotaWeight: 1
gpu:
quota: -1
limit: -1
overQuotaWeight: 1
memory:
quota: -1
limit: -1
overQuotaWeight: 1
---
apiVersion: scheduling.run.ai/v2
kind: Queue
metadata:
name: test
spec:
parentQueue: default
resources:
cpu:
quota: -1
limit: -1
overQuotaWeight: 1
gpu:
quota: -1
limit: -1
overQuotaWeight: 1
memory:
quota: -1
limit: -1
overQuotaWeight: 1
Apply it:
$ kubectl apply -f default_test_queue.yaml
queue.scheduling.run.ai/default created
queue.scheduling.run.ai/test created
Show the created queues:
$ kubectl get queue
NAME PRIORITY PARENT CHILDREN DISPLAYNAME
default ["test"]
test default
Create pods
Create file sleep_po_kai.yaml, based on the pod config in Pod, with these modifications:
kind: Pod
metadata:
name: sleep
+ labels:
+ kai.scheduler/queue: test
spec:
restartPolicy: OnFailure
+ schedulerName: kai-scheduler
containers:
- image: busybox:1.37.0-glibc
imagePullPolicy: IfNotPresent
Apply it:
$ kubectl apply -f sleep_po_kai.yaml
pod/sleep created