Volcano 作业抢占式调度实验

本文适用于 Volcano 1.12.1.

准备工作

Volcano 作业使用 PriorityClass 定义优先级。首先定义一个 PriorityClass:

apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: high-priority
value: 10000
preemptionPolicy: PreemptLowerPriority
globalDefault: false
description: "Priority 10000"

其名称为 high-priority, value 为 10000, 高于默认优先级的 0. 将以上文本保存为文件 high_pc.yaml 然后应用到 k8s 集群:

$ kubectl apply -f high_pc.yaml 
priorityclass.scheduling.k8s.io/high-priority created

在默认配置下

Volcano 调度器可以通过配置控制其行为,查看配置:

$ kubectl get cm volcano-scheduler-configmap -n volcano-system -oyaml
apiVersion: v1
data:
  volcano-scheduler.conf: |
    actions: "enqueue, allocate, backfill"
    tiers:
    - plugins:
      - name: priority
      - name: gang
        enablePreemptable: false
      - name: conformance
    - plugins:
      - name: overcommit
      - name: drf
        enablePreemptable: false
      - name: predicates
      - name: proportion
      - name: nodeorder
      - name: binpack
kind: ConfigMap

以上为 1.12.1 安装后的默认配置。

实验一、队列中可以按优先级排序

定义一个 Volcano 队列用于测试:

apiVersion: scheduling.volcano.sh/v1beta1
kind: Queue
metadata:
  name: test
spec:
  weight: 1
  reclaimable: false
  capability:
    cpu: 1
    memory: 64Gi

其名称为 test, 容量为 1 个 CPU. 将以上文本保存为文件 test_q_1c.yaml 然后应用到 k8s 集群:

$ kubectl apply -f test_q_1c.yaml
queue.scheduling.volcano.sh/test created

定义一个 Volcano 作业:

apiVersion: batch.volcano.sh/v1alpha1
kind: Job
metadata:
  name: sleep-normal
spec:
  minAvailable: 3
  schedulerName: volcano
  queue: test
  policies:
    - event: PodEvicted
      action: RestartJob
  tasks:
    - replicas: 3
      name: sleep-task
      policies:
        - event: TaskCompleted
          action: CompleteJob
      template:
        spec:
          restartPolicy: Never
          containers:
            - image: busybox
              imagePullPolicy: IfNotPresent
              name: busybox-sleep
              command: ["sh", "-c", "trap exit INT TERM; sleep 1m & wait"]
              resources:
                requests:
                  cpu: 1
                limits:
                  cpu: 1

其名称为 sleep-normal. 此作业包含 3 个任务,每个任务需要的资源为 1 个 CPU. 任务的内容是等待 1 分钟后退出。

将上述文本保存为文件 sleep_vj_normal.yaml. 同时另存为文件 sleep_vj_high.yaml 并作如下修改:

 apiVersion: batch.volcano.sh/v1alpha1
 kind: Job
 metadata:
-  name: sleep-normal
+  name: sleep-high
 spec:
   minAvailable: 3
   schedulerName: volcano
   queue: test
+  priorityClassName: high-priority
   policies:
     - event: PodEvicted
       action: RestartJob
       template:
         spec:
           restartPolicy: Never
+          priorityClassName: high-priority
           containers:
             - image: busybox
               imagePullPolicy: IfNotPresent

将名称改为了 sleep-high 并使用了高优先级。

Note

高优先级的定义 priorityClassName: high-priority 必须在 JobPod 两个层级同时出现,这一点对后面的运行时抢占十分重要。

先后提交作业 sleep-normalsleep-high:

$ kubectl create -f sleep_vj_normal.yaml
job.batch.volcano.sh/sleep-normal created
$ kubectl create -f sleep_vj_high.yaml 
job.batch.volcano.sh/sleep-high created

由于队列 test 的 CPU 个数不够,两个作业都处于 Pending 状态:

$ kubectl get vj -owide
NAME           STATUS    MINAVAILABLE   RUNNINGS   AGE   QUEUE
sleep-high     Pending   3                         8s    test
sleep-normal   Pending   3                         19s   test

如果装了 vcctl 工具,可以看到更详细的信息:

$ vcctl job list
Name           Creation       Phase       JobType     Replicas    Min   Pending   Running   Succeeded   Failed    Unknown     RetryCount
sleep-high     2025-07-14     Pending     Batch       3           3     0         0         0           0         0           0
sleep-normal   2025-07-14     Pending     Batch       3           3     0         0         0           0         0           0
$ vcctl queue get --name test
Name                     Weight  State   Inqueue Pending Running Unknown Completed
test                     1       Open    0       2       0       0       0

修改队列 test 的定义:

$ kubectl edit q test
queue.scheduling.volcano.sh/test edited

修改内容为增加 CPU 个数,如下:

   weight: 1
   reclaimable: false
   capability:
-    cpu: 1
+    cpu: 4
     memory: 64Gi

由于资源数量增加,有作业开始运行:

$ kubectl get vj -owide
NAME           STATUS    MINAVAILABLE   RUNNINGS   AGE     QUEUE
sleep-high     Running   3              3          2m23s   test
sleep-normal   Pending   3                         2m34s   test
$ vcctl job list
Name           Creation       Phase       JobType     Replicas    Min   Pending   Running   Succeeded   Failed    Unknown     RetryCount
sleep-high     2025-07-14     Running     Batch       3           3     0         3         0           0         0           0
sleep-normal   2025-07-14     Pending     Batch       3           3     0         0         0           0         0           0
$ vcctl queue get --name test
Name                     Weight  State   Inqueue Pending Running Unknown Completed
test                     1       Open    0       1       1       0       0

可以看到优先级较高的作业首先得到运行。

实验二、运行中的作业不能被抢占

如前述设置,队列中的 CPU 个数设置为 4. 先提交默认优先级的任务 sleep-normal, 候其运行后再提交高优先级任务 sleep-high:

$ kubectl create -f sleep_vj_normal.yaml 
job.batch.volcano.sh/sleep-normal created
$ kubectl get vj sleep-normal
NAME           STATUS    MINAVAILABLE   RUNNINGS   AGE
sleep-normal   Running   3              3          10s
$ kubectl create -f sleep_vj_high.yaml 
job.batch.volcano.sh/sleep-high created

检查作业状态:

$ kubectl get vj
NAME           STATUS    MINAVAILABLE   RUNNINGS   AGE
sleep-high     Pending   3                         3s
sleep-normal   Running   3              3          22s
$ vcctl job list
Name           Creation       Phase       JobType     Replicas    Min   Pending   Running   Succeeded   Failed    Unknown     RetryCount
sleep-high     2025-07-14     Pending     Batch       3           3     0         0         0           0         0           0
sleep-normal   2025-07-14     Running     Batch       3           3     0         3         0           0         0           0
$ vcctl queue get --name test
Name                     Weight  State   Inqueue Pending Running Unknown Completed
test                     1       Open    0       1       1       0       0

可以看出低优先级任务仍处于 Running 状态,高优先级作业处于 Pending 状态。

启用运行时抢占

对 Volcano 调度器的配置进行修改:

$ kubectl edit cm volcano-scheduler-configmap -n volcano-system
configmap/volcano-scheduler-configmap edited

修改内容如下:

 apiVersion: v1
 data:
   volcano-scheduler.conf: |
-    actions: "enqueue, allocate, backfill"
+    actions: "enqueue, allocate, backfill, preempt"
     tiers:
     - plugins:
       - name: priority
+      - name: overcommit
       - name: gang
         enablePreemptable: false
       - name: conformance
     - plugins:
-      - name: overcommit
       - name: drf
         enablePreemptable: false
       - name: predicates

Note

这里特别要注意把 overcommit 插件挪到 gang 的前面。

实验三、高优先作业抢占运行中的低优先级作业

为了观察作业的状态变化,在另一个终端中监视作业状态:

$ kubectl get vj -owide -w

队列和作业设置同实验二,先提交默认优先级的作业 sleep-normal, 候其运行后再提交高优先级作业 sleep-high:

$ kubectl create -f sleep_vj_normal.yaml 
job.batch.volcano.sh/sleep-normal created
$ kubectl create -f sleep_vj_high.yaml 
job.batch.volcano.sh/sleep-high created

在另一个终端中的输出:

NAME           STATUS   MINAVAILABLE   RUNNINGS   AGE   QUEUE
sleep-normal                                      0s    test
sleep-normal   Pending   3                         0s    test
sleep-normal   Pending   3                         1s    test
sleep-normal   Pending   3              1          3s    test
sleep-normal   Pending   3              2          4s    test
sleep-normal   Running   3              3          4s    test
sleep-high                                         0s    test
sleep-high     Pending   3                         0s    test
sleep-high     Pending   3                         1s    test
sleep-normal   Restarting   3                         19s   test
sleep-normal   Pending      3                         19s   test
sleep-normal   Pending      3                         20s   test
sleep-normal   Pending      3                         20s   test
sleep-high     Pending      3              1          5s    test
sleep-high     Pending      3              2          5s    test
sleep-high     Running      3              3          5s    test
sleep-high     Running      3              2          65s   test
sleep-high     Running      3              1          66s   test
sleep-high     Completing   3                         66s   test
sleep-high     Completed    3                         66s   test
sleep-high     Completed    3                         66s   test
sleep-normal   Pending      3              1          85s   test
sleep-normal   Pending      3              2          85s   test
sleep-normal   Running      3              3          85s   test
sleep-normal   Running      3              2          2m26s   test
sleep-normal   Running      3              1          2m27s   test
sleep-normal   Completing   3                         2m27s   test
sleep-normal   Completed    3                         2m27s   test
sleep-normal   Completed    3                         2m27s   test

可以看到,高优先级作业提交后,低优先级作业被重置,然后进入 Pending 状态,高优先级作业得到运行。另外可以看到,作业的 3 个任务实例总是同时被调度,符合 GANG 调度的特征。