在 Kubeflow 作业中使用 Volcano 调度器队列、优先级和 backfill 功能

实验内容

首先在 k8s 中创建一个 PriorityClass:

$ kubectl apply -f high_pc.yaml
priorityclass.scheduling.k8s.io/high-priority created

其定义如下：

apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: high-priority
value: 10000
preemptionPolicy: PreemptLowerPriority
globalDefault: false
description: "Priority 10000"

默认的优先级值为 0, 因此以上优先级高于默认。

创建一个 Volcano 队列：

$ kubectl apply -f test_q_8g.yaml 
queue.scheduling.volcano.sh/test created

其定义如下：

apiVersion: scheduling.volcano.sh/v1beta1
kind: Queue
metadata:
  name: test
spec:
  reclaimable: false
  capability:
    cpu: 8
    memory: 64Gi
    nvidia.com/gpu: 8

使用如下 PyTorchJob 定义：

apiVersion: kubeflow.org/v1
kind: PyTorchJob
metadata:
  generateName: job-
  namespace: default
  labels:
    for: test
spec:
  runPolicy:
    backoffLimit: 0
    schedulingPolicy:
      queue: test
      priorityClass: high-priority
  pytorchReplicaSpecs:
    Master:
      replicas: 1 # must be 1
      restartPolicy: Never
      template:
        spec:
          containers:
            - image: busybox:1.37.0-glibc
              imagePullPolicy: IfNotPresent
              name: pytorch # must be `pytorch`
              command: ["sh", "-c", "trap exit INT TERM; sleep 30s & wait"]
              resources:
                limits:
                  cpu: "1"
                  memory: "100Mi"
                  nvidia.com/gpu: "1"
    Worker:
      replicas: 5
      restartPolicy: Never
      template:
        spec:
          containers:
            - image: busybox:1.37.0-glibc
              imagePullPolicy: IfNotPresent
              name: pytorch # must be `pytorch`
              command: ["sh", "-c", "trap exit INT TERM; sleep 30s & wait"]
              resources:
                limits:
                  cpu: "1"
                  memory: "100Mi"
                  nvidia.com/gpu: "1"

其中与 Volcano 队列和优先级相关的设置都放在 .spec.runPolicy.schedulingPolicy 下面。

优先调度实验

首先提交一个作业：

$ kubectl create -f pytorchjob_q_test.yaml
pytorchjob.kubeflow.org/job-6wgnj created

当它调度后，队列的容量已不足，这时再创建一个默认优先级作业：

$ yq 'del(.spec.runPolicy.schedulingPolicy.priorityClass)' pytorchjob_q_test.yaml | kubectl create -f -
pytorchjob.kubeflow.org/job-2jmz2 created

随后再创建一个高优先级作业：

$ kubectl create -f pytorchjob_q_test.yaml
pytorchjob.kubeflow.org/job-58cjb created

这时后创建的两个作业都在排队状态。

监视 PyTorchJob 状态：

$ kubectl get pytorchjob -w                                                                      
NAME        STATE   AGE
job-6wgnj           0s
job-6wgnj   Created   0s
job-6wgnj   Created   1s
job-6wgnj   Running   3s
job-6wgnj   Running   6s
job-6wgnj   Running   6s
job-6wgnj   Running   6s
job-2jmz2             0s
job-2jmz2   Created   0s
job-2jmz2   Created   1s
job-58cjb             0s
job-58cjb   Created   0s
job-58cjb   Created   1s
job-58cjb   Created   2s
job-6wgnj   Succeeded   34s
job-6wgnj   Succeeded   35s
job-58cjb   Running     20s
job-58cjb   Running     24s
job-58cjb   Running     24s
job-58cjb   Running     24s
job-58cjb   Running     24s
job-58cjb   Succeeded   52s
job-58cjb   Succeeded   52s
job-2jmz2   Running     56s
job-2jmz2   Running     59s
job-2jmz2   Running     59s
job-2jmz2   Running     59s
job-2jmz2   Running     59s
job-2jmz2   Succeeded   87s
job-2jmz2   Succeeded   87s

监视 PodGroup 状态：

$ kubectl get podgroup -owide -w
NAME        STATUS   MINMEMBER   RUNNINGS   AGE   QUEUE
job-6wgnj            6                      0s    test
job-6wgnj   Inqueue   6                      0s    test
job-6wgnj   Running   6                      2s    test
job-6wgnj   Running   6           1          4s    test
job-6wgnj   Running   6           6          7s    test
job-2jmz2             6                      0s    test
job-2jmz2   Inqueue   6                      1s    test
job-2jmz2   Inqueue   6                      2s    test
job-58cjb             6                      0s    test
job-58cjb   Inqueue   6                      1s    test
job-58cjb   Inqueue   6                      2s    test
job-6wgnj   Running   6           6          35s   test
job-58cjb   Running   6                      19s   test
job-58cjb   Running   6           1          21s   test
job-58cjb   Running   6           4          24s   test
job-58cjb   Running   6           6          25s   test
job-58cjb   Running   6           6          52s   test
job-2jmz2   Running   6                      54s   test
job-2jmz2   Running   6           1          56s   test
job-2jmz2   Running   6           5          59s   test
job-2jmz2   Running   6           6          60s   test
job-2jmz2   Running   6           6          87s   test

可以看出队列中高优先级的作业先被调度了。

backfill 实验

首先提交一个作业：

$ kubectl create -f pytorchjob_q_test.yaml
pytorchjob.kubeflow.org/job-gtps2 created

当它调度后，队列的容量已不足，这时再创建一个作业：

$ kubectl create -f pytorchjob_q_test.yaml
pytorchjob.kubeflow.org/job-xjhtm created

这个作业应该处于排队状态。随后再创建一个需要资源较少的作业：

$ yq 'del(.spec.pytorchReplicaSpecs.Worker)' pytorchjob_q_test.yaml | kubectl create -f -
pytorchjob.kubeflow.org/job-vr5xc created

注意：Worker 的数量最少为 1, 但是可以把整个 Worker 字段去掉。

监视 PyTorchJob 状态：

$ kubectl get pytorchjob -w                                                                      
NAME        STATE   AGE
job-gtps2           0s
job-gtps2   Created   0s
job-gtps2   Created   0s
job-gtps2   Running   2s
job-gtps2   Running   4s
job-gtps2   Running   4s
job-gtps2   Running   4s
job-gtps2   Running   6s
job-xjhtm             0s
job-xjhtm   Created   0s
job-xjhtm   Created   1s
job-vr5xc             1s
job-vr5xc   Created   1s
job-vr5xc   Created   1s
job-vr5xc   Running   3s
job-gtps2   Succeeded   33s
job-gtps2   Succeeded   33s
job-xjhtm   Running     27s
job-xjhtm   Running     28s
job-xjhtm   Running     28s
job-xjhtm   Running     28s
job-xjhtm   Running     29s
job-xjhtm   Running     30s
job-vr5xc   Succeeded   34s
job-xjhtm   Succeeded   58s
job-xjhtm   Succeeded   58s

监视 PodGroup 状态：

$ kubectl get podgroup -owide -w
NAME        STATUS   MINMEMBER   RUNNINGS   AGE   QUEUE
job-gtps2            6                      0s    test
job-gtps2   Inqueue   6                      0s    test
job-gtps2   Running   6                      1s    test
job-gtps2   Running   6           1          2s    test
job-gtps2   Running   6           3          4s    test
job-gtps2   Running   6           5          5s    test
job-gtps2   Running   6           6          6s    test
job-xjhtm             6                      0s    test
job-xjhtm   Inqueue   6                      1s    test
job-xjhtm   Inqueue   6                      2s    test
job-vr5xc             1                      0s    test
job-vr5xc   Inqueue   1                      0s    test
job-vr5xc   Running   1                      1s    test
job-vr5xc   Running   1           1          2s    test
job-xjhtm   Inqueue   6                      10s   test
job-gtps2   Running   6           6          33s   test
job-xjhtm   Running   6                      26s   test
job-xjhtm   Running   6           1          27s   test
job-xjhtm   Running   6           5          29s   test
job-xjhtm   Running   6           6          31s   test
job-vr5xc   Running   1           1          33s   test
job-xjhtm   Running   6           6          58s   test

可以看出队列中排在后面但是资源能够容纳的作业被先调度了。

结论

Kubeflow 作业支持定义 Volcano 的队列和优先级
队列中高优先级的作业将先于低优先级的作业运行
backfill 功能有效，当队列的资源不足以调度下一个作业时，Volcano 会查找队列后面消耗资源较少能够被调度的作业提前调度