在 Kubeflow 作业中使用 Volcano 调度器队列、优先级和 backfill 功能
实验内容
首先在 k8s 中创建一个 PriorityClass:
$ kubectl apply -f high_pc.yaml
priorityclass.scheduling.k8s.io/high-priority created
其定义如下:
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
name: high-priority
value: 10000
preemptionPolicy: PreemptLowerPriority
globalDefault: false
description: "Priority 10000"
默认的优先级值为 0, 因此以上优先级高于默认。
创建一个 Volcano 队列:
$ kubectl apply -f test_q_8g.yaml
queue.scheduling.volcano.sh/test created
其定义如下:
apiVersion: scheduling.volcano.sh/v1beta1
kind: Queue
metadata:
name: test
spec:
reclaimable: false
capability:
cpu: 8
memory: 64Gi
nvidia.com/gpu: 8
使用如下 PyTorchJob 定义:
apiVersion: kubeflow.org/v1
kind: PyTorchJob
metadata:
generateName: job-
namespace: default
labels:
for: test
spec:
runPolicy:
backoffLimit: 0
schedulingPolicy:
queue: test
priorityClass: high-priority
pytorchReplicaSpecs:
Master:
replicas: 1 # must be 1
restartPolicy: Never
template:
spec:
containers:
- image: busybox:1.37.0-glibc
imagePullPolicy: IfNotPresent
name: pytorch # must be `pytorch`
command: ["sh", "-c", "trap exit INT TERM; sleep 30s & wait"]
resources:
limits:
cpu: "1"
memory: "100Mi"
nvidia.com/gpu: "1"
Worker:
replicas: 5
restartPolicy: Never
template:
spec:
containers:
- image: busybox:1.37.0-glibc
imagePullPolicy: IfNotPresent
name: pytorch # must be `pytorch`
command: ["sh", "-c", "trap exit INT TERM; sleep 30s & wait"]
resources:
limits:
cpu: "1"
memory: "100Mi"
nvidia.com/gpu: "1"
其中与 Volcano 队列和优先级相关的设置都放在 .spec.runPolicy.schedulingPolicy 下面。
优先调度实验
首先提交一个作业:
$ kubectl create -f pytorchjob_q_test.yaml
pytorchjob.kubeflow.org/job-6wgnj created
当它调度后,队列的容量已不足,这时再创建一个默认优先级作业:
$ yq 'del(.spec.runPolicy.schedulingPolicy.priorityClass)' pytorchjob_q_test.yaml | kubectl create -f -
pytorchjob.kubeflow.org/job-2jmz2 created
随后再创建一个高优先级作业:
$ kubectl create -f pytorchjob_q_test.yaml
pytorchjob.kubeflow.org/job-58cjb created
这时后创建的两个作业都在排队状态。
监视 PyTorchJob 状态:
$ kubectl get pytorchjob -w
NAME STATE AGE
job-6wgnj 0s
job-6wgnj Created 0s
job-6wgnj Created 1s
job-6wgnj Running 3s
job-6wgnj Running 6s
job-6wgnj Running 6s
job-6wgnj Running 6s
job-2jmz2 0s
job-2jmz2 Created 0s
job-2jmz2 Created 1s
job-58cjb 0s
job-58cjb Created 0s
job-58cjb Created 1s
job-58cjb Created 2s
job-6wgnj Succeeded 34s
job-6wgnj Succeeded 35s
job-58cjb Running 20s
job-58cjb Running 24s
job-58cjb Running 24s
job-58cjb Running 24s
job-58cjb Running 24s
job-58cjb Succeeded 52s
job-58cjb Succeeded 52s
job-2jmz2 Running 56s
job-2jmz2 Running 59s
job-2jmz2 Running 59s
job-2jmz2 Running 59s
job-2jmz2 Running 59s
job-2jmz2 Succeeded 87s
job-2jmz2 Succeeded 87s
监视 PodGroup 状态:
$ kubectl get podgroup -owide -w
NAME STATUS MINMEMBER RUNNINGS AGE QUEUE
job-6wgnj 6 0s test
job-6wgnj Inqueue 6 0s test
job-6wgnj Running 6 2s test
job-6wgnj Running 6 1 4s test
job-6wgnj Running 6 6 7s test
job-2jmz2 6 0s test
job-2jmz2 Inqueue 6 1s test
job-2jmz2 Inqueue 6 2s test
job-58cjb 6 0s test
job-58cjb Inqueue 6 1s test
job-58cjb Inqueue 6 2s test
job-6wgnj Running 6 6 35s test
job-58cjb Running 6 19s test
job-58cjb Running 6 1 21s test
job-58cjb Running 6 4 24s test
job-58cjb Running 6 6 25s test
job-58cjb Running 6 6 52s test
job-2jmz2 Running 6 54s test
job-2jmz2 Running 6 1 56s test
job-2jmz2 Running 6 5 59s test
job-2jmz2 Running 6 6 60s test
job-2jmz2 Running 6 6 87s test
可以看出队列中高优先级的作业先被调度了。
backfill 实验
首先提交一个作业:
$ kubectl create -f pytorchjob_q_test.yaml
pytorchjob.kubeflow.org/job-gtps2 created
当它调度后,队列的容量已不足,这时再创建一个作业:
$ kubectl create -f pytorchjob_q_test.yaml
pytorchjob.kubeflow.org/job-xjhtm created
这个作业应该处于排队状态。随后再创建一个需要资源较少的作业:
$ yq 'del(.spec.pytorchReplicaSpecs.Worker)' pytorchjob_q_test.yaml | kubectl create -f -
pytorchjob.kubeflow.org/job-vr5xc created
注意:Worker 的数量最少为 1, 但是可以把整个 Worker 字段去掉。
监视 PyTorchJob 状态:
$ kubectl get pytorchjob -w
NAME STATE AGE
job-gtps2 0s
job-gtps2 Created 0s
job-gtps2 Created 0s
job-gtps2 Running 2s
job-gtps2 Running 4s
job-gtps2 Running 4s
job-gtps2 Running 4s
job-gtps2 Running 6s
job-xjhtm 0s
job-xjhtm Created 0s
job-xjhtm Created 1s
job-vr5xc 1s
job-vr5xc Created 1s
job-vr5xc Created 1s
job-vr5xc Running 3s
job-gtps2 Succeeded 33s
job-gtps2 Succeeded 33s
job-xjhtm Running 27s
job-xjhtm Running 28s
job-xjhtm Running 28s
job-xjhtm Running 28s
job-xjhtm Running 29s
job-xjhtm Running 30s
job-vr5xc Succeeded 34s
job-xjhtm Succeeded 58s
job-xjhtm Succeeded 58s
监视 PodGroup 状态:
$ kubectl get podgroup -owide -w
NAME STATUS MINMEMBER RUNNINGS AGE QUEUE
job-gtps2 6 0s test
job-gtps2 Inqueue 6 0s test
job-gtps2 Running 6 1s test
job-gtps2 Running 6 1 2s test
job-gtps2 Running 6 3 4s test
job-gtps2 Running 6 5 5s test
job-gtps2 Running 6 6 6s test
job-xjhtm 6 0s test
job-xjhtm Inqueue 6 1s test
job-xjhtm Inqueue 6 2s test
job-vr5xc 1 0s test
job-vr5xc Inqueue 1 0s test
job-vr5xc Running 1 1s test
job-vr5xc Running 1 1 2s test
job-xjhtm Inqueue 6 10s test
job-gtps2 Running 6 6 33s test
job-xjhtm Running 6 26s test
job-xjhtm Running 6 1 27s test
job-xjhtm Running 6 5 29s test
job-xjhtm Running 6 6 31s test
job-vr5xc Running 1 1 33s test
job-xjhtm Running 6 6 58s test
可以看出队列中排在后面但是资源能够容纳的作业被先调度了。
结论
Kubeflow 作业支持定义 Volcano 的队列和优先级
队列中高优先级的作业将先于低优先级的作业运行
backfill 功能有效,当队列的资源不足以调度下一个作业时,Volcano 会查找队列后面消耗资源较少能够被调度的作业提前调度