Kueue 作业实验
准备工作
首先建立一个 LocalQueue 用于作业排队:
apiVersion: kueue.x-k8s.io/v1beta2
kind: ResourceFlavor
metadata:
name: default
spec:
nodeLabels:
node-group: default
---
apiVersion: kueue.x-k8s.io/v1beta2
kind: ClusterQueue
metadata:
name: test
spec:
namespaceSelector: {} # match all.
resourceGroups:
- coveredResources:
- cpu
- memory
- pods
flavors:
- name: default
resources:
- name: cpu
nominalQuota: 4
- name: memory
nominalQuota: 64Gi
- name: pods
nominalQuota: 4
---
apiVersion: kueue.x-k8s.io/v1beta2
kind: LocalQueue
metadata:
namespace: default
name: test
spec:
clusterQueue: test
Kueue 的作业队列结构比较复杂,包含以下资源:
ResourceFlavor: 通过节点标签可以限制可调度的节点范围,通过
toleration设置可以容忍污点ClusterQueue: 全局作业队列,定义了队列的容量和 ResourceFlavor
LocalQueue: 命名空间局域的队列,映射到一个全局作业队列,是作业提交时实际使用的队列
将以上文本保存为文件 test_lq.yaml, 应用到集群:
$ kubectl apply -f test_lq.yaml
resourceflavor.kueue.x-k8s.io/default created
clusterqueue.kueue.x-k8s.io/test created
localqueue.kueue.x-k8s.io/test created
$ kubectl get rf,cq,lq -owide
NAME AGE
resourceflavor.kueue.x-k8s.io/default 24s
NAME COHORT STRATEGY PENDING WORKLOADS ADMITTED WORKLOADS
clusterqueue.kueue.x-k8s.io/test BestEffortFIFO 0 0
NAME CLUSTERQUEUE PENDING WORKLOADS ADMITTED WORKLOADS
localqueue.kueue.x-k8s.io/test test 0 0
实验一、无队列的作业调度
定义一个 batch/v1 版本的作业:
apiVersion: batch/v1
kind: Job
metadata:
name: sleep-normal
spec:
completions: 3
completionMode: Indexed
parallelism: 3
template:
spec:
restartPolicy: OnFailure
containers:
- image: busybox
imagePullPolicy: IfNotPresent
name: busybox-sleep
command: ["sh", "-c", "trap exit INT TERM; sleep 30s & wait"]
resources:
requests:
cpu: "1"
limits:
cpu: "1"
其中规定了并行度为 3, 并且需要 3 个 Pod 完成后 Job 才算完成,而每个 Pod 需要 1 个 CPU. 假设系统中仅有一个 CPU, 那么提交上述作业后:
$ kubectl get job,po
NAME STATUS COMPLETIONS DURATION AGE
job.batch/sleep-normal Running 0/3 15s 15s
NAME READY STATUS RESTARTS AGE
pod/sleep-normal-0-67psd 1/1 Running 0 15s
pod/sleep-normal-1-jq8rr 0/1 Pending 0 15s
pod/sleep-normal-2-pmm58 0/1 Pending 0 15s
可见系统并不保证 3 个 Pod 同时运行,这在很多并行计算的情况是不可接受的。
实验二、使用 Kueue 队列的作业调度
修改实验一中的作业定义:
kind: Job
metadata:
name: sleep-normal
+ labels:
+ kueue.x-k8s.io/queue-name: test
spec:
completions: 3
completionMode: Indexed
增加的标签可以使 Kueue 的调度生效,使用的队列为 test. 提交上述作业后:
NAME QUEUE RESERVED IN ADMITTED FINISHED AGE
workload.kueue.x-k8s.io/job-sleep-normal-b1ce9 test test True 21s
NAME STATUS COMPLETIONS DURATION AGE
job.batch/sleep-normal Running 0/3 21s 21s
NAME READY STATUS RESTARTS AGE
pod/sleep-normal-0-tzkmt 0/1 Pending 0 21s
pod/sleep-normal-1-bfjt2 0/1 Pending 0 21s
pod/sleep-normal-2-2k8d8 0/1 Pending 0 21s
NAME CLUSTERQUEUE PENDING WORKLOADS ADMITTED WORKLOADS
localqueue.kueue.x-k8s.io/test test 0 1
可以发现 Kueue 将作业映射为一个 Workload 资源进行管理。因为队列的资源足够,所以作业开始运行,创建了相应的 Pod 并提交调度,但实际并没有调度起来。这是因为 ResourceFlavor 中的节点标签被写在了 Pod 的 nodeSelector 中:
$ kubectl get po -o custom-columns='NAME:.metadata.name,NODE-GROUP:.spec.nodeSelector.node-group'
NAME NODE-GROUP
sleep-normal-0-tzkmt default
sleep-normal-1-bfjt2 default
sleep-normal-2-2k8d8 default
Tip
Kueue 0.16.2 中 ResourceFlavor 定义已支持不设 nodeLabels.
查看其中一个 Pod 的事件:
$ kubectl get event --field-selector involvedObject.name=sleep-normal-0-tzkmt
LAST SEEN TYPE REASON OBJECT MESSAGE
18m Warning FailedScheduling pod/sleep-normal-0-tzkmt 0/4 nodes are available: 1 node(s) had untolerated taint {node-role.kubernetes.io/control-plane: }, 3 node(s) didn't match Pod's node affinity/selector. no new claims to deallocate, preemption: 0/4 nodes are available: 4 Preemption is not helpful for scheduling.
可见确实是因为节点选择器不满足。对节点打标签:
$ kubectl label node --all node-group=default
node/las0 labeled
node/las1 labeled
node/las2 labeled
node/las3 labeled
然后再查看状态:
$ kubectl get wl,job,po
NAME QUEUE RESERVED IN ADMITTED FINISHED AGE
workload.kueue.x-k8s.io/job-sleep-normal-b1ce9 test test True 16h
NAME STATUS COMPLETIONS DURATION AGE
job.batch/sleep-normal Running 0/3 16h 16h
NAME READY STATUS RESTARTS AGE
pod/sleep-normal-0-tzkmt 1/1 Running 0 16h
pod/sleep-normal-1-bfjt2 1/1 Running 0 16h
pod/sleep-normal-2-2k8d8 1/1 Running 0 16h
可见 Pod 都开始运行了。
实验二、排队优先级
因为 Job 的 spec 并不支持 priorityClassName 属性,所以 Kueue 推出一个 WorkloadPriorityClass 用来设置 Job 的优先级:
apiVersion: kueue.x-k8s.io/v1beta2
kind: WorkloadPriorityClass
metadata:
name: high-priority
value: 10000
description: "Priority 10000"
将上述文本保存为文件 high_workloadpriorityclass.yaml 并应用:
$ kubectl apply -f high_workloadpriorityclass.yaml
workloadpriorityclass.kueue.x-k8s.io/high-priority created
修改实验二中的作业定义:
apiVersion: batch/v1
kind: Job
metadata:
- name: sleep-normal
+ name: sleep-high
labels:
kueue.x-k8s.io/queue-name: test
+ kueue.x-k8s.io/priority-class: high-priority
spec:
completions: 3
completionMode: Indexed
为了阻止作业运行,将 ClusterQueue 的 cpu 配额降低:
$ kubectl edit cq test
修改内容如下:
- name: default
resources:
- name: cpu
- nominalQuota: "4"
+ nominalQuota: "2"
- name: memory
nominalQuota: 64Gi
- name: pods
然后先提交作业 sleep-normal, 再提交作业 sleep-high. 查看状态:
$ kubectl get wl,job,po
NAME QUEUE RESERVED IN ADMITTED FINISHED AGE
workload.kueue.x-k8s.io/job-sleep-high-8de71 test 60s
workload.kueue.x-k8s.io/job-sleep-normal-1b4d8 test 69s
NAME STATUS COMPLETIONS DURATION AGE
job.batch/sleep-high Suspended 0/3 60s
job.batch/sleep-normal Suspended 0/3 69s
可见由于队列容量不够,作业都处于 Suspended 状态,没有创建 Pod. 事实上 Workload 是作为整体准入的,有 GANG 调度的特点,但 Kueue 不是调度器。
查看队列的状态也可以发现有作业在排队:
$ kubectl get lq,cq
NAME CLUSTERQUEUE PENDING WORKLOADS ADMITTED WORKLOADS
localqueue.kueue.x-k8s.io/test test 2 0
NAME COHORT PENDING WORKLOADS
clusterqueue.kueue.x-k8s.io/test 2
更进一步,通过 kubectl 的 kueue 插件可以列出作业在队列中的位置:
$ kubectl kueue list wl
NAME JOB TYPE JOB NAME LOCALQUEUE CLUSTERQUEUE STATUS POSITION IN QUEUE EXEC TIME AGE
job-sleep-high-8de71 job sleep-high test test PENDING 0 5h10m
job-sleep-normal-1b4d8 job sleep-normal test test PENDING 1 5h10m
可以发现优先级高的作业排在了前面。现在修改队列的 cpu 为 4, 让其中一个作业运行起来。然后查看状态:
$ kubectl get wl,job,po
NAME QUEUE RESERVED IN ADMITTED FINISHED AGE
workload.kueue.x-k8s.io/job-sleep-high-8de71 test test True 5h12m
workload.kueue.x-k8s.io/job-sleep-normal-1b4d8 test 5h12m
NAME STATUS COMPLETIONS DURATION AGE
job.batch/sleep-high Running 0/3 8s 5h12m
job.batch/sleep-normal Suspended 0/3 5h12m
NAME READY STATUS RESTARTS AGE
pod/sleep-high-0-spsfk 1/1 Running 0 8s
pod/sleep-high-1-wjfx9 1/1 Running 0 8s
pod/sleep-high-2-vcxfk 1/1 Running 0 8s
$ kubectl get lq,cq
NAME CLUSTERQUEUE PENDING WORKLOADS ADMITTED WORKLOADS
localqueue.kueue.x-k8s.io/test test 1 1
NAME COHORT PENDING WORKLOADS
clusterqueue.kueue.x-k8s.io/test 1
可以看到排在前面的高优先级作业先运行了。
实验三、不开启抢占
保持配置不变,清空所有作业,然后先提交作业 sleep-normal, 再提交作业 sleep-high, 查看状态:
$ kubectl get job,wl,po
NAME STATUS COMPLETIONS DURATION AGE
job.batch/sleep-high Suspended 0/3 16s
job.batch/sleep-normal Running 0/3 24s 24s
NAME QUEUE RESERVED IN ADMITTED FINISHED AGE
workload.kueue.x-k8s.io/job-sleep-high-f9455 test 16s
workload.kueue.x-k8s.io/job-sleep-normal-dc237 test test True 24s
NAME READY STATUS RESTARTS AGE
pod/sleep-normal-0-vwdb2 1/1 Running 0 24s
pod/sleep-normal-1-lkf72 1/1 Running 0 24s
pod/sleep-normal-2-dnlkl 1/1 Running 0 24s
$ kubectl kueue list wl
NAME JOB TYPE JOB NAME LOCALQUEUE CLUSTERQUEUE STATUS POSITION IN QUEUE EXEC TIME AGE
job-sleep-high-f9455 job sleep-high test test PENDING 0 4s
job-sleep-normal-dc237 job sleep-normal test test ADMITTED 15s 12s
可见当作业运行以后,即使队列中的作业优先级更高也不能再运行。
实验四、开启抢占
Kueue 的作业抢占情形较为复杂,此处仅研究 ClusterQueue 队列内的抢占,需要正确设置其 spec.preemption 属性:
borrowWithinCohort:
policy: Never
reclaimWithinCohort: Never
- withinClusterQueue: Never
+ withinClusterQueue: LowerPriority
queueingStrategy: BestEffortFIFO
resourceGroups:
- coveredResources:
步骤同实验三。先提交作业 sleep-normal:
$ kubectl apply -f sleep_job_kueue.yaml
job.batch/sleep-normal created
$ kubectl get job
NAME STATUS COMPLETIONS DURATION AGE
sleep-normal Running 0/3 3s 3s
观察状态确保作业已运行。然后提交作业 sleep-high:
$ kubectl apply -f sleep_job_kueue_high.yaml
job.batch/sleep-high created
$ kubectl get job,wl,po
NAME STATUS COMPLETIONS DURATION AGE
job.batch/sleep-high Running 0/3 6s 6s
job.batch/sleep-normal Suspended 0/3 18s
NAME QUEUE RESERVED IN ADMITTED FINISHED AGE
workload.kueue.x-k8s.io/job-sleep-high-8fef6 test test True 6s
workload.kueue.x-k8s.io/job-sleep-normal-7c41c test False 18s
NAME READY STATUS RESTARTS AGE
pod/sleep-high-0-dvtjd 1/1 Running 0 6s
pod/sleep-high-1-wp7kg 1/1 Running 0 6s
pod/sleep-high-2-2d4td 1/1 Running 0 6s
$ kubectl get lq,cq
NAME CLUSTERQUEUE PENDING WORKLOADS ADMITTED WORKLOADS
localqueue.kueue.x-k8s.io/test test 1 1
NAME COHORT PENDING WORKLOADS
clusterqueue.kueue.x-k8s.io/test 1
$ kubectl kueue list wl
NAME JOB TYPE JOB NAME LOCALQUEUE CLUSTERQUEUE STATUS POSITION IN QUEUE EXEC TIME AGE
job-sleep-high-8fef6 job sleep-high test test ADMITTED 26s 23s
job-sleep-normal-7c41c job sleep-normal test test PENDING 0 35s
可以看到已经运行的默认优先级作业被挂起,其所有 Pod 被杀死,高优先级作业抢先运行。
实验五、Backfill
ClusterQueue 的默认调度策略为 BestEffortFIFO, 如果排在前面的作业不能调度,允许后面满足资源额度限制的作业先调度,即 backfill.
将以上作业的并行度修改为 5:
apiVersion: batch/v1
kind: Job
metadata:
- name: sleep-normal
+ name: sleep-big
labels:
kueue.x-k8s.io/queue-name: test
spec:
- completions: 3
+ completions: 5
completionMode: Indexed
- parallelism: 3
+ parallelism: 5
template:
spec:
restartPolicy: OnFailure
然后提交。因为队列只有 4 个 CPU, 所以作业被挂起:
$ kubectl get job
NAME STATUS COMPLETIONS DURATION AGE
sleep-big Suspended 0/5 6s
$ kubectl get cq -owide
NAME COHORT STRATEGY PENDING WORKLOADS ADMITTED WORKLOADS
test BestEffortFIFO 1
现在提交原来并行度为 3 的作业。查看状态:
NAME STATUS COMPLETIONS DURATION AGE
$ kubectl get job
sleep-big Suspended 0/5 9m20s
sleep-normal Running 0/3 10s 10s
$ kubectl get cq -owide
NAME COHORT STRATEGY PENDING WORKLOADS ADMITTED WORKLOADS
test BestEffortFIFO 1 1
可见后提交的作业确实先提交运行了。