Kueue 队列在 Cohort 内的资源借用与收回

准备工作

创建 ResourceFlavor, Cohort 以及两个队列:

apiVersion: kueue.x-k8s.io/v1beta2
kind: ResourceFlavor
metadata:
  name: default
---
apiVersion: kueue.x-k8s.io/v1beta2
kind: Cohort
metadata:
  name: "test"
spec:
  resourceGroups:
    - coveredResources: ["cpu"]
      flavors:
        - name: default
          resources:
            - name: cpu
              nominalQuota: 2
---
apiVersion: kueue.x-k8s.io/v1beta2
kind: ClusterQueue
metadata:
  name: test1
spec:
  namespaceSelector: {} # match all.
  cohortName: test
  preemption:
    reclaimWithinCohort: Any
    borrowWithinCohort:
      policy: Never
    withinClusterQueue: Never
  resourceGroups:
    - coveredResources: ["cpu"]
      flavors:
        - name: default
          resources:
            - name: cpu
              nominalQuota: 1
---
apiVersion: kueue.x-k8s.io/v1beta2
kind: ClusterQueue
metadata:
  name: test2
spec:
  namespaceSelector: {} # match all.
  cohortName: test
  preemption:
    reclaimWithinCohort: Any
    borrowWithinCohort:
      policy: Never
    withinClusterQueue: Never
  resourceGroups:
    - coveredResources: ["cpu"]
      flavors:
        - name: default
          resources:
            - name: cpu
              nominalQuota: 1
---
apiVersion: kueue.x-k8s.io/v1beta2
kind: LocalQueue
metadata:
  namespace: default
  name: test1
spec:
  clusterQueue: test1
---
apiVersion: kueue.x-k8s.io/v1beta2
kind: LocalQueue
metadata:
  namespace: default
  name: test2
spec:
  clusterQueue: test2

Cohort 本身定义的额度是内部所有队列之外的,因此 Cohort 内的额度分配如下图:

        pie showData title Cohort 额度分配
"队列 test1": 1
"队列 test2": 1
"额外": 2
    

实验一、额度借用与收回

创建作业 test1:

apiVersion: batch/v1
kind: Job
metadata:
  name: test1
  labels:
    kueue.x-k8s.io/queue-name: test1
spec:
  completions: 4
  completionMode: Indexed
  parallelism: 4
  template:
    spec:
      restartPolicy: OnFailure
      containers:
        - image: busybox
          imagePullPolicy: IfNotPresent
          name: busybox-sleep
          command: ["sh", "-c", "trap exit INT TERM; sleep 30s & wait"]
          resources:
            requests:
              cpu: "1"
            limits:
              cpu: "1"

可见作业 test1 使用了整个 Cohort 包括队列 test2 的额度。

然后创建作业 test2:

apiVersion: batch/v1
kind: Job
metadata:
  name: test2
  labels:
    kueue.x-k8s.io/queue-name: test2
spec:
  completions: 1
  completionMode: Indexed
  parallelism: 1
  template:
    spec:
      restartPolicy: OnFailure
      containers:
        - image: busybox
          imagePullPolicy: IfNotPresent
          name: busybox-sleep
          command: ["sh", "-c", "trap exit INT TERM; sleep 30s & wait"]
          resources:
            requests:
              cpu: "1"
            limits:
              cpu: "1"

监视作业状态变化:

$ kubectl get job -w
NAME    STATUS    COMPLETIONS   DURATION   AGE
test1   Running   0/4                      0s
test1   Suspended   0/4                      0s
test1   Suspended   0/4                      0s
test1   Running     0/4           0s         0s
test1   Running     0/4           2s         2s
test2   Running     0/1                      0s
test2   Suspended   0/1                      0s
test1   Running     0/4           9s         9s
test1   Running     0/4                      9s
test1   Running     0/4                      9s
test1   Suspended   0/4                      9s
test2   Suspended   0/1                      0s
test2   Running     0/1           0s         0s
test1   Suspended   0/4                      10s
test2   Running     0/1           2s         2s
test2   Running     0/1           33s        33s
test2   SuccessCriteriaMet   1/1           34s        34s
test2   Complete             1/1           34s        34s
test1   Suspended            0/4                      43s
test1   Running              0/4           0s         43s
test1   Running              0/4           2s         45s
test1   Running              0/4           32s        75s
test1   SuccessCriteriaMet   4/4           33s        76s
test1   Complete             4/4           33s        76s

监视 Workload 状态变化:

$ kubectl get wl -o wide -w
NAME              QUEUE   RESERVED IN   ADMITTED   FINISHED   AGE
job-test1-f14e1   test1                                       0s
job-test1-f14e1   test1                                       0s
job-test1-f14e1   test1   test1         True                  0s
job-test1-f14e1   test1   test1         True                  2s
job-test2-f2017   test2                                       0s
job-test1-f14e1   test1   test1         True                  9s
job-test2-f2017   test2                                       0s
job-test2-f2017   test2                                       0s
job-test1-f14e1   test1   test1         True                  9s
job-test1-f14e1   test1                 False                 9s
job-test1-f14e1   test1                 False                 9s
job-test2-f2017   test2   test2         True                  0s
job-test1-f14e1   test1                 False                 9s
job-test1-f14e1   test1                 False                 9s
job-test2-f2017   test2   test2         True                  2s
job-test2-f2017   test2   test2         True                  33s
job-test2-f2017   test2   test2         True                  34s
job-test2-f2017   test2                 False      True       34s
job-test2-f2017   test2                 False      True       34s
job-test1-f14e1   test1   test1         True                  43s
job-test1-f14e1   test1   test1         True                  45s
job-test1-f14e1   test1   test1         True                  75s
job-test1-f14e1   test1   test1         True                  76s
job-test1-f14e1   test1                 False      True       76s
job-test1-f14e1   test1                 False      True       76s

可以看到作业 test2 提交后,作业 test1 被迫退出。

实验二、只能收回本队列额度

现在将作业 test2 的并行度改为 2, 超过队列 test2 的额度:

   labels:
     kueue.x-k8s.io/queue-name: test2
 spec:
-  completions: 1
+  completions: 2
   completionMode: Indexed
-  parallelism: 1
+  parallelism: 2
   template:
     spec:
       restartPolicy: OnFailure

与实验二步骤相同。监视作业状态变化:

$ kubectl get job -w
NAME    STATUS    COMPLETIONS   DURATION   AGE
test1   Running   0/4                      0s
test1   Suspended   0/4                      0s
test1   Suspended   0/4                      0s
test1   Running     0/4           0s         0s
test1   Running     0/4           2s         2s
test2   Running     0/2                      0s
test2   Suspended   0/2                      0s
test1   Running     0/4           32s        32s
test1   SuccessCriteriaMet   4/4           33s        33s
test1   Complete             4/4           33s        33s
test2   Suspended            0/2                      26s
test2   Running              0/2           0s         26s
test2   Running              0/2           2s         28s
test2   Running              0/2           32s        58s
test2   SuccessCriteriaMet   2/2           33s        59s
test2   Complete             2/2           33s        59s

监视 Workload 状态变化:

$ kubectl get wl -o wide -w
NAME              QUEUE   RESERVED IN   ADMITTED   FINISHED   AGE
job-test1-647d2   test1                                       0s
job-test1-647d2   test1                                       0s
job-test1-647d2   test1   test1         True                  0s
job-test1-647d2   test1   test1         True                  2s
job-test2-05e1b   test2                                       0s
job-test2-05e1b   test2                                       0s
job-test2-05e1b   test2                                       0s
job-test1-647d2   test1   test1         True                  32s
job-test1-647d2   test1   test1         True                  33s
job-test2-05e1b   test2                                       26s
job-test1-647d2   test1   test1         True                  33s
job-test2-05e1b   test2   test2         True                  26s
job-test1-647d2   test1                 False      True       33s
job-test1-647d2   test1                 False      True       33s
job-test2-05e1b   test2   test2         True                  28s
job-test2-05e1b   test2   test2         True                  58s
job-test2-05e1b   test2   test2         True                  59s
job-test2-05e1b   test2                 False      True       59s
job-test2-05e1b   test2                 False      True       59s

可以看到作业 test2 因为额度不够被挂起了。等作业 test1 完成以后,test2 又可以借用 Cohort 内的资源运行。

以上结果是因为关闭了 borrowWithinCohort 功能。如果打开的话,高优先级的作业可以挤掉其他队列里低优先级的作业。