Job with kueue

Install kueue first, see https://kueue.sigs.k8s.io/. By helm:

$ helm pull oci://registry.k8s.io/kueue/charts/kueue --version=0.11.4
Pulled: registry.k8s.io/kueue/charts/kueue:0.11.4
Digest: sha256:0c58ed6e88716c90da94dce0351694b8788552421c63f0c30739ed5bc8bb659c
$ helm install kueue kueue-0.11.4.tgz --namespace kueue-system --create-namespace
NAME: kueue
LAST DEPLOYED: Sun Apr 27 03:49:12 2025
NAMESPACE: kueue-system
STATUS: deployed
REVISION: 1
TEST SUITE: None

Create file topology_rf_cq_lq.yaml for some kueue staffs:

apiVersion: kueue.x-k8s.io/v1alpha1
kind: Topology
metadata:
  name: default
spec:
  levels:
    - nodeLabel: topology-block
    - nodeLabel: topology-rack
    - nodeLabel: kubernetes.io/hostname
---
apiVersion: kueue.x-k8s.io/v1beta1
kind: ResourceFlavor
metadata:
  name: default
spec:
  nodeLabels:
    node-group: default
  topologyName: default
---
apiVersion: kueue.x-k8s.io/v1beta1
kind: ClusterQueue
metadata:
  name: test
spec:
  namespaceSelector: {} # match all.
  resourceGroups:
    - coveredResources:
        - cpu
        - memory
        - pods
      flavors:
        - name: default
          resources:
            - name: cpu
              nominalQuota: 8
            - name: memory
              nominalQuota: 64Gi
            - name: pods
              nominalQuota: 4
---
apiVersion: kueue.x-k8s.io/v1beta1
kind: LocalQueue
metadata:
  namespace: default
  name: test
spec:
  clusterQueue: test

Note

For ResourceFlavor, At least one of nodeLabels is required if topologyName is set. That means, the correct label must be set to schedule pods on the expected nodes before running jobs:

$ kubectl label node --all node-group=default
node/las1 labeled
node/las2 labeled
node/las0 labeled
node/las3 labeled

Apply to the cluster:

$ kubectl apply -f topology_rf_cq_lq.yaml
topology.kueue.x-k8s.io/default created
resourceflavor.kueue.x-k8s.io/default created
clusterqueue.kueue.x-k8s.io/test created
localqueue.kueue.x-k8s.io/test created

See what are created:

$ kubectl get topology,rf,cq,lq -owide
NAME                              AGE
topology.kueue.x-k8s.io/default   2m20s

NAME                                    AGE
resourceflavor.kueue.x-k8s.io/default   2m20s

NAME                               COHORT   STRATEGY         PENDING WORKLOADS   ADMITTED WORKLOADS
clusterqueue.kueue.x-k8s.io/test            BestEffortFIFO   0                   0

NAME                             CLUSTERQUEUE   PENDING WORKLOADS   ADMITTED WORKLOADS
localqueue.kueue.x-k8s.io/test   test           0                   0

Now create file sleep_job_kueue.yaml:

apiVersion: batch/v1
kind: Job
metadata:
  generateName: sleep-
  labels:
    kueue.x-k8s.io/queue-name: test
spec:
  completions: 3
  completionMode: Indexed
  parallelism: 3
  template:
    metadata:
      annotations:
        kueue.x-k8s.io/podset-required-topology: kubernetes.io/hostname
    spec:
      restartPolicy: OnFailure
      containers:
        - image: busybox:1.37.0-glibc
          imagePullPolicy: IfNotPresent
          name: sleep-busybox
          command: ["sh", "-c", "trap exit INT TERM; sleep 1m & wait"]
          resources:
            requests:
              cpu: "1"
              memory: 100Mi
            limits:
              cpu: "1"
              memory: 100Mi

Apply to the cluster:

$ kubectl create -f sleep_job_kueue.yaml
job.batch/sleep-85sck created
$ kubectl create -f sleep_job_kueue.yaml
job.batch/sleep-mrbcc created

Show events:

$ kubectl get job -owide -w
NAME          STATUS      COMPLETIONS   DURATION   AGE   CONTAINERS      IMAGES                 SELECTOR
sleep-85sck   Running     0/3           0s         0s    sleep-busybox   busybox:1.37.0-glibc   batch.kubernetes.io/controller-uid=2eb95f91-11df-43a0-9285-b180904f7030
sleep-mrbcc   Suspended   0/3                      0s    sleep-busybox   busybox:1.37.0-glibc   batch.kubernetes.io/controller-uid=49f34454-601a-4a40-a9bb-dcfd457b0bb0
sleep-85sck   Running     0/3           1s         1s    sleep-busybox   busybox:1.37.0-glibc   batch.kubernetes.io/controller-uid=2eb95f91-11df-43a0-9285-b180904f7030
sleep-85sck   Running     0/3           2s         2s    sleep-busybox   busybox:1.37.0-glibc   batch.kubernetes.io/controller-uid=2eb95f91-11df-43a0-9285-b180904f7030
sleep-85sck   Running     0/3           62s        62s   sleep-busybox   busybox:1.37.0-glibc   batch.kubernetes.io/controller-uid=2eb95f91-11df-43a0-9285-b180904f7030
sleep-85sck   Running     2/3           63s        63s   sleep-busybox   busybox:1.37.0-glibc   batch.kubernetes.io/controller-uid=2eb95f91-11df-43a0-9285-b180904f7030
sleep-mrbcc   Suspended   0/3                      63s   sleep-busybox   busybox:1.37.0-glibc   batch.kubernetes.io/controller-uid=49f34454-601a-4a40-a9bb-dcfd457b0bb0
sleep-mrbcc   Running     0/3           0s         63s   sleep-busybox   busybox:1.37.0-glibc   batch.kubernetes.io/controller-uid=49f34454-601a-4a40-a9bb-dcfd457b0bb0
sleep-85sck   Running     3/3           64s        64s   sleep-busybox   busybox:1.37.0-glibc   batch.kubernetes.io/controller-uid=2eb95f91-11df-43a0-9285-b180904f7030
sleep-85sck   Complete    3/3           64s        64s   sleep-busybox   busybox:1.37.0-glibc   batch.kubernetes.io/controller-uid=2eb95f91-11df-43a0-9285-b180904f7030
sleep-mrbcc   Running     0/3           1s         64s   sleep-busybox   busybox:1.37.0-glibc   batch.kubernetes.io/controller-uid=49f34454-601a-4a40-a9bb-dcfd457b0bb0
sleep-mrbcc   Running     0/3           2s         65s   sleep-busybox   busybox:1.37.0-glibc   batch.kubernetes.io/controller-uid=49f34454-601a-4a40-a9bb-dcfd457b0bb0