Kubeflow Training Operator V1

Install

$ kubectl apply --server-side -k "github.com/kubeflow/trainer.git/manifests/overlays/standalone?ref=v1.9.3"
namespace/kubeflow serverside-applied
customresourcedefinition.apiextensions.k8s.io/jaxjobs.kubeflow.org serverside-applied
customresourcedefinition.apiextensions.k8s.io/mpijobs.kubeflow.org serverside-applied
customresourcedefinition.apiextensions.k8s.io/paddlejobs.kubeflow.org serverside-applied
customresourcedefinition.apiextensions.k8s.io/pytorchjobs.kubeflow.org serverside-applied
customresourcedefinition.apiextensions.k8s.io/tfjobs.kubeflow.org serverside-applied
customresourcedefinition.apiextensions.k8s.io/xgboostjobs.kubeflow.org serverside-applied
serviceaccount/training-operator serverside-applied
clusterrole.rbac.authorization.k8s.io/training-operator serverside-applied
clusterrolebinding.rbac.authorization.k8s.io/training-operator serverside-applied
secret/training-operator-webhook-cert serverside-applied
service/training-operator serverside-applied
deployment.apps/training-operator serverside-applied
validatingwebhookconfiguration.admissionregistration.k8s.io/validator.training-operator.kubeflow.org serverside-applied

Show depolyed workloads:

$ kubectl get all -n kubeflow
NAME                                     READY   STATUS    RESTARTS   AGE
pod/training-operator-6577bb88bf-tgqz9   1/1     Running   0          73s

NAME                        TYPE        CLUSTER-IP     EXTERNAL-IP   PORT(S)            AGE
service/training-operator   ClusterIP   10.100.99.45   <none>        8080/TCP,443/TCP   73s

NAME                                READY   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/training-operator   1/1     1            1           73s

NAME                                           DESIRED   CURRENT   READY   AGE
replicaset.apps/training-operator-6577bb88bf   1         1         1       73s

Show installed CRDs:

$ kubectl get crd | grep kubeflow
jaxjobs.kubeflow.org                                  2025-10-13T08:27:15Z
mpijobs.kubeflow.org                                  2025-10-13T08:27:15Z
paddlejobs.kubeflow.org                               2025-10-13T08:27:16Z
pytorchjobs.kubeflow.org                              2025-10-13T08:27:16Z
tfjobs.kubeflow.org                                   2025-10-13T08:27:16Z
xgboostjobs.kubeflow.org                              2025-10-13T08:27:16Z

Testing

Edit file pytorchjob.yaml:

apiVersion: kubeflow.org/v1
kind: PyTorchJob
metadata:
  name: pytest
  namespace: default
spec:
  runPolicy:
    backoffLimit: 3
  pytorchReplicaSpecs:
    Master:
      replicas: 1 # must be 1
      restartPolicy: OnFailure
      template:
        spec:
          containers:
            - image: busybox:1.37.0-glibc
              imagePullPolicy: IfNotPresent
              name: pytorch # must be `pytorch`
              command: ["sh", "-c", "trap exit INT TERM; sleep 30s & wait"]
              resources:
                limits:
                  cpu: "1"
                  memory: "100Mi"
    Worker:
      replicas: 3
      restartPolicy: OnFailure
      template:
        spec:
          containers:
            - image: busybox:1.37.0-glibc
              imagePullPolicy: IfNotPresent
              name: pytorch # must be `pytorch`
              command: ["sh", "-c", "trap exit INT TERM; sleep 30s & wait"]
              resources:
                limits:
                  cpu: "1"
                  memory: "100Mi"

Apply to the cluster:

$ kubectl apply -f pytorchjob.yaml 
pytorchjob.kubeflow.org/pytest created

Trace the state of job and pods:

$ kubectl get pytorchjob -owide -w
NAME     STATE   AGE
pytest           0s
pytest   Created   0s
pytest   Running   1s
pytest   Running   5s
pytest   Running   5s
pytest   Running   5s
pytest   Succeeded   33s
pytest   Succeeded   33s
$ kubectl get po -owide -w
NAME              READY   STATUS    RESTARTS   AGE   IP       NODE     NOMINATED NODE   READINESS GATES
pytest-master-0   0/1     Pending   0          0s    <none>   <none>   <none>           <none>
pytest-master-0   0/1     Pending   0          0s    <none>   las3     <none>           <none>
pytest-worker-0   0/1     Pending   0          0s    <none>   <none>   <none>           <none>
pytest-master-0   0/1     ContainerCreating   0          0s    <none>   las3     <none>           <none>
pytest-worker-0   0/1     Pending             0          0s    <none>   las2     <none>           <none>
pytest-worker-1   0/1     Pending             0          0s    <none>   <none>   <none>           <none>
pytest-worker-1   0/1     Pending             0          0s    <none>   las1     <none>           <none>
pytest-worker-2   0/1     Pending             0          0s    <none>   <none>   <none>           <none>
pytest-worker-2   0/1     Pending             0          0s    <none>   las3     <none>           <none>
pytest-worker-0   0/1     Init:0/1            0          0s    <none>   las2     <none>           <none>
pytest-worker-1   0/1     Init:0/1            0          0s    <none>   las1     <none>           <none>
pytest-worker-2   0/1     Init:0/1            0          0s    <none>   las3     <none>           <none>
pytest-master-0   0/1     ContainerCreating   0          1s    <none>   las3     <none>           <none>
pytest-worker-2   0/1     Init:0/1            0          1s    <none>   las3     <none>           <none>
pytest-worker-0   0/1     Init:0/1            0          1s    <none>   las2     <none>           <none>
pytest-worker-1   0/1     Init:0/1            0          1s    <none>   las1     <none>           <none>
pytest-worker-1   0/1     Init:0/1            0          1s    192.168.221.177   las1     <none>           <none>
pytest-worker-2   0/1     Init:0/1            0          1s    192.168.185.45    las3     <none>           <none>
pytest-master-0   1/1     Running             0          1s    192.168.185.20    las3     <none>           <none>
pytest-worker-0   0/1     Init:0/1            0          2s    192.168.67.160    las2     <none>           <none>
pytest-worker-0   0/1     PodInitializing     0          4s    192.168.67.160    las2     <none>           <none>
pytest-worker-1   0/1     PodInitializing     0          4s    192.168.221.177   las1     <none>           <none>
pytest-worker-2   0/1     PodInitializing     0          4s    192.168.185.45    las3     <none>           <none>
pytest-worker-0   1/1     Running             0          5s    192.168.67.160    las2     <none>           <none>
pytest-worker-1   1/1     Running             0          5s    192.168.221.177   las1     <none>           <none>
pytest-worker-2   1/1     Running             0          5s    192.168.185.45    las3     <none>           <none>
pytest-master-0   0/1     Completed           0          31s   192.168.185.20    las3     <none>           <none>
pytest-master-0   0/1     Completed           0          33s   192.168.185.20    las3     <none>           <none>
pytest-master-0   0/1     Completed           0          33s   192.168.185.20    las3     <none>           <none>
pytest-worker-0   0/1     Completed           0          35s   192.168.67.160    las2     <none>           <none>
pytest-worker-1   0/1     Completed           0          35s   192.168.221.177   las1     <none>           <none>
pytest-worker-2   0/1     Completed           0          35s   192.168.185.45    las3     <none>           <none>
pytest-worker-0   0/1     Completed           0          36s   192.168.67.160    las2     <none>           <none>
pytest-worker-0   0/1     Completed           0          36s   192.168.67.160    las2     <none>           <none>
pytest-worker-1   0/1     Completed           0          36s   192.168.221.177   las1     <none>           <none>
pytest-worker-2   0/1     Completed           0          37s   192.168.185.45    las3     <none>           <none>
pytest-worker-1   0/1     Completed           0          37s   192.168.221.177   las1     <none>           <none>
pytest-worker-2   0/1     Completed           0          37s   192.168.185.45    las3     <none>           <none>

Using volcano scheduler

Add an argument to the command of training operator:

$ kubectl edit deployment.apps/training-operator -n kubeflow
deployment.apps/training-operator edited

The contents added:

       containers:
       - command:
         - /manager
+        - --gang-scheduler-name=volcano
 ...

After editing, the pods are restarted automatically. Then commit the job again and trace the state.

$ kubectl get pytorchjob -owide -w
NAME     STATE   AGE
pytest           0s
pytest   Created   0s
pytest   Created   1s
pytest   Running   3s
pytest   Running   6s
pytest   Running   7s
pytest   Running   7s
pytest   Succeeded   35s
pytest   Succeeded   35s
$ kubectl get podgroup -owide -w
NAME     STATUS   MINMEMBER   RUNNINGS   AGE   QUEUE
pytest            4                      0s    default
pytest   Inqueue   4                      0s    default
pytest   Running   4                      1s    default
pytest   Running   4           1          3s    default
pytest   Running   4           2          6s    default
pytest   Running   4           4          7s    default
pytest   Running   4           4          35s   default
$ kubectl get po -owide -w
NAME              READY   STATUS    RESTARTS   AGE   IP       NODE     NOMINATED NODE   READINESS GATES
pytest-master-0   0/1     Pending   0          0s    <none>   <none>   <none>           <none>
pytest-worker-0   0/1     Pending   0          0s    <none>   <none>   <none>           <none>
pytest-worker-1   0/1     Pending   0          0s    <none>   <none>   <none>           <none>
pytest-worker-2   0/1     Pending   0          0s    <none>   <none>   <none>           <none>
pytest-worker-0   0/1     Pending   0          1s    <none>   las1     <none>           <none>
pytest-worker-2   0/1     Pending   0          1s    <none>   las1     <none>           <none>
pytest-master-0   0/1     Pending   0          1s    <none>   las1     <none>           <none>
pytest-worker-1   0/1     Pending   0          1s    <none>   las1     <none>           <none>
pytest-worker-0   0/1     Init:0/1   0          1s    <none>   las1     <none>           <none>
pytest-master-0   0/1     ContainerCreating   0          1s    <none>   las1     <none>           <none>
pytest-worker-1   0/1     Init:0/1            0          1s    <none>   las1     <none>           <none>
pytest-worker-2   0/1     Init:0/1            0          1s    <none>   las1     <none>           <none>
pytest-worker-0   0/1     Init:0/1            0          2s    <none>   las1     <none>           <none>
pytest-worker-1   0/1     Init:0/1            0          2s    <none>   las1     <none>           <none>
pytest-worker-2   0/1     Init:0/1            0          2s    <none>   las1     <none>           <none>
pytest-master-0   0/1     ContainerCreating   0          2s    <none>   las1     <none>           <none>
pytest-worker-0   0/1     Init:0/1            0          2s    192.168.221.145   las1     <none>           <none>
pytest-worker-1   0/1     Init:0/1            0          2s    192.168.221.150   las1     <none>           <none>
pytest-worker-2   0/1     Init:0/1            0          3s    192.168.221.144   las1     <none>           <none>
pytest-master-0   1/1     Running             0          3s    192.168.221.188   las1     <none>           <none>
pytest-worker-2   0/1     PodInitializing     0          5s    192.168.221.144   las1     <none>           <none>
pytest-worker-0   0/1     PodInitializing     0          5s    192.168.221.145   las1     <none>           <none>
pytest-worker-1   0/1     PodInitializing     0          5s    192.168.221.150   las1     <none>           <none>
pytest-worker-0   1/1     Running             0          6s    192.168.221.145   las1     <none>           <none>
pytest-worker-2   1/1     Running             0          6s    192.168.221.144   las1     <none>           <none>
pytest-worker-1   1/1     Running             0          6s    192.168.221.150   las1     <none>           <none>
pytest-master-0   0/1     Completed           0          34s   192.168.221.188   las1     <none>           <none>
pytest-master-0   0/1     Completed           0          35s   192.168.221.188   las1     <none>           <none>
pytest-master-0   0/1     Completed           0          35s   192.168.221.188   las1     <none>           <none>
pytest-worker-0   0/1     Completed           0          37s   192.168.221.145   las1     <none>           <none>
pytest-worker-1   0/1     Completed           0          37s   192.168.221.150   las1     <none>           <none>
pytest-worker-2   0/1     Completed           0          37s   192.168.221.144   las1     <none>           <none>
pytest-worker-1   0/1     Completed           0          38s   192.168.221.150   las1     <none>           <none>
pytest-worker-0   0/1     Completed           0          38s   192.168.221.145   las1     <none>           <none>
pytest-worker-2   0/1     Completed           0          38s   192.168.221.144   las1     <none>           <none>
pytest-worker-1   0/1     Completed           0          38s   192.168.221.150   las1     <none>           <none>
pytest-worker-0   0/1     Completed           0          38s   192.168.221.145   las1     <none>           <none>
pytest-worker-2   0/1     Completed           0          38s   192.168.221.144   las1     <none>           <none>

We can see a PodGroup of the same name is generated with correct minMember set, and all pods were scheduled to one node because of binpack policy.