Kubeflow Trainer
Install
$ helm pull oci://ghcr.io/kubeflow/charts/kubeflow-trainer
Pulled: ghcr.io/kubeflow/charts/kubeflow-trainer:2.1.0
Digest: sha256:2659823a63034cdd091ef70621c2d631a2ab64de5ec07ce2d5848f74baa5ee60
$ helm install kubeflow-trainer kubeflow-trainer-2.1.0.tgz --namespace kubeflow-system --create-namespace
NAME: kubeflow-trainer
LAST DEPLOYED: Fri Jan 23 15:56:02 2026
NAMESPACE: kubeflow-system
STATUS: deployed
REVISION: 1
TEST SUITE: None
Show deployed workloads:
$ kubectl get all -n kubeflow-system
NAME READY STATUS RESTARTS AGE
pod/jobset-controller-996545cf5-qhpvv 0/1 Running 0 12s
pod/kubeflow-trainer-controller-manager-589f5f5945-gpgdg 0/1 Running 0 12s
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
service/jobset-metrics-service ClusterIP 10.103.19.211 <none> 8443/TCP 12s
service/jobset-webhook-service ClusterIP 10.105.141.64 <none> 443/TCP 12s
service/kubeflow-trainer-controller-manager ClusterIP 10.104.36.240 <none> 8080/TCP,443/TCP 12s
NAME READY UP-TO-DATE AVAILABLE AGE
deployment.apps/jobset-controller 0/1 1 0 12s
deployment.apps/kubeflow-trainer-controller-manager 0/1 1 0 12s
NAME DESIRED CURRENT READY AGE
replicaset.apps/jobset-controller-996545cf5 1 1 0 12s
replicaset.apps/kubeflow-trainer-controller-manager-589f5f5945 1 1 0 12s
Show installed API resources:
$ kubectl api-resources --api-group=trainer.kubeflow.org
NAME SHORTNAMES APIVERSION NAMESPACED KIND
clustertrainingruntimes trainer.kubeflow.org/v1alpha1 false ClusterTrainingRuntime
trainingruntimes trainer.kubeflow.org/v1alpha1 true TrainingRuntime
trainjobs trainer.kubeflow.org/v1alpha1 true TrainJob
Install runtimes
$ kubectl apply --server-side -k "https://github.com/kubeflow/trainer.git/manifests/overlays/runtimes?ref=master"
clustertrainingruntime.trainer.kubeflow.org/deepspeed-distributed serverside-applied
clustertrainingruntime.trainer.kubeflow.org/mlx-distributed serverside-applied
clustertrainingruntime.trainer.kubeflow.org/torch-distributed serverside-applied
clustertrainingruntime.trainer.kubeflow.org/torchtune-llama3.2-1b serverside-applied
clustertrainingruntime.trainer.kubeflow.org/torchtune-llama3.2-3b serverside-applied
clustertrainingruntime.trainer.kubeflow.org/torchtune-qwen2.5-1.5b serverside-applied
These ClusterTrainingRuntimes are installed:
$ kubectl get clustertrainingruntimes
NAME AGE
deepspeed-distributed 78s
mlx-distributed 78s
torch-distributed 78s
torchtune-llama3.2-1b 78s
torchtune-llama3.2-3b 78s
torchtune-qwen2.5-1.5b 78s
The definition of torch-distributed:
$ kubectl get clustertrainingruntimes torch-distributed -oyaml
apiVersion: trainer.kubeflow.org/v1alpha1
kind: ClusterTrainingRuntime
metadata:
creationTimestamp: "2026-01-23T08:28:34Z"
generation: 1
labels:
trainer.kubeflow.org/framework: torch
name: torch-distributed
resourceVersion: "82949790"
uid: 189c4800-c269-4f7d-84e8-10ec27c8dbbf
spec:
mlPolicy:
numNodes: 1
torch:
numProcPerNode: auto
template:
spec:
replicatedJobs:
- groupName: default
name: node
replicas: 1
template:
metadata:
labels:
trainer.kubeflow.org/trainjob-ancestor-step: trainer
spec:
template:
spec:
containers:
- image: pytorch/pytorch:2.9.1-cuda12.8-cudnn9-runtime
name: node
Run a TrainJob
Submit a TrainJob:
apiVersion: trainer.kubeflow.org/v1alpha1
kind: TrainJob
metadata:
name: test
spec:
runtimeRef:
name: torch-distributed
trainer:
numNodes: 3
image: busybox:1.37.0-glibc
command: ["sh", "-c", "trap exit INT TERM; sleep 30s & wait"]
numProcPerNode: 1
resourcesPerNode:
requests:
cpu: "1"
memory: "128Mi"
limits:
cpu: "1"
memory: "128Mi"
Show running workloads:
$ kubectl get trainjob,jobset,job,pod -owide
NAME STATE AGE
trainjob.trainer.kubeflow.org/test 20s
NAME TERMINALSTATE RESTARTS COMPLETED SUSPENDED AGE
jobset.jobset.x-k8s.io/test 0 false 20s
NAME STATUS COMPLETIONS DURATION AGE CONTAINERS IMAGES SELECTOR
job.batch/test-node-0 Running 0/3 20s 20s node busybox:1.37.0-glibc batch.kubernetes.io/controller-uid=6fa15312-46fd-43f4-bb85-b73bd74fe079
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
pod/test-node-0-0-bxjmh 1/1 Running 0 20s 192.168.185.5 las3 <none> <none>
pod/test-node-0-1-xjszt 1/1 Running 0 20s 192.168.67.135 las2 <none> <none>
pod/test-node-0-2-bgvx4 1/1 Running 0 20s 192.168.221.136 las1 <none> <none>
Integrate with Kueue
Kueue support TrainJob by default, just add the label:
kind: TrainJob
metadata:
name: test
+ labels:
+ kueue.x-k8s.io/queue-name: test
spec:
runtimeRef:
name: torch-distributed