Kubeflow 作业挂起的实验
Kubeflow 作业有一个 Suspended 状态。通过设置 .spec.runPolicy.suspend 可以使作业进入 Suspended 状态。
$ yq '.spec.runPolicy.suspend = true' pytorchjob.yaml | kubectl apply -f -
pytorchjob.kubeflow.org/pytest created
通过修改设置 .spec.runPolicy.suspend 可以使已经存在的作业进入或退出 Suspended 状态:
$ kubectl patch pytorchjob pytest --type=merge -p '{"spec": {"runPolicy": {"suspend": false}}}'
pytorchjob.kubeflow.org/pytest patched
在以下实验中,作业启动时设为 Suspended 状态,55s 时使其退出 Suspended 状态,67s 再次进入 Suspended 状态。
监视 PyTorchJob 的状态:
$ kubectl get pytorchjob -owide -w
NAME STATE AGE
pytest 0s
pytest Suspended 0s
pytest Suspended 52s
pytest Suspended 52s
pytest Suspended 53s
pytest Running 55s
pytest Running 59s
pytest Running 59s
pytest Running 60s
pytest Running 67s
pytest Suspended 67s
监视 Pod 的状态:
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
pytest-master-0 0/1 Pending 0 0s <none> <none> <none> <none>
pytest-worker-0 0/1 Pending 0 0s <none> <none> <none> <none>
pytest-worker-1 0/1 Pending 0 0s <none> <none> <none> <none>
pytest-worker-2 0/1 Pending 0 0s <none> <none> <none> <none>
pytest-master-0 0/1 Pending 0 1s <none> las1 <none> <none>
pytest-worker-0 0/1 Pending 0 1s <none> las1 <none> <none>
pytest-worker-2 0/1 Pending 0 1s <none> las1 <none> <none>
pytest-worker-1 0/1 Pending 0 1s <none> las1 <none> <none>
pytest-master-0 0/1 ContainerCreating 0 1s <none> las1 <none> <none>
pytest-worker-1 0/1 Init:0/1 0 1s <none> las1 <none> <none>
pytest-worker-0 0/1 Init:0/1 0 1s <none> las1 <none> <none>
pytest-worker-2 0/1 Init:0/1 0 1s <none> las1 <none> <none>
pytest-master-0 0/1 ContainerCreating 0 2s <none> las1 <none> <none>
pytest-worker-0 0/1 Init:0/1 0 2s <none> las1 <none> <none>
pytest-worker-2 0/1 Init:0/1 0 2s <none> las1 <none> <none>
pytest-worker-1 0/1 Init:0/1 0 2s <none> las1 <none> <none>
pytest-worker-0 0/1 Init:0/1 0 2s 192.168.221.183 las1 <none> <none>
pytest-master-0 1/1 Running 0 2s 192.168.221.184 las1 <none> <none>
pytest-worker-2 0/1 Init:0/1 0 3s 192.168.221.137 las1 <none> <none>
pytest-worker-1 0/1 Init:0/1 0 3s 192.168.221.180 las1 <none> <none>
pytest-worker-2 0/1 PodInitializing 0 5s 192.168.221.137 las1 <none> <none>
pytest-worker-1 0/1 PodInitializing 0 5s 192.168.221.180 las1 <none> <none>
pytest-worker-0 0/1 PodInitializing 0 5s 192.168.221.183 las1 <none> <none>
pytest-worker-1 1/1 Running 0 6s 192.168.221.180 las1 <none> <none>
pytest-worker-0 1/1 Running 0 6s 192.168.221.183 las1 <none> <none>
pytest-worker-2 1/1 Running 0 6s 192.168.221.137 las1 <none> <none>
pytest-master-0 1/1 Terminating 0 14s 192.168.221.184 las1 <none> <none>
pytest-worker-0 1/1 Terminating 0 14s 192.168.221.183 las1 <none> <none>
pytest-worker-1 1/1 Terminating 0 14s 192.168.221.180 las1 <none> <none>
pytest-worker-2 1/1 Terminating 0 14s 192.168.221.137 las1 <none> <none>
pytest-master-0 1/1 Terminating 0 14s 192.168.221.184 las1 <none> <none>
pytest-worker-0 1/1 Terminating 0 14s 192.168.221.183 las1 <none> <none>
pytest-master-0 0/1 Error 0 15s 192.168.221.184 las1 <none> <none>
pytest-worker-1 1/1 Terminating 0 15s 192.168.221.180 las1 <none> <none>
pytest-worker-2 1/1 Terminating 0 15s 192.168.221.137 las1 <none> <none>
pytest-worker-0 0/1 Error 0 15s 192.168.221.183 las1 <none> <none>
pytest-worker-1 0/1 Error 0 15s 192.168.221.180 las1 <none> <none>
pytest-worker-2 0/1 Error 0 15s 192.168.221.137 las1 <none> <none>
pytest-worker-0 0/1 Error 0 15s 192.168.221.183 las1 <none> <none>
pytest-worker-0 0/1 Error 0 15s 192.168.221.183 las1 <none> <none>
pytest-worker-2 0/1 Error 0 15s 192.168.221.137 las1 <none> <none>
pytest-worker-2 0/1 Error 0 16s 192.168.221.137 las1 <none> <none>
pytest-master-0 0/1 Error 0 16s 192.168.221.184 las1 <none> <none>
pytest-master-0 0/1 Error 0 16s 192.168.221.184 las1 <none> <none>
pytest-worker-1 0/1 Error 0 16s 192.168.221.180 las1 <none> <none>
pytest-worker-1 0/1 Error 0 16s 192.168.221.180 las1 <none> <none>
可见使作业进入 Suspended 状态等价于删除所有正在运行的 Pod.