关于 activeDeadlineSeconds 的实验
Kubeflow Job 可以设置 .spec.runPolicy.activeDeadlineSeconds, 当 Job 的寿命大于指定的时间且未完成或失败时,将主动终止 Job 并将状态设为失败。这一逻辑体现在源码 https://github.com/kubeflow/trainer/blob/release-1.9/pkg/controller.v1/common/job.go#L216 处。但是由于以上代码在 ReconcileJob 函数中,如果系统不触发 ReconcileJob, 相关代码均不生效。
以 PyTorchJob 为例进行实验。PyTorchJob 定义:
apiVersion: kubeflow.org/v1
kind: PyTorchJob
metadata:
name: pytest
namespace: default
spec:
runPolicy:
backoffLimit: 3
pytorchReplicaSpecs:
Master:
replicas: 1 # must be 1
restartPolicy: OnFailure
template:
spec:
containers:
- image: busybox:1.37.0-glibc
imagePullPolicy: IfNotPresent
name: pytorch # must be `pytorch`
command: ["sh", "-c", "trap exit INT TERM; sleep 30s & wait"]
resources:
limits:
cpu: "1"
memory: "100Mi"
Worker:
replicas: 3
restartPolicy: OnFailure
template:
spec:
containers:
- image: busybox:1.37.0-glibc
imagePullPolicy: IfNotPresent
name: pytorch # must be `pytorch`
command: ["sh", "-c", "trap exit INT TERM; sleep 30s & wait"]
resources:
limits:
cpu: "1"
memory: "100Mi"
生效情形
设置 activeDeadlineSeconds 时间小于作业运行时间 (30s),提交:
$ yq '.spec.runPolicy.activeDeadlineSeconds = 5' pytorchjob.yaml | kubectl apply -f -
监视 PyTorchJob 状态变化:
$ kubectl get pytorchjob -owide -w
NAME STATE AGE
pytest 0s
pytest Created 0s
pytest Created 0s
pytest Running 3s
pytest Failed 6s
监视 Pod 状态变化:
$ kubectl get po -owide -w
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
pytest-master-0 0/1 Pending 0 0s <none> <none> <none> <none>
pytest-worker-0 0/1 Pending 0 0s <none> <none> <none> <none>
pytest-worker-1 0/1 Pending 0 0s <none> <none> <none> <none>
pytest-worker-2 0/1 Pending 0 0s <none> <none> <none> <none>
pytest-worker-2 0/1 Pending 0 1s <none> las1 <none> <none>
pytest-worker-1 0/1 Pending 0 1s <none> las1 <none> <none>
pytest-worker-0 0/1 Pending 0 1s <none> las1 <none> <none>
pytest-master-0 0/1 Pending 0 1s <none> las1 <none> <none>
pytest-worker-2 0/1 Init:0/1 0 1s <none> las1 <none> <none>
pytest-master-0 0/1 ContainerCreating 0 1s <none> las1 <none> <none>
pytest-worker-0 0/1 Init:0/1 0 1s <none> las1 <none> <none>
pytest-worker-1 0/1 Init:0/1 0 1s <none> las1 <none> <none>
pytest-worker-2 0/1 Init:0/1 0 2s <none> las1 <none> <none>
pytest-worker-0 0/1 Init:0/1 0 2s <none> las1 <none> <none>
pytest-master-0 0/1 ContainerCreating 0 2s <none> las1 <none> <none>
pytest-worker-1 0/1 Init:0/1 0 2s <none> las1 <none> <none>
pytest-worker-1 0/1 Init:0/1 0 2s 192.168.221.145 las1 <none> <none>
pytest-worker-2 0/1 Init:0/1 0 3s 192.168.221.144 las1 <none> <none>
pytest-master-0 1/1 Running 0 3s 192.168.221.188 las1 <none> <none>
pytest-worker-0 0/1 Init:0/1 0 3s 192.168.221.150 las1 <none> <none>
pytest-worker-0 0/1 PodInitializing 0 5s 192.168.221.150 las1 <none> <none>
pytest-master-0 1/1 Terminating 0 6s 192.168.221.188 las1 <none> <none>
pytest-worker-1 0/1 PodInitializing 0 6s 192.168.221.145 las1 <none> <none>
pytest-worker-2 0/1 PodInitializing 0 6s 192.168.221.144 las1 <none> <none>
pytest-worker-0 0/1 Terminating 0 6s 192.168.221.150 las1 <none> <none>
pytest-worker-1 0/1 Terminating 0 6s 192.168.221.145 las1 <none> <none>
pytest-worker-2 0/1 Terminating 0 6s 192.168.221.144 las1 <none> <none>
pytest-master-0 1/1 Terminating 0 6s 192.168.221.188 las1 <none> <none>
pytest-master-0 0/1 Error 0 6s 192.168.221.188 las1 <none> <none>
pytest-worker-1 1/1 Terminating 0 7s 192.168.221.145 las1 <none> <none>
pytest-worker-0 1/1 Terminating 0 7s 192.168.221.150 las1 <none> <none>
pytest-worker-2 1/1 Terminating 0 7s 192.168.221.144 las1 <none> <none>
pytest-master-0 0/1 Error 0 7s 192.168.221.188 las1 <none> <none>
pytest-master-0 0/1 Error 0 7s 192.168.221.188 las1 <none> <none>
pytest-worker-1 1/1 Terminating 0 7s 192.168.221.145 las1 <none> <none>
pytest-worker-2 1/1 Terminating 0 7s 192.168.221.144 las1 <none> <none>
pytest-worker-0 1/1 Terminating 0 7s 192.168.221.150 las1 <none> <none>
pytest-worker-1 0/1 Error 0 7s 192.168.221.145 las1 <none> <none>
pytest-worker-2 0/1 Error 0 7s 192.168.221.144 las1 <none> <none>
pytest-worker-0 0/1 Error 0 7s 192.168.221.150 las1 <none> <none>
pytest-worker-1 0/1 Error 0 8s 192.168.221.145 las1 <none> <none>
pytest-worker-1 0/1 Error 0 8s 192.168.221.145 las1 <none> <none>
pytest-worker-2 0/1 Error 0 8s 192.168.221.144 las1 <none> <none>
pytest-worker-2 0/1 Error 0 8s 192.168.221.144 las1 <none> <none>
pytest-worker-0 0/1 Error 0 8s 192.168.221.150 las1 <none> <none>
pytest-worker-0 0/1 Error 0 8s 192.168.221.150 las1 <none> <none>
可见相关代码生效,及时停止了 Job. 这是因为相关的 Pod 状态仍在变化中,触发了 reconcileJob 操作。
不生效情形
设置 activeDeadlineSeconds 时间足够长使得所有 Pod 都进入稳定的 Running 状态:
$ yq '.spec.runPolicy.activeDeadlineSeconds = 15' pytorchjob.yaml | kubectl apply -f -
监视 PyTorchJob 状态变化:
$ kubectl get pytorchjob -owide -w
NAME STATE AGE
pytest 0s
pytest Created 0s
pytest Created 0s
pytest Running 3s
pytest Running 6s
pytest Running 6s
pytest Running 7s
pytest Failed 33s
监视 Pod 状态变化:
$ kubectl get po -owide -w
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
pytest-worker-0 0/1 Pending 0 0s <none> <none> <none> <none>
pytest-worker-1 0/1 Pending 0 0s <none> <none> <none> <none>
pytest-worker-2 0/1 Pending 0 0s <none> <none> <none> <none>
pytest-master-0 0/1 Pending 0 0s <none> <none> <none> <none>
pytest-worker-2 0/1 Pending 0 1s <none> las1 <none> <none>
pytest-worker-1 0/1 Pending 0 1s <none> las1 <none> <none>
pytest-master-0 0/1 Pending 0 1s <none> las1 <none> <none>
pytest-worker-0 0/1 Pending 0 1s <none> las1 <none> <none>
pytest-worker-2 0/1 Init:0/1 0 1s <none> las1 <none> <none>
pytest-worker-1 0/1 Init:0/1 0 1s <none> las1 <none> <none>
pytest-worker-0 0/1 Init:0/1 0 1s <none> las1 <none> <none>
pytest-master-0 0/1 ContainerCreating 0 1s <none> las1 <none> <none>
pytest-worker-2 0/1 Init:0/1 0 2s <none> las1 <none> <none>
pytest-worker-1 0/1 Init:0/1 0 2s <none> las1 <none> <none>
pytest-master-0 0/1 ContainerCreating 0 2s <none> las1 <none> <none>
pytest-worker-0 0/1 Init:0/1 0 2s <none> las1 <none> <none>
pytest-master-0 1/1 Running 0 3s 192.168.221.159 las1 <none> <none>
pytest-worker-1 0/1 Init:0/1 0 3s 192.168.221.161 las1 <none> <none>
pytest-worker-0 0/1 Init:0/1 0 3s 192.168.221.129 las1 <none> <none>
pytest-worker-2 0/1 Init:0/1 0 3s 192.168.221.148 las1 <none> <none>
pytest-worker-2 0/1 PodInitializing 0 5s 192.168.221.148 las1 <none> <none>
pytest-worker-1 0/1 PodInitializing 0 5s 192.168.221.161 las1 <none> <none>
pytest-worker-0 0/1 PodInitializing 0 6s 192.168.221.129 las1 <none> <none>
pytest-worker-1 1/1 Running 0 6s 192.168.221.161 las1 <none> <none>
pytest-worker-2 1/1 Running 0 6s 192.168.221.148 las1 <none> <none>
pytest-worker-0 1/1 Running 0 7s 192.168.221.129 las1 <none> <none>
pytest-master-0 0/1 Completed 0 33s 192.168.221.159 las1 <none> <none>
pytest-worker-2 1/1 Terminating 0 33s 192.168.221.148 las1 <none> <none>
pytest-master-0 0/1 Terminating 0 33s 192.168.221.159 las1 <none> <none>
pytest-worker-0 1/1 Terminating 0 33s 192.168.221.129 las1 <none> <none>
pytest-worker-1 1/1 Terminating 0 33s 192.168.221.161 las1 <none> <none>
pytest-worker-0 1/1 Terminating 0 33s 192.168.221.129 las1 <none> <none>
pytest-worker-2 1/1 Terminating 0 33s 192.168.221.148 las1 <none> <none>
pytest-worker-1 1/1 Terminating 0 33s 192.168.221.161 las1 <none> <none>
pytest-worker-2 0/1 Error 0 33s 192.168.221.148 las1 <none> <none>
pytest-worker-0 0/1 Error 0 33s 192.168.221.129 las1 <none> <none>
pytest-worker-1 0/1 Error 0 33s 192.168.221.161 las1 <none> <none>
pytest-worker-1 0/1 Error 0 34s 192.168.221.161 las1 <none> <none>
pytest-worker-1 0/1 Error 0 34s 192.168.221.161 las1 <none> <none>
pytest-worker-0 0/1 Error 0 34s 192.168.221.129 las1 <none> <none>
pytest-worker-0 0/1 Error 0 34s 192.168.221.129 las1 <none> <none>
pytest-worker-2 0/1 Error 0 34s 192.168.221.148 las1 <none> <none>
pytest-worker-2 0/1 Error 0 34s 192.168.221.148 las1 <none> <none>
pytest-master-0 0/1 Terminating 0 34s 192.168.221.159 las1 <none> <none>
pytest-master-0 0/1 Completed 0 34s 192.168.221.159 las1 <none> <none>
pytest-master-0 0/1 Completed 0 35s 192.168.221.159 las1 <none> <none>
pytest-master-0 0/1 Completed 0 35s 192.168.221.159 las1 <none> <none>
可见因为没有状态变化,到达预定时间时,Job 并没有被取消,而是当有 Pod 状态变化时才发现 Job 已超时。
以上结果均已排除节点间时间不同步的状况。