-
Notifications
You must be signed in to change notification settings - Fork 75
Description
I recently found that the job tracking routine has changed. It may be very rare, but I think there may be exceptional situations in which multiple jobs or duplicate jobs are submitted to the cluster due to user mistakes or errors in the associated system. Because it is difficult to predict or cope with all the various side effects from that, in order to prevent such a situation, I have previously implemented to identify and track the job by the ID obtained from the submitted job.
As for blocking mode, job tracking routine seems to be changed as it is now because the job ID cannot be obtained immediately, but I am a little concerned about taking these issues to support blocking mode.
flink-on-k8s-operator/controllers/flinkcluster_observer.go
Lines 301 to 304 in 35d5413
| flinkJobStatus.flinkJob = &flinkJobList.Jobs[0] | |
| for _, job := range flinkJobList.Jobs[1:] { | |
| flinkJobStatus.flinkJobsUnexpected = append(flinkJobStatus.flinkJobsUnexpected, job.Id) | |
| } |
In addition to selecting and tracking an arbitrary job when there are multiple jobs as above, when handling unexpected jobs, they are currently arbitrarily cleared collectively as follows. Rather, what do you think about to make users to check and handle unexpected jobs themselves like PR #107?
flink-on-k8s-operator/controllers/flinkcluster_reconciler.go
Lines 497 to 502 in 35d5413
| // Cancel unexpected jobs | |
| if len(observed.flinkJobStatus.flinkJobsUnexpected) > 0 { | |
| log.Info("Cancelling unexpected running job(s)") | |
| err = reconciler.cancelUnexpectedJobs(false /* takeSavepoint */) | |
| return requeueResult, err | |
| } |
<PR #107>
https://github.com/elanv/flink-on-k8s-operator/blob/2ad3293cf85ea92a9b6d020b707a878a0ec59ed2/controllers/flinkcluster_reconciler.go#L509-L520