Skip to content

Job tracking routine changes for the blocking mode  #110

@elanv

Description

@elanv

I recently found that the job tracking routine has changed. It may be very rare, but I think there may be exceptional situations in which multiple jobs or duplicate jobs are submitted to the cluster due to user mistakes or errors in the associated system. Because it is difficult to predict or cope with all the various side effects from that, in order to prevent such a situation, I have previously implemented to identify and track the job by the ID obtained from the submitted job.

As for blocking mode, job tracking routine seems to be changed as it is now because the job ID cannot be obtained immediately, but I am a little concerned about taking these issues to support blocking mode.

flinkJobStatus.flinkJob = &flinkJobList.Jobs[0]
for _, job := range flinkJobList.Jobs[1:] {
flinkJobStatus.flinkJobsUnexpected = append(flinkJobStatus.flinkJobsUnexpected, job.Id)
}

In addition to selecting and tracking an arbitrary job when there are multiple jobs as above, when handling unexpected jobs, they are currently arbitrarily cleared collectively as follows. Rather, what do you think about to make users to check and handle unexpected jobs themselves like PR #107?

// Cancel unexpected jobs
if len(observed.flinkJobStatus.flinkJobsUnexpected) > 0 {
log.Info("Cancelling unexpected running job(s)")
err = reconciler.cancelUnexpectedJobs(false /* takeSavepoint */)
return requeueResult, err
}

<PR #107>
https://github.com/elanv/flink-on-k8s-operator/blob/2ad3293cf85ea92a9b6d020b707a878a0ec59ed2/controllers/flinkcluster_reconciler.go#L509-L520

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions