This repository was archived by the owner on May 8, 2025. It is now read-only.
Improve submitting/tracking job and fix job recovery bug #372
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR improves submitting/tracking job and fixes job recovery bugs. I made this PR for now because there are some changes and contents to discuss. (It still works even though there are some minor works.)
Changes
Submit a job
Submit via REST API instead of flink CLI. Submit a job with the job ID generated by the operator, and the operator tracks the job with that ID.
Pros: When using the CLI, the client side generates and sends a job graph to the jobmanager, but when using the REST API, the jobmanager side generates a job graph, reducing the resources used by the job submitter. Because curl is used, the job submitter image can also be lightened.
Cons: The REST API does not yet support pyFlink.
Job tracking
Previously, the job submitter was responsible for monitoring the job, therefore it was not terminated after submitting the job. But this PR changed the operator to track it instead of the job submitter. The advantage is that it saves the resources allocated to it by terminating the job submitter.
Fix job recovery bug
Fix bug that job recovery does not work if more than one job history remains in the job manager
Discussion
About pyFlink support: REST API does not support python job submission at this time. There is a related PR in upstream, but it seems no further progress (apache/flink#8532 (comment)). If we need to support pyFlink now, we need to change the job submitter to use flink CLI, which requires a change in implementation because Flink CLI does not allow specifying the ID of the job to submit. (In this case, we could make the job submitter with two containers like submitter containers and result containers - result container writes the job ID to output log and the operator reads it.)