Skip to content
This repository was archived by the owner on May 8, 2025. It is now read-only.

Conversation

@elanv
Copy link
Contributor

@elanv elanv commented Nov 20, 2020

This PR improves submitting/tracking job and fixes job recovery bugs. I made this PR for now because there are some changes and contents to discuss. (It still works even though there are some minor works.)

Changes

  • Submit a job
    Submit via REST API instead of flink CLI. Submit a job with the job ID generated by the operator, and the operator tracks the job with that ID.

    Pros: When using the CLI, the client side generates and sends a job graph to the jobmanager, but when using the REST API, the jobmanager side generates a job graph, reducing the resources used by the job submitter. Because curl is used, the job submitter image can also be lightened.

    Cons: The REST API does not yet support pyFlink.

  • Job tracking
    Previously, the job submitter was responsible for monitoring the job, therefore it was not terminated after submitting the job. But this PR changed the operator to track it instead of the job submitter. The advantage is that it saves the resources allocated to it by terminating the job submitter.

  • Fix job recovery bug
    Fix bug that job recovery does not work if more than one job history remains in the job manager

Discussion

About pyFlink support: REST API does not support python job submission at this time. There is a related PR in upstream, but it seems no further progress (apache/flink#8532 (comment)). If we need to support pyFlink now, we need to change the job submitter to use flink CLI, which requires a change in implementation because Flink CLI does not allow specifying the ID of the job to submit. (In this case, we could make the job submitter with two containers like submitter containers and result containers - result container writes the job ID to output log and the operator reads it.)

@elanv
Copy link
Contributor Author

elanv commented Nov 21, 2020

I thought about it more, but it would be nice to implement it by using CLI and getting the ID via the pod log. It is better to apply the REST API method later. What do you think? @functicons

@functicons functicons self-requested a review November 21, 2020 04:31
@functicons
Copy link
Contributor

Thanks for the PR! I will need a few days before getting back to you.

@functicons
Copy link
Contributor

it would be nice to implement it by using CLI and getting the ID via the pod log. It is better to apply the REST API method later.

@elanv what's the reason behind it? I feel getting ID from the pod log is unreliable.

For PyFlink, we don't have to support it now, but it would be nice if the upstream Flink CLI could be improved to accept client specified ID. I guess the implementation should be easy. That way Flink CLI could be used for both. Or we could wait for apache/flink#8532 (comment) to be resolved before adding PyFlink support.

@elanv
Copy link
Contributor Author

elanv commented Nov 24, 2020

It's not now, but I've found out because I need to support pyFlink later and it was mentioned in another issue. (Anyway, I found it's better to use the container termination message recorded in the Pod resource instead of requesting log to running Pod.)

If we can wait for support from upstream, I would like to finish the remaining work of this PR and ask you to review again.

@functicons
Copy link
Contributor

Seems the problem for now is about whether PyFlink support is a priority, I don't have a preference, but if you think it is important, then it is okay to first use CLI and getting the ID via the pod log.

@elanv elanv closed this Dec 3, 2020
@elanv
Copy link
Contributor Author

elanv commented Dec 3, 2020

Replaced with #379

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants