-
Notifications
You must be signed in to change notification settings - Fork 262
Improve submitting/tracking job and fix #379
Improve submitting/tracking job and fix #379
Conversation
|
I will add and fix some tests soon. |
functicons
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR! I left a few comments.
functicons
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few more comments, thanks!
|
/gcbrun |
|
Restored a CRD that was accidentally changed. Removed JobManager check start delay from submit script. This is because the time difference between the submitter and JM initialization is small and sometimes the submitter can take longer, such as downloading large file. |
|
/gcbrun |
|
I just did a test, the sample job finished successfully, but the status in the CR was not quite right, it was still in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I recommend we change the name suffix of the Job resource from job to job-submitter; otherwise, people might be confused when seeing it is in Completed status.
We also need to update the "Flink job" section of the user guide on how to monitor the job. Previously monitoring the job submitter makes sense, but with this change, it doesn't make much sense. I think we can briefly mention how the job is submitted.
|
Thanks for your review of something I missed. Will proceed remaining works of renaming and docs. |
0d16303 to
34e37d7
Compare
|
Unit test is failing, could you fix it? |
|
I noticed another problem, in the current job status, it shows |
That's right. I think it would be better to remove it too. |
|
/gcbrun |
| function check_existing_jobs() { | ||
| echo "Checking existing jobs..." | ||
| list_jobs | ||
| if list_jobs | grep -e "(SCHEDULED)" -e "(CREATED)" -e "(SUSPENDED)" -e "(FINISHED)" -e "(FAILED)" -e "(CANCELED)" \ | ||
| -e "(RUNNING)" -e "(RESTARTING)" -e "(CANCELLING)" -e "(FAILING)" -e "(RECONCILING)"; then | ||
| echo "Found an existing job, skip resubmitting..." | ||
| return 0 | ||
| fi | ||
| return 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@elanv, could you please explain why you removed this functionality?
We use HA mode. So that, when the cluster is recreated (removed and and deployed again), the newly created jobmanager is able to restore cluster's state and the job. And in the same time the job submitter submitted the same job again. We ended up with two jobs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@abroskin The job is now being tracked inside the operator (link) rather than the submitter script. And since the previous job tracking script had an issue like #294 with the flink operator's auto job recovery feature, this PR changed to track the job by ID unlike before. And is the spec.job.restartPolicy set to Never? If so, the flink operator should not automatically restart the job. In addition, it would be nice if you could explain more about the HA configuration and deployment.
Update Helm chart CRD for PR GoogleCloudPlatform#379 (GoogleCloudPlatform#386)
bains00
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Review plzz
Changes
Job submitting and tracking
Fix
Fix bug that job recovery does not work if more than one job history remains in the job manager
fromSavepointwhen update.When updating with provided
fromSavepoint, change latest savepoint (status.savepointLocation) to it also.Resolves #294