Improve submitting/tracking job and fix #379

elanv · 2020-12-03T15:37:00Z

Changes

Job submitting and tracking
- Job submitter pod is finished after job submission.
- Identify Flink job by extracting the job ID from termination-log of job submitter pod
- Flink job is tracked by the operator itself (not by job submitter)
- Shorten the bootstrap time of the job cluster by parallelizing cluster and job submit initialization
Fix
- Job recovery
  Fix bug that job recovery does not work if more than one job history remains in the job manager
- Change latest savepoint to provided fromSavepoint when update.
  When updating with provided fromSavepoint, change latest savepoint (status.savepointLocation) to it also.

Resolves #294

elanv · 2020-12-04T02:32:40Z

I will add and fix some tests soon.

functicons

Thanks for the PR! I left a few comments.

api/v1beta1/flinkcluster_types.go

controllers/flinkcluster_converter.go

controllers/flinkcluster_observer.go

controllers/flinkcluster_util.go

controllers/flinkcluster_observer.go

controllers/flinkcluster_submit_job_script.go

controllers/flinkcluster_util.go

functicons

A few more comments, thanks!

controllers/flinkcluster_submit_job_script.go

controllers/flinkcluster_util.go

functicons · 2020-12-05T05:26:52Z

/gcbrun

elanv · 2020-12-05T10:20:54Z

Restored a CRD that was accidentally changed. Removed JobManager check start delay from submit script. This is because the time difference between the submitter and JM initialization is small and sometimes the submitter can take longer, such as downloading large file.

functicons · 2020-12-05T19:45:18Z

/gcbrun

functicons · 2020-12-06T04:57:14Z

I just did a test, the sample job finished successfully, but the status in the CR was not quite right, it was still in Pending status:

Status:
  Components:
    Job:
      Id:     af4f6808f9c78597623a626e300c73f5
      Name:   flinkjobcluster-sample-job
      State:  Pending
  ...
  Current Revision:  flinkjobcluster-sample-5d96cb58dd-1
  Last Update Time:  2020-12-06T04:55:33Z
  Next Revision:     flinkjobcluster-sample-5d96cb58dd-1
  State:             Stopped

functicons

I recommend we change the name suffix of the Job resource from job to job-submitter; otherwise, people might be confused when seeing it is in Completed status.

We also need to update the "Flink job" section of the user guide on how to monitor the job. Previously monitoring the job submitter makes sense, but with this change, it doesn't make much sense. I think we can briefly mention how the job is submitted.

elanv · 2020-12-06T16:36:44Z

Thanks for your review of something I missed.
Fixed finished job related issues.

Will proceed remaining works of renaming and docs.

docs/user_guide.md

controllers/flinkcluster_reconciler.go

functicons · 2020-12-07T05:02:25Z

Unit test is failing, could you fix it?

E1207 05:00:42.876848    3516 factory.go:35] Failed initializing volcano batch scheduler: invalid configuration: no configuration has been provided, try setting KUBERNETES_MASTER environment variable
--- FAIL: TestGetDesiredClusterState (0.01s)
    flinkcluster_converter_test.go:889: assertion failed: 
        --- desiredState.Job
        +++ expectedDesiredJob
          v1.Job{
                TypeMeta: v1.TypeMeta{},
                ObjectMeta: v1.ObjectMeta{
        -               Name:         "flinkjobcluster-sample-job-submitter",
        +               Name:         "flinkjobcluster-sample-job",
                        GenerateName: "",
                        Namespace:    "default",
                        ... // 13 identical fields
                },

functicons · 2020-12-07T05:29:54Z

I noticed another problem, in the current job status, it shows x-job-submitter in Succeeded status. It unclear whether the status is for the job submitter or the job. Maybe remove the Name field to make it less confusing?

Status:
    Components:
        Job:
            Id:     b2a57bdfdc5127ddbab05da9ec438168
            Name:   flinkjobcluster-sample-job-submitter
            State:  Succeeded

elanv · 2020-12-07T05:58:00Z

I noticed another problem, in the current job status, it shows x-job-submitter in Succeeded status. It unclear whether the status is for the job submitter or the job. Maybe remove the Name field to make it less confusing?

That's right. I think it would be better to remove it too.
Or it would be nice to keep the field and make it optional for later use, and not set that value now.

functicons · 2020-12-07T19:47:52Z

/gcbrun

abroskin · 2020-12-15T12:17:06Z

controllers/flinkcluster_submit_job_script.go

-function check_existing_jobs() {
-	echo "Checking existing jobs..."
-	list_jobs
-	if list_jobs | grep -e "(SCHEDULED)" -e "(CREATED)" -e "(SUSPENDED)" -e "(FINISHED)" -e "(FAILED)" -e "(CANCELED)" \
-		-e "(RUNNING)" -e "(RESTARTING)" -e "(CANCELLING)" -e "(FAILING)" -e "(RECONCILING)"; then
-		echo "Found an existing job, skip resubmitting..."
-		return 0
-	fi
-	return 1


@elanv, could you please explain why you removed this functionality?
We use HA mode. So that, when the cluster is recreated (removed and and deployed again), the newly created jobmanager is able to restore cluster's state and the job. And in the same time the job submitter submitted the same job again. We ended up with two jobs.

@abroskin The job is now being tracked inside the operator (link) rather than the submitter script. And since the previous job tracking script had an issue like #294 with the flink operator's auto job recovery feature, this PR changed to track the job by ID unlike before. And is the spec.job.restartPolicy set to Never? If so, the flink operator should not automatically restart the job. In addition, it would be nice if you could explain more about the HA configuration and deployment.

Update Helm chart CRD for PR GoogleCloudPlatform#379 (GoogleCloudPlatform#386)

bains00

Review plzz

Improve submitting/tracking job and enable job recovery

b708a3b

elanv mentioned this pull request Dec 3, 2020

Improve submitting/tracking job and fix job recovery bug #372

Closed

functicons self-requested a review December 4, 2020 01:14

Add and fix test

4331615

functicons suggested changes Dec 4, 2020

View reviewed changes

elanv added 2 commits December 5, 2020 00:55

comment and fix

15dfa11

comment

8d8b123

elanv requested a review from functicons December 4, 2020 16:13

functicons suggested changes Dec 4, 2020

View reviewed changes

controllers/flinkcluster_submit_job_script.go Outdated Show resolved Hide resolved

controllers/flinkcluster_submit_job_script.go Outdated Show resolved Hide resolved

controllers/flinkcluster_util.go Outdated Show resolved Hide resolved

fix submit script and rename function

14940c3

elanv requested a review from functicons December 5, 2020 01:36

make job ID in status optional, remove JM check delay in submit script

f2b13cd

functicons suggested changes Dec 6, 2020

View reviewed changes

keep the job to be stopped, recover from lost state

b4bed2c

rename job sumitter and update doc

34e37d7

elanv force-pushed the feature/job-submitting-tracking-recovery-cli branch from 0d16303 to 34e37d7 Compare December 7, 2020 01:29

elanv requested a review from functicons December 7, 2020 01:29

functicons suggested changes Dec 7, 2020

View reviewed changes

docs/user_guide.md Show resolved Hide resolved

controllers/flinkcluster_reconciler.go Outdated Show resolved Hide resolved

fix test, update doc

e0928fa

elanv requested a review from functicons December 7, 2020 05:12

leave job name in status empty

c61c5cb

functicons approved these changes Dec 7, 2020

View reviewed changes

functicons merged commit 2d0509e into GoogleCloudPlatform:master Dec 7, 2020

functicons mentioned this pull request Dec 7, 2020

Operator does not resubmit failed jobs. #294

Closed

abroskin reviewed Dec 15, 2020

View reviewed changes

jaredstehler mentioned this pull request Dec 16, 2020

Failed to update job status for new job submission: status.components.job.id: required value #385

Closed

elanv added a commit to elanv/gcp-flink-on-k8s-operator that referenced this pull request Dec 17, 2020

Update CRD for PR GoogleCloudPlatform#379

1e330d5

elanv added a commit to elanv/gcp-flink-on-k8s-operator that referenced this pull request Dec 17, 2020

Update Helm chart CRD for PR GoogleCloudPlatform#379

b00f8b4

functicons pushed a commit that referenced this pull request Dec 18, 2020

Update Helm chart CRD for PR #379 (#386)

37e09fb

shashken added a commit to shashken/flink-on-k8s-operator that referenced this pull request Jan 3, 2021

Merge pull request #3 from GoogleCloudPlatform/master

35c0b26

Update Helm chart CRD for PR GoogleCloudPlatform#379 (GoogleCloudPlatform#386)

elanv mentioned this pull request Feb 1, 2021

Job is lost after JobManager restart #398

Closed

bains00 reviewed Mar 21, 2022

View reviewed changes

Improve submitting/tracking job and fix #379

Improve submitting/tracking job and fix #379

Uh oh!

Conversation

elanv commented Dec 3, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Changes

Uh oh!

elanv commented Dec 4, 2020

Uh oh!

functicons left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

functicons left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

functicons commented Dec 5, 2020

Uh oh!

elanv commented Dec 5, 2020

Uh oh!

functicons commented Dec 5, 2020

Uh oh!

functicons commented Dec 6, 2020

Uh oh!

functicons left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

elanv commented Dec 6, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

functicons commented Dec 7, 2020

Uh oh!

functicons commented Dec 7, 2020

Uh oh!

elanv commented Dec 7, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

functicons commented Dec 7, 2020

Uh oh!

abroskin Dec 15, 2020

Choose a reason for hiding this comment

Uh oh!

elanv Dec 15, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bains00 left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

elanv commented Dec 3, 2020 •

edited

Loading

functicons left a comment •

edited

Loading

elanv commented Dec 6, 2020 •

edited

Loading

elanv commented Dec 7, 2020 •

edited

Loading

elanv Dec 15, 2020 •

edited

Loading