Skip to content

Conversation

vijtrip2
Copy link
Contributor

@vijtrip2 vijtrip2 commented Aug 25, 2021

Description of changes:

  • Introduces a new DeepCopy method in AWSResource interface
  • After the fixes made in release v0.12.0, the status of latest object was getting modified during PatchMetadataAndSpec call.
  • The problem was surfaced in one of the test in SageMaker that rely on waiting on Status.Condition.
  • Since the condition was being reset by the PatchMetadataAndSpec call, the test was never passing.
  • I tried multiple things like
    (a) Passing empty status in the base parameter during kc.Patch() call, but the kc.Patch call resets the status by reading it from etcd.
  • At this point, to understand this problem more, I will have to deep dive into kc.Patch source code.
  • To unblock sagemaker team from developing their controller, I propose the change in following PR which is safe and unblocks the team.
  • It is some redundant work in copying the status again but hopefully when we understand the kc.Patch operation better, we can improve on it.
  • I have tested with sagemaker e2e that following solution works.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

Copy link
Member

@surajkota surajkota left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not clean as mentioned in the description but ok for now

@vijtrip2
Copy link
Contributor Author

Some log artifacts showing that latest.Status being overwritten from etcd copy after kc.Patch call.

  • desired is the object read from etcd
  • latest is the modified object
  • clearedDesired is deepcopy of desired with empty status.

the call looks like

err = r.kc.Patch(
		ctx,
		latest.RuntimeObject(),
		client.MergeFrom(clearedDesired.RuntimeObject()),
	)
**********************************************

2021-08-25T07:17:36.748Z	INFO	ackrt	Desired Status before kc.Patch {"ackResourceMetadata":{"arn":"arn:aws:sagemaker:us-west-2:309117047740:endpoint/xgboost-endpoint-tdugtffkpge6hw2","ownerAccountID":"309117047740"},"conditions":[{"type":"ACK.ResourceSynced","status":"True"},{"type":"ACK.Recoverable","status":"False"},{"type":"ACK.Terminal","status":"True","message":"unable to update endpoint. check FailureReason"}],"creationTime":"2021-08-25T07:05:50Z","endpointStatus":"InService","failureReason":" Failed to download model data for container \"container_1\" from URL: \"s3://ack-data-bucket-us-west-2-309117047740-hmlyu3uon3xicxn0ebs1qhxn/sagemaker/model/delete/xgboost-mnist-model.tar.gz\". Please ensure that there is an object located at the URL and that the role passed to CreateModel has permissions to download the object.","lastEndpointConfigNameForUpdate":"xgboost-endpoint-tdugtffkpge6hw2-faulty-config","lastModifiedTime":"2021-08-25T07:16:37Z","latestEndpointConfigName":"xgboost-endpoint-tdugtffkpge6hw2-single-variant-config","productionVariants":[{"currentInstanceCount":2,"currentWeight":1,"deployedImages":[{"resolutionTime":"2021-08-25T07:05:52Z","resolvedImage":"433757028032.dkr.ecr.us-west-2.amazonaws.com/xgboost@sha256:54004f910467ebf7cfa71b5523b81695d103abf21a37d38dc84d63ab8d510c35","specifiedImage":"433757028032.dkr.ecr.us-west-2.amazonaws.com/xgboost:latest"}],"desiredInstanceCount":2,"desiredWeight":1,"variantName":"variant-1"}]}	{"kind": "Endpoint", "namespace": "default", "name": "xgboost-endpoint-tdugtffkpge6hw2", "generation": 3, "account": "309117047740", "role": "", "region": "us-west-2", "is_adopted": false, "kind": "Endpoint", "namespace": "default", "name": "xgboost-endpoint-tdugtffkpge6hw2", "generation": 3}


2021-08-25T07:17:36.748Z	INFO	ackrt	Cleared Desired Status before kc.Patch {"ackResourceMetadata":null,"conditions":null}	{"kind": "Endpoint", "namespace": "default", "name": "xgboost-endpoint-tdugtffkpge6hw2", "generation": 3, "account": "309117047740", "role": "", "region": "us-west-2", "is_adopted": false, "kind": "Endpoint", "namespace": "default", "name": "xgboost-endpoint-tdugtffkpge6hw2", "generation": 3}



2021-08-25T07:17:36.748Z	INFO	ackrt	Latest Status before kc.Patch {"ackResourceMetadata":{"arn":"arn:aws:sagemaker:us-west-2:309117047740:endpoint/xgboost-endpoint-tdugtffkpge6hw2","ownerAccountID":"309117047740"},"conditions":[{"type":"ACK.ResourceSynced","status":"False"},{"type":"ACK.Recoverable","status":"False"},{"type":"ACK.Terminal","status":"False"}],"creationTime":"2021-08-25T07:05:50Z","endpointStatus":"InService","failureReason":" Failed to download model data for container \"container_1\" from URL: \"s3://ack-data-bucket-us-west-2-309117047740-hmlyu3uon3xicxn0ebs1qhxn/sagemaker/model/delete/xgboost-mnist-model.tar.gz\". Please ensure that there is an object located at the URL and that the role passed to CreateModel has permissions to download the object.","lastEndpointConfigNameForUpdate":"xgboost-endpoint-tdugtffkpge6hw2-multi-variant-config","lastModifiedTime":"2021-08-25T07:16:37Z","latestEndpointConfigName":"xgboost-endpoint-tdugtffkpge6hw2-single-variant-config","productionVariants":[{"currentInstanceCount":2,"currentWeight":1,"deployedImages":[{"resolutionTime":"2021-08-25T07:05:52Z","resolvedImage":"433757028032.dkr.ecr.us-west-2.amazonaws.com/xgboost@sha256:54004f910467ebf7cfa71b5523b81695d103abf21a37d38dc84d63ab8d510c35","specifiedImage":"433757028032.dkr.ecr.us-west-2.amazonaws.com/xgboost:latest"}],"desiredInstanceCount":2,"desiredWeight":1,"variantName":"variant-1"}]}	{"kind": "Endpoint", "namespace": "default", "name": "xgboost-endpoint-tdugtffkpge6hw2", "generation": 3, "account": "309117047740", "role": "", "region": "us-west-2", "is_adopted": false, "kind": "Endpoint", "namespace": "default", "name": "xgboost-endpoint-tdugtffkpge6hw2", "generation": 3}



2021-08-25T07:17:36.762Z	INFO	ackrt	Desired Status After kc.Patch {"ackResourceMetadata":{"arn":"arn:aws:sagemaker:us-west-2:309117047740:endpoint/xgboost-endpoint-tdugtffkpge6hw2","ownerAccountID":"309117047740"},"conditions":[{"type":"ACK.ResourceSynced","status":"True"},{"type":"ACK.Recoverable","status":"False"},{"type":"ACK.Terminal","status":"True","message":"unable to update endpoint. check FailureReason"}],"creationTime":"2021-08-25T07:05:50Z","endpointStatus":"InService","failureReason":" Failed to download model data for container \"container_1\" from URL: \"s3://ack-data-bucket-us-west-2-309117047740-hmlyu3uon3xicxn0ebs1qhxn/sagemaker/model/delete/xgboost-mnist-model.tar.gz\". Please ensure that there is an object located at the URL and that the role passed to CreateModel has permissions to download the object.","lastEndpointConfigNameForUpdate":"xgboost-endpoint-tdugtffkpge6hw2-faulty-config","lastModifiedTime":"2021-08-25T07:16:37Z","latestEndpointConfigName":"xgboost-endpoint-tdugtffkpge6hw2-single-variant-config","productionVariants":[{"currentInstanceCount":2,"currentWeight":1,"deployedImages":[{"resolutionTime":"2021-08-25T07:05:52Z","resolvedImage":"433757028032.dkr.ecr.us-west-2.amazonaws.com/xgboost@sha256:54004f910467ebf7cfa71b5523b81695d103abf21a37d38dc84d63ab8d510c35","specifiedImage":"433757028032.dkr.ecr.us-west-2.amazonaws.com/xgboost:latest"}],"desiredInstanceCount":2,"desiredWeight":1,"variantName":"variant-1"}]}	{"kind": "Endpoint", "namespace": "default", "name": "xgboost-endpoint-tdugtffkpge6hw2", "generation": 3, "account": "309117047740", "role": "", "region": "us-west-2", "is_adopted": false, "kind": "Endpoint", "namespace": "default", "name": "xgboost-endpoint-tdugtffkpge6hw2", "generation": 3}



2021-08-25T07:17:36.762Z	INFO	ackrt	Cleared Desired Status After kc.Patch {"ackResourceMetadata":null,"conditions":null}	{"kind": "Endpoint", "namespace": "default", "name": "xgboost-endpoint-tdugtffkpge6hw2", "generation": 3, "account": "309117047740", "role": "", "region": "us-west-2", "is_adopted": false, "kind": "Endpoint", "namespace": "default", "name": "xgboost-endpoint-tdugtffkpge6hw2", "generation": 3}



2021-08-25T07:17:36.762Z	INFO	ackrt	Latest Status After kc.Patch {"ackResourceMetadata":{"arn":"arn:aws:sagemaker:us-west-2:309117047740:endpoint/xgboost-endpoint-tdugtffkpge6hw2","ownerAccountID":"309117047740"},"conditions":[{"type":"ACK.ResourceSynced","status":"True"},{"type":"ACK.Recoverable","status":"False"},{"type":"ACK.Terminal","status":"True","message":"unable to update endpoint. check FailureReason"}],"creationTime":"2021-08-25T07:05:50Z","endpointStatus":"InService","failureReason":" Failed to download model data for container \"container_1\" from URL: \"s3://ack-data-bucket-us-west-2-309117047740-hmlyu3uon3xicxn0ebs1qhxn/sagemaker/model/delete/xgboost-mnist-model.tar.gz\". Please ensure that there is an object located at the URL and that the role passed to CreateModel has permissions to download the object.","lastEndpointConfigNameForUpdate":"xgboost-endpoint-tdugtffkpge6hw2-faulty-config","lastModifiedTime":"2021-08-25T07:16:37Z","latestEndpointConfigName":"xgboost-endpoint-tdugtffkpge6hw2-single-variant-config","productionVariants":[{"currentInstanceCount":2,"currentWeight":1,"deployedImages":[{"resolutionTime":"2021-08-25T07:05:52Z","resolvedImage":"433757028032.dkr.ecr.us-west-2.amazonaws.com/xgboost@sha256:54004f910467ebf7cfa71b5523b81695d103abf21a37d38dc84d63ab8d510c35","specifiedImage":"433757028032.dkr.ecr.us-west-2.amazonaws.com/xgboost:latest"}],"desiredInstanceCount":2,"desiredWeight":1,"variantName":"variant-1"}]}	{"kind": "Endpoint", "namespace": "default", "name": "xgboost-endpoint-tdugtffkpge6hw2", "generation": 3, "account": "309117047740", "role": "", "region": "us-west-2", "is_adopted": false, "kind": "Endpoint", "namespace": "default", "name": "xgboost-endpoint-tdugtffkpge6hw2", "generation": 3}


**********************************************	

I have verified by creating diff that

  • desired.Status is unchanged after kc.Patch call.
  • desired.Status is not equal to latest.Status before kc.Patch call.
  • desired.Status is equal to latest.Status after Patch call.
  • clearedDesired.Status stays empty.

latestCopy := latest.DeepCopy()
err = r.kc.Patch(
ctx,
latest.RuntimeObject(),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could we accomplish the same thing as this PR's changes by just doing this here?

         latest.RuntimeObject().DeepCopy(),

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The SetStatus() method in AWSResource interface uses AWSResource as parameter type. If we do the above we will need to create another method that can accept runtime.Object and update the status using it.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But what I'm saying is that you wouldn't need to call latest.SetStatus() below if latest.RuntimeObject().DeepCopy() is used here because the latest variable would not be mutated, and you are calling latest.SetStatus(latestCopy) below to "undo" that mutation.

Copy link
Contributor Author

@vijtrip2 vijtrip2 Aug 26, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And if we do latest.RuntimeObject().DeepCopy() inside kc.Patch, then we will run into resource version and optimistic lock issues we saw earlier.right? @RedbackThomson

the latest variable would not be mutated

I believe we want metadata of latest to be mutated after kc.Patch call.

@vijtrip2 vijtrip2 requested a review from surajkota August 26, 2021 15:36
@RedbackThomson
Copy link
Contributor

/lgtm

@ack-bot ack-bot added the lgtm Indicates that a PR is ready to be merged. label Aug 26, 2021
@ack-bot
Copy link
Collaborator

ack-bot commented Aug 26, 2021

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: A-Hilaly, RedbackThomson, vijtrip2

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:
  • OWNERS [A-Hilaly,RedbackThomson,vijtrip2]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@ack-bot ack-bot merged commit 8665453 into aws-controllers-k8s:main Aug 26, 2021
@RedbackThomson RedbackThomson added the kind/bug Categorizes issue or PR as related to a bug. label Aug 26, 2021
ack-bot pushed a commit to aws-controllers-k8s/code-generator that referenced this pull request Aug 26, 2021
Description of changes:
Commit#1
- Adds the implementation of AWSResource.DeepCopy() method. aws-controllers-k8s/runtime#48

Commit#2
- Fixes the patching bug in lateInitialize code to unblock sagemaker team
- BUG: rm.LateInitialize() method was adding lateInitialized fields to the same object passed in the parameter and returning as output.
```go
        lateInitializedLatest, err := rm.LateInitialize(ctx, latest)
	rlog.Exit("rm.LateInitialize", err)
	// Always patch after late initialize because some fields may have been initialized while
	// others require a retry after some delay.
	// This patching does not hurt because if there is no diff then 'patchResourceMetadataAndSpec'
	// acts as a no-op.
	if ackcompare.IsNotNil(lateInitializedLatest) {
		patchErr := r.patchResourceMetadataAndSpec(ctx, latest, lateInitializedLatest)
		// Throw the patching error if reconciler is unable to patch the resource with late initializations
		if patchErr != nil {
			err = patchErr
		}
	}
```
- Since lateInitializedLatest and latest were same object, above `patchResourceMetadataAndSpec` call sees no diff and does not patch lateInitialized fields into etcd.
- This PR solves the bug by adding lateInitialized fields in a copy of latest and returning that copy (without modifying latest)

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved kind/bug Categorizes issue or PR as related to a bug. lgtm Indicates that a PR is ready to be merged.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants