Skip to content

Conversation

kumargauravsharma
Copy link
Contributor

@kumargauravsharma kumargauravsharma commented Jun 16, 2021

Related issue: aws-controllers-k8s/community#836

This PR contains code changes to patch custom resource with latest details during long running delete.
It does so if resource manager's Delete function return requeue related errors.
The expectation is that resource manager's ReadOne method would set appropriate conditions (example: ACK.ResourceSynced to false with message) and provide latest details about the resource which is being deleted.
Conditions can also be set at some common place where common conditions logic should reside.

Testing:
make test passed for runtime.

Alternate approach that was considered, but was not taken up as it involved non-intrusive changes to resource manager's APIs. Following are the details:
Currently, resource manager's Delete API returns only one data field of type error, it does not return resource.
Also, the sdk.go::sdkDelete() logic does not parse the return result output of service delete API.

It can be updated such that sdk.go::sdkDelete() logic parses the output response, and allows a sdk_delete_post_set_output hook for service controllers to have custom logic (as needed). The resultant resource reference would then be returned from this method.
In case, requeue is required then the return values from this method will have both resource, error (requeue type) as not nil.
And the reconciler:cleanup() logic here, would patch the current resource with the returned resource from rm.Delete() call and return the error (if any).

Though this change is similar to the approach taken for resource manager ReadOne here, this is more intrusive as it changes the Delete API's return type.
However, it does have advantage that there is no need for another ReadOne call from reconciler, and also provides flexibility to service controller's to implement custom hook specific to delete scenario.

[EDIT]: the alternate approach that is mentioned above, is the final approach taken per the discussion/comments in this PR.
Related PR in code-generator: aws-controllers-k8s/code-generator#114

@ack-bot ack-bot requested review from jaypipes and vijtrip2 June 16, 2021 02:34
Comment on lines 330 to 331
if errors.As(err, &requeueNeededAfter) ||
errors.As(err, &requeueNeeded) {
Copy link
Member

@a-hilaly a-hilaly Jun 16, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not in favor of using errors.As here, first because it's using the reflect library and second because it's a code that can panic if misused https://golang.org/src/errors/wrap.go?s=2494:2537#L67

Another small remark: &requeueNeededAfter is a "pointer to a pointer to a requeue.RequeueNeededAfter"

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can use Go type assertions/switches for this use case. Something similar to:

_, ok := err.(*requeue.RequeueNeededAfter)
if ok {
    // ...
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated code per the comment.

Comment on lines 334 to 336
observed, _ := rm.ReadOne(ctx, current)
if observed != nil {
_ = r.patchResource(ctx, current, observed)
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why ignore the returned errors here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This block of code executes when Delete api returned error and requeue is requested. ReadOne error is ignored as during the next reconciler loop run (due to requeue) the correct course of action would be taken.
Error from path resources are ignored throughout in the reconciler at other places as well, thus followed the same.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This block of code executes when Delete api returned error and requeue is requested. ReadOne error is ignored as during the next reconciler loop run (due to requeue) the correct course of action would be taken.
Error from path resources are ignored throughout in the reconciler at other places as well, thus followed the same.

@kumargauravsharma I think @a-hilaly was referring to ignoring the returned error from rm.ReadOne(), not patchResource... (I think?) :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought the comment was for both.
For error during read one:
There are two errors available at that time a) error returned from Delete() method b) error returned from Delete call.
The code returns the error received from Delete call as that is the primary operation being done in this method.
ReadOne error is ignored as during the next reconciler loop run (due to requeue) the correct course of action would be taken.

Copy link
Member

@a-hilaly a-hilaly Jun 22, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO we should at least log these errors. It would be hard to track them if the metrics server is reporting few 4XX/5XX but logs are not showing any

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO we should at least log these errors. It would be hard to track them if the metrics server is reporting few 4XX/5XX but logs are not showing any

nvm this was added in the recent logging PR :)

@jaypipes
Copy link
Collaborator

@kumargauravsharma if a call to DeleteReplicationGroup returns success, can the ReplicationGroup ever end up in any state other than deleted?

@kumargauravsharma
Copy link
Contributor Author

@jaypipes no error from DeleteReplicationGroup API call set the replication group state as 'deleting'. It may take some time for it to get deleted, meanwhile the resource describe api does return the resource details as delete progresses.

@jaypipes
Copy link
Collaborator

@jaypipes no error from DeleteReplicationGroup API call set the replication group state as 'deleting'. It may take some time for it to get deleted, meanwhile the resource describe api does return the resource details as delete progresses.

(apologies for the late reply, I'm on PTO this week...)

@kumargauravsharma ok, thanks for the info. I just want to make sure I'm understanding your use cases clearly...you want this functionality in order to match the Kubernetes API experience to the Elasticache API experience: that the replication group resource continues to appear in a DescribeReplicationGroup API call -- and the ReplicationGroup CR will continue to appear in calls to kubectl describe replicationgroups/{name} with a Status.Status == deleting -- even after a DeleteReplicationGroup API call returns no error and the replication group will only ever transition to a deleted state (and not appear in a DescribeReplicationGroup API call). This is about allowing the ReplicationGroup CR to exist for that period of time while it is in progress of being deleted, yes?

Copy link
Collaborator

@jaypipes jaypipes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kumargauravsharma please see inline for a suggested alternative approach that I think would align our Delete method more with the other ReadOne/Create/Update resource manager methods...

Comment on lines 334 to 336
observed, _ := rm.ReadOne(ctx, current)
if observed != nil {
_ = r.patchResource(ctx, current, observed)
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This block of code executes when Delete api returned error and requeue is requested. ReadOne error is ignored as during the next reconciler loop run (due to requeue) the correct course of action would be taken.
Error from path resources are ignored throughout in the reconciler at other places as well, thus followed the same.

@kumargauravsharma I think @a-hilaly was referring to ignoring the returned error from rm.ReadOne(), not patchResource... (I think?) :)

}
return err
}
if err = rm.Delete(ctx, observed); err != nil {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of adding in the logic below to call rm.ReadOne(), how about changing the rm.Delete() method signature to return (acktypes.AWSResource, error) similar to the acktypes.AWSResourceManager:ReadOne(), Create() and Update() methods?

That way, rm.Delete() implementations for asynchronously-completing delete operations (like Elasticache's RGs) could simply return a modified AWSResource containing the fields that should be saved via patchResource()? Then the below code could be simplified to this:

latest, err = rm.Delete(ctx, observed)
if latest != nil {
    // The Delete operation is likely asynchronous and has likely set a Status
    // field on the returned CR to something like `deleting`. Here, we patchResource()
    // in order to save these Status field modifications.
    _ = r.patchResource(ctx, latest)
}
if err != nil {
    // NOTE: Delete() implementations that have asynchronously-completing
    // deletions should return a RequeueNeededAfter.
    return err
}

thoughts?

Copy link
Contributor Author

@kumargauravsharma kumargauravsharma Jun 22, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I gave this approach a thought while creating this PR and mentioned its pros, cons in the PR description at that time, please refer section with text in PR description:

Alternate approach that was considered ...

Downside of this approach was the scope of the changes that it brings: changes to the API signature of resource manager delete, sdkDelete method, and controllers would need to be regenerated with unit tests for delete and code generator templates.

However, I prefer this approach of changing API return type and this PR can be updated for it.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I gave this approach a thought while creating this PR and mentioned its pros, cons in the PR description at that time, please refer section with text in PR description:

Alternate approach that was considered ...

Indeed you did :) Sorry I missed that.

However, I prefer this approach of changing API return type and this PR can be updated for it.

OK, cool. Let's cut a release of runtime (and probably code-generator) with the enhancements made over the last couple days and then modify this PR to change the Delete() method signature.

Work for you?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, thanks Jay!

Copy link
Member

@surajkota surajkota Jun 23, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for this approach for two reasons. SageMaker has resources where:

  1. Delete = Stop so readOne will never be NotFound (InProgress -(call delete)-> Stopping -> Stopped)
  2. Delete resource returns success, but can end up DeleteFailed state so we can add custom code to retry and keep returning requeueerror until the resource is actually cleaned up

Copy link
Member

@surajkota surajkota Jun 29, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kumargauravsharma any plan to update this PR soon?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have updated this PR with the approach that we concluded.
Corresponding code-generator changes are available in PR: aws-controllers-k8s/code-generator#114

@kumargauravsharma
Copy link
Contributor Author

@jaypipes no error from DeleteReplicationGroup API call set the replication group state as 'deleting'. It may take some time for it to get deleted, meanwhile the resource describe api does return the resource details as delete progresses.

(apologies for the late reply, I'm on PTO this week...)

@kumargauravsharma ok, thanks for the info. I just want to make sure I'm understanding your use cases clearly...you want this functionality in order to match the Kubernetes API experience to the Elasticache API experience: that the replication group resource continues to appear in a DescribeReplicationGroup API call -- and the ReplicationGroup CR will continue to appear in calls to kubectl describe replicationgroups/{name} with a Status.Status == deleting -- even after a DeleteReplicationGroup API call returns no error and the replication group will only ever transition to a deleted state (and not appear in a DescribeReplicationGroup API call). This is about allowing the ReplicationGroup CR to exist for that period of time while it is in progress of being deleted, yes?

The PR: aws-controllers-k8s/elasticache-controller#34 takes care of ensuring that the ReplicationGroup CR exists for the period of time while delete is in progress.
However, the ReplicationGroup CR becomes out of sync from its latest state on service side. Thus, this PR #21 was to fix that behavior of reconciler.

nit: once the ReplicationGroup gets deleted on service side, the sdkFind returns not found exception, as there is no ' deleted' status.

@kumargauravsharma kumargauravsharma changed the title Provide custom resource delete progress for long running delete Returning the latest resource from the Delete() call Jul 12, 2021
Copy link
Collaborator

@jaypipes jaypipes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for your patience @kumargauravsharma.

@jaypipes
Copy link
Collaborator

/lgtm

@ack-bot ack-bot added the lgtm Indicates that a PR is ready to be merged. label Jul 12, 2021
@ack-bot
Copy link
Collaborator

ack-bot commented Jul 12, 2021

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: jaypipes, kumargauravsharma, surajkota

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@ack-bot ack-bot merged commit 308bc45 into aws-controllers-k8s:main Jul 12, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved lgtm Indicates that a PR is ready to be merged.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants