CA DRA: implement transaction-like clean-up on DRA-related errors in PredicateSnapshot

**Which component are you using?**:

/area cluster-autoscaler
/area core-autoscaler
/wg device-management

**Is your feature request designed to solve a problem? If so describe the problem this feature should solve.**:

`PredicateSnapshot` methods like `AddNodeInfo()` or `SchedulePod()` can fail because of DRA-related issues, but don't always clean up the partial DRA snapshot modifications that happened prior to the error. This shouldn't be an issue for [the MVP implementation](https://github.com/kubernetes/autoscaler/pull/7530) because these errors would mean aborting the whole loop anyway (see #7784), and the snapshot would be recreated from scratch in the next loop. It will be an issue if we want to proceed with the loop when seeing these errors though, so it should probably be tackled together with #7530.

**Describe the solution you'd like.**:

The most obvious solution would probably be to add clean-up defers to `PredicateSnapshot` methods, and use `dynamicresources.Snapshot` methods to reverse already performed actions. One caveat here is that ResourceClaim allocations are made by the DRA scheduler plugin code, so they aren't easily reversible from within `PredicateSnapshot` (we don't know what to reverse). We could solve that by having `snapshotClaimTracker.SignalClaimPendingAllocation()` save the modification to some intermediate place that can be rolled back, instead of just directly modifying the claim in the DRA snapshot. Then `PredicateSnapshot` could just call something like `dynamicresources.Snapshot.RollBackLastClaimAllocations()`.

**Additional context.**:

This is a part of Dynamic Resource Allocation (DRA) support in Cluster Autoscaler. An MVP of the support was implemented in #7530 (with the whole implementation tracked in https://github.com/kubernetes/kubernetes/issues/118612). There are a number of post-MVP follow-ups to be addressed before DRA autoscaling is ready for production use - this is one of them. 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

CA DRA: implement transaction-like clean-up on DRA-related errors in PredicateSnapshot #7786

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

CA DRA: implement transaction-like clean-up on DRA-related errors in PredicateSnapshot #7786

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions