-
Notifications
You must be signed in to change notification settings - Fork 4.3k
Description
Which component are you using?:
/area cluster-autoscaler
/area core-autoscaler
/wg device-management
Is your feature request designed to solve a problem? If so describe the problem this feature should solve.:
During autoscaling simulations Cluster Autoscaler has to predict how a new, empty Node from a given NodeGroup would look like if CA were to scale the NodeGroup up. This is called a template NodeInfo, and the logic for computing it is roughly:
- If the NodeGroup has at least 1 healthy Node, CA takes that Node as a base for the template and sanitizes it - changes the parts that are Node-specific (like Name or UID), and removes Pods that are not DaemonSet/static (because they won't be present on a new Node).
- If the NodeGroup doesn't have any healthy Nodes, CA delegates computing the template to
CloudProvider.TemplateNodeInfo(). MostCloudProvider.TemplateNodeInfo()implementations create the template in-memory from some information tracked on the CloudProvider side for the NodeGroup (e.g. a VM instance template).
The first method is pretty reliable, but it requires having at least 1 Node kept in the NodeGroup at all times, which can be cost-prohibitive for expensive hardware. The reliability of the second method varies between CloudProvider implementations.
To support DRA, CloudProvider.TemplateNodeInfo() has to predict ResourceSlices and potentially ResourceClaims in addition to the Node and its Pods.
We have the following problems with the current setup:
- Template NodeInfos have poor visibility and debuggability. CA doesn't log much about them (because of the volume), in case of tricky issues the debugging snapshot has to be taken on-demand and analyzed. This can only be done by the cluster admin, a regular cluster user doesn't really have any visibility.
- There is no standard way of influencing the
CloudProvider.TemplateNodeInfo()templates by a regular cluster user (e.g. if the user has a DS pod that exposes an extended resource). Some CloudProvider implementations give the cluster user some control (e.g. AWS, via ASG tags), but even though they allow configuring the same things (e.g. extended resources), they do so in provider-specific ways (e.g. ASG tags on AWS vs KUBE_ENV variable in MIG instance templates on GCE). - There are more template objects to track with DRA, and the new objects can be quite complex. Creating them in-memory from scratch might become non-trivial, and could in some cases be better delegated to some other component where the logic fits better (e.g. a cloud provider control plane creating the NodeGroup).
Describe the solution you'd like.:
IMO we should integrate the template NodeInfo concept with the K8s API.
We could introduce a NodeTemplate/NodeGroupTemplate CRD:
- The Spec would contain the scale-from-0 template NodeInfo (i.e. today's
CloudProvider.TemplateNodeInfo()) set by the cluster admin. - The Spec would allow to modify/override the scale-from-0 template by the cluster user.
- The Status would contain the actual template used by Cluster Autoscaler. This could be the scale-from-0 template from the Spec, but it could be obtained differently (e.g. by sanitizing a real Node).
Which would help us with the problems:
- We'd have visibility into template NodeInfos used by CA through the Status.
- We'd have a standard, CloudProvider-agnostic way of modifying the templates by a regular cluster user - by changing the Spec.
- Computing the scale-from-0 template could be delegated to a component other than CA - the component would just change the Spec.
There are a lot of details to be figured out, in particular how this relates to the Karpenter NodePool model. If it makes sense, we should generalize the concept to be useful for both Node Autoscalers. In any case, this seems like it would require writing a KEP.
Additional context.:
This is a part of Dynamic Resource Allocation (DRA) support in Cluster Autoscaler. An MVP of the support was implemented in #7530 (with the whole implementation tracked in kubernetes/kubernetes#118612). There are a number of post-MVP follow-ups to be addressed before DRA autoscaling is ready for production use - this is one of them.