Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
82 changes: 82 additions & 0 deletions cluster-autoscaler/proposals/csinode.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,82 @@
## Using CSINode allocatable limits for scaling nodes that are required for pods that use volumes

## Introduction
Currently cluster-autoscaler doesn’t take into account, volume-attach limit that a node may have when scaling nodes to support unschedulable pods.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, I thought I reported this in #4517 and fixed it in #4539. 🤔

Am I missing something?

Copy link
Member Author

@gnufied gnufied May 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What you are talking about is a feature of scheduler and yes scheduler is aware of CSI volume limits.

But cluster-autoscaler itself actually doesn't take into account, how many nodes it should spin to satisfy pending pods that need X number of volumes. For example - in a nodegroup X, there are 0 nodes and assuming all other nodegroups are at max. capacity, autoscaler must scale the nodegroup X to schedule pending pods. Now say, there are 5 pending pods, that each use 20 volumes. So in total 100 volumes. Cluster Autoscaler currently can't figure out that, on AWS for example, each of those nodes can support 20 volumes and hence it must spin 5 new nodes. So, what will happen is - cluster-autoscaler will spin exactly 1 node and assuming CPU/Memory requirements are satisfied, all of those 5 pods will get scheduled to same node and except 1 pod, rest of the pod will be stuck in ContainerCreating forever until someone manually intervenes.

Part of the reasons is - scheduler allows scheduling of pods to a node that doesn't have CSI driver installed. So a upcoming node N will have unlimited capacity until CSI driver is installed on the node and it starts to report a capacity. But while the CSI driver is getting installed on the node, cluster-autoscaler will send *all pending pods to same node, assuming CPU/Memory and other constraints are met.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How will this proposal work with the existing CSI volume limits? Can you help to clarify the flow?


This leads to bunch of problems:
- If there are unschedulable pods that require more volume than one supported by newly created nodes, there will still be unschedulable pods left.

- Since a node does not come up with a CSI driver typically, usually too many pods get scheduled on a node, which may not be supportable by the node in the first place. This leads to bunch of pods, just stuck.

## Implementation

Unfortunately, this can’t be fixed in cluster-autoscaler alone. We are proposing changes in both kubernetes/kubernetes and cluster-autoscaler.

## Kubernetes Scheduler change

The first change we propose in Kubernetes scheduler is to not schedule pods that require CSI volumes, if given node is not reporting any installed CSI drivers.

The proposed change is small and a draft PR is available here - https://github.com/kubernetes/kubernetes/pull/130702

This will stop too many pods crowding a node, when a new node is spun up and node is not yet reporting volume limits.

But this alone is not enough to fix the underlying problem. In part-2, cluster-autoscaler must be fixed so as it is aware of attach limits of a node via CSINode object.

## Cluster Autoscaler changes

We can split the implementation in cluster-autoscaler in two parts:
- Scaling a node-group that already has one or more nodes.
- Scaling a node-group that doesn’t have one or more nodes (Scaling from zero).

### Scaling a node-group that already has one or more nodes.

1. We propose a similar label as GPULabel added to the node that is supposed to come up with a CSI driver. This would ensure that, nodes which are supposed to have a certain CSI driver installed aren’t considered ready - https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/core/static_autoscaler.go#L979 until CSI driver is installed there.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if there are multiple drivers installed?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are changing this mechanism to not use GPU like labels anymore. We are following what was done in #8109

Please see kubernetes/enhancements#5031 for more updated version.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In cases where CSI Drivers are only installed on subset of nodes, how does the auto-scalar get this info? (e.g. ebs-csi-node Daemonset ignores instance-type x, or disabled on windows nodes)

Does the autoscalar have to parse CSI Node Daemonset node-affinity? Does the cluster admin have to configure whether nodes from X node-group/node-pool should have a CSI Driver?

Otherwise CSI Driver will never be installed and then node marked as NotReady. (I guess if no driver is ever scheduled onto node, then we can ignore this requirement?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think it would be nice to see how the label will be used.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have updated the proposal with readiness detail. Please see kubernetes/enhancements#5031 for more details.


However, we also propose that a node will be considered ready as soon as corresponding CSI driver is being reported as installed via corresponding CSINode object.

A node which is ready but does not have CSI driver installed within certain time limit will be considered as NotReady and removed from the cluster.


2. We propose that, we add volume limits and installed CSI driver information to framework.NodeInfo objects. So -

```
type NodeInfo struct {
....
....
csiDrivers map[string]*DriverAllocatable

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these mutable?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, Allocatable count can change for some storage providers (E.g. some EC2 instance types share network interfaces and EBS volume attachment limits. Therefore if extra ENI added later, new stateful workloads may get stuck.

Storage providers might have to deploy controller that watches for additional ENIs and edits this allocatable, or result of NodeGetInfo may need to get percolated up to this DriverAllocatable struct due to kubernetes/enhancements#4876.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This in in-memory representation of limits read from actual CSINode objects. They will resynced via snapshot mechanism in CAS.

For all intent and purposes, it doesn't matter if these are mutable or immutable, because they will be rebuild whenever informer reports changes in CSINode objects.

..
}

type DriverAllocatable struct {
Allocatable int32
Copy link

@sunnylovestiramisu sunnylovestiramisu May 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this extensible to multiple different attach limit in the future? Like what was discussed in the K8s Advanced Volume Attach Limits Strawman

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not yet. it is doable but we aren't yet implementing what was not implement for intree scheduler limits.

}
```

3. We propose that, when saving `ClusterState` , we capture and add `csiDrivers` information to all existing nodes.

4. We propose that, when getting nodeInfosForGroups , the return nodeInfo map also contains csidriver information, which can be used later on for scheduling decisions.

```
nodeInfosForGroups, autoscalerError := a.processors.TemplateNodeInfoProvider.Process(autoscalingContext, readyNodes, daemonsets, a.taintConfig, currentTime)
```

Please note that, we will have to handle the case of scaling from 0, separately from
scaling from 1, because in former case - no CSI volume limit information will be available
If no node exists in a NodeGroup.

5. We propose that, when deciding pods that should be considered for scaling nodes in podListProcessor.Process function, we update the hinting_simulator to consider CSI volume limits of existing nodes. This will allow cluster-autoscaler to exactly know, if all unschedulable pods will fit in the recently spun or currently running nodes.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How to pick the existing nodes if they have very different attach limits?


Making aforementioned changes should allow us to handle scaling of nodes from 1.

### Scaling from zero

Scaling from zero should work similar to scaling from 1, but the main problem is - we do not have NodeInfo which can tell us what would be the CSI attach limit on the node which is being spun up in a NodeGroup.

We propose that we introduce similar annotation as CPU, Memory resources in cluster-api to process attach limits available on a node.

We have to introduce similar mechanism in various cloudproviders which return Template objects to incorporate volume limits. This will allow us to handle the case of scaling from zero.

## Alternatives Considered

Certain Kubernetes vendors taint the node when a new node is created and CSI driver has logic to remove the taint when CSI driver starts on the node.
- https://github.com/kubernetes-sigs/azuredisk-csi-driver/pull/2309
Loading