-
Notifications
You must be signed in to change notification settings - Fork 4.3k
Add proposal for CSI node awareness #8083
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,82 @@ | ||
| ## Using CSINode allocatable limits for scaling nodes that are required for pods that use volumes | ||
|
|
||
| ## Introduction | ||
| Currently cluster-autoscaler doesn’t take into account, volume-attach limit that a node may have when scaling nodes to support unschedulable pods. | ||
|
|
||
| This leads to bunch of problems: | ||
| - If there are unschedulable pods that require more volume than one supported by newly created nodes, there will still be unschedulable pods left. | ||
|
|
||
| - Since a node does not come up with a CSI driver typically, usually too many pods get scheduled on a node, which may not be supportable by the node in the first place. This leads to bunch of pods, just stuck. | ||
|
|
||
| ## Implementation | ||
|
|
||
| Unfortunately, this can’t be fixed in cluster-autoscaler alone. We are proposing changes in both kubernetes/kubernetes and cluster-autoscaler. | ||
|
|
||
| ## Kubernetes Scheduler change | ||
|
|
||
| The first change we propose in Kubernetes scheduler is to not schedule pods that require CSI volumes, if given node is not reporting any installed CSI drivers. | ||
|
|
||
| The proposed change is small and a draft PR is available here - https://github.com/kubernetes/kubernetes/pull/130702 | ||
|
|
||
| This will stop too many pods crowding a node, when a new node is spun up and node is not yet reporting volume limits. | ||
|
|
||
| But this alone is not enough to fix the underlying problem. In part-2, cluster-autoscaler must be fixed so as it is aware of attach limits of a node via CSINode object. | ||
|
|
||
| ## Cluster Autoscaler changes | ||
|
|
||
| We can split the implementation in cluster-autoscaler in two parts: | ||
| - Scaling a node-group that already has one or more nodes. | ||
| - Scaling a node-group that doesn’t have one or more nodes (Scaling from zero). | ||
|
|
||
| ### Scaling a node-group that already has one or more nodes. | ||
|
|
||
| 1. We propose a similar label as GPULabel added to the node that is supposed to come up with a CSI driver. This would ensure that, nodes which are supposed to have a certain CSI driver installed aren’t considered ready - https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/core/static_autoscaler.go#L979 until CSI driver is installed there. | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What happens if there are multiple drivers installed?
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We are changing this mechanism to not use GPU like labels anymore. We are following what was done in #8109 Please see kubernetes/enhancements#5031 for more updated version. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. In cases where CSI Drivers are only installed on subset of nodes, how does the auto-scalar get this info? (e.g. ebs-csi-node Daemonset ignores instance-type x, or disabled on windows nodes) Does the autoscalar have to parse CSI Node Daemonset node-affinity? Does the cluster admin have to configure whether nodes from X node-group/node-pool should have a CSI Driver? Otherwise CSI Driver will never be installed and then node marked as NotReady. (I guess if no driver is ever scheduled onto node, then we can ignore this requirement?
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. i think it would be nice to see how the label will be used.
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I have updated the proposal with readiness detail. Please see kubernetes/enhancements#5031 for more details. |
||
|
|
||
| However, we also propose that a node will be considered ready as soon as corresponding CSI driver is being reported as installed via corresponding CSINode object. | ||
|
|
||
| A node which is ready but does not have CSI driver installed within certain time limit will be considered as NotReady and removed from the cluster. | ||
|
|
||
|
|
||
| 2. We propose that, we add volume limits and installed CSI driver information to framework.NodeInfo objects. So - | ||
|
|
||
| ``` | ||
| type NodeInfo struct { | ||
| .... | ||
| .... | ||
| csiDrivers map[string]*DriverAllocatable | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Are these mutable? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. +1, Allocatable count can change for some storage providers (E.g. some EC2 instance types share network interfaces and EBS volume attachment limits. Therefore if extra ENI added later, new stateful workloads may get stuck. Storage providers might have to deploy controller that watches for additional ENIs and edits this allocatable, or result of NodeGetInfo may need to get percolated up to this DriverAllocatable struct due to kubernetes/enhancements#4876.
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This in in-memory representation of limits read from actual For all intent and purposes, it doesn't matter if these are mutable or immutable, because they will be rebuild whenever informer reports changes in |
||
| .. | ||
| } | ||
|
|
||
| type DriverAllocatable struct { | ||
| Allocatable int32 | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is this extensible to multiple different attach limit in the future? Like what was discussed in the K8s Advanced Volume Attach Limits Strawman
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Not yet. it is doable but we aren't yet implementing what was not implement for intree scheduler limits. |
||
| } | ||
| ``` | ||
|
|
||
| 3. We propose that, when saving `ClusterState` , we capture and add `csiDrivers` information to all existing nodes. | ||
|
|
||
| 4. We propose that, when getting nodeInfosForGroups , the return nodeInfo map also contains csidriver information, which can be used later on for scheduling decisions. | ||
|
|
||
| ``` | ||
| nodeInfosForGroups, autoscalerError := a.processors.TemplateNodeInfoProvider.Process(autoscalingContext, readyNodes, daemonsets, a.taintConfig, currentTime) | ||
| ``` | ||
|
|
||
| Please note that, we will have to handle the case of scaling from 0, separately from | ||
| scaling from 1, because in former case - no CSI volume limit information will be available | ||
| If no node exists in a NodeGroup. | ||
|
|
||
| 5. We propose that, when deciding pods that should be considered for scaling nodes in podListProcessor.Process function, we update the hinting_simulator to consider CSI volume limits of existing nodes. This will allow cluster-autoscaler to exactly know, if all unschedulable pods will fit in the recently spun or currently running nodes. | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. How to pick the existing nodes if they have very different attach limits? |
||
|
|
||
| Making aforementioned changes should allow us to handle scaling of nodes from 1. | ||
|
|
||
| ### Scaling from zero | ||
|
|
||
| Scaling from zero should work similar to scaling from 1, but the main problem is - we do not have NodeInfo which can tell us what would be the CSI attach limit on the node which is being spun up in a NodeGroup. | ||
|
|
||
| We propose that we introduce similar annotation as CPU, Memory resources in cluster-api to process attach limits available on a node. | ||
|
|
||
| We have to introduce similar mechanism in various cloudproviders which return Template objects to incorporate volume limits. This will allow us to handle the case of scaling from zero. | ||
|
|
||
| ## Alternatives Considered | ||
|
|
||
| Certain Kubernetes vendors taint the node when a new node is created and CSI driver has logic to remove the taint when CSI driver starts on the node. | ||
| - https://github.com/kubernetes-sigs/azuredisk-csi-driver/pull/2309 | ||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, I thought I reported this in #4517 and fixed it in #4539. 🤔
Am I missing something?
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What you are talking about is a feature of scheduler and yes scheduler is aware of CSI volume limits.
But cluster-autoscaler itself actually doesn't take into account, how many nodes it should spin to satisfy pending pods that need X number of volumes. For example - in a
nodegroupX, there are 0 nodes and assuming all other nodegroups are at max. capacity, autoscaler must scale the nodegroup X to schedule pending pods. Now say, there are 5 pending pods, that each use 20 volumes. So in total 100 volumes. Cluster Autoscaler currently can't figure out that, on AWS for example, each of those nodes can support 20 volumes and hence it must spin 5 new nodes. So, what will happen is - cluster-autoscaler will spin exactly 1 node and assuming CPU/Memory requirements are satisfied, all of those 5 pods will get scheduled to same node and except 1 pod, rest of the pod will be stuck inContainerCreatingforever until someone manually intervenes.Part of the reasons is - scheduler allows scheduling of pods to a node that doesn't have CSI driver installed. So a upcoming node N will have unlimited capacity until CSI driver is installed on the node and it starts to report a capacity. But while the CSI driver is getting installed on the node, cluster-autoscaler will send *all pending pods to same node, assuming CPU/Memory and other constraints are met.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How will this proposal work with the existing CSI volume limits? Can you help to clarify the flow?