Improvements: Cluster Autoscaling with GPU-sharing Pods & Support for Scheduling Gates #125

EkinKarabulut · 2025-05-08T14:56:38Z

EkinKarabulut
May 8, 2025
Maintainer

Hi everyone,

We are excited to share that there are some significant improvements to KAI Scheduler:

Support for cluster autoscaling with GPU-sharing pods via node-scale-adjuster (PR #119)
Support for Kubernetes pod scheduling gates to better control pod scheduling readiness (PR #122)

Support for Cluster Autoscaling with GPU-Sharing Pods via node-scale-adjuster (see docs)

Introducing node-scale-adjuster, which enables cluster autoscalers like Karpenter to work with pods using GPU sharing.

Problem
Cluster autoscalers rely on pending pods with resource requests to trigger node provisioning. In KAI, GPU-sharing pods define GPU needs in annotations, not resources.requests, making them invisible to the autoscaler. As a result, no scale-up was triggered (see issue #111)

Solution
node-scale-adjuster watches for unschedulable GPU-sharing pods and creates temporary utility pods that request full GPUs via standard resources.requests. These utility pods allow the autoscaler to react as expected.

Behavior

GPU fraction requests are aggregated. For example, two pods requesting 0.5 GPU each will trigger one utility pod requesting 1 GPU.
If the autoscaler only provisions a node that can host one of the pods, node-scale-adjuster launches another utility pod to continue the scaling process.
Utility pods are removed once the original GPU-sharing pods are scheduled.

For more details, refer to the documentation.

Kubernetes Scheduling Gates Support

KAI now respects Kubernetes pod scheduling gates, which allow pending pods to delay scheduling until certain conditions are met.

Behavior with PodGroups

Gated pods are excluded from minMember checks in a PodGroup.
For example, a PodGroup with minMember = 4 and 4 pending pods, where 1 is gated, will not be scheduled.
If there are enough ungated pods to satisfy minMember, those pods are scheduled.
Remaining gated pods will be scheduled once their gates are lifted.

We’d love your feedback on these updates! 🚀

Feel free to:

Ask questions and drop your further ideas here
Request enhancements
Report bugs

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Improvements: Cluster Autoscaling with GPU-sharing Pods & Support for Scheduling Gates #125

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

Improvements: Cluster Autoscaling with GPU-sharing Pods & Support for Scheduling Gates #125

Uh oh!

EkinKarabulut May 8, 2025 Maintainer

Support for Cluster Autoscaling with GPU-Sharing Pods via node-scale-adjuster (see docs)

Kubernetes Scheduling Gates Support

Replies: 0 comments

EkinKarabulut
May 8, 2025
Maintainer