KEP-5142: Pop pod from backoffQ when activeQ is empty #5144

macsko · 2025-02-06T16:29:18Z

One-line PR description: Add KEP-5142

Issue link: Pop pod from backoffQ when activeQ is empty #5142

Other comments:

macsko · 2025-02-06T16:35:53Z

/cc @dom4ha @sanposhiho @Huang-Wei @alculquicondor

I haven't finished all the points yet, but the general idea is there.

sanposhiho

Though there're some sections with TODO, looking good overall, aligned with the discussion we had in the issue.

keps/sig-scheduling/5142-pop-backoffq-when-activeq-empty/README.md

sanposhiho · 2025-02-07T00:07:36Z

keps/prod-readiness/sig-scheduling/5142.yaml

@@ -0,0 +1,3 @@
+kep-number: 5142
+alpha:
+  approver: "@wojtek-t"


/cc @wojtek-t

ack - waiting for sig buy-in (the PRR is looking good enough, I'm just not sure I understand the change itself - see my comment below).

keps/sig-scheduling/5142-pop-backoffq-when-activeq-empty/README.md

sanposhiho

Looking mostly good for me, left only nits.

/cc @alculquicondor @macsko @Huang-Wei

keps/sig-scheduling/5142-pop-backoffq-when-activeq-empty/README.md

sanposhiho · 2025-02-10T15:34:18Z

keps/sig-scheduling/5142-pop-backoffq-when-activeq-empty/kep.yaml

+approvers:
+  - 
+
+stage: alpha


can we start it from beta per the discussion at the issue? i.e., enable from the day1, but we have a feature gate for a safe guird. wdyt

I think we could do this.

keps/sig-scheduling/5142-pop-backoffq-when-activeq-empty/README.md

k8s-ci-robot · 2025-02-10T15:43:45Z

@sanposhiho: GitHub didn't allow me to request PR reviews from the following users: macsko.

Note that only kubernetes members and repo collaborators can review this PR, and authors cannot review their own PRs.

In response to this:

Looking mostly good for me, left only nits.

/cc @alculquicondor @macsko @Huang-Wei

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

sanposhiho · 2025-02-10T15:44:06Z

I meant @dom4ha

keps/sig-scheduling/5142-pop-backoffq-when-activeq-empty/README.md

dom4ha · 2025-02-11T08:08:21Z

keps/sig-scheduling/5142-pop-backoffq-when-activeq-empty/README.md

+However, in the real world, if the scheduling latency is short enough, there won't be a visible downgrade in throughput.
+This will only happen if there are no pods in activeQ, so this can be mitigated by an appropriate rate of pod creation.
+
+#### Backoff won't be working as natural rate limiter in case of errors


I think we have to distinguish unschedulable from error cases in the first version, otherwise we risk the system overload due to scheduler sending retries storm.

Right, especially if we decide to start from beta directly, it will be implemented in the first version.

dom4ha · 2025-02-11T08:16:10Z

keps/sig-scheduling/5142-pop-backoffq-when-activeq-empty/README.md

+and the backoff time is calculated based on the number of scheduling failures that the pod has experienced.
+If one pod has a smaller attempt counter than others,
+could the scheduler keep popping this pod ahead of other pods because the pod's backoff expires faster than others?
+Actually, that wouldn't happen because the scheduler would increment the attempt counter of pods from backoffQ as well,


I assumed we can leave the logic of incrementing scheduling attempts only after the backoff time expires to avoid rapid increase of the backoff time for pods triggered by large number of events.

It's probably debatable, because the side effect is increasing the chance of pod starvation as described here (pod would get back with it's former backoff timeout, unless it expired).

The approach proposed here would make the backoff more natural, but would cause reporting way higher (in the order of magnitude) values for scheduling attempts.

dom4ha · 2025-02-11T08:24:02Z

keps/sig-scheduling/5142-pop-backoffq-when-activeq-empty/README.md

+To achieve the goal, activeQ's `pop()` method needs to be changed:
+1. If activeQ is empty, then instead of waiting for a pod to arrive at activeQ, popping from backoffQ is tried.
+2. If backoffQ is empty, then `pop()` is waiting for pod as previously.
+3. If backoffQ is not empty, then the pod is processed like the pod would be taken from activeQ, including increasing attempts number.


increasing attempts number.

It's debatable. It can cause rapid increase in the backoff time for some pods (when many event trigger retry), so I was thinking about computing the backoff based on the original approach to not change the logic too much. In other words, not increase scheduling attempts if all of them happened within the last backoff time window. This way this mechanism would be just a best effort improvement rather than a new approach.

It can cause rapid increase in the backoff time for some pods (when many event trigger retry)

I believe we should increment the attempt number.
It is exactly an intentional behavior to move the pods to backoffQ immediately and make a lot of scheduling retries (regardless of whether with or without this feature), if many events are coming and each triggers the pod's requeueing.
As long as QHints determine the event might make the pod schedulable and then move this pod to backoffQ, this Pod should be able to get retried at any timing, regardless of with or without backoff time.
In case this tons of retries don't make the pod schedulable after all, and make the pod's backoff time much longer, that's the fault of QHint accuracy, not this feature.

The only concern is the backoff time for pods getting error (not unschedulable status) because their scheduling failure isn't solved by events, but solved by some other factors (e.g., the networking issue etc).
In this case, it could happen: pod gets error status -> enqueued directly to backoffQ -> the scheduler pops this pod from backoffQ directly immediately (this feature) -> ... -> the attempt counter is incremented every time and the backoff time gets too long.
But, as the KEP mentioned, we'll address this point along with kubernetes/kubernetes#128748

We had a followup discussion with @macsko, so this is what we came up with:

We can count errors and unschedulable attempts separately, but we need two queues - for errors and for unschedulable pods, since the feature applies only to unschedulable pods (scheduling errors need the original backoff logic for rate limiting).

We agreed that it may be fine to increment attempts on every pop (we need to be prepared to see much higher values for attempts counter than before), although there are a few important things:

the interesting is the worst case (when active queue is permanently non-empty), so we think the existing mechanism should still apply and move expired items to the active queue - to sort them by priority.

since the original mechanism was triggered every 1s, the pods within this second got sorted in the active queue by priority. If we pop just based on expiration time, we would loose this, so we could have sorting by priority within every full second to reflect the current behavior.

Unfortunately I saw a case where max backoff of 10s was constantly moving big number of high priority pods to active, which were processed longer than these 10s, causing starvation of lower priority pods. So this feature does not help in this scenario directly, but it does indirectly, because we should be able to safely increase the default value to something like 120s without risking any pod would need to wait that long in practice.

since the original mechanism was triggered every 1s, the pods within this second got sorted in the active queue by priority.

Yeah, this is the point I missed. So,

so we think the existing mechanism should still apply and move expired items to the active queue - to sort them by priority.

do you mean, instead of just directly popping pods out of backoffQ when activeQ is empty, we'd once move pods from backoffQ to activeQ, let activeQ sort them, and restart popping from activeQ? I.e., when activeQ is empty, trigger flushBackoffQCompleted forcibly?

do you mean, instead of just directly popping pods out of backoffQ when activeQ is empty, we'd once move pods from backoffQ to activeQ, let activeQ sort them, and restart popping from activeQ? I.e., when activeQ is empty, trigger flushBackoffQCompleted forcibly?

He just emphasized that we'll still keep the old flushing from backoffQ to activeQ apart from new popping.

Sorting by priority for popping when activeQ is empty can be done by changing sorting key of backoffQ's heap - we could ignore the milliseconds in backoff time expiration and add priority to comparison. It would probably also require to align those windows with flushing windows, so to trigger the flush every full second.

He just emphasized that we'll still keep the old flushing from backoffQ to activeQ apart from new popping.

Yes agree, we definitely should keep that.

by changing sorting key of backoffQ's heap - we could ignore the milliseconds in backoff time expiration and add priority to comparison.

Aha, that looks awesome. So, like sorting by the expiration time first (but ignore miliseconds), and sorting by the priority next, right?

So, like sorting by the expiration time first (but ignore miliseconds), and sorting by the priority next, right?

Exactly, I added it to the kep, so please take a look.

haosdent · 2025-02-12T06:27:00Z

keps/sig-scheduling/5142-pop-backoffq-when-activeq-empty/README.md

+   It is poping from a heap data structure, so it should be fast enough not to cause any performance troubles.
+
+To support monitoring, when popping from backoffQ, 
+the `scheduler_queue_incoming_pods_total` metric with an `activeQ` queue and a new `PopFromBackoffQ` event label will be incremented.


metric with an activeQ queue

Should it be backoffQ?

This metric shows the pod movement between queues, where the queue means the destination. As pod is already in backoffQ, its destination have to be activeQ (we just skip putting the pod there, but treat it as it would be in activeQ).

haosdent · 2025-02-12T06:28:14Z

keps/sig-scheduling/5142-pop-backoffq-when-activeq-empty/README.md

+
+To achieve the goal, activeQ's `pop()` method needs to be changed:
+1. If activeQ is empty, then instead of waiting for a pod to arrive at activeQ, popping from backoffQ is tried.
+2. If backoffQ is empty, then `pop()` is waiting for pod as previously.


Regarding point 2, it means waiting for the pods in ActiveQ, right?

It will be waiting for pods both from activeQ and backoffQ, whatever will be first.

keps/sig-scheduling/5142-pop-backoffq-when-activeq-empty/README.md

wojtek-t · 2025-02-12T07:42:53Z

keps/sig-scheduling/5142-pop-backoffq-when-activeq-empty/README.md

+This proposal will take those pods earlier, leading to losing this mechanism.
+
+After merging [kubernetes#128748](github.com/kubernetes/kubernetes/pull/128748),
+it will be possible to distinguish pods backing off because of errors from those backing off because of an unschedulable attempt.


Forgive me my ignorance, but if pods were unschedulable, wouldn't it be put into the unschedulableQ instead of backoffQ?
Also - if it was unschedulable, the we need some cluster state change to even wish it could become schedulable.

You're right for pods that were rejected by Unschedulable status.
But, for example, when a network error happened and the pod is rejected, in this case, the pod would get Error status, and directly put in backoffQ, skipping unschedQ. This is because we cannot see when to requeue this pod (i.e., when the network failure is solved) by the cluster event. So, we just put such pod to backoffQ directly to keep it in the scheduling queue for a while, and retry it afterwards with the hope that the network failure is gone now.

What here is saying is that it's better not to skip backoff of such pods that got Error status. Because if we retry those pods immediately, the network instability is likely not yet solved. Rather if, for another example, an external component that is used during the scheduling is overwhelmed and pods are getting Error at the scheduling or binding cycles, to keep retrying those pods with Error status might result in a worse situation. (that would put more load on the component unnecessarily.)
We want to skip backoff time of pods that were in the queue due to Unschedulable status.

What here is saying is that it's better not to skip backoff of such pods that got Error status. Because the sooner we retry those pods, the more likely the network instability is not yet solved.

Agree with that.

We want to skip backoff time of pods that were in the queue due to Unschedulable status.

This is what I don't understand. Aren't those pods supposed to live in unschedulableQ instead of backoffQ? If they are unschedulable, I thought they are added to unschedulableQ instead.

Those pods are put in unschedQ first, and then events trigger requeueing to backoffQ.
e.g., If a pod is rejected by the resource fit plugin, it is, first, put in unschedQ. Then, let's say new node is created, and the scheduler notices it. The queue would move this pod from unschedQ to backoffQ.
So, backoffQ could have those two types of pods; pods that got rejected by Error status, and pods that got rejected by Unschedulable status and moved from unschedQ by some event.

Then, let's say new node is created, and the scheduler notices it. The queue would move this pod from unschedQ to backoffQ.

Ok - that was the bit I was missing. This makes sense now - thanks!

The reason why there is the transition from the unschedulableQ -> backoffQ (instead of going to activeQ directly) is that the mechanism of triggering this transition not super precise for performance reasons. So it triggers whenever there is a chance the received cluster event may make an unschedulable pod schedulable, but scheduler may very often assess the pod unschedulable again, so we have so called a wasted cycle.

So the backoffQ serve the purpose of a buffer so that scheduling reattempts won't consume all scheduler cycles (and starve out other schedulable pods). However, it introduces scheduling delay if an event indeed made a pod schedulable. This is why we propose to pop from the backoff queue whenever activeQ is empty (the scheduler is idle anyway)

That makes sense - thanks!

wojtek-t · 2025-02-12T07:46:49Z

keps/prod-readiness/sig-scheduling/5142.yaml

@@ -0,0 +1,3 @@
+kep-number: 5142
+alpha:
+  approver: "@wojtek-t"


ack - waiting for sig buy-in (the PRR is looking good enough, I'm just not sure I understand the change itself - see my comment below).

sanposhiho

/lgtm

keps/sig-scheduling/5142-pop-backoffq-when-activeq-empty/README.md

dom4ha · 2025-02-12T11:40:10Z

/lgtm

macsko · 2025-02-12T12:00:39Z

/assign @alculquicondor
for approval

dom4ha · 2025-02-12T12:26:22Z

/lgtm

alculquicondor

/approve

wojtek-t · 2025-02-12T14:36:52Z

/lgtm
/approve PRR

k8s-ci-robot · 2025-02-12T14:37:49Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: alculquicondor, macsko, sanposhiho, wojtek-t

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~keps/prod-readiness/OWNERS~~ [wojtek-t]
~~keps/sig-scheduling/OWNERS~~ [alculquicondor]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

dom4ha · 2025-02-12T14:50:02Z

Nice work! Thanks Maciek!

k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Feb 6, 2025

k8s-ci-robot requested review from Huang-Wei and kikisdeliveryservice February 6, 2025 16:29

k8s-ci-robot added kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Feb 6, 2025

macsko mentioned this pull request Feb 6, 2025

Pop pod from backoffQ when activeQ is empty #5142

Open

8 tasks

macsko force-pushed the pop-Pod_from_backoffq_when-activeq_is_empty branch from 7fa0d54 to 34fc985 Compare February 6, 2025 16:32

k8s-ci-robot requested review from alculquicondor, dom4ha and sanposhiho February 6, 2025 16:35

KEP-5142: Pop pod from backoffQ when activeQ is empty

58bc648

macsko force-pushed the pop-Pod_from_backoffq_when-activeq_is_empty branch from 34fc985 to 58bc648 Compare February 6, 2025 16:38

sanposhiho reviewed Feb 6, 2025

View reviewed changes

keps/sig-scheduling/5142-pop-backoffq-when-activeq-empty/README.md Outdated Show resolved Hide resolved

keps/sig-scheduling/5142-pop-backoffq-when-activeq-empty/README.md Show resolved Hide resolved

sanposhiho reviewed Feb 7, 2025

View reviewed changes

k8s-ci-robot requested a review from wojtek-t February 7, 2025 00:07

Fill all todos and elaborate on some points

d2fc2db

sanposhiho reviewed Feb 8, 2025

View reviewed changes

dom4ha mentioned this pull request Feb 10, 2025

KEP-5007: DRA Device Binding Conditions #5012

Merged

macsko added 3 commits February 10, 2025 11:10

Apply comments

e7ebadc

Fix event name

98c5024

Update toc

a205b3e

sanposhiho reviewed Feb 10, 2025

View reviewed changes

dom4ha reviewed Feb 11, 2025

View reviewed changes

macsko added 2 commits February 11, 2025 15:09

Improve wording, change target to beta

afc3295

Update toc

259cf07

haosdent reviewed Feb 12, 2025

View reviewed changes

wojtek-t reviewed Feb 12, 2025

View reviewed changes

wojtek-t self-assigned this Feb 12, 2025

macsko added 7 commits February 12, 2025 08:50

Fix nits, add a part about ordering

57bd202

Fix typo and formatting

f2096df

Fix formatting

bc40cb2

Fix ordering func

1572942

Update toc

92a64d8

Update wording

4f21ed3

Change stage to beta in prod-readiness

46219c4

sanposhiho approved these changes Feb 12, 2025

View reviewed changes

k8s-ci-robot assigned sanposhiho Feb 12, 2025

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 12, 2025

dom4ha reviewed Feb 12, 2025

View reviewed changes

keps/sig-scheduling/5142-pop-backoffq-when-activeq-empty/README.md Outdated Show resolved Hide resolved

keps/sig-scheduling/5142-pop-backoffq-when-activeq-empty/README.md Outdated Show resolved Hide resolved

k8s-ci-robot assigned dom4ha Feb 12, 2025

Apply review comments

de711ee

k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 12, 2025

k8s-ci-robot assigned alculquicondor Feb 12, 2025

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 12, 2025

alculquicondor reviewed Feb 12, 2025

View reviewed changes

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 12, 2025

k8s-ci-robot merged commit dcbea32 into kubernetes:master Feb 12, 2025
4 checks passed

k8s-ci-robot added this to the v1.33 milestone Feb 12, 2025

KEP-5142: Pop pod from backoffQ when activeQ is empty #5144

KEP-5142: Pop pod from backoffQ when activeQ is empty #5144

Uh oh!

Conversation

macsko commented Feb 6, 2025

Uh oh!

macsko commented Feb 6, 2025

Uh oh!

sanposhiho left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

sanposhiho left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

k8s-ci-robot commented Feb 10, 2025

Uh oh!

sanposhiho commented Feb 10, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sanposhiho Feb 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sanposhiho Feb 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sanposhiho Feb 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sanposhiho Feb 11, 2025 •

edited

Loading

sanposhiho Feb 11, 2025 •

edited

Loading

sanposhiho Feb 12, 2025 •

edited

Loading

sanposhiho Feb 12, 2025 •

edited

Loading

sanposhiho Feb 12, 2025 •

edited

Loading

dom4ha Feb 12, 2025 •

edited

Loading