Skip to content

Conversation

@macsko
Copy link
Member

@macsko macsko commented Feb 6, 2025

  • One-line PR description: Add KEP-5142
  • Other comments:

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Feb 6, 2025
@k8s-ci-robot k8s-ci-robot added kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Feb 6, 2025
@macsko macsko force-pushed the pop-Pod_from_backoffq_when-activeq_is_empty branch from 7fa0d54 to 34fc985 Compare February 6, 2025 16:32
@macsko
Copy link
Member Author

macsko commented Feb 6, 2025

/cc @dom4ha @sanposhiho @Huang-Wei @alculquicondor

I haven't finished all the points yet, but the general idea is there.

@macsko macsko force-pushed the pop-Pod_from_backoffq_when-activeq_is_empty branch from 34fc985 to 58bc648 Compare February 6, 2025 16:38
Copy link
Member

@sanposhiho sanposhiho left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Though there're some sections with TODO, looking good overall, aligned with the discussion we had in the issue.

@@ -0,0 +1,3 @@
kep-number: 5142
alpha:
approver: "@wojtek-t"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/cc @wojtek-t

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ack - waiting for sig buy-in (the PRR is looking good enough, I'm just not sure I understand the change itself - see my comment below).

@k8s-ci-robot k8s-ci-robot requested a review from wojtek-t February 7, 2025 00:07
Copy link
Member

@sanposhiho sanposhiho left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking mostly good for me, left only nits.

/cc @alculquicondor @macsko @Huang-Wei

approvers:
-

stage: alpha
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we start it from beta per the discussion at the issue? i.e., enable from the day1, but we have a feature gate for a safe guird. wdyt

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we could do this.

@k8s-ci-robot
Copy link
Contributor

@sanposhiho: GitHub didn't allow me to request PR reviews from the following users: macsko.

Note that only kubernetes members and repo collaborators can review this PR, and authors cannot review their own PRs.

In response to this:

Looking mostly good for me, left only nits.

/cc @alculquicondor @macsko @Huang-Wei

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@sanposhiho
Copy link
Member

I meant @dom4ha

However, in the real world, if the scheduling latency is short enough, there won't be a visible downgrade in throughput.
This will only happen if there are no pods in activeQ, so this can be mitigated by an appropriate rate of pod creation.

#### Backoff won't be working as natural rate limiter in case of errors
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we have to distinguish unschedulable from error cases in the first version, otherwise we risk the system overload due to scheduler sending retries storm.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, especially if we decide to start from beta directly, it will be implemented in the first version.

and the backoff time is calculated based on the number of scheduling failures that the pod has experienced.
If one pod has a smaller attempt counter than others,
could the scheduler keep popping this pod ahead of other pods because the pod's backoff expires faster than others?
Actually, that wouldn't happen because the scheduler would increment the attempt counter of pods from backoffQ as well,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assumed we can leave the logic of incrementing scheduling attempts only after the backoff time expires to avoid rapid increase of the backoff time for pods triggered by large number of events.

It's probably debatable, because the side effect is increasing the chance of pod starvation as described here (pod would get back with it's former backoff timeout, unless it expired).

The approach proposed here would make the backoff more natural, but would cause reporting way higher (in the order of magnitude) values for scheduling attempts.

To achieve the goal, activeQ's `pop()` method needs to be changed:
1. If activeQ is empty, then instead of waiting for a pod to arrive at activeQ, popping from backoffQ is tried.
2. If backoffQ is empty, then `pop()` is waiting for pod as previously.
3. If backoffQ is not empty, then the pod is processed like the pod would be taken from activeQ, including increasing attempts number.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

increasing attempts number.

It's debatable. It can cause rapid increase in the backoff time for some pods (when many event trigger retry), so I was thinking about computing the backoff based on the original approach to not change the logic too much. In other words, not increase scheduling attempts if all of them happened within the last backoff time window. This way this mechanism would be just a best effort improvement rather than a new approach.

Copy link
Member

@sanposhiho sanposhiho Feb 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It can cause rapid increase in the backoff time for some pods (when many event trigger retry)

I believe we should increment the attempt number.
It is exactly an intentional behavior to move the pods to backoffQ immediately and make a lot of scheduling retries (regardless of whether with or without this feature), if many events are coming and each triggers the pod's requeueing.
As long as QHints determine the event might make the pod schedulable and then move this pod to backoffQ, this Pod should be able to get retried at any timing, regardless of with or without backoff time.
In case this tons of retries don't make the pod schedulable after all, and make the pod's backoff time much longer, that's the fault of QHint accuracy, not this feature.

Copy link
Member

@sanposhiho sanposhiho Feb 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only concern is the backoff time for pods getting error (not unschedulable status) because their scheduling failure isn't solved by events, but solved by some other factors (e.g., the networking issue etc).
In this case, it could happen: pod gets error status -> enqueued directly to backoffQ -> the scheduler pops this pod from backoffQ directly immediately (this feature) -> ... -> the attempt counter is incremented every time and the backoff time gets too long.
But, as the KEP mentioned, we'll address this point along with kubernetes/kubernetes#128748

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We had a followup discussion with @macsko, so this is what we came up with:

We can count errors and unschedulable attempts separately, but we need two queues - for errors and for unschedulable pods, since the feature applies only to unschedulable pods (scheduling errors need the original backoff logic for rate limiting).

We agreed that it may be fine to increment attempts on every pop (we need to be prepared to see much higher values for attempts counter than before), although there are a few important things:

  1. the interesting is the worst case (when active queue is permanently non-empty), so we think the existing mechanism should still apply and move expired items to the active queue - to sort them by priority.

  2. since the original mechanism was triggered every 1s, the pods within this second got sorted in the active queue by priority. If we pop just based on expiration time, we would loose this, so we could have sorting by priority within every full second to reflect the current behavior.

Unfortunately I saw a case where max backoff of 10s was constantly moving big number of high priority pods to active, which were processed longer than these 10s, causing starvation of lower priority pods. So this feature does not help in this scenario directly, but it does indirectly, because we should be able to safely increase the default value to something like 120s without risking any pod would need to wait that long in practice.

Copy link
Member

@sanposhiho sanposhiho Feb 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since the original mechanism was triggered every 1s, the pods within this second got sorted in the active queue by priority.

Yeah, this is the point I missed. So,

so we think the existing mechanism should still apply and move expired items to the active queue - to sort them by priority.

do you mean, instead of just directly popping pods out of backoffQ when activeQ is empty, we'd once move pods from backoffQ to activeQ, let activeQ sort them, and restart popping from activeQ? I.e., when activeQ is empty, trigger flushBackoffQCompleted forcibly?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you mean, instead of just directly popping pods out of backoffQ when activeQ is empty, we'd once move pods from backoffQ to activeQ, let activeQ sort them, and restart popping from activeQ? I.e., when activeQ is empty, trigger flushBackoffQCompleted forcibly?

He just emphasized that we'll still keep the old flushing from backoffQ to activeQ apart from new popping.

Sorting by priority for popping when activeQ is empty can be done by changing sorting key of backoffQ's heap - we could ignore the milliseconds in backoff time expiration and add priority to comparison. It would probably also require to align those windows with flushing windows, so to trigger the flush every full second.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

He just emphasized that we'll still keep the old flushing from backoffQ to activeQ apart from new popping.

Yes agree, we definitely should keep that.

by changing sorting key of backoffQ's heap - we could ignore the milliseconds in backoff time expiration and add priority to comparison.

Aha, that looks awesome. So, like sorting by the expiration time first (but ignore miliseconds), and sorting by the priority next, right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, like sorting by the expiration time first (but ignore miliseconds), and sorting by the priority next, right?

Exactly, I added it to the kep, so please take a look.

It is poping from a heap data structure, so it should be fast enough not to cause any performance troubles.

To support monitoring, when popping from backoffQ,
the `scheduler_queue_incoming_pods_total` metric with an `activeQ` queue and a new `PopFromBackoffQ` event label will be incremented.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

metric with an activeQ queue

Should it be backoffQ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This metric shows the pod movement between queues, where the queue means the destination. As pod is already in backoffQ, its destination have to be activeQ (we just skip putting the pod there, but treat it as it would be in activeQ).


To achieve the goal, activeQ's `pop()` method needs to be changed:
1. If activeQ is empty, then instead of waiting for a pod to arrive at activeQ, popping from backoffQ is tried.
2. If backoffQ is empty, then `pop()` is waiting for pod as previously.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding point 2, it means waiting for the pods in ActiveQ, right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It will be waiting for pods both from activeQ and backoffQ, whatever will be first.

This proposal will take those pods earlier, leading to losing this mechanism.

After merging [kubernetes#128748](github.com/kubernetes/kubernetes/pull/128748),
it will be possible to distinguish pods backing off because of errors from those backing off because of an unschedulable attempt.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Forgive me my ignorance, but if pods were unschedulable, wouldn't it be put into the unschedulableQ instead of backoffQ?
Also - if it was unschedulable, the we need some cluster state change to even wish it could become schedulable.

Copy link
Member

@sanposhiho sanposhiho Feb 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right for pods that were rejected by Unschedulable status.
But, for example, when a network error happened and the pod is rejected, in this case, the pod would get Error status, and directly put in backoffQ, skipping unschedQ. This is because we cannot see when to requeue this pod (i.e., when the network failure is solved) by the cluster event. So, we just put such pod to backoffQ directly to keep it in the scheduling queue for a while, and retry it afterwards with the hope that the network failure is gone now.

What here is saying is that it's better not to skip backoff of such pods that got Error status. Because if we retry those pods immediately, the network instability is likely not yet solved. Rather if, for another example, an external component that is used during the scheduling is overwhelmed and pods are getting Error at the scheduling or binding cycles, to keep retrying those pods with Error status might result in a worse situation. (that would put more load on the component unnecessarily.)
We want to skip backoff time of pods that were in the queue due to Unschedulable status.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What here is saying is that it's better not to skip backoff of such pods that got Error status. Because the sooner we retry those pods, the more likely the network instability is not yet solved.

Agree with that.

We want to skip backoff time of pods that were in the queue due to Unschedulable status.

This is what I don't understand. Aren't those pods supposed to live in unschedulableQ instead of backoffQ? If they are unschedulable, I thought they are added to unschedulableQ instead.

Copy link
Member

@sanposhiho sanposhiho Feb 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Those pods are put in unschedQ first, and then events trigger requeueing to backoffQ.
e.g., If a pod is rejected by the resource fit plugin, it is, first, put in unschedQ. Then, let's say new node is created, and the scheduler notices it. The queue would move this pod from unschedQ to backoffQ.
So, backoffQ could have those two types of pods; pods that got rejected by Error status, and pods that got rejected by Unschedulable status and moved from unschedQ by some event.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then, let's say new node is created, and the scheduler notices it. The queue would move this pod from unschedQ to backoffQ.

Ok - that was the bit I was missing. This makes sense now - thanks!

Copy link
Member

@dom4ha dom4ha Feb 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason why there is the transition from the unschedulableQ -> backoffQ (instead of going to activeQ directly) is that the mechanism of triggering this transition not super precise for performance reasons. So it triggers whenever there is a chance the received cluster event may make an unschedulable pod schedulable, but scheduler may very often assess the pod unschedulable again, so we have so called a wasted cycle.

So the backoffQ serve the purpose of a buffer so that scheduling reattempts won't consume all scheduler cycles (and starve out other schedulable pods). However, it introduces scheduling delay if an event indeed made a pod schedulable. This is why we propose to pop from the backoff queue whenever activeQ is empty (the scheduler is idle anyway)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense - thanks!

@@ -0,0 +1,3 @@
kep-number: 5142
alpha:
approver: "@wojtek-t"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ack - waiting for sig buy-in (the PRR is looking good enough, I'm just not sure I understand the change itself - see my comment below).

@wojtek-t wojtek-t self-assigned this Feb 12, 2025
Copy link
Member

@sanposhiho sanposhiho left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 12, 2025
@dom4ha
Copy link
Member

dom4ha commented Feb 12, 2025

/lgtm

@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 12, 2025
@macsko
Copy link
Member Author

macsko commented Feb 12, 2025

/assign @alculquicondor
for approval

@dom4ha
Copy link
Member

dom4ha commented Feb 12, 2025

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 12, 2025
Copy link
Member

@alculquicondor alculquicondor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/approve

@wojtek-t
Copy link
Member

/lgtm
/approve PRR

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: alculquicondor, macsko, sanposhiho, wojtek-t

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 12, 2025
@k8s-ci-robot k8s-ci-robot merged commit dcbea32 into kubernetes:master Feb 12, 2025
4 checks passed
@k8s-ci-robot k8s-ci-robot added this to the v1.33 milestone Feb 12, 2025
@dom4ha
Copy link
Member

dom4ha commented Feb 12, 2025

Nice work! Thanks Maciek!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory lgtm "Looks good to me", indicates that a PR is ready to be merged. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.

Projects

Archived in project

Development

Successfully merging this pull request may close these issues.

7 participants