Skip to content

Conversation

sats-23
Copy link
Contributor

@sats-23 sats-23 commented Oct 14, 2025

Problem:
Currently, except unit tests and master integration tests which are run at 1h frequency, all other jobs seem to run at 3h, 6h or 8h interval. These jobs are often triggered at the same time.

These periodics combined with dev triggered pre-submits can often lead to a resource crunch on cluster.

Solution:
-Stagger the job timings using cron syntax.
-The interval of each job has been maintained the same.
-The approx finish times of the jobs are considered to help club lighter workloads in closer intervals.

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Oct 14, 2025
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: sats-23
Once this PR has been reviewed and has the lgtm label, please assign kishen-v for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added size/S Denotes a PR that changes 10-29 lines, ignoring generated files. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Oct 14, 2025
@k8s-ci-robot
Copy link
Contributor

Hi @sats-23. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added area/config Issues or PRs related to code in /config area/jobs sig/cloud-provider Categorizes an issue or PR as relevant to SIG Cloud Provider. sig/testing Categorizes an issue or PR as relevant to SIG Testing. labels Oct 14, 2025
@kishen-v
Copy link
Contributor

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Oct 14, 2025
@mkumatag
Copy link
Member

-Stagger the job timings using cron syntax.

One potential downside to this approach is that you'll end up maintaining these entries statically, and I'm not aware of a better automated solution for handling this?!

@sats-23
Copy link
Contributor Author

sats-23 commented Oct 14, 2025

-Stagger the job timings using cron syntax.

One potential downside to this approach is that you'll end up maintaining these entries statically, and I'm not aware of a better automated solution for handling this?!

True — while we’ll need to account for the syntax of each newly introduced job, this approach gives us the best chance to keep our cluster resources fully optimized and significantly reduce the risk of scheduling timeouts.

@upodroid
Copy link
Member

upodroid commented Oct 14, 2025

A better approach would be to force kubernetes to allocate a specific node for every prow pod. Should be very easy to do with antiaffinity rules or allocating all the resources on the node to the job as we do with the gke build cluster

@sats-23
Copy link
Contributor Author

sats-23 commented Oct 15, 2025

A better approach would be to force kubernetes to allocate a specific node for every prow pod. Should be very easy to do with antiaffinity rules or allocating all the resources on the node to the job as we do with the gke build cluster

Thanks for the suggestion @upodroid — that’s a valid point. In this case, all worker nodes have the same configuration and resource capacity, so using node-level antiaffinity or dedicating entire nodes to individual jobs wouldn’t give us much additional isolation.
The main goal behind staggering the cron schedules is to smooth out cluster-wide concurrency — preventing resource spikes when multiple periodic jobs start at once. This approach helps ensure that we can run more presubmit jobs successfully even during high concurrent triggers, without running into scheduling timeouts or underutilization of the cluster.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/config Issues or PRs related to code in /config area/jobs cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. sig/cloud-provider Categorizes an issue or PR as relevant to SIG Cloud Provider. sig/testing Categorizes an issue or PR as relevant to SIG Testing. size/S Denotes a PR that changes 10-29 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants