Skip to content

Conversation

@AlexeyPerevalov
Copy link
Contributor

Initially it was proposed as built-in plugin kubernetes/enhancements#1858

Co-authored-by: Swati Sehgal [email protected]
Signed-off-by: Swati Sehgal [email protected]
Signed-off-by: Alexey Perevalov [email protected]

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Nov 30, 2020
@k8s-ci-robot
Copy link
Contributor

Welcome @AlexeyPerevalov!

It looks like this is your first PR to kubernetes-sigs/scheduler-plugins 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes-sigs/scheduler-plugins has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. 😃

@k8s-ci-robot k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Nov 30, 2020
@k8s-ci-robot
Copy link
Contributor

Hi @AlexeyPerevalov. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Nov 30, 2020
Co-authored-by: Swati Sehgal <[email protected]>
Signed-off-by: Swati Sehgal <[email protected]>
Signed-off-by: Alexey Perevalov <[email protected]>
@AlexeyPerevalov AlexeyPerevalov force-pushed the KEPTopologyAwareSchedulerPerNUMA branch from ad23c8b to dbacf97 Compare November 30, 2020 14:10

## Feature enablement and rollback
* **How can this feature be enabled / disabled in a live cluster?**
- This plugin doesn't require special feature gate, but it expects: TopologyManager and CPUManager feature gate enabled on the worker node\
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I'm not wrong if special KubeSchedulerConfiguration is not defined, than plugin is not loading.

@Huang-Wei
Copy link
Contributor

@AlexeyPerevalov We haven't come to a conclusion where to hold the API, have we?

If we get a consensus on kubernetes/kubernetes#96275, then we can implement it in in-tree scheduler.

@AlexeyPerevalov
Copy link
Contributor Author

@AlexeyPerevalov We haven't come to a conclusion where to hold the API, have we?

If we get a consensus on kubernetes/kubernetes#96275, then we can implement it in in-tree scheduler.

We discussed the ability to include it as built-in API in the sig-architecture meeting, where opinions divided. The staging is also raised concerns. So we will collect usage feedbacks for NodeResourceTopology plugin and for API,
API could be placed in scheduler-plugins or in somewhere in https://github.com/k8stopologyawareschedwg,

@swatisehgal
Copy link
Contributor

swatisehgal commented Dec 1, 2020

If we get a consensus on kubernetes/kubernetes#96275, then we can implement it in in-tree scheduler.

Discussion upstream is still going on and we are pursuing acceptance of in-tree scheduler plugin (and will continue to do so). While we work on showcasing the advantages of Topology aware scheduling in a more measurable way in order to make a strong case for in-tree acceptance of this feature, it was suggested that we consider an out-of-tree approach for better feature velocity. Hence the enhancement proposal here.

@Huang-Wei
Copy link
Contributor

@AlexeyPerevalov @swatisehgal Thanks for the clarification. This proposal looks good to me (actually we have already reviewed this KEP in the upstream). It's just that I'd like to hold the merge in this repo until we got a conclusion on where to place the APIs. Let me know your thoughts.

@swatisehgal
Copy link
Contributor

swatisehgal commented Dec 2, 2020

@Huang-Wei Thanks for reviewing the proposal. In this KEP, in the CRD API section here we are proposing that the API be placed in kubernetes-sigs/scheduler-plugins in pkg/apis/noderesourcetopology-api directory. What do you think about that? Getting this scheduler plugin merged as an out-of tree scheduler plugin would be a great starting point for us as it would allow relevant stakeholders to play around with it allowing us to gather feedback and help mature the plugin and the API. This would also help drive conversations upstream (with sig-node, sig-scheduling and sig-architecture) to figure out how we as a community want to shape and introduce this feature natively in kubernetes. Let us know your thoughts.

@Huang-Wei
Copy link
Contributor

@swatisehgal does that mean the work pushing the API to staging and scheduler plugin to in-tree being postponed? Or you're still pursuing that?

The other aspect is the API review. Similar to the Topology Manager feature, sig-node should be responsible to own and review the topology API. Sig-scheduling is out of expertise in this area actually. So if we store the CRD API in scheduler-plugins, can we involved sig-node to review it to ensure the API design is appropriate?

@swatisehgal
Copy link
Contributor

swatisehgal commented Dec 4, 2020

@swatisehgal does that mean the work pushing the API to staging and scheduler plugin to in-tree being postponed? Or you're still pursuing that?

@Huang-Wei We need to do some work to showcase the value-proposition of topology-aware scheduling to make a strong case for the API and scheduler-plugin in-tree. We think that this will take some time but when we are ready with that we will resume the discussion with SIG-arch and SIG-node. So, to answer your question we are pursuing pushing in-tree in the background but postponing the discussion for now.

The other aspect is the API review. Similar to the Topology Manager feature, sig-node should be responsible to own and review the topology API. Sig-scheduling is out of expertise in this area actually. So if we store the CRD API in scheduler-plugins, can we involved sig-node to review it to ensure the API design is appropriate?

We will take care of this and bring this to SIG-Node. Thanks for letting us know the expectations.

swatisehgal and others added 2 commits January 22, 2021 15:02
- Capturing API updates and RBAC rules (for accessing NodeResourceTopology CRD)
- Capturing the import path of the CRD API as
  https://github.com/k8stopologyawareschedwg/noderesourcetopology-api
  rather than placing the API defnition in the scheduler-plugin repo itself.

Signed-off-by: Swati Sehgal <[email protected]>
Updates to the KEP based on the lastest changes
@AlexeyPerevalov
Copy link
Contributor Author

/ok-to-test

@k8s-ci-robot
Copy link
Contributor

@AlexeyPerevalov: Cannot trigger testing until a trusted user reviews the PR and leaves an /ok-to-test message.

In response to this:

/ok-to-test

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@denkensk
Copy link
Member

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Jan 29, 2021

type ResourceInfo struct {
Name string `json:"name"`
Allocatable intstr.IntOrString `json:"allocatable"`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How can we maintain the consistency for NodeResourceTopology.Allocatable and reserved CPUs. Let's suppose if schedule a pod on the node.

  • In Filter and Score extension, we use the NodeResourceTopology.Allocatable to filter and select the host.
  • In Reserve extension, we reserve several CPUs for the current pod (kubelet don't know the reserved CPUs)
  • In the PostBind extension, we should release the CPUs as Kubelet would update the CRD to reflect the CPU usage, e.g. reduce the allocatable field.

Since kubelet and scheduler do not talk to each other, is it possible that there could be a small timeslot between releasing the CPUs in the PostBind extension and kubelet update the CRD? And is a wrong scheduling decision could be made in that gap?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this KEP didn't cover issue you mentioned, but it should. We can't follow approach of noderesources/fit.go built-in plugin, where plugin relies on NodeInfo.Allocatable, which updates once pod nominated, and it reduces the gap. We can't follow it, because we can't predict which NUMA node will be chosen on the worker node.
I think in future we can try to add some "hint", e.g. annotation at PostBind stage to inform kubelet which NUMA node was selected and use it to reduce amount of available resource of the worker node, as it implemented for Fit plugin.

Copy link
Contributor

@draveness draveness Mar 11, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible if we select a socket or sockets during scheduling and the kubelet allocates the pod on that. I think after we introduce NUMA into scheduling, the socket may become the minimum scheduling unit.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Socket it's another term, several NUMA nodes could be on one socket. Here we're talking about NUMA nodes, yes as I said, if we would manage to add some annotation it's possible for kubelet to make its decision based on scheduler's evaluation.

Copy link
Contributor

@draveness draveness Mar 31, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Annotations would work but sound like a patch to me. Can we introduce a common approach to control the behaviour on how kubelet set cgroups, including cpuset, cfs and etc?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think external resource management like https://github.com/nokia/CPU-pooler or https://github.com/intel/cri-resource-manager/ could address cgroup management issue.

@seanmalloy
Copy link
Member

/kind feature

@k8s-ci-robot k8s-ci-robot added the kind/feature Categorizes issue or PR as related to a new feature. label Mar 31, 2021
@Huang-Wei
Copy link
Contributor

/lgtm
/approve

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label May 13, 2021
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: AlexeyPerevalov, Huang-Wei

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 13, 2021
@Huang-Wei
Copy link
Contributor

/retest

2 similar comments
@Huang-Wei
Copy link
Contributor

/retest

@Huang-Wei
Copy link
Contributor

/retest

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/feature Categorizes issue or PR as related to a new feature. lgtm "Looks good to me", indicates that a PR is ready to be merged. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants