-
Notifications
You must be signed in to change notification settings - Fork 579
Topology aware scheduler plugin #119
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Topology aware scheduler plugin #119
Conversation
|
Welcome @AlexeyPerevalov! |
|
Hi @AlexeyPerevalov. Thanks for your PR. I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Co-authored-by: Swati Sehgal <[email protected]> Signed-off-by: Swati Sehgal <[email protected]> Signed-off-by: Alexey Perevalov <[email protected]>
ad23c8b to
dbacf97
Compare
|
|
||
| ## Feature enablement and rollback | ||
| * **How can this feature be enabled / disabled in a live cluster?** | ||
| - This plugin doesn't require special feature gate, but it expects: TopologyManager and CPUManager feature gate enabled on the worker node\ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I'm not wrong if special KubeSchedulerConfiguration is not defined, than plugin is not loading.
|
@AlexeyPerevalov We haven't come to a conclusion where to hold the API, have we? If we get a consensus on kubernetes/kubernetes#96275, then we can implement it in in-tree scheduler. |
We discussed the ability to include it as built-in API in the sig-architecture meeting, where opinions divided. The staging is also raised concerns. So we will collect usage feedbacks for NodeResourceTopology plugin and for API, |
Discussion upstream is still going on and we are pursuing acceptance of in-tree scheduler plugin (and will continue to do so). While we work on showcasing the advantages of Topology aware scheduling in a more measurable way in order to make a strong case for in-tree acceptance of this feature, it was suggested that we consider an out-of-tree approach for better feature velocity. Hence the enhancement proposal here. |
|
@AlexeyPerevalov @swatisehgal Thanks for the clarification. This proposal looks good to me (actually we have already reviewed this KEP in the upstream). It's just that I'd like to hold the merge in this repo until we got a conclusion on where to place the APIs. Let me know your thoughts. |
|
@Huang-Wei Thanks for reviewing the proposal. In this KEP, in the CRD API section here we are proposing that the API be placed in kubernetes-sigs/scheduler-plugins in pkg/apis/noderesourcetopology-api directory. What do you think about that? Getting this scheduler plugin merged as an out-of tree scheduler plugin would be a great starting point for us as it would allow relevant stakeholders to play around with it allowing us to gather feedback and help mature the plugin and the API. This would also help drive conversations upstream (with sig-node, sig-scheduling and sig-architecture) to figure out how we as a community want to shape and introduce this feature natively in kubernetes. Let us know your thoughts. |
|
@swatisehgal does that mean the work pushing the API to staging and scheduler plugin to in-tree being postponed? Or you're still pursuing that? The other aspect is the API review. Similar to the Topology Manager feature, sig-node should be responsible to own and review the topology API. Sig-scheduling is out of expertise in this area actually. So if we store the CRD API in scheduler-plugins, can we involved sig-node to review it to ensure the API design is appropriate? |
@Huang-Wei We need to do some work to showcase the value-proposition of topology-aware scheduling to make a strong case for the API and scheduler-plugin in-tree. We think that this will take some time but when we are ready with that we will resume the discussion with SIG-arch and SIG-node. So, to answer your question we are pursuing pushing in-tree in the background but postponing the discussion for now.
We will take care of this and bring this to SIG-Node. Thanks for letting us know the expectations. |
- Capturing API updates and RBAC rules (for accessing NodeResourceTopology CRD) - Capturing the import path of the CRD API as https://github.com/k8stopologyawareschedwg/noderesourcetopology-api rather than placing the API defnition in the scheduler-plugin repo itself. Signed-off-by: Swati Sehgal <[email protected]>
Updates to the KEP based on the lastest changes
|
/ok-to-test |
|
@AlexeyPerevalov: Cannot trigger testing until a trusted user reviews the PR and leaves an In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
|
/ok-to-test |
|
|
||
| type ResourceInfo struct { | ||
| Name string `json:"name"` | ||
| Allocatable intstr.IntOrString `json:"allocatable"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How can we maintain the consistency for NodeResourceTopology.Allocatable and reserved CPUs. Let's suppose if schedule a pod on the node.
- In Filter and Score extension, we use the
NodeResourceTopology.Allocatableto filter and select the host. - In Reserve extension, we reserve several CPUs for the current pod (kubelet don't know the reserved CPUs)
- In the PostBind extension, we should release the CPUs as Kubelet would update the CRD to reflect the CPU usage, e.g. reduce the allocatable field.
Since kubelet and scheduler do not talk to each other, is it possible that there could be a small timeslot between releasing the CPUs in the PostBind extension and kubelet update the CRD? And is a wrong scheduling decision could be made in that gap?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this KEP didn't cover issue you mentioned, but it should. We can't follow approach of noderesources/fit.go built-in plugin, where plugin relies on NodeInfo.Allocatable, which updates once pod nominated, and it reduces the gap. We can't follow it, because we can't predict which NUMA node will be chosen on the worker node.
I think in future we can try to add some "hint", e.g. annotation at PostBind stage to inform kubelet which NUMA node was selected and use it to reduce amount of available resource of the worker node, as it implemented for Fit plugin.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible if we select a socket or sockets during scheduling and the kubelet allocates the pod on that. I think after we introduce NUMA into scheduling, the socket may become the minimum scheduling unit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Socket it's another term, several NUMA nodes could be on one socket. Here we're talking about NUMA nodes, yes as I said, if we would manage to add some annotation it's possible for kubelet to make its decision based on scheduler's evaluation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Annotations would work but sound like a patch to me. Can we introduce a common approach to control the behaviour on how kubelet set cgroups, including cpuset, cfs and etc?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think external resource management like https://github.com/nokia/CPU-pooler or https://github.com/intel/cri-resource-manager/ could address cgroup management issue.
|
/kind feature |
|
/lgtm |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: AlexeyPerevalov, Huang-Wei The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
/retest |
2 similar comments
|
/retest |
|
/retest |
Initially it was proposed as built-in plugin kubernetes/enhancements#1858
Co-authored-by: Swati Sehgal [email protected]
Signed-off-by: Swati Sehgal [email protected]
Signed-off-by: Alexey Perevalov [email protected]