-
Notifications
You must be signed in to change notification settings - Fork 579
Topology aware scheduler per numa #143
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Topology aware scheduler per numa #143
Conversation
|
Hi @AlexeyPerevalov. Thanks for your PR. I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
|
/ok-to-test |
0046f48 to
ec10aa2
Compare
|
Associated Enhancement Proposal: Here. |
|
@Huang-Wei Could you please take a look at this PR and the associated KEP here. We are now focusing on out-of-tree enablement of Topology-aware Scheduling and the CRD API is http://github.com/k8stopologyawareschedwg/noderesourcetopology-api. Thanks in advance! |
Huang-Wei
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First round of review. (will review core Filter logic and tests after most of the comments get resolved)
pkg/apis/config/types.go
Outdated
| Most ModeType = "Most" | ||
|
|
||
| // to preserve consistency keep it in pkg/apis/core/types.go" | ||
| SingleNUMANodeTopologyManagerPolicy TopologyManagerPolicy = "SingleNUMANode" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given the type is called "TopologyManagerPolicy", IMO SingleNUMANode a better name?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
these constants (all of type TopologyManagerPolicy) represent behavior of the resource management component on the node (now it's TopologyManager, but it could be some external resource manager). This patch series adds support only for SingleNUMANode (it's single-numa-node policy of the TopologyManager) and PodTopologyScope (it's pod level resource counting in TopologyManager). In CRD it's array of string. So here could be a string too or we can keep TopologyManagerPolicy type in noderesourcetopology-api.
Regarding name of the type, since it represent behavior/policy of the node's resource management based on hardware topology. I think names like TopologyPolicy/TopologyManagerPolicy/ResourceTopologyPolicy show the purpose.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It sounds to be we should name them as SingleNUMANodeGeneral and SingleNUMANodePodTopology?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, it worth changing it now, since current names have some historical aspects )
pkg/apis/config/types.go
Outdated
|
|
||
| // to preserve consistency keep it in pkg/apis/core/types.go" | ||
| SingleNUMANodeTopologyManagerPolicy TopologyManagerPolicy = "SingleNUMANode" | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a comment explaining it.
pkg/apis/config/types.go
Outdated
| // Most is the string "Most". | ||
| Most ModeType = "Most" | ||
|
|
||
| // to preserve consistency keep it in pkg/apis/core/types.go" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a comment explaining its semantics (instead of mentioning relationship with types.go).
pkg/noderesourcetopology/match.go
Outdated
|
|
||
| type nodeTopologyMap map[string]topologyv1alpha1.NodeResourceTopology | ||
|
|
||
| type PolicyHandler interface { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we over-designed it a bit. IIUC we inject two identical implementations into the plugin instance, while using the plugin instance itself as the interface implementation. This is confusing: isn't the plugin instance already enough to handle all the logic?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now we have 2 implementation, they are not similar, one is working per all resources of all containers and makes decision for whole pod, another one makes decision per container. And quite different implementation may exist in future, even not related to NUMA, since we may also support https://github.com/intel/cri-resource-manager (by using the same noderesourcetopology-api, it's now flexible enough to represent any kind of hardwares)
Since topologyPolicy is an attribute of the worker node, the idea behind this is to choose appropriate handler for node's policy or skip node if policy is not found (e.g. wasn't provided) for node.
It could be done simpler e.g. with if clause in case of one Filter plugin per all policies we will have. Or one Filter plugin per each policy, but in this case each implementation of Filter plugin will do unnecessary job like skipping nodes if node configured for another policy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Commented in #143 (comment)
pkg/noderesourcetopology/match.go
Outdated
|
|
||
| nodeName := nodeInfo.Node().Name | ||
|
|
||
| topologyPolicies := getTopologyPolicies(tm.nodeTopologies, nodeName) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If there is no particular reason (e.g., indexing), it's unnecessary to maintain a local copy of nodeTopologies. Instead, inject a xyzInformerFactory().xyzInfomer.Lister() to the plugin instance, and then use xyzLister's methods to fetch the xyz objects when needed. Then, all the onXYZAdd/Update/Delete methods can also be removed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nodeTopologies are populating in Add/Update callback, it's not a poll model, it's push model, when plugin getting informed about node topology state modification (e.g. after new pod get launched).
I think it's better to update nodeTopologies as soon as node state was changed, rather then we're in Filter plugin. Because I don't know a proper way of updating, xyz objects, except maybe a time interval base.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, you may misunderstand xyzLister which is providerd by Kubernetes client-go SDK. It actually maintains a local cache for you, and xyzLister's methods fetch the xyz object(s) from the client cache in an O(1) manner. It's not polling objects from API server.
In other words, no matter what state of your CR gets changed, the cache (xyzStore) that backs xyzLister is (almost) up-to-date.
IIUC, we can just rely on the xyzLister, because we just look at the up-to-date state of particular xyz objects upon a scheduling cycle (the Filter() hook) - rather than needs to be notified and proactively trigger the scheduling of a Pod.
pkg/noderesourcetopology/match.go
Outdated
|
|
||
| topologyPolicies := getTopologyPolicies(tm.nodeTopologies, nodeName) | ||
| for _, policyName := range topologyPolicies { | ||
| if handler, ok := tm.topologyPolicyHandlers[policyName]; ok { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given that the two objects are actually identical, and there is no struct fields associated with each, I'd suggest just create 2 stateless PolicyFilter() methods (e.g., PodLevelFilter() and SingleNUMANodeFilter())to handle them individually.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you suggest to do it with map[string]interface{} or just compare policyName and call appropriate handler?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can define a function type:
type PolicyFilter func(*v1.Pod, topologyv1alpha1.ZoneList) *framework.StatusAnd then register each function implementation in the map: map[apiconfig.TopologyManagerPolicy]PolicyFilter.
d48189a to
8ca3805
Compare
78cc192 to
07264a7
Compare
| "k8s.io/client-go/tools/clientcmd" | ||
| "k8s.io/klog/v2" | ||
| v1qos "k8s.io/kubernetes/pkg/apis/core/v1/helper/qos" | ||
| bm "k8s.io/kubernetes/pkg/kubelet/cm/topologymanager/bitmask" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know about https://docs.google.com/document/d/1WO-ixERpqkCSEXEq30YtEH_z_G-BoLKeCbkRJcKq3xA/edit
in this context using kubernetes/pkg/kubelet not so perfect from maintainability point of view.
pkg/noderesourcetopology/filter.go
Outdated
| ) | ||
|
|
||
| ctx := context.Background() | ||
| go nodeTopologyInformer.Informer().Run(ctx.Done()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think L342 actually does the same thing, so this line can be removed.
pkg/noderesourcetopology/match.go
Outdated
|
|
||
| nodeName := nodeInfo.Node().Name | ||
|
|
||
| topologyPolicies := getTopologyPolicies(tm.nodeTopologies, nodeName) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, you may misunderstand xyzLister which is providerd by Kubernetes client-go SDK. It actually maintains a local cache for you, and xyzLister's methods fetch the xyz object(s) from the client cache in an O(1) manner. It's not polling objects from API server.
In other words, no matter what state of your CR gets changed, the cache (xyzStore) that backs xyzLister is (almost) up-to-date.
IIUC, we can just rely on the xyzLister, because we just look at the up-to-date state of particular xyz objects upon a scheduling cycle (the Filter() hook) - rather than needs to be notified and proactively trigger the scheduling of a Pod.
|
Thank you, I didn't know about cache in Lister, now I see it's a ThreadSafeMap. |
My understanding is that So does it make more sense to make it a cluster-scoped CRD? |
07264a7 to
52b951f
Compare
Namespace feature is necessary for product purpose, e.g. to isolate tenants, it's better to keep. |
| deniedPGExpirationTimeSeconds: 3 | ||
| kubeConfigPath: "REPLACE_ME_WITH_KUBE_CONFIG_PATH" | ||
| kubeMaster: "REPLACE_ME_WIHT_KUBE_MASTER" | ||
| kubeMaster: "REPLACE_ME_WITH_KUBE_MASTER" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what's the change here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
WIHT -> WITH
it's from 33dca04 commit of this PR, but probably it should be in separate PR, @swatisehgal what do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, it was typo I had found and just addressed here. If you both think it is better to be addressed in a separate PR, I can do that. No problem.
pkg/noderesourcetopology/filter.go
Outdated
|
|
||
| for _, container := range containers { | ||
| for resource, quantity := range container.Resources.Requests { | ||
| if quan, ok := resources[resource]; ok { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: quan -> q is fine.
|
@AlexeyPerevalov I noticed some comments haven't been resolved, so let me know it's ready for another round reivew. |
7fe21db to
a6e0a30
Compare
|
/hold |
107c1ef to
bac8db5
Compare
Huang-Wei
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In addition to the comments below, some previous comments have been folded, like https://github.com/kubernetes-sigs/scheduler-plugins/pull/143/files#r604520697, please unfold them and resolve accordingly.
BTW: the unit test can be refactored using fake utilities: https://gist.github.com/Huang-Wei/9852f8a53d47fc683295a0097f5dfd51
16decad to
afb89f9
Compare
|
/hold |
|
/unhold |
Huang-Wei
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some final comments. Disabling all other plugins doesn't look good still.
| Plugins: &schedapi.Plugins{ | ||
| PreBind: &schedapi.PluginSet{ | ||
| Disabled: []schedapi.Plugin{ | ||
| {Name: "*"}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think it's related to PreBind of VolumeBinding. The conflict seems to hide in some Filter plugin.
| defer testutils.CleanupTest(t, testCtx) | ||
|
|
||
| // Create a Node. | ||
| nodeName1 := "fake-node-1" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bump.
Huang-Wei
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some final comments to resolve the integration test issue.
| Plugins: &schedapi.Plugins{ | ||
| PreBind: &schedapi.PluginSet{ | ||
| Disabled: []schedapi.Plugin{ | ||
| {Name: "*"}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, it's because the memory was built with plain "100", you should use "100Gi".
Huang-Wei
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One final nit. LGTM otherwise.
You can address that along with squashing the commits.
Two or three commits are fine - one carries autogen files/docs, the other carries code implemenation.
| for _, node := range []struct { | ||
| name string |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A nit: as "name" is the only field, we can simply do:
for _, nodeName := []string{"fake-node-1", "fake-node-2"} { This patch partly implements modified ideas proposed in: https://docs.google.com/document/d/1XGTx6F8qgdq_zPvd87LAfhW1ObQ1McTDAoIWyokY88E and https://docs.google.com/document/d/1gPknVIOiu-c_fpLm53-jUAm-AGQQ8XC0hQYE7hq0L4c The exact idea it implements simplified version of TopologyManager in the kube-scheduler. CRD is described in this document: https://docs.google.com/document/d/12kj3fK8boNuPNqob6F_pPU9ZTaNEnPGaXEooW1Cilwg/edit It also adds Add github.com/k8stopologyawareschedwg/noderesourcetopology-api as dependency This commit includes integration and unit tests. Co-authored-by: Swati Sehgal <[email protected]> Co-authored-by: Wei Huang <[email protected]> Signed-off-by: Swati Sehgal <[email protected]> Signed-off-by: Alexey Perevalov <[email protected]>
Signed-off-by: Alexey Perevalov <[email protected]>
Signed-off-by: Alexey Perevalov <[email protected]>
Signed-off-by: Alexey Perevalov <[email protected]>
f42b39e to
c8094c1
Compare
|
/retest |
|
/lgtm Thanks @AlexeyPerevalov and @swatisehgal ! |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: AlexeyPerevalov, Huang-Wei The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
This pull request implements KEP #119
It implements idea collected in this document: https://docs.google.com/presentation/d/1SR6XSIFsHkiWTws66LABpTiZaRwYk8IoRts4HQWE8Bc/edit#slide=id.g752682a7d2_68_129
Mostly focused on approach where some simplified version of TopologyManager is filtering pods at scheduler time.
Initially was proposed as built-in plugin (implementation kubernetes/kubernetes#90708 design described in KEP(kubernetes/enhancements#1858)
/kind feature