Topology aware scheduler per numa #143

AlexeyPerevalov · 2021-01-29T11:52:07Z

This pull request implements KEP #119

It implements idea collected in this document: https://docs.google.com/presentation/d/1SR6XSIFsHkiWTws66LABpTiZaRwYk8IoRts4HQWE8Bc/edit#slide=id.g752682a7d2_68_129

Mostly focused on approach where some simplified version of TopologyManager is filtering pods at scheduler time.
Initially was proposed as built-in plugin (implementation kubernetes/kubernetes#90708 design described in KEP(kubernetes/enhancements#1858)

/kind feature

k8s-ci-robot · 2021-01-29T11:52:15Z

Hi @AlexeyPerevalov. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

denkensk · 2021-02-01T02:59:54Z

/ok-to-test

swatisehgal · 2021-02-01T14:58:54Z

Associated Enhancement Proposal: Here.

swatisehgal · 2021-02-01T15:04:07Z

@Huang-Wei Could you please take a look at this PR and the associated KEP here. We are now focusing on out-of-tree enablement of Topology-aware Scheduling and the CRD API is http://github.com/k8stopologyawareschedwg/noderesourcetopology-api. Thanks in advance!

Huang-Wei

First round of review. (will review core Filter logic and tests after most of the comments get resolved)

pkg/apis/config/register.go

Huang-Wei · 2021-02-02T03:19:21Z

pkg/apis/config/types.go

 	Most ModeType = "Most"
+
+	// to preserve consistency keep it in pkg/apis/core/types.go"
+	SingleNUMANodeTopologyManagerPolicy TopologyManagerPolicy = "SingleNUMANode"


Given the type is called "TopologyManagerPolicy", IMO SingleNUMANode a better name?

these constants (all of type TopologyManagerPolicy) represent behavior of the resource management component on the node (now it's TopologyManager, but it could be some external resource manager). This patch series adds support only for SingleNUMANode (it's single-numa-node policy of the TopologyManager) and PodTopologyScope (it's pod level resource counting in TopologyManager). In CRD it's array of string. So here could be a string too or we can keep TopologyManagerPolicy type in noderesourcetopology-api.

Regarding name of the type, since it represent behavior/policy of the node's resource management based on hardware topology. I think names like TopologyPolicy/TopologyManagerPolicy/ResourceTopologyPolicy show the purpose.

It sounds to be we should name them as SingleNUMANodeGeneral and SingleNUMANodePodTopology?

I agree, it worth changing it now, since current names have some historical aspects )

Huang-Wei · 2021-02-02T03:19:51Z

pkg/apis/config/types.go

+
+	// to preserve consistency keep it in pkg/apis/core/types.go"
+	SingleNUMANodeTopologyManagerPolicy TopologyManagerPolicy = "SingleNUMANode"
+


Add a comment explaining it.

Huang-Wei · 2021-02-02T03:20:33Z

pkg/apis/config/types.go

 	// Most is the string "Most".
 	Most ModeType = "Most"
+
+	// to preserve consistency keep it in pkg/apis/core/types.go"


Add a comment explaining its semantics (instead of mentioning relationship with types.go).

pkg/apis/config/v1beta1/defaults.go

pkg/noderesourcetopology/match.go

Huang-Wei · 2021-02-02T05:54:20Z

pkg/noderesourcetopology/match.go

+
+type nodeTopologyMap map[string]topologyv1alpha1.NodeResourceTopology
+
+type PolicyHandler interface {


I think we over-designed it a bit. IIUC we inject two identical implementations into the plugin instance, while using the plugin instance itself as the interface implementation. This is confusing: isn't the plugin instance already enough to handle all the logic?

Now we have 2 implementation, they are not similar, one is working per all resources of all containers and makes decision for whole pod, another one makes decision per container. And quite different implementation may exist in future, even not related to NUMA, since we may also support https://github.com/intel/cri-resource-manager (by using the same noderesourcetopology-api, it's now flexible enough to represent any kind of hardwares)

Since topologyPolicy is an attribute of the worker node, the idea behind this is to choose appropriate handler for node's policy or skip node if policy is not found (e.g. wasn't provided) for node.
It could be done simpler e.g. with if clause in case of one Filter plugin per all policies we will have. Or one Filter plugin per each policy, but in this case each implementation of Filter plugin will do unnecessary job like skipping nodes if node configured for another policy.

Commented in #143 (comment)

pkg/noderesourcetopology/match.go

Huang-Wei · 2021-02-02T06:18:36Z

pkg/noderesourcetopology/match.go

+
+	nodeName := nodeInfo.Node().Name
+
+	topologyPolicies := getTopologyPolicies(tm.nodeTopologies, nodeName)


If there is no particular reason (e.g., indexing), it's unnecessary to maintain a local copy of nodeTopologies. Instead, inject a xyzInformerFactory().xyzInfomer.Lister() to the plugin instance, and then use xyzLister's methods to fetch the xyz objects when needed. Then, all the onXYZAdd/Update/Delete methods can also be removed.

nodeTopologies are populating in Add/Update callback, it's not a poll model, it's push model, when plugin getting informed about node topology state modification (e.g. after new pod get launched).
I think it's better to update nodeTopologies as soon as node state was changed, rather then we're in Filter plugin. Because I don't know a proper way of updating, xyz objects, except maybe a time interval base.

Oh, you may misunderstand xyzLister which is providerd by Kubernetes client-go SDK. It actually maintains a local cache for you, and xyzLister's methods fetch the xyz object(s) from the client cache in an O(1) manner. It's not polling objects from API server.

In other words, no matter what state of your CR gets changed, the cache (xyzStore) that backs xyzLister is (almost) up-to-date.

IIUC, we can just rely on the xyzLister, because we just look at the up-to-date state of particular xyz objects upon a scheduling cycle (the Filter() hook) - rather than needs to be notified and proactively trigger the scheduling of a Pod.

Huang-Wei · 2021-02-02T06:22:42Z

pkg/noderesourcetopology/match.go

+
+	topologyPolicies := getTopologyPolicies(tm.nodeTopologies, nodeName)
+	for _, policyName := range topologyPolicies {
+		if handler, ok := tm.topologyPolicyHandlers[policyName]; ok {


Given that the two objects are actually identical, and there is no struct fields associated with each, I'd suggest just create 2 stateless PolicyFilter() methods (e.g., PodLevelFilter() and SingleNUMANodeFilter())to handle them individually.

Do you suggest to do it with map[string]interface{} or just compare policyName and call appropriate handler?

We can define a function type:

type PolicyFilter func(*v1.Pod, topologyv1alpha1.ZoneList) *framework.Status

And then register each function implementation in the map: map[apiconfig.TopologyManagerPolicy]PolicyFilter.

AlexeyPerevalov · 2021-02-04T07:01:00Z

pkg/noderesourcetopology/filter.go

+	"k8s.io/client-go/tools/clientcmd"
+	"k8s.io/klog/v2"
+	v1qos "k8s.io/kubernetes/pkg/apis/core/v1/helper/qos"
+	bm "k8s.io/kubernetes/pkg/kubelet/cm/topologymanager/bitmask"


I know about https://docs.google.com/document/d/1WO-ixERpqkCSEXEq30YtEH_z_G-BoLKeCbkRJcKq3xA/edit
in this context using kubernetes/pkg/kubelet not so perfect from maintainability point of view.

Huang-Wei · 2021-02-04T21:35:32Z

pkg/noderesourcetopology/filter.go

+	)
+
+	ctx := context.Background()
+	go nodeTopologyInformer.Informer().Run(ctx.Done())


I think L342 actually does the same thing, so this line can be removed.

Huang-Wei · 2021-02-04T21:48:41Z

pkg/noderesourcetopology/match.go

+
+	nodeName := nodeInfo.Node().Name
+
+	topologyPolicies := getTopologyPolicies(tm.nodeTopologies, nodeName)


Oh, you may misunderstand xyzLister which is providerd by Kubernetes client-go SDK. It actually maintains a local cache for you, and xyzLister's methods fetch the xyz object(s) from the client cache in an O(1) manner. It's not polling objects from API server.

In other words, no matter what state of your CR gets changed, the cache (xyzStore) that backs xyzLister is (almost) up-to-date.

IIUC, we can just rely on the xyzLister, because we just look at the up-to-date state of particular xyz objects upon a scheduling cycle (the Filter() hook) - rather than needs to be notified and proactively trigger the scheduling of a Pod.

AlexeyPerevalov · 2021-02-05T16:23:57Z

Thank you, I didn't know about cache in Lister, now I see it's a ThreadSafeMap.
The List method returns []*NodeResourceTopology, here need to iterate to find appropriate.
Also it's possible to do it by another way:
lister.NodeResourceTopology(Namespace).Get(hostName)
and map keeps with key as "namespace/hostName", so we will provide namespace in config file (AllNamespace doesn't work, since constant is "", and it concatenates to "/hostName", the storage doesn't know about wildcards )

Huang-Wei · 2021-02-06T07:57:25Z

Also it's possible to do it by another way:
lister.NodeResourceTopology(Namespace).Get(hostName)
and map keeps with key as "namespace/hostName"

My understanding is that NodeResourceTopology is similar to Node, and hence cluster-scoped. However, it's defined as namespaced:

https://github.com/k8stopologyawareschedwg/noderesourcetopology-api/blob/fd566f16210b3eb0e66e1c4f110ad8660b4027eb/manifests/crd.yaml#L62

So does it make more sense to make it a cluster-scoped CRD?

AlexeyPerevalov · 2021-02-10T06:57:43Z

Also it's possible to do it by another way:
lister.NodeResourceTopology(Namespace).Get(hostName)
and map keeps with key as "namespace/hostName"

My understanding is that NodeResourceTopology is similar to Node, and hence cluster-scoped. However, it's defined as namespaced:

https://github.com/k8stopologyawareschedwg/noderesourcetopology-api/blob/fd566f16210b3eb0e66e1c4f110ad8660b4027eb/manifests/crd.yaml#L62

So does it make more sense to make it a cluster-scoped CRD?

Namespace feature is necessary for product purpose, e.g. to isolate tenants, it's better to keep.

cmd/scheduler/main.go

Huang-Wei · 2021-02-26T23:54:26Z

manifests/coscheduling/scheduler-config.yaml

      deniedPGExpirationTimeSeconds: 3
      kubeConfigPath: "REPLACE_ME_WITH_KUBE_CONFIG_PATH"
-      kubeMaster: "REPLACE_ME_WIHT_KUBE_MASTER"
+      kubeMaster: "REPLACE_ME_WITH_KUBE_MASTER"


what's the change here?

WIHT -> WITH
it's from 33dca04 commit of this PR, but probably it should be in separate PR, @swatisehgal what do you think?

Yep, it was typo I had found and just addressed here. If you both think it is better to be addressed in a separate PR, I can do that. No problem.

manifests/noderesourcetopology/worker-node-A.yaml

pkg/apis/config/types.go

pkg/noderesourcetopology/filter.go

Huang-Wei · 2021-02-27T00:47:15Z

pkg/noderesourcetopology/filter.go

+
+	for _, container := range containers {
+		for resource, quantity := range container.Resources.Requests {
+			if quan, ok := resources[resource]; ok {


nit: quan -> q is fine.

pkg/noderesourcetopology/filter.go

Huang-Wei · 2021-03-02T20:27:57Z

@AlexeyPerevalov I noticed some comments haven't been resolved, so let me know it's ready for another round reivew.

AlexeyPerevalov · 2021-03-04T09:13:29Z

/hold

pkg/noderesourcetopology/filter_test.go

test/integration/noderesourcetopology_test.go

Huang-Wei

In addition to the comments below, some previous comments have been folded, like https://github.com/kubernetes-sigs/scheduler-plugins/pull/143/files#r604520697, please unfold them and resolve accordingly.

BTW: the unit test can be refactored using fake utilities: https://gist.github.com/Huang-Wei/9852f8a53d47fc683295a0097f5dfd51

test/integration/noderesourcetopology_test.go

AlexeyPerevalov · 2021-04-01T15:20:16Z

/hold

swatisehgal · 2021-04-01T17:17:29Z

/unhold

Huang-Wei

Some final comments. Disabling all other plugins doesn't look good still.

pkg/noderesourcetopology/filter.go

Huang-Wei · 2021-04-02T03:42:28Z

test/integration/noderesourcetopology_test.go

+		Plugins: &schedapi.Plugins{
+			PreBind: &schedapi.PluginSet{
+				Disabled: []schedapi.Plugin{
+					{Name: "*"},


I don't think it's related to PreBind of VolumeBinding. The conflict seems to hide in some Filter plugin.

Huang-Wei · 2021-04-02T03:45:44Z

test/integration/noderesourcetopology_test.go

+	defer testutils.CleanupTest(t, testCtx)
+
+	// Create a Node.
+	nodeName1 := "fake-node-1"


Huang-Wei

Some final comments to resolve the integration test issue.

Huang-Wei · 2021-04-02T04:19:45Z

test/integration/noderesourcetopology_test.go

+		Plugins: &schedapi.Plugins{
+			PreBind: &schedapi.PluginSet{
+				Disabled: []schedapi.Plugin{
+					{Name: "*"},


ok, it's because the memory was built with plain "100", you should use "100Gi".

test/integration/noderesourcetopology_test.go

Huang-Wei

One final nit. LGTM otherwise.

You can address that along with squashing the commits.
Two or three commits are fine - one carries autogen files/docs, the other carries code implemenation.

Huang-Wei · 2021-04-05T17:56:24Z

test/integration/noderesourcetopology_test.go

+	for _, node := range []struct {
+		name string


A nit: as "name" is the only field, we can simply do:

for _, nodeName := []string{"fake-node-1", "fake-node-2"} {

This patch partly implements modified ideas proposed in: https://docs.google.com/document/d/1XGTx6F8qgdq_zPvd87LAfhW1ObQ1McTDAoIWyokY88E and https://docs.google.com/document/d/1gPknVIOiu-c_fpLm53-jUAm-AGQQ8XC0hQYE7hq0L4c The exact idea it implements simplified version of TopologyManager in the kube-scheduler. CRD is described in this document: https://docs.google.com/document/d/12kj3fK8boNuPNqob6F_pPU9ZTaNEnPGaXEooW1Cilwg/edit It also adds Add github.com/k8stopologyawareschedwg/noderesourcetopology-api as dependency This commit includes integration and unit tests. Co-authored-by: Swati Sehgal <[email protected]> Co-authored-by: Wei Huang <[email protected]> Signed-off-by: Swati Sehgal <[email protected]> Signed-off-by: Alexey Perevalov <[email protected]>

Signed-off-by: Alexey Perevalov <[email protected]>

Huang-Wei · 2021-04-06T18:49:55Z

/retest

Huang-Wei · 2021-04-06T18:50:37Z

/lgtm
/approve

Thanks @AlexeyPerevalov and @swatisehgal !

k8s-ci-robot · 2021-04-06T18:50:45Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: AlexeyPerevalov, Huang-Wei

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [Huang-Wei]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

…143-#170-upstream-release-1.19 Automated cherry pick of #156: added LoadVariationRiskBalancing plugin (apis + docs) #143: NUMA aware scheduling #170: Add default value to pg status

k8s-ci-robot added the kind/feature Categorizes issue or PR as related to a new feature. label Jan 29, 2021

k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Jan 29, 2021

k8s-ci-robot requested review from cwdsuzhou and seanmalloy January 29, 2021 11:52

k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels Jan 29, 2021

k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Feb 1, 2021

AlexeyPerevalov force-pushed the TopologyAwareSchedulerPerNUMA branch from 0046f48 to ec10aa2 Compare February 1, 2021 09:11

Huang-Wei reviewed Feb 2, 2021

View reviewed changes

AlexeyPerevalov force-pushed the TopologyAwareSchedulerPerNUMA branch 2 times, most recently from d48189a to 8ca3805 Compare February 3, 2021 10:50

swatisehgal force-pushed the TopologyAwareSchedulerPerNUMA branch from 78cc192 to 07264a7 Compare February 3, 2021 13:35

AlexeyPerevalov commented Feb 4, 2021

View reviewed changes

Huang-Wei reviewed Feb 4, 2021

View reviewed changes

AlexeyPerevalov force-pushed the TopologyAwareSchedulerPerNUMA branch from 07264a7 to 52b951f Compare February 9, 2021 08:55

Huang-Wei reviewed Feb 27, 2021

View reviewed changes

AlexeyPerevalov mentioned this pull request Mar 1, 2021

Move policy constants into noderesourcetopology-api k8stopologyawareschedwg/noderesourcetopology-api#4

Merged

k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Mar 3, 2021

AlexeyPerevalov force-pushed the TopologyAwareSchedulerPerNUMA branch from 7fe21db to a6e0a30 Compare March 4, 2021 09:12

k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Mar 4, 2021

swatisehgal force-pushed the TopologyAwareSchedulerPerNUMA branch 3 times, most recently from 107c1ef to bac8db5 Compare March 30, 2021 23:47

Huang-Wei reviewed Mar 31, 2021

View reviewed changes

Huang-Wei reviewed Apr 1, 2021

View reviewed changes

test/integration/noderesourcetopology_test.go Outdated Show resolved Hide resolved

test/integration/noderesourcetopology_test.go Outdated Show resolved Hide resolved

test/integration/noderesourcetopology_test.go Outdated Show resolved Hide resolved

AlexeyPerevalov force-pushed the TopologyAwareSchedulerPerNUMA branch from 16decad to afb89f9 Compare April 1, 2021 10:37

k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Apr 1, 2021

k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Apr 1, 2021

Huang-Wei reviewed Apr 2, 2021

View reviewed changes

Huang-Wei reviewed Apr 5, 2021

View reviewed changes

This was referenced Apr 5, 2021

Update Dependencies to k8s v1.20 #171

Closed

bump deps to k8s 1.19.9 #179

Merged

AlexeyPerevalov and others added 4 commits April 6, 2021 12:25

Generated files for NUMA aware scheduler plugin

d74ae3d

Signed-off-by: Alexey Perevalov <[email protected]>

Manifest files for NUMA aware scheduler plugin

e0c0dfc

Signed-off-by: Alexey Perevalov <[email protected]>

Documentation for NUMA aware scheduler plugin

c8094c1

Signed-off-by: Alexey Perevalov <[email protected]>

AlexeyPerevalov force-pushed the TopologyAwareSchedulerPerNUMA branch from f42b39e to c8094c1 Compare April 6, 2021 09:26

k8s-ci-robot assigned Huang-Wei Apr 6, 2021

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Apr 6, 2021

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Apr 6, 2021

k8s-ci-robot merged commit 16b3ba7 into kubernetes-sigs:master Apr 6, 2021

Huang-Wei mentioned this pull request Apr 6, 2021

Automated cherry pick of #156: added LoadVariationRiskBalancing plugin (apis + docs) #143: NUMA aware scheduling #170: Add default value to pg status #180

Merged

swatisehgal mentioned this pull request Sep 30, 2021

REQUEST: New kubernetes-sigs membership for @swatisehgal kubernetes/org#3023

Closed

7 tasks


		// to preserve consistency keep it in pkg/apis/core/types.go"
		SingleNUMANodeTopologyManagerPolicy TopologyManagerPolicy = "SingleNUMANode"


		type nodeTopologyMap map[string]topologyv1alpha1.NodeResourceTopology

		type PolicyHandler interface {


		nodeName := nodeInfo.Node().Name

		topologyPolicies := getTopologyPolicies(tm.nodeTopologies, nodeName)

Topology aware scheduler per numa #143

Topology aware scheduler per numa #143

Uh oh!

Conversation

AlexeyPerevalov commented Jan 29, 2021

Uh oh!

k8s-ci-robot commented Jan 29, 2021

Uh oh!

denkensk commented Feb 1, 2021

Uh oh!

swatisehgal commented Feb 1, 2021

Uh oh!

swatisehgal commented Feb 1, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Huang-Wei left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

AlexeyPerevalov Feb 2, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

AlexeyPerevalov commented Feb 5, 2021

Uh oh!

Huang-Wei commented Feb 6, 2021

Uh oh!

AlexeyPerevalov commented Feb 10, 2021

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

swatisehgal commented Feb 1, 2021 •

edited

Loading

AlexeyPerevalov Feb 2, 2021 •

edited

Loading