Skip to content

Conversation

@AlexeyPerevalov
Copy link
Contributor

This pull request implements KEP #119

It implements idea collected in this document: https://docs.google.com/presentation/d/1SR6XSIFsHkiWTws66LABpTiZaRwYk8IoRts4HQWE8Bc/edit#slide=id.g752682a7d2_68_129

Mostly focused on approach where some simplified version of TopologyManager is filtering pods at scheduler time.
Initially was proposed as built-in plugin (implementation kubernetes/kubernetes#90708 design described in KEP(kubernetes/enhancements#1858)

/kind feature

@k8s-ci-robot k8s-ci-robot added the kind/feature Categorizes issue or PR as related to a new feature. label Jan 29, 2021
@k8s-ci-robot
Copy link
Contributor

Hi @AlexeyPerevalov. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Jan 29, 2021
@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels Jan 29, 2021
@denkensk
Copy link
Member

denkensk commented Feb 1, 2021

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Feb 1, 2021
@AlexeyPerevalov AlexeyPerevalov force-pushed the TopologyAwareSchedulerPerNUMA branch from 0046f48 to ec10aa2 Compare February 1, 2021 09:11
@swatisehgal
Copy link
Contributor

Associated Enhancement Proposal: Here.

@swatisehgal
Copy link
Contributor

swatisehgal commented Feb 1, 2021

@Huang-Wei Could you please take a look at this PR and the associated KEP here. We are now focusing on out-of-tree enablement of Topology-aware Scheduling and the CRD API is http://github.com/k8stopologyawareschedwg/noderesourcetopology-api. Thanks in advance!

Copy link
Contributor

@Huang-Wei Huang-Wei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First round of review. (will review core Filter logic and tests after most of the comments get resolved)

Most ModeType = "Most"

// to preserve consistency keep it in pkg/apis/core/types.go"
SingleNUMANodeTopologyManagerPolicy TopologyManagerPolicy = "SingleNUMANode"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given the type is called "TopologyManagerPolicy", IMO SingleNUMANode a better name?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these constants (all of type TopologyManagerPolicy) represent behavior of the resource management component on the node (now it's TopologyManager, but it could be some external resource manager). This patch series adds support only for SingleNUMANode (it's single-numa-node policy of the TopologyManager) and PodTopologyScope (it's pod level resource counting in TopologyManager). In CRD it's array of string. So here could be a string too or we can keep TopologyManagerPolicy type in noderesourcetopology-api.

Regarding name of the type, since it represent behavior/policy of the node's resource management based on hardware topology. I think names like TopologyPolicy/TopologyManagerPolicy/ResourceTopologyPolicy show the purpose.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It sounds to be we should name them as SingleNUMANodeGeneral and SingleNUMANodePodTopology?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, it worth changing it now, since current names have some historical aspects )


// to preserve consistency keep it in pkg/apis/core/types.go"
SingleNUMANodeTopologyManagerPolicy TopologyManagerPolicy = "SingleNUMANode"

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a comment explaining it.

// Most is the string "Most".
Most ModeType = "Most"

// to preserve consistency keep it in pkg/apis/core/types.go"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a comment explaining its semantics (instead of mentioning relationship with types.go).


type nodeTopologyMap map[string]topologyv1alpha1.NodeResourceTopology

type PolicyHandler interface {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we over-designed it a bit. IIUC we inject two identical implementations into the plugin instance, while using the plugin instance itself as the interface implementation. This is confusing: isn't the plugin instance already enough to handle all the logic?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now we have 2 implementation, they are not similar, one is working per all resources of all containers and makes decision for whole pod, another one makes decision per container. And quite different implementation may exist in future, even not related to NUMA, since we may also support https://github.com/intel/cri-resource-manager (by using the same noderesourcetopology-api, it's now flexible enough to represent any kind of hardwares)

Since topologyPolicy is an attribute of the worker node, the idea behind this is to choose appropriate handler for node's policy or skip node if policy is not found (e.g. wasn't provided) for node.
It could be done simpler e.g. with if clause in case of one Filter plugin per all policies we will have. Or one Filter plugin per each policy, but in this case each implementation of Filter plugin will do unnecessary job like skipping nodes if node configured for another policy.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Commented in #143 (comment)


nodeName := nodeInfo.Node().Name

topologyPolicies := getTopologyPolicies(tm.nodeTopologies, nodeName)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there is no particular reason (e.g., indexing), it's unnecessary to maintain a local copy of nodeTopologies. Instead, inject a xyzInformerFactory().xyzInfomer.Lister() to the plugin instance, and then use xyzLister's methods to fetch the xyz objects when needed. Then, all the onXYZAdd/Update/Delete methods can also be removed.

Copy link
Contributor Author

@AlexeyPerevalov AlexeyPerevalov Feb 2, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nodeTopologies are populating in Add/Update callback, it's not a poll model, it's push model, when plugin getting informed about node topology state modification (e.g. after new pod get launched).
I think it's better to update nodeTopologies as soon as node state was changed, rather then we're in Filter plugin. Because I don't know a proper way of updating, xyz objects, except maybe a time interval base.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, you may misunderstand xyzLister which is providerd by Kubernetes client-go SDK. It actually maintains a local cache for you, and xyzLister's methods fetch the xyz object(s) from the client cache in an O(1) manner. It's not polling objects from API server.

In other words, no matter what state of your CR gets changed, the cache (xyzStore) that backs xyzLister is (almost) up-to-date.

IIUC, we can just rely on the xyzLister, because we just look at the up-to-date state of particular xyz objects upon a scheduling cycle (the Filter() hook) - rather than needs to be notified and proactively trigger the scheduling of a Pod.


topologyPolicies := getTopologyPolicies(tm.nodeTopologies, nodeName)
for _, policyName := range topologyPolicies {
if handler, ok := tm.topologyPolicyHandlers[policyName]; ok {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that the two objects are actually identical, and there is no struct fields associated with each, I'd suggest just create 2 stateless PolicyFilter() methods (e.g., PodLevelFilter() and SingleNUMANodeFilter())to handle them individually.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you suggest to do it with map[string]interface{} or just compare policyName and call appropriate handler?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can define a function type:

type PolicyFilter func(*v1.Pod, topologyv1alpha1.ZoneList) *framework.Status

And then register each function implementation in the map: map[apiconfig.TopologyManagerPolicy]PolicyFilter.

@AlexeyPerevalov AlexeyPerevalov force-pushed the TopologyAwareSchedulerPerNUMA branch 2 times, most recently from d48189a to 8ca3805 Compare February 3, 2021 10:50
@swatisehgal swatisehgal force-pushed the TopologyAwareSchedulerPerNUMA branch from 78cc192 to 07264a7 Compare February 3, 2021 13:35
"k8s.io/client-go/tools/clientcmd"
"k8s.io/klog/v2"
v1qos "k8s.io/kubernetes/pkg/apis/core/v1/helper/qos"
bm "k8s.io/kubernetes/pkg/kubelet/cm/topologymanager/bitmask"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know about https://docs.google.com/document/d/1WO-ixERpqkCSEXEq30YtEH_z_G-BoLKeCbkRJcKq3xA/edit
in this context using kubernetes/pkg/kubelet not so perfect from maintainability point of view.

)

ctx := context.Background()
go nodeTopologyInformer.Informer().Run(ctx.Done())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think L342 actually does the same thing, so this line can be removed.


nodeName := nodeInfo.Node().Name

topologyPolicies := getTopologyPolicies(tm.nodeTopologies, nodeName)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, you may misunderstand xyzLister which is providerd by Kubernetes client-go SDK. It actually maintains a local cache for you, and xyzLister's methods fetch the xyz object(s) from the client cache in an O(1) manner. It's not polling objects from API server.

In other words, no matter what state of your CR gets changed, the cache (xyzStore) that backs xyzLister is (almost) up-to-date.

IIUC, we can just rely on the xyzLister, because we just look at the up-to-date state of particular xyz objects upon a scheduling cycle (the Filter() hook) - rather than needs to be notified and proactively trigger the scheduling of a Pod.

@AlexeyPerevalov
Copy link
Contributor Author

Thank you, I didn't know about cache in Lister, now I see it's a ThreadSafeMap.
The List method returns []*NodeResourceTopology, here need to iterate to find appropriate.
Also it's possible to do it by another way:
lister.NodeResourceTopology(Namespace).Get(hostName)
and map keeps with key as "namespace/hostName", so we will provide namespace in config file (AllNamespace doesn't work, since constant is "", and it concatenates to "/hostName", the storage doesn't know about wildcards )

@Huang-Wei
Copy link
Contributor

Also it's possible to do it by another way:
lister.NodeResourceTopology(Namespace).Get(hostName)
and map keeps with key as "namespace/hostName"

My understanding is that NodeResourceTopology is similar to Node, and hence cluster-scoped. However, it's defined as namespaced:

https://github.com/k8stopologyawareschedwg/noderesourcetopology-api/blob/fd566f16210b3eb0e66e1c4f110ad8660b4027eb/manifests/crd.yaml#L62

So does it make more sense to make it a cluster-scoped CRD?

@AlexeyPerevalov AlexeyPerevalov force-pushed the TopologyAwareSchedulerPerNUMA branch from 07264a7 to 52b951f Compare February 9, 2021 08:55
@AlexeyPerevalov
Copy link
Contributor Author

Also it's possible to do it by another way:
lister.NodeResourceTopology(Namespace).Get(hostName)
and map keeps with key as "namespace/hostName"

My understanding is that NodeResourceTopology is similar to Node, and hence cluster-scoped. However, it's defined as namespaced:

https://github.com/k8stopologyawareschedwg/noderesourcetopology-api/blob/fd566f16210b3eb0e66e1c4f110ad8660b4027eb/manifests/crd.yaml#L62

So does it make more sense to make it a cluster-scoped CRD?

Namespace feature is necessary for product purpose, e.g. to isolate tenants, it's better to keep.

deniedPGExpirationTimeSeconds: 3
kubeConfigPath: "REPLACE_ME_WITH_KUBE_CONFIG_PATH"
kubeMaster: "REPLACE_ME_WIHT_KUBE_MASTER"
kubeMaster: "REPLACE_ME_WITH_KUBE_MASTER"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's the change here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WIHT -> WITH
it's from 33dca04 commit of this PR, but probably it should be in separate PR, @swatisehgal what do you think?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, it was typo I had found and just addressed here. If you both think it is better to be addressed in a separate PR, I can do that. No problem.


for _, container := range containers {
for resource, quantity := range container.Resources.Requests {
if quan, ok := resources[resource]; ok {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: quan -> q is fine.

@Huang-Wei
Copy link
Contributor

@AlexeyPerevalov I noticed some comments haven't been resolved, so let me know it's ready for another round reivew.

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Mar 3, 2021
@AlexeyPerevalov AlexeyPerevalov force-pushed the TopologyAwareSchedulerPerNUMA branch from 7fe21db to a6e0a30 Compare March 4, 2021 09:12
@AlexeyPerevalov
Copy link
Contributor Author

/hold

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Mar 4, 2021
@swatisehgal swatisehgal force-pushed the TopologyAwareSchedulerPerNUMA branch 3 times, most recently from 107c1ef to bac8db5 Compare March 30, 2021 23:47
Copy link
Contributor

@Huang-Wei Huang-Wei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In addition to the comments below, some previous comments have been folded, like https://github.com/kubernetes-sigs/scheduler-plugins/pull/143/files#r604520697, please unfold them and resolve accordingly.

BTW: the unit test can be refactored using fake utilities: https://gist.github.com/Huang-Wei/9852f8a53d47fc683295a0097f5dfd51

@AlexeyPerevalov AlexeyPerevalov force-pushed the TopologyAwareSchedulerPerNUMA branch from 16decad to afb89f9 Compare April 1, 2021 10:37
@AlexeyPerevalov
Copy link
Contributor Author

/hold

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Apr 1, 2021
@swatisehgal
Copy link
Contributor

/unhold

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Apr 1, 2021
Copy link
Contributor

@Huang-Wei Huang-Wei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some final comments. Disabling all other plugins doesn't look good still.

Plugins: &schedapi.Plugins{
PreBind: &schedapi.PluginSet{
Disabled: []schedapi.Plugin{
{Name: "*"},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it's related to PreBind of VolumeBinding. The conflict seems to hide in some Filter plugin.

defer testutils.CleanupTest(t, testCtx)

// Create a Node.
nodeName1 := "fake-node-1"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bump.

Copy link
Contributor

@Huang-Wei Huang-Wei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some final comments to resolve the integration test issue.

Plugins: &schedapi.Plugins{
PreBind: &schedapi.PluginSet{
Disabled: []schedapi.Plugin{
{Name: "*"},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, it's because the memory was built with plain "100", you should use "100Gi".

Copy link
Contributor

@Huang-Wei Huang-Wei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One final nit. LGTM otherwise.

You can address that along with squashing the commits.
Two or three commits are fine - one carries autogen files/docs, the other carries code implemenation.

Comment on lines 166 to 167
for _, node := range []struct {
name string
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A nit: as "name" is the only field, we can simply do:

for _, nodeName := []string{"fake-node-1", "fake-node-2"} { 

AlexeyPerevalov and others added 4 commits April 6, 2021 12:25
This patch partly implements modified ideas proposed in:
https://docs.google.com/document/d/1XGTx6F8qgdq_zPvd87LAfhW1ObQ1McTDAoIWyokY88E
and
https://docs.google.com/document/d/1gPknVIOiu-c_fpLm53-jUAm-AGQQ8XC0hQYE7hq0L4c

The exact idea it implements simplified version of TopologyManager in
the kube-scheduler.

CRD is described in this document:
https://docs.google.com/document/d/12kj3fK8boNuPNqob6F_pPU9ZTaNEnPGaXEooW1Cilwg/edit

It also adds
Add github.com/k8stopologyawareschedwg/noderesourcetopology-api as dependency

This commit includes integration and unit tests.

Co-authored-by: Swati Sehgal <[email protected]>
Co-authored-by: Wei Huang <[email protected]>
Signed-off-by: Swati Sehgal <[email protected]>
Signed-off-by: Alexey Perevalov <[email protected]>
@AlexeyPerevalov AlexeyPerevalov force-pushed the TopologyAwareSchedulerPerNUMA branch from f42b39e to c8094c1 Compare April 6, 2021 09:26
@Huang-Wei
Copy link
Contributor

/retest

@Huang-Wei
Copy link
Contributor

/lgtm
/approve

Thanks @AlexeyPerevalov and @swatisehgal !

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Apr 6, 2021
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: AlexeyPerevalov, Huang-Wei

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Apr 6, 2021
@k8s-ci-robot k8s-ci-robot merged commit 16b3ba7 into kubernetes-sigs:master Apr 6, 2021
k8s-ci-robot added a commit that referenced this pull request Apr 6, 2021
…143-#170-upstream-release-1.19

Automated cherry pick of #156: added LoadVariationRiskBalancing plugin (apis + docs) #143: NUMA aware scheduling #170: Add default value to pg status
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/feature Categorizes issue or PR as related to a new feature. lgtm "Looks good to me", indicates that a PR is ready to be merged. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants