Deducted version of topology manager in kube-scheduler

AlexeyPerevalov · AlexeyPerevalov · commit 8c263bd9b340 · 2020-06-12T11:27:36.000+03:00
Signed-off-by: Alexey Perevalov &lt;alexey.perevalov@huawei.com&gt;
diff --git a/keps/sig-scheduling/20200612-deducted-topology-manager.md b/keps/sig-scheduling/20200612-deducted-topology-manager.md
@@ -0,0 +1,177 @@
+---
+title: Deducted version of TopologyManager in kube-scheduler
+authors:
+  - "@AlexeyPerevalov"
+owning-sig: sig-scheduling
+participating-sigs:
+reviewers:
+  - "@huang-wei"
+approvers:
+  - "@huang-wei"
+editor: TBD
+creation-date: 2020-05-28
+last-updated: 2020-05-28
+status: implementable
+see-also:
+superseded-by:
+---
+# Topology Aware Scheduling
+
+<!-- toc -->
+- [Summary](#Summary)
+- [Motivation](#Motivation)
+  - [Goals](#goals)
+  - [Non-Goals](#non-goals)
+- [Proposal](#Proposal)
+- [Use cases](#use-cases)
+- [Test plans](#test-plans)
+- [Graduation criteria](#graduation-criteria)
+- [Implementation history](#implementation-history)
+<!-- /toc -->
+
+# Summary
+
+This document describes behavior of the Kubernetes Scheduler which take into
+account worker node NUMA topology.
+
+# Motivation
+
+After Topology Manager was introduced the problem of launching pod in the
+cluster where worker nodes have different NUMA topology and different amount
+of resources in that topology became actual. Pod could be scheduled on the node
+where total amount of resources are enough, but resource distribution could not
+satisfy the appropriate Topology policy. In this case the pod failed to start. Much
+better behaviour for scheduler would be to select appropriate node where admit
+handlers may pass.
+
+
+## Goals
+
+-   Make scheduling process more precise when we have NUMA topology on the
+worker node.
+
+## Non-Goals
+
+-   Do not change Topology Manager behaviour to be able to work with policy in
+the PodSpec
+-   This Proposal requires exposing NUMA topology information. This KEP doesn't
+describe how to expose all necessary information it just declare what kind of
+information is necessary.
+
+# Proposal
+Kube-scheduler builtin plugin will be added to the main tree. This plugin
+implements a deducted version of TopologyManager it’s different from original topology manager algorithm.
+Plugin checks the ability to run pod only in case of single-numa-node policy on the
+node, since it is the most strict policy, it implies that the launch on the node with
+other existing policies will be successful if the condition for single-numa-node policy passed for the worker node.
+Proposed plugin will use node label to identify which topology policy is
+enabled on the node.
+To work, this plugin requires topology information of the available resource on the worker nodes.
+
+
+## Node labels
+
+Node label contains the name of the topology policy currently implemented in kubelet.
+
+Proposed Node Label may look like this:
+  `beta.kubernetes.io/topology=none|best-effort|restricted|single-numa-node`
+
+It is based on [this Proposal] (https://github.com/kubernetes/enhancements/pull/1340)
+
+To use these labels both in kube-scheduler and in kubelet string constants of these labels should be moved from pkg/kubelet/cm/topologymanager/ and pkg/kubelet/apis/config/types.go to
+pkg/apis/core/types.go
+
+## Topology format
+
+Available resources with topology of the node should be stored in CRD. Format of the topology described
+[in this document](https://docs.google.com/document/d/12kj3fK8boNuPNqob6F_pPU9ZTaNEnPGaXEooW1Cilwg/edit).
+
+The daemon which runs outside of the kubelet will collect all necessary information on running pods, based on allocatable resources of the node and consumed resources by pods it will provide available resources in CRD, where one CRD instance represents one worker node. The name of the CRD instance is the name of the worker node.
+
+## Plugin implementation details
+
+Since topology of the node is stored in the CRD, kube-scheduler should be subscribed for updates of appropriate CRD type. Kube-scheduler will use informers which will be generated.
+
+### Topology information in NodeInfo
+
+Since not every cluster has servers with NUMA topology and not every node of the cluster with NUMA topology enables TopologyManager feature gate, topology in the NodeInfo doesn’t substitute existing representation of the resource to keep compatibility with existing scheduler plugin, but it extends NodeInfo with following field:
+
+```go
+Nodes []NUMANodeResource
+
+type NUMANodeResource struct {
+    NUMAID int
+    Resources v1.ResourceList
+}
+```
+
+
+### Description of the Algorithm
+
+The algorithm which implements single-numa-node policy is following:
+
+```go
+for _, container := range containers {
+       bitmask := bm.NewEmptyBitMask()
+       for resource, quantity := range container.Resources.Requests {
+	       resourceBitmask := bm.NewEmptyBitMask()
+	       if guarantedQoS(&container.Resources.Limits, resource, quantity) {
+		       for numaIndex, numaNodeResources := range numaMap {
+			       nodeQuantity, ok := numaNodeResources[resource]
+			       if !ok || nodeQuantity.Cmp(quantity) < 0 {
+				       continue
+			       }
+			       resourceBitmask.Add(numaIndex)
+		       }
+	       }
+	       if resourceBitmask.IsEmpty() {
+		       continue
+	       }
+	       bitmask.And(resourceBitmask)
+       }
+       if bitmask.IsEmpty() {
+	       // we can't align container, so we can't align a pod
+	       return framework.NewStatus(framework.Error, fmt.Sprintf("Can't align container: %s", container.Name))
+       }
+}
+
+```
+
+
+
+# Use cases
+
+Numbers of kubernetes worker nodes on bara metal with NUMA topology. TopologyManager feature gate enabled on the nodes. In this configuration, the operator does not want that in the case of an unsatisfactory host topology, it should be re-scheduled for launch, but wants the scheduling to be successful the first time.
+
+
+# Test plans
+
+Components which should be developed or modified for this feature could be easily tested.
+
+* Unit Tests
+
+Unit test for scheduler plugin (pkg/scheduler/framework/plugins/noderesources/topology_match.go)
+pkg/scheduler/framework/plugins/noderesources/topology_match_test.go which test the plugin.
+
+Separate tests for CRD informer also should be implemented.
+
+* Integration Tests and End-to-end tests
+
+Implementation of it does not constitute a difficulty, but launching of it requires appropriate equipment.
+
+# Graduation criteria
+
+* Alpha (v1.20)
+
+These are the required changes:
+- [ ] CRD informer used in kubernetes as staging project
+- [ ] New `kube scheduler plugin` TopologyMatch.
+    - [ ] Implementation of Filter
+    - [ ] Implementation of Score
+- [ ] Tests from [Test plans](#test-plans).
+
+
+# Implementation history
+
+- 2020-06-12: Initial KEP sent out for review, including Summary, Motivation, Proposal, Test plans and Graduation criteria.
+