Skip to content

Commit 1c2dce0

Browse files
Simplified version of topology manager in kube-scheduler
Signed-off-by: Alexey Perevalov <[email protected]>
1 parent f2af06b commit 1c2dce0

File tree

1 file changed

+177
-0
lines changed

1 file changed

+177
-0
lines changed
Lines changed: 177 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,177 @@
1+
---
2+
title: Simplified version of TopologyManager in kube-scheduler
3+
authors:
4+
- "@AlexeyPerevalov"
5+
owning-sig: sig-scheduling
6+
participating-sigs:
7+
reviewers:
8+
- "@ahg-g"
9+
- "@huang-wei"
10+
- "@derekwaynecarr"
11+
approvers:
12+
- "@ahg-g"
13+
- "@huang-wei"
14+
editor: TBD
15+
creation-date: 2020-05-28
16+
last-updated: 2020-05-28
17+
status: implementable
18+
see-also:
19+
superseded-by:
20+
---
21+
# Topology Aware Scheduling
22+
23+
<!-- toc -->
24+
- [Summary](#Summary)
25+
- [Motivation](#Motivation)
26+
- [Goals](#goals)
27+
- [Non-Goals](#non-goals)
28+
- [Proposal](#Proposal)
29+
- [Topology format] (#topology-format)
30+
- [Plugin implementation details] (#plugin-implementation-details)
31+
- [Topology information in NodeInfo] (#topology-information-in-nodeinfo)
32+
- [Description of the Algorithm] (#description-of-the-algorithm)
33+
- [Use cases](#use-cases)
34+
- [Test plans](#test-plans)
35+
- [Graduation criteria](#graduation-criteria)
36+
- [Implementation history](#implementation-history)
37+
<!-- /toc -->
38+
39+
# Summary
40+
41+
This document describes behavior of the Kubernetes Scheduler which take into
42+
account worker node NUMA topology.
43+
44+
# Motivation
45+
46+
After Topology Manager was introduced the problem of launching pod in the
47+
cluster where worker nodes have different NUMA topology and different amount
48+
of resources in that topology became actual. Pod could be scheduled on the node
49+
where total amount of resources are enough, but resource distribution could not
50+
satisfy the appropriate Topology policy. In this case the pod failed to start. Much
51+
better behaviour for scheduler would be to select appropriate node where kubelet admit
52+
handlers may pass.
53+
54+
55+
## Goals
56+
57+
- Make scheduling process more precise when we have NUMA topology on the
58+
worker node.
59+
60+
## Non-Goals
61+
62+
- Change the PodSpec to allow requesting a specific node topology manager policy
63+
- This Proposal requires exposing NUMA topology information. This KEP doesn't
64+
describe how to expose all necessary information it just declare what kind of
65+
information is necessary.
66+
67+
# Proposal
68+
Kube-scheduler builtin plugin will be added to the main tree. This plugin
69+
implements a simplified version of TopologyManager it’s different from original topology manager algorithm.
70+
Plugin checks the ability to run pod only in case of single-numa-node policy on the
71+
node, since it is the most strict policy, it implies that the launch on the node with
72+
other existing policies will be successful if the condition for single-numa-node policy passed for the worker node.
73+
Proposed plugin will use [CRD][1] to identify which topology policy is enabled on the node.
74+
To work, this plugin requires topology information of the available resource on the worker nodes.
75+
76+
## Topology format
77+
78+
Available resources with topology of the node should be stored in CRD. Format of the topology described
79+
[in this document][1].
80+
81+
The daemon which runs outside of the kubelet will collect all necessary information on running pods, based on allocatable resources of the node and consumed resources by pods it will provide available resources in CRD, where one CRD instance represents one worker node. The name of the CRD instance is the name of the worker node.
82+
83+
## Plugin implementation details
84+
85+
Since topology of the node is stored in the CRD, kube-scheduler should be subscribed for updates of appropriate CRD type. Kube-scheduler will use informers which will be generated.
86+
87+
### Topology information in NodeInfo
88+
89+
Since not every cluster has servers with NUMA topology and not every node of the cluster with NUMA topology enables TopologyManager feature gate, topology in the NodeInfo doesn’t substitute existing representation of the resource to keep compatibility with existing scheduler plugin, but it extends NodeInfo with following field:
90+
91+
```go
92+
93+
topologyInfo *NodeTopology
94+
95+
type NodeTopology struct {
96+
TopologyPolicy string
97+
Nodes []NUMANodeResource
98+
}
99+
100+
type NUMANodeResource struct {
101+
NUMAID int
102+
Resources v1.ResourceList
103+
}
104+
```
105+
Where TopologyPolicy may have following values: none, best-effort, restricted, single-numa-node
106+
107+
To use these labels both in kube-scheduler and in kubelet string constants of these labels should be moved from pkg/kubelet/cm/topologymanager/ and pkg/kubelet/apis/config/types.go to pkg/apis/core/types.go a one single place.
108+
109+
NUMAID is an auxilary field since scheduler version of TopologyManager doesn't make a real assignment.
110+
111+
### Description of the Algorithm
112+
113+
The algorithm which implements single-numa-node policy is following:
114+
115+
```go
116+
for _, container := range containers {
117+
bitmask := bm.NewEmptyBitMask()
118+
for resource, quantity := range container.Resources.Requests {
119+
resourceBitmask := bm.NewEmptyBitMask()
120+
if guarantedQoS(&container.Resources.Limits, resource, quantity) {
121+
for numaIndex, numaNodeResources := range numaMap {
122+
nodeQuantity, ok := numaNodeResources[resource]
123+
if !ok || nodeQuantity.Cmp(quantity) < 0 {
124+
continue
125+
}
126+
resourceBitmask.Add(numaIndex)
127+
}
128+
}
129+
if resourceBitmask.IsEmpty() {
130+
continue
131+
}
132+
bitmask.And(resourceBitmask)
133+
}
134+
if bitmask.IsEmpty() {
135+
// we can't align container, so we can't align a pod
136+
return framework.NewStatus(framework.Unschedulable, fmt.Sprintf("Can't align container: %s", container.Name))
137+
}
138+
}
139+
```
140+
141+
# Use cases
142+
143+
Numbers of kubernetes worker nodes on bara metal with NUMA topology. TopologyManager feature gate enabled on the nodes. In this configuration, the operator does not want that in the case of an unsatisfactory host topology, it should be re-scheduled for launch, but wants the scheduling to be successful the first time.
144+
145+
146+
# Test plans
147+
148+
Components which should be developed or modified for this feature could be easily tested.
149+
150+
* Unit Tests
151+
152+
Unit test for scheduler plugin (pkg/scheduler/framework/plugins/noderesources/topology_match.go)
153+
pkg/scheduler/framework/plugins/noderesources/topology_match_test.go which test the plugin.
154+
155+
Separate tests for CRD informer also should be implemented.
156+
157+
* Integration Tests and End-to-end tests
158+
159+
Implementation of it does not constitute a difficulty, but launching of it requires appropriate equipment.
160+
161+
# Graduation criteria
162+
163+
* Alpha (v1.20)
164+
165+
These are the required changes:
166+
- [ ] CRD informer used in kubernetes as staging project
167+
- [ ] New `kube scheduler plugin` TopologyMatch.
168+
- [ ] Implementation of Filter
169+
- [ ] Implementation of Score
170+
- [ ] Tests from [Test plans](#test-plans).
171+
172+
173+
# Implementation history
174+
175+
- 2020-06-12: Initial KEP sent out for review, including Summary, Motivation, Proposal, Test plans and Graduation criteria.
176+
177+
[1]: https://docs.google.com/document/d/12kj3fK8boNuPNqob6F_pPU9ZTaNEnPGaXEooW1Cilwg/edit

0 commit comments

Comments
 (0)