Skip to content

Commit 8c263bd

Browse files
Deducted version of topology manager in kube-scheduler
Signed-off-by: Alexey Perevalov <[email protected]>
1 parent f2af06b commit 8c263bd

File tree

1 file changed

+177
-0
lines changed

1 file changed

+177
-0
lines changed
Lines changed: 177 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,177 @@
1+
---
2+
title: Deducted version of TopologyManager in kube-scheduler
3+
authors:
4+
- "@AlexeyPerevalov"
5+
owning-sig: sig-scheduling
6+
participating-sigs:
7+
reviewers:
8+
- "@huang-wei"
9+
approvers:
10+
- "@huang-wei"
11+
editor: TBD
12+
creation-date: 2020-05-28
13+
last-updated: 2020-05-28
14+
status: implementable
15+
see-also:
16+
superseded-by:
17+
---
18+
# Topology Aware Scheduling
19+
20+
<!-- toc -->
21+
- [Summary](#Summary)
22+
- [Motivation](#Motivation)
23+
- [Goals](#goals)
24+
- [Non-Goals](#non-goals)
25+
- [Proposal](#Proposal)
26+
- [Use cases](#use-cases)
27+
- [Test plans](#test-plans)
28+
- [Graduation criteria](#graduation-criteria)
29+
- [Implementation history](#implementation-history)
30+
<!-- /toc -->
31+
32+
# Summary
33+
34+
This document describes behavior of the Kubernetes Scheduler which take into
35+
account worker node NUMA topology.
36+
37+
# Motivation
38+
39+
After Topology Manager was introduced the problem of launching pod in the
40+
cluster where worker nodes have different NUMA topology and different amount
41+
of resources in that topology became actual. Pod could be scheduled on the node
42+
where total amount of resources are enough, but resource distribution could not
43+
satisfy the appropriate Topology policy. In this case the pod failed to start. Much
44+
better behaviour for scheduler would be to select appropriate node where admit
45+
handlers may pass.
46+
47+
48+
## Goals
49+
50+
- Make scheduling process more precise when we have NUMA topology on the
51+
worker node.
52+
53+
## Non-Goals
54+
55+
- Do not change Topology Manager behaviour to be able to work with policy in
56+
the PodSpec
57+
- This Proposal requires exposing NUMA topology information. This KEP doesn't
58+
describe how to expose all necessary information it just declare what kind of
59+
information is necessary.
60+
61+
# Proposal
62+
Kube-scheduler builtin plugin will be added to the main tree. This plugin
63+
implements a deducted version of TopologyManager it’s different from original topology manager algorithm.
64+
Plugin checks the ability to run pod only in case of single-numa-node policy on the
65+
node, since it is the most strict policy, it implies that the launch on the node with
66+
other existing policies will be successful if the condition for single-numa-node policy passed for the worker node.
67+
Proposed plugin will use node label to identify which topology policy is
68+
enabled on the node.
69+
To work, this plugin requires topology information of the available resource on the worker nodes.
70+
71+
72+
## Node labels
73+
74+
Node label contains the name of the topology policy currently implemented in kubelet.
75+
76+
Proposed Node Label may look like this:
77+
`beta.kubernetes.io/topology=none|best-effort|restricted|single-numa-node`
78+
79+
It is based on [this Proposal] (https://github.com/kubernetes/enhancements/pull/1340)
80+
81+
To use these labels both in kube-scheduler and in kubelet string constants of these labels should be moved from pkg/kubelet/cm/topologymanager/ and pkg/kubelet/apis/config/types.go to
82+
pkg/apis/core/types.go
83+
84+
## Topology format
85+
86+
Available resources with topology of the node should be stored in CRD. Format of the topology described
87+
[in this document](https://docs.google.com/document/d/12kj3fK8boNuPNqob6F_pPU9ZTaNEnPGaXEooW1Cilwg/edit).
88+
89+
The daemon which runs outside of the kubelet will collect all necessary information on running pods, based on allocatable resources of the node and consumed resources by pods it will provide available resources in CRD, where one CRD instance represents one worker node. The name of the CRD instance is the name of the worker node.
90+
91+
## Plugin implementation details
92+
93+
Since topology of the node is stored in the CRD, kube-scheduler should be subscribed for updates of appropriate CRD type. Kube-scheduler will use informers which will be generated.
94+
95+
### Topology information in NodeInfo
96+
97+
Since not every cluster has servers with NUMA topology and not every node of the cluster with NUMA topology enables TopologyManager feature gate, topology in the NodeInfo doesn’t substitute existing representation of the resource to keep compatibility with existing scheduler plugin, but it extends NodeInfo with following field:
98+
99+
```go
100+
Nodes []NUMANodeResource
101+
102+
type NUMANodeResource struct {
103+
NUMAID int
104+
Resources v1.ResourceList
105+
}
106+
```
107+
108+
109+
### Description of the Algorithm
110+
111+
The algorithm which implements single-numa-node policy is following:
112+
113+
```go
114+
for _, container := range containers {
115+
bitmask := bm.NewEmptyBitMask()
116+
for resource, quantity := range container.Resources.Requests {
117+
resourceBitmask := bm.NewEmptyBitMask()
118+
if guarantedQoS(&container.Resources.Limits, resource, quantity) {
119+
for numaIndex, numaNodeResources := range numaMap {
120+
nodeQuantity, ok := numaNodeResources[resource]
121+
if !ok || nodeQuantity.Cmp(quantity) < 0 {
122+
continue
123+
}
124+
resourceBitmask.Add(numaIndex)
125+
}
126+
}
127+
if resourceBitmask.IsEmpty() {
128+
continue
129+
}
130+
bitmask.And(resourceBitmask)
131+
}
132+
if bitmask.IsEmpty() {
133+
// we can't align container, so we can't align a pod
134+
return framework.NewStatus(framework.Error, fmt.Sprintf("Can't align container: %s", container.Name))
135+
}
136+
}
137+
138+
```
139+
140+
141+
142+
# Use cases
143+
144+
Numbers of kubernetes worker nodes on bara metal with NUMA topology. TopologyManager feature gate enabled on the nodes. In this configuration, the operator does not want that in the case of an unsatisfactory host topology, it should be re-scheduled for launch, but wants the scheduling to be successful the first time.
145+
146+
147+
# Test plans
148+
149+
Components which should be developed or modified for this feature could be easily tested.
150+
151+
* Unit Tests
152+
153+
Unit test for scheduler plugin (pkg/scheduler/framework/plugins/noderesources/topology_match.go)
154+
pkg/scheduler/framework/plugins/noderesources/topology_match_test.go which test the plugin.
155+
156+
Separate tests for CRD informer also should be implemented.
157+
158+
* Integration Tests and End-to-end tests
159+
160+
Implementation of it does not constitute a difficulty, but launching of it requires appropriate equipment.
161+
162+
# Graduation criteria
163+
164+
* Alpha (v1.20)
165+
166+
These are the required changes:
167+
- [ ] CRD informer used in kubernetes as staging project
168+
- [ ] New `kube scheduler plugin` TopologyMatch.
169+
- [ ] Implementation of Filter
170+
- [ ] Implementation of Score
171+
- [ ] Tests from [Test plans](#test-plans).
172+
173+
174+
# Implementation history
175+
176+
- 2020-06-12: Initial KEP sent out for review, including Summary, Motivation, Proposal, Test plans and Graduation criteria.
177+

0 commit comments

Comments
 (0)