Skip to content

Commit e867203

Browse files
author
Eric Ernst
committed
pod-overhead: update documnetation for beta
Signed-off-by: Eric Ernst <[email protected]>
1 parent 93a5942 commit e867203

File tree

2 files changed

+157
-25
lines changed

2 files changed

+157
-25
lines changed

content/en/docs/concepts/configuration/pod-overhead.md

Lines changed: 155 additions & 20 deletions
Original file line numberDiff line numberDiff line change
@@ -15,7 +15,7 @@ weight: 20
1515

1616
When you run a Pod on a Node, the Pod itself takes an amount of system resources. These
1717
resources are additional to the resources needed to run the container(s) inside the Pod.
18-
_Pod Overhead_ is a feature for accounting for the resources consumed by the pod infrastructure
18+
_Pod Overhead_ is a feature for accounting for the resources consumed by the Pod infrastructure
1919
on top of the container requests & limits.
2020

2121

@@ -24,33 +24,168 @@ on top of the container requests & limits.
2424

2525
{{% capture body %}}
2626

27-
## Pod Overhead
28-
29-
In Kubernetes, the pod's overhead is set at
27+
In Kubernetes, the Pod's overhead is set at
3028
[admission](/docs/reference/access-authn-authz/extensible-admission-controllers/#what-are-admission-webhooks)
31-
time according to the overhead associated with the pod's
29+
time according to the overhead associated with the Pod's
3230
[RuntimeClass](/docs/concepts/containers/runtime-class/).
3331

3432
When Pod Overhead is enabled, the overhead is considered in addition to the sum of container
35-
resource requests when scheduling a pod. Similarly, Kubelet will include the pod overhead when sizing
36-
the pod cgroup, and when carrying out pod eviction ranking.
33+
resource requests when scheduling a Pod. Similarly, Kubelet will include the Pod overhead when sizing
34+
the Pod cgroup, and when carrying out Pod eviction ranking.
3735

38-
### Set Up
36+
## Enabling Pod Overhead {#set-up}
3937

4038
You need to make sure that the `PodOverhead`
4139
[feature gate](/docs/reference/command-line-tools-reference/feature-gates/) is enabled (it is on by default as of 1.18)
42-
across your cluster. This means:
43-
44-
- in {{< glossary_tooltip text="kube-scheduler" term_id="kube-scheduler" >}}
45-
- in {{< glossary_tooltip text="kube-apiserver" term_id="kube-apiserver" >}}
46-
- in the {{< glossary_tooltip text="kubelet" term_id="kubelet" >}} on each Node
47-
- in any custom API servers that use feature gates
48-
49-
{{< note >}}
50-
Users who can write to RuntimeClass resources are able to have cluster-wide impact on
51-
workload performance. You can limit access to this ability using Kubernetes access controls.
52-
See [Authorization Overview](/docs/reference/access-authn-authz/authorization/) for more details.
53-
{{< /note >}}
40+
across your cluster, and a `RuntimeClass` is utilized which defines the `Overhead` field.
41+
42+
## Usage example
43+
44+
To use the PodOverhead feature, you need a RuntimeClass that defines the `overhead` field. As
45+
an example, you could use the following RuntimeClass definition with a virtualizing container runtime
46+
that uses around 120MiB per Pod for the virtual machine and the guest OS:
47+
48+
```yaml
49+
---
50+
kind: RuntimeClass
51+
apiVersion: node.k8s.io/v1beta1
52+
metadata:
53+
name: kata-fc
54+
handler: kata-fc
55+
overhead:
56+
podFixed:
57+
memory: "120Mi"
58+
cpu: "250m"
59+
```
60+
61+
Workloads which are created which specify the `kata-fc` RuntimeClass handler will take the memory and
62+
cpu overheads into account for resource quota calculations, node scheduling, as well as Pod cgroup sizing.
63+
64+
Consider running the given example workload, test-pod:
65+
66+
```yaml
67+
apiVersion: v1
68+
kind: Pod
69+
metadata:
70+
name: test-pod
71+
spec:
72+
runtimeClassName: kata-fc
73+
containers:
74+
- name: busybox-ctr
75+
image: busybox
76+
stdin: true
77+
tty: true
78+
resources:
79+
limits:
80+
cpu: 500m
81+
memory: 100Mi
82+
- name: nginx-ctr
83+
image: nginx
84+
resources:
85+
limits:
86+
cpu: 1500m
87+
memory: 100Mi
88+
```
89+
90+
At admission time the RuntimeClass [admission controller](https://kubernetes.io/docs/reference/access-authn-authz/admission-controllers/)
91+
updates the workload's PodSpec to include the `overhead` as described in the RuntimeClass. If the PodSpec already has this field defined,
92+
the Pod will be rejected. In the given example, since only the RuntimeClass name is specified, the admission controller mutates the Pod
93+
to include an `overhead`.
94+
95+
After the RuntimeClass admission controller, you can check the updated PodSpec:
96+
97+
```bash
98+
kubectl get pod test-pod -o jsonpath='{.spec.overhead}'
99+
```
100+
101+
The output is:
102+
```
103+
map[cpu:250m memory:120Mi]
104+
```
105+
106+
If a ResourceQuota is defined, the sum of container requests as well as the
107+
`overhead` field are counted.
108+
109+
When the kube-scheduler is deciding which node should run a new Pod, the scheduler considers that Pod's
110+
`overhead` as well as the sum of container requests for that Pod. For this example, the scheduler adds the
111+
requests and the overhead, then looks for a node that has 2.25 CPU and 320 MiB of memory available.
112+
113+
Once a Pod is scheduled to a node, the kubelet on that node creates a new {{< glossary_tooltip text="cgroup" term_id="cgroup" >}}
114+
for the Pod. It is within this pod that the underlying container runtime will create containers.
115+
116+
If the resource has a limit defined for each container (Guaranteed QoS or Bustrable QoS with limits defined),
117+
the kubelet will set an upper limit for the pod cgroup associated with that resource (cpu.cfs_quota_us for CPU
118+
and memory.limit_in_bytes memory). This upper limit is based on the sum of the container limits plus the `overhead`
119+
defined in the PodSpec.
120+
121+
For CPU, if the Pod is Quaranteed or Bustrable QoS, kubelet will set cpu.shares based on the sum of container
122+
requests plus the `overhead` defined in the PodSpec.
123+
124+
Looking at our example, verify the container requests for the workload:
125+
```bash
126+
kubectl get pod test-pod -o jsonpath='{.spec.containers[*].resources.limits}'
127+
```
128+
129+
The total container requests are 2000m CPU and 200MiB of memory:
130+
```
131+
map[cpu: 500m memory:100Mi] map[cpu:1500m memory:100Mi]
132+
```
133+
134+
Check this against what is observed by the node:
135+
```bash
136+
kubectl describe node | grep test-pod -B2
137+
```
138+
139+
The output shows 2250m CPU and 320MiB of memory are requested, which includes PodOverhead:
140+
```
141+
Namespace Name CPU Requests CPU Limits Memory Requests Memory Limits AGE
142+
--------- ---- ------------ ---------- --------------- ------------- ---
143+
default test-pod 2250m (56%) 2250m (56%) 320Mi (1%) 320Mi (1%) 36m
144+
```
145+
146+
Check the Pod's memory cgroups on the node where the workload is running. First, on the particular node, determine
147+
the Pod identifier:
148+
149+
```bash
150+
# Run this on the node where the Pod is scheduled
151+
sudo crictl pods | grep test-pod
152+
```
153+
154+
The value observed is `7ccf55aee35dd`:
155+
```
156+
7ccf55aee35dd 58 minutes ago Ready test-pod default 0
157+
```
158+
159+
From this, you can determine the cgroup path for the Pod:
160+
```bash
161+
# Run this on the node where the Pod is scheduled
162+
sudo crictl inspectp -o=json 7ccf55aee35dd | grep cgroupsPath
163+
```
164+
165+
The resulting cgroup path includes the Pod's `pause` container. The Pod level cgroup is one directory above.
166+
```
167+
"cgroupsPath": "/kubepods/podd7f4b509-cf94-4951-9417-d1087c92a5b2/7ccf55aee35dd16aca4189c952d83487297f3cd760f1bbf09620e206e7d0c27a"
168+
```
169+
170+
In this specific case, the pod cgroup path is `kubepods/podd7f4b509-cf94-4951-9417-d1087c92a5b2`. Verify the Pod level cgroup setting for memory:
171+
```bash
172+
# Run this on the node where the Pod is scheduled.
173+
# Also, change the name of the cgroup to match the cgroup allocated for your pod.
174+
cat /sys/fs/cgroup/memory/kubepods/podd7f4b509-cf94-4951-9417-d1087c92a5b2/memory.limit_in_bytes
175+
```
176+
177+
This is 320 MiB, as expected:
178+
```
179+
335544320
180+
```
181+
182+
### Observability
183+
184+
A `kube_pod_overhead` metric is available in [kube-state-metrics](https://github.com/kubernetes/kube-state-metrics)
185+
to help identify when PodOverhead is being utilized and to help observe stability of workloads
186+
running with a defined Overhead. This functionality is not available in the 1.9 release of
187+
kube-state-metrics, but is expected in a following release. Users will need to build kube-state-metrics
188+
from source in the meantime.
54189

55190
{{% /capture %}}
56191

content/en/docs/concepts/containers/runtime-class.md

Lines changed: 2 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -174,15 +174,12 @@ Nodes](/docs/concepts/configuration/assign-pod-node/).
174174
175175
### Pod Overhead
176176
177-
{{< feature-state for_k8s_version="v1.16" state="alpha" >}}
177+
{{< feature-state for_k8s_version="v1.18" state="beta" >}}
178178
179179
As of Kubernetes v1.16, RuntimeClass includes support for specifying overhead associated with
180180
running a pod, as part of the [`PodOverhead`](/docs/concepts/configuration/pod-overhead/) feature.
181-
To use `PodOverhead`, you must have the PodOverhead [feature gate](/docs/reference/command-line-tools-reference/feature-gates/)
182-
enabled (it is off by default).
183181
184-
185-
Pod overhead is defined in RuntimeClass through the `Overhead` fields. Through the use of these fields,
182+
Pod overhead is defined in RuntimeClass through the `overhead` fields. Through the use of these fields,
186183
you can specify the overhead of running pods utilizing this RuntimeClass and ensure these overheads
187184
are accounted for in Kubernetes.
188185

0 commit comments

Comments
 (0)