@@ -15,7 +15,7 @@ weight: 20
1515
1616When you run a Pod on a Node, the Pod itself takes an amount of system resources. These
1717resources are additional to the resources needed to run the container(s) inside the Pod.
18- _ Pod Overhead_ is a feature for accounting for the resources consumed by the pod infrastructure
18+ _ Pod Overhead_ is a feature for accounting for the resources consumed by the Pod infrastructure
1919on top of the container requests & limits.
2020
2121
@@ -24,33 +24,168 @@ on top of the container requests & limits.
2424
2525{{% capture body %}}
2626
27- ## Pod Overhead
28-
29- In Kubernetes, the pod's overhead is set at
27+ In Kubernetes, the Pod's overhead is set at
3028[ admission] ( /docs/reference/access-authn-authz/extensible-admission-controllers/#what-are-admission-webhooks )
31- time according to the overhead associated with the pod 's
29+ time according to the overhead associated with the Pod 's
3230[ RuntimeClass] ( /docs/concepts/containers/runtime-class/ ) .
3331
3432When Pod Overhead is enabled, the overhead is considered in addition to the sum of container
35- resource requests when scheduling a pod . Similarly, Kubelet will include the pod overhead when sizing
36- the pod cgroup, and when carrying out pod eviction ranking.
33+ resource requests when scheduling a Pod . Similarly, Kubelet will include the Pod overhead when sizing
34+ the Pod cgroup, and when carrying out Pod eviction ranking.
3735
38- ### Set Up
36+ ## Enabling Pod Overhead {#set-up}
3937
4038You need to make sure that the ` PodOverhead `
4139[ feature gate] ( /docs/reference/command-line-tools-reference/feature-gates/ ) is enabled (it is on by default as of 1.18)
42- across your cluster. This means:
43-
44- - in {{< glossary_tooltip text="kube-scheduler" term_id="kube-scheduler" >}}
45- - in {{< glossary_tooltip text="kube-apiserver" term_id="kube-apiserver" >}}
46- - in the {{< glossary_tooltip text="kubelet" term_id="kubelet" >}} on each Node
47- - in any custom API servers that use feature gates
48-
49- {{< note >}}
50- Users who can write to RuntimeClass resources are able to have cluster-wide impact on
51- workload performance. You can limit access to this ability using Kubernetes access controls.
52- See [ Authorization Overview] ( /docs/reference/access-authn-authz/authorization/ ) for more details.
53- {{< /note >}}
40+ across your cluster, and a ` RuntimeClass ` is utilized which defines the ` Overhead ` field.
41+
42+ ## Usage example
43+
44+ To use the PodOverhead feature, you need a RuntimeClass that defines the ` overhead ` field. As
45+ an example, you could use the following RuntimeClass definition with a virtualizing container runtime
46+ that uses around 120MiB per Pod for the virtual machine and the guest OS:
47+
48+ ``` yaml
49+ ---
50+ kind : RuntimeClass
51+ apiVersion : node.k8s.io/v1beta1
52+ metadata :
53+ name : kata-fc
54+ handler : kata-fc
55+ overhead :
56+ podFixed :
57+ memory : " 120Mi"
58+ cpu : " 250m"
59+ ` ` `
60+
61+ Workloads which are created which specify the ` kata-fc` RuntimeClass handler will take the memory and
62+ cpu overheads into account for resource quota calculations, node scheduling, as well as Pod cgroup sizing.
63+
64+ Consider running the given example workload, test-pod :
65+
66+ ` ` ` yaml
67+ apiVersion: v1
68+ kind: Pod
69+ metadata:
70+ name: test-pod
71+ spec:
72+ runtimeClassName: kata-fc
73+ containers:
74+ - name: busybox-ctr
75+ image: busybox
76+ stdin: true
77+ tty: true
78+ resources:
79+ limits:
80+ cpu: 500m
81+ memory: 100Mi
82+ - name: nginx-ctr
83+ image: nginx
84+ resources:
85+ limits:
86+ cpu: 1500m
87+ memory: 100Mi
88+ ` ` `
89+
90+ At admission time the RuntimeClass [admission controller](https://kubernetes.io/docs/reference/access-authn-authz/admission-controllers/)
91+ updates the workload's PodSpec to include the `overhead` as described in the RuntimeClass. If the PodSpec already has this field defined,
92+ the Pod will be rejected. In the given example, since only the RuntimeClass name is specified, the admission controller mutates the Pod
93+ to include an `overhead`.
94+
95+ After the RuntimeClass admission controller, you can check the updated PodSpec :
96+
97+ ` ` ` bash
98+ kubectl get pod test-pod -o jsonpath='{.spec.overhead}'
99+ ` ` `
100+
101+ The output is :
102+ ` ` `
103+ map[cpu:250m memory:120Mi]
104+ ` ` `
105+
106+ If a ResourceQuota is defined, the sum of container requests as well as the
107+ ` overhead` field are counted.
108+
109+ When the kube-scheduler is deciding which node should run a new Pod, the scheduler considers that Pod's
110+ ` overhead` as well as the sum of container requests for that Pod. For this example, the scheduler adds the
111+ requests and the overhead, then looks for a node that has 2.25 CPU and 320 MiB of memory available.
112+
113+ Once a Pod is scheduled to a node, the kubelet on that node creates a new {{< glossary_tooltip text="cgroup" term_id="cgroup" >}}
114+ for the Pod. It is within this pod that the underlying container runtime will create containers.
115+
116+ If the resource has a limit defined for each container (Guaranteed QoS or Bustrable QoS with limits defined),
117+ the kubelet will set an upper limit for the pod cgroup associated with that resource (cpu.cfs_quota_us for CPU
118+ and memory.limit_in_bytes memory). This upper limit is based on the sum of the container limits plus the `overhead`
119+ defined in the PodSpec.
120+
121+ For CPU, if the Pod is Quaranteed or Bustrable QoS, kubelet will set cpu.shares based on the sum of container
122+ requests plus the `overhead` defined in the PodSpec.
123+
124+ Looking at our example, verify the container requests for the workload :
125+ ` ` ` bash
126+ kubectl get pod test-pod -o jsonpath='{.spec.containers[*].resources.limits}'
127+ ` ` `
128+
129+ The total container requests are 2000m CPU and 200MiB of memory :
130+ ` ` `
131+ map[cpu: 500m memory:100Mi] map[cpu:1500m memory:100Mi]
132+ ` ` `
133+
134+ Check this against what is observed by the node :
135+ ` ` ` bash
136+ kubectl describe node | grep test-pod -B2
137+ ` ` `
138+
139+ The output shows 2250m CPU and 320MiB of memory are requested, which includes PodOverhead :
140+ ` ` `
141+ Namespace Name CPU Requests CPU Limits Memory Requests Memory Limits AGE
142+ --------- ---- ------------ ---------- --------------- ------------- ---
143+ default test-pod 2250m (56%) 2250m (56%) 320Mi (1%) 320Mi (1%) 36m
144+ ` ` `
145+
146+ Check the Pod's memory cgroups on the node where the workload is running. First, on the particular node, determine
147+ the Pod identifier :
148+
149+ ` ` ` bash
150+ # Run this on the node where the Pod is scheduled
151+ sudo crictl pods | grep test-pod
152+ ` ` `
153+
154+ The value observed is `7ccf55aee35dd` :
155+ ` ` `
156+ 7ccf55aee35dd 58 minutes ago Ready test-pod default 0
157+ ` ` `
158+
159+ From this, you can determine the cgroup path for the Pod :
160+ ` ` ` bash
161+ # Run this on the node where the Pod is scheduled
162+ sudo crictl inspectp -o=json 7ccf55aee35dd | grep cgroupsPath
163+ ` ` `
164+
165+ The resulting cgroup path includes the Pod's `pause` container. The Pod level cgroup is one directory above.
166+ ```
167+ "cgroupsPath": "/kubepods/podd7f4b509-cf94-4951-9417-d1087c92a5b2/7ccf55aee35dd16aca4189c952d83487297f3cd760f1bbf09620e206e7d0c27a"
168+ ```
169+
170+ In this specific case, the pod cgroup path is `kubepods/podd7f4b509-cf94-4951-9417-d1087c92a5b2`. Verify the Pod level cgroup setting for memory:
171+ ```bash
172+ # Run this on the node where the Pod is scheduled.
173+ # Also, change the name of the cgroup to match the cgroup allocated for your pod.
174+ cat /sys/fs/cgroup/memory/kubepods/podd7f4b509-cf94-4951-9417-d1087c92a5b2/memory.limit_in_bytes
175+ ```
176+
177+ This is 320 MiB, as expected:
178+ ```
179+ 335544320
180+ ```
181+
182+ ### Observability
183+
184+ A ` kube_pod_overhead ` metric is available in [ kube-state-metrics] ( https://github.com/kubernetes/kube-state-metrics )
185+ to help identify when PodOverhead is being utilized and to help observe stability of workloads
186+ running with a defined Overhead. This functionality is not available in the 1.9 release of
187+ kube-state-metrics, but is expected in a following release. Users will need to build kube-state-metrics
188+ from source in the meantime.
54189
55190{{% /capture %}}
56191
0 commit comments