diff --git a/content/en/docs/concepts/configuration/pod-overhead.md b/content/en/docs/concepts/configuration/pod-overhead.md index 8309fce51e88c..0e796df9ffc49 100644 --- a/content/en/docs/concepts/configuration/pod-overhead.md +++ b/content/en/docs/concepts/configuration/pod-overhead.md @@ -10,12 +10,12 @@ weight: 20 {{% capture overview %}} -{{< feature-state for_k8s_version="v1.16" state="alpha" >}} +{{< feature-state for_k8s_version="v1.18" state="beta" >}} When you run a Pod on a Node, the Pod itself takes an amount of system resources. These resources are additional to the resources needed to run the container(s) inside the Pod. -_Pod Overhead_ is a feature for accounting for the resources consumed by the pod infrastructure +_Pod Overhead_ is a feature for accounting for the resources consumed by the Pod infrastructure on top of the container requests & limits. @@ -24,33 +24,169 @@ on top of the container requests & limits. {{% capture body %}} -## Pod Overhead - -In Kubernetes, the pod's overhead is set at +In Kubernetes, the Pod's overhead is set at [admission](/docs/reference/access-authn-authz/extensible-admission-controllers/#what-are-admission-webhooks) -time according to the overhead associated with the pod's +time according to the overhead associated with the Pod's [RuntimeClass](/docs/concepts/containers/runtime-class/). When Pod Overhead is enabled, the overhead is considered in addition to the sum of container -resource requests when scheduling a pod. Similarly, Kubelet will include the pod overhead when sizing -the pod cgroup, and when carrying out pod eviction ranking. +resource requests when scheduling a Pod. Similarly, Kubelet will include the Pod overhead when sizing +the Pod cgroup, and when carrying out Pod eviction ranking. -### Set Up +## Enabling Pod Overhead {#set-up} You need to make sure that the `PodOverhead` -[feature gate](/docs/reference/command-line-tools-reference/feature-gates/) is enabled (it is off by default) -across your cluster. This means: - -- in {{< glossary_tooltip text="kube-scheduler" term_id="kube-scheduler" >}} -- in {{< glossary_tooltip text="kube-apiserver" term_id="kube-apiserver" >}} -- in the {{< glossary_tooltip text="kubelet" term_id="kubelet" >}} on each Node -- in any custom API servers that use feature gates - -{{< note >}} -Users who can write to RuntimeClass resources are able to have cluster-wide impact on -workload performance. You can limit access to this ability using Kubernetes access controls. -See [Authorization Overview](/docs/reference/access-authn-authz/authorization/) for more details. -{{< /note >}} +[feature gate](/docs/reference/command-line-tools-reference/feature-gates/) is enabled (it is on by default as of 1.18) +across your cluster, and a `RuntimeClass` is utilized which defines the `overhead` field. + +## Usage example + +To use the PodOverhead feature, you need a RuntimeClass that defines the `overhead` field. As +an example, you could use the following RuntimeClass definition with a virtualizing container runtime +that uses around 120MiB per Pod for the virtual machine and the guest OS: + +```yaml +--- +kind: RuntimeClass +apiVersion: node.k8s.io/v1beta1 +metadata: + name: kata-fc +handler: kata-fc +overhead: + podFixed: + memory: "120Mi" + cpu: "250m" +``` + +Workloads which are created which specify the `kata-fc` RuntimeClass handler will take the memory and +cpu overheads into account for resource quota calculations, node scheduling, as well as Pod cgroup sizing. + +Consider running the given example workload, test-pod: + +```yaml +apiVersion: v1 +kind: Pod +metadata: + name: test-pod +spec: + runtimeClassName: kata-fc + containers: + - name: busybox-ctr + image: busybox + stdin: true + tty: true + resources: + limits: + cpu: 500m + memory: 100Mi + - name: nginx-ctr + image: nginx + resources: + limits: + cpu: 1500m + memory: 100Mi +``` + +At admission time the RuntimeClass [admission controller](https://kubernetes.io/docs/reference/access-authn-authz/admission-controllers/) +updates the workload's PodSpec to include the `overhead` as described in the RuntimeClass. If the PodSpec already has this field defined, +the Pod will be rejected. In the given example, since only the RuntimeClass name is specified, the admission controller mutates the Pod +to include an `overhead`. + +After the RuntimeClass admission controller, you can check the updated PodSpec: + +```bash +kubectl get pod test-pod -o jsonpath='{.spec.overhead}' +``` + +The output is: +``` +map[cpu:250m memory:120Mi] +``` + +If a ResourceQuota is defined, the sum of container requests as well as the +`overhead` field are counted. + +When the kube-scheduler is deciding which node should run a new Pod, the scheduler considers that Pod's +`overhead` as well as the sum of container requests for that Pod. For this example, the scheduler adds the +requests and the overhead, then looks for a node that has 2.25 CPU and 320 MiB of memory available. + +Once a Pod is scheduled to a node, the kubelet on that node creates a new {{< glossary_tooltip text="cgroup" term_id="cgroup" >}} +for the Pod. It is within this pod that the underlying container runtime will create containers. + +If the resource has a limit defined for each container (Guaranteed QoS or Bustrable QoS with limits defined), +the kubelet will set an upper limit for the pod cgroup associated with that resource (cpu.cfs_quota_us for CPU +and memory.limit_in_bytes memory). This upper limit is based on the sum of the container limits plus the `overhead` +defined in the PodSpec. + +For CPU, if the Pod is Guaranteed or Burstable QoS, the kubelet will set `cpu.shares` based on the sum of container +requests plus the `overhead` defined in the PodSpec. + +Looking at our example, verify the container requests for the workload: +```bash +kubectl get pod test-pod -o jsonpath='{.spec.containers[*].resources.limits}' +``` + +The total container requests are 2000m CPU and 200MiB of memory: +``` +map[cpu: 500m memory:100Mi] map[cpu:1500m memory:100Mi] +``` + +Check this against what is observed by the node: +```bash +kubectl describe node | grep test-pod -B2 +``` + +The output shows 2250m CPU and 320MiB of memory are requested, which includes PodOverhead: +``` + Namespace Name CPU Requests CPU Limits Memory Requests Memory Limits AGE + --------- ---- ------------ ---------- --------------- ------------- --- + default test-pod 2250m (56%) 2250m (56%) 320Mi (1%) 320Mi (1%) 36m +``` + +## Verify Pod cgroup limits + +Check the Pod's memory cgroups on the node where the workload is running. In the following example, [`crictl`](https://github.com/kubernetes-sigs/cri-tools/blob/master/docs/crictl.md) +is used on the node, which provides a CLI for CRI-compatible container runtimes. This is an +advanced example to show PodOverhead behavior, and it is not expected that users should need to check +cgroups directly on the node. + +First, on the particular node, determine the Pod identifier: + +```bash +# Run this on the node where the Pod is scheduled +POD_ID="$(sudo crictl pods --name test-pod -q)" +``` + +From this, you can determine the cgroup path for the Pod: +```bash +# Run this on the node where the Pod is scheduled +sudo crictl inspectp -o=json $POD_ID | grep cgroupsPath +``` + +The resulting cgroup path includes the Pod's `pause` container. The Pod level cgroup is one directory above. +``` + "cgroupsPath": "/kubepods/podd7f4b509-cf94-4951-9417-d1087c92a5b2/7ccf55aee35dd16aca4189c952d83487297f3cd760f1bbf09620e206e7d0c27a" +``` + +In this specific case, the pod cgroup path is `kubepods/podd7f4b509-cf94-4951-9417-d1087c92a5b2`. Verify the Pod level cgroup setting for memory: +```bash +# Run this on the node where the Pod is scheduled. +# Also, change the name of the cgroup to match the cgroup allocated for your pod. + cat /sys/fs/cgroup/memory/kubepods/podd7f4b509-cf94-4951-9417-d1087c92a5b2/memory.limit_in_bytes +``` + +This is 320 MiB, as expected: +``` +335544320 +``` + +### Observability + +A `kube_pod_overhead` metric is available in [kube-state-metrics](https://github.com/kubernetes/kube-state-metrics) +to help identify when PodOverhead is being utilized and to help observe stability of workloads +running with a defined Overhead. This functionality is not available in the 1.9 release of +kube-state-metrics, but is expected in a following release. Users will need to build kube-state-metrics +from source in the meantime. {{% /capture %}} diff --git a/content/en/docs/concepts/containers/runtime-class.md b/content/en/docs/concepts/containers/runtime-class.md index 00bd9fae34a4f..f0008750a809d 100644 --- a/content/en/docs/concepts/containers/runtime-class.md +++ b/content/en/docs/concepts/containers/runtime-class.md @@ -174,15 +174,14 @@ Nodes](/docs/concepts/configuration/assign-pod-node/). ### Pod Overhead -{{< feature-state for_k8s_version="v1.16" state="alpha" >}} +{{< feature-state for_k8s_version="v1.18" state="beta" >}} -As of Kubernetes v1.16, RuntimeClass includes support for specifying overhead associated with -running a pod, as part of the [`PodOverhead`](/docs/concepts/configuration/pod-overhead/) feature. -To use `PodOverhead`, you must have the PodOverhead [feature gate](/docs/reference/command-line-tools-reference/feature-gates/) -enabled (it is off by default). +You can specify _overhead_ resources that are associated with running a Pod. Declaring overhead allows +the cluster (including the scheduler) to account for it when making decisions about Pods and resources. +To use Pod overhead, you must have the PodOverhead [feature gate](/docs/reference/command-line-tools-reference/feature-gates/) +enabled (it is on by default). - -Pod overhead is defined in RuntimeClass through the `Overhead` fields. Through the use of these fields, +Pod overhead is defined in RuntimeClass through the `overhead` fields. Through the use of these fields, you can specify the overhead of running pods utilizing this RuntimeClass and ensure these overheads are accounted for in Kubernetes.