Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
180 changes: 158 additions & 22 deletions content/en/docs/concepts/configuration/pod-overhead.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,12 +10,12 @@ weight: 20

{{% capture overview %}}

{{< feature-state for_k8s_version="v1.16" state="alpha" >}}
{{< feature-state for_k8s_version="v1.18" state="beta" >}}


When you run a Pod on a Node, the Pod itself takes an amount of system resources. These
resources are additional to the resources needed to run the container(s) inside the Pod.
_Pod Overhead_ is a feature for accounting for the resources consumed by the pod infrastructure
_Pod Overhead_ is a feature for accounting for the resources consumed by the Pod infrastructure
on top of the container requests & limits.


Expand All @@ -24,33 +24,169 @@ on top of the container requests & limits.

{{% capture body %}}

## Pod Overhead

In Kubernetes, the pod's overhead is set at
In Kubernetes, the Pod's overhead is set at
[admission](/docs/reference/access-authn-authz/extensible-admission-controllers/#what-are-admission-webhooks)
time according to the overhead associated with the pod's
time according to the overhead associated with the Pod's
[RuntimeClass](/docs/concepts/containers/runtime-class/).

When Pod Overhead is enabled, the overhead is considered in addition to the sum of container
resource requests when scheduling a pod. Similarly, Kubelet will include the pod overhead when sizing
the pod cgroup, and when carrying out pod eviction ranking.
resource requests when scheduling a Pod. Similarly, Kubelet will include the Pod overhead when sizing
the Pod cgroup, and when carrying out Pod eviction ranking.

### Set Up
## Enabling Pod Overhead {#set-up}

You need to make sure that the `PodOverhead`
[feature gate](/docs/reference/command-line-tools-reference/feature-gates/) is enabled (it is off by default)
across your cluster. This means:

- in {{< glossary_tooltip text="kube-scheduler" term_id="kube-scheduler" >}}
- in {{< glossary_tooltip text="kube-apiserver" term_id="kube-apiserver" >}}
- in the {{< glossary_tooltip text="kubelet" term_id="kubelet" >}} on each Node
- in any custom API servers that use feature gates

{{< note >}}
Users who can write to RuntimeClass resources are able to have cluster-wide impact on
workload performance. You can limit access to this ability using Kubernetes access controls.
See [Authorization Overview](/docs/reference/access-authn-authz/authorization/) for more details.
{{< /note >}}
[feature gate](/docs/reference/command-line-tools-reference/feature-gates/) is enabled (it is on by default as of 1.18)
across your cluster, and a `RuntimeClass` is utilized which defines the `overhead` field.

## Usage example

To use the PodOverhead feature, you need a RuntimeClass that defines the `overhead` field. As
an example, you could use the following RuntimeClass definition with a virtualizing container runtime
that uses around 120MiB per Pod for the virtual machine and the guest OS:

```yaml
---
kind: RuntimeClass
apiVersion: node.k8s.io/v1beta1
metadata:
name: kata-fc
handler: kata-fc
overhead:
podFixed:
memory: "120Mi"
cpu: "250m"
```
Comment on lines +48 to +59
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider making this a downloadable example; see https://kubernetes.io/docs/contribute/style/write-new-topic/#embedding-code-in-your-topic for details (use {{< codenew >}}).


Workloads which are created which specify the `kata-fc` RuntimeClass handler will take the memory and
cpu overheads into account for resource quota calculations, node scheduling, as well as Pod cgroup sizing.

Consider running the given example workload, test-pod:

```yaml
apiVersion: v1
kind: Pod
metadata:
name: test-pod
spec:
runtimeClassName: kata-fc
containers:
- name: busybox-ctr
image: busybox
stdin: true
tty: true
resources:
limits:
cpu: 500m
memory: 100Mi
- name: nginx-ctr
image: nginx
resources:
limits:
cpu: 1500m
memory: 100Mi
```
Comment on lines +64 to +88
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might work better as a new Task page, “Define a RuntimeClass with a PodOverhead”


At admission time the RuntimeClass [admission controller](https://kubernetes.io/docs/reference/access-authn-authz/admission-controllers/)
updates the workload's PodSpec to include the `overhead` as described in the RuntimeClass. If the PodSpec already has this field defined,
the Pod will be rejected. In the given example, since only the RuntimeClass name is specified, the admission controller mutates the Pod
to include an `overhead`.

After the RuntimeClass admission controller, you can check the updated PodSpec:

```bash
kubectl get pod test-pod -o jsonpath='{.spec.overhead}'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is pod overhead included in the output of kubectl describe? If so, that might be a more user-friendly way of checking

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is not part of kubectl describe.

```

The output is:
```
map[cpu:250m memory:120Mi]
```

If a ResourceQuota is defined, the sum of container requests as well as the
`overhead` field are counted.

When the kube-scheduler is deciding which node should run a new Pod, the scheduler considers that Pod's
`overhead` as well as the sum of container requests for that Pod. For this example, the scheduler adds the
requests and the overhead, then looks for a node that has 2.25 CPU and 320 MiB of memory available.

Once a Pod is scheduled to a node, the kubelet on that node creates a new {{< glossary_tooltip text="cgroup" term_id="cgroup" >}}
for the Pod. It is within this pod that the underlying container runtime will create containers.

If the resource has a limit defined for each container (Guaranteed QoS or Bustrable QoS with limits defined),
the kubelet will set an upper limit for the pod cgroup associated with that resource (cpu.cfs_quota_us for CPU
and memory.limit_in_bytes memory). This upper limit is based on the sum of the container limits plus the `overhead`
defined in the PodSpec.

For CPU, if the Pod is Guaranteed or Burstable QoS, the kubelet will set `cpu.shares` based on the sum of container
requests plus the `overhead` defined in the PodSpec.

Looking at our example, verify the container requests for the workload:
```bash
kubectl get pod test-pod -o jsonpath='{.spec.containers[*].resources.limits}'
```

The total container requests are 2000m CPU and 200MiB of memory:
```
map[cpu: 500m memory:100Mi] map[cpu:1500m memory:100Mi]
```

Check this against what is observed by the node:
```bash
kubectl describe node | grep test-pod -B2
```

The output shows 2250m CPU and 320MiB of memory are requested, which includes PodOverhead:
```
Namespace Name CPU Requests CPU Limits Memory Requests Memory Limits AGE
--------- ---- ------------ ---------- --------------- ------------- ---
default test-pod 2250m (56%) 2250m (56%) 320Mi (1%) 320Mi (1%) 36m
```

## Verify Pod cgroup limits

Check the Pod's memory cgroups on the node where the workload is running. In the following example, [`crictl`](https://github.com/kubernetes-sigs/cri-tools/blob/master/docs/crictl.md)
is used on the node, which provides a CLI for CRI-compatible container runtimes. This is an
advanced example to show PodOverhead behavior, and it is not expected that users should need to check
cgroups directly on the node.

First, on the particular node, determine the Pod identifier:

```bash
# Run this on the node where the Pod is scheduled
POD_ID="$(sudo crictl pods --name test-pod -q)"
```

From this, you can determine the cgroup path for the Pod:
```bash
# Run this on the node where the Pod is scheduled
sudo crictl inspectp -o=json $POD_ID | grep cgroupsPath
```

The resulting cgroup path includes the Pod's `pause` container. The Pod level cgroup is one directory above.
```
"cgroupsPath": "/kubepods/podd7f4b509-cf94-4951-9417-d1087c92a5b2/7ccf55aee35dd16aca4189c952d83487297f3cd760f1bbf09620e206e7d0c27a"
```

In this specific case, the pod cgroup path is `kubepods/podd7f4b509-cf94-4951-9417-d1087c92a5b2`. Verify the Pod level cgroup setting for memory:
```bash
# Run this on the node where the Pod is scheduled.
# Also, change the name of the cgroup to match the cgroup allocated for your pod.
cat /sys/fs/cgroup/memory/kubepods/podd7f4b509-cf94-4951-9417-d1087c92a5b2/memory.limit_in_bytes
```

This is 320 MiB, as expected:
```
335544320
```

### Observability

A `kube_pod_overhead` metric is available in [kube-state-metrics](https://github.com/kubernetes/kube-state-metrics)
to help identify when PodOverhead is being utilized and to help observe stability of workloads
running with a defined Overhead. This functionality is not available in the 1.9 release of
kube-state-metrics, but is expected in a following release. Users will need to build kube-state-metrics
from source in the meantime.

{{% /capture %}}

Expand Down
13 changes: 6 additions & 7 deletions content/en/docs/concepts/containers/runtime-class.md
Original file line number Diff line number Diff line change
Expand Up @@ -174,15 +174,14 @@ Nodes](/docs/concepts/configuration/assign-pod-node/).

### Pod Overhead

{{< feature-state for_k8s_version="v1.16" state="alpha" >}}
{{< feature-state for_k8s_version="v1.18" state="beta" >}}

As of Kubernetes v1.16, RuntimeClass includes support for specifying overhead associated with
running a pod, as part of the [`PodOverhead`](/docs/concepts/configuration/pod-overhead/) feature.
To use `PodOverhead`, you must have the PodOverhead [feature gate](/docs/reference/command-line-tools-reference/feature-gates/)
enabled (it is off by default).
You can specify _overhead_ resources that are associated with running a Pod. Declaring overhead allows
the cluster (including the scheduler) to account for it when making decisions about Pods and resources.
To use Pod overhead, you must have the PodOverhead [feature gate](/docs/reference/command-line-tools-reference/feature-gates/)
enabled (it is on by default).


Pod overhead is defined in RuntimeClass through the `Overhead` fields. Through the use of these fields,
Pod overhead is defined in RuntimeClass through the `overhead` fields. Through the use of these fields,
you can specify the overhead of running pods utilizing this RuntimeClass and ensure these overheads
are accounted for in Kubernetes.

Expand Down