pod-overhead: update documnetation for beta

Eric Ernst · Eric Ernst · commit e867203c98c7 · 2020-03-09T13:16:44.000-07:00
Signed-off-by: Eric Ernst &lt;eric@amperecomputing.com&gt;
diff --git a/content/en/docs/concepts/configuration/pod-overhead.md b/content/en/docs/concepts/configuration/pod-overhead.md
@@ -15,7 +15,7 @@ weight: 20
 
 When you run a Pod on a Node, the Pod itself takes an amount of system resources. These
 resources are additional to the resources needed to run the container(s) inside the Pod.
-_Pod Overhead_ is a feature for accounting for the resources consumed by the pod infrastructure
+_Pod Overhead_ is a feature for accounting for the resources consumed by the Pod infrastructure
 on top of the container requests & limits.
 
 
@@ -24,33 +24,168 @@ on top of the container requests & limits.
 
 {{% capture body %}}
 
-## Pod Overhead
-
-In Kubernetes, the pod's overhead is set at
+In Kubernetes, the Pod's overhead is set at
 [admission](/docs/reference/access-authn-authz/extensible-admission-controllers/#what-are-admission-webhooks)
-time according to the overhead associated with the pod's
+time according to the overhead associated with the Pod's
 [RuntimeClass](/docs/concepts/containers/runtime-class/).
 
 When Pod Overhead is enabled, the overhead is considered in addition to the sum of container
-resource requests when scheduling a pod. Similarly, Kubelet will include the pod overhead when sizing
-the pod cgroup, and when carrying out pod eviction ranking.
+resource requests when scheduling a Pod. Similarly, Kubelet will include the Pod overhead when sizing
+the Pod cgroup, and when carrying out Pod eviction ranking.
 
-### Set Up
+## Enabling Pod Overhead {#set-up}
 
 You need to make sure that the `PodOverhead`
 [feature gate](/docs/reference/command-line-tools-reference/feature-gates/) is enabled (it is on by default as of 1.18)
-across your cluster. This means:
-
-- in {{< glossary_tooltip text="kube-scheduler" term_id="kube-scheduler" >}}
-- in {{< glossary_tooltip text="kube-apiserver" term_id="kube-apiserver" >}}
-- in the {{< glossary_tooltip text="kubelet" term_id="kubelet" >}} on each Node
-- in any custom API servers that use feature gates
-
-{{< note >}}
-Users who can write to RuntimeClass resources are able to have cluster-wide impact on
-workload performance. You can limit access to this ability using Kubernetes access controls.
-See [Authorization Overview](/docs/reference/access-authn-authz/authorization/) for more details.
-{{< /note >}}
+across your cluster, and a `RuntimeClass` is utilized which defines the `Overhead` field.
+
+## Usage example
+
+To use the PodOverhead feature, you need a RuntimeClass that defines the `overhead` field. As
+an example, you could use the following RuntimeClass definition with a virtualizing container runtime
+that uses around 120MiB per Pod for the virtual machine and the guest OS:
+
+```yaml
+---
+kind: RuntimeClass
+apiVersion: node.k8s.io/v1beta1
+metadata:
+    name: kata-fc
+handler: kata-fc
+overhead:
+    podFixed:
+        memory: "120Mi"
+        cpu: "250m"
+```
+
+Workloads which are created which specify the `kata-fc` RuntimeClass handler will take the memory and
+cpu overheads into account for resource quota calculations, node scheduling, as well as Pod cgroup sizing.
+
+Consider running the given example workload, test-pod:
+
+```yaml
+apiVersion: v1
+kind: Pod
+metadata:
+  name: test-pod
+spec:
+  runtimeClassName: kata-fc
+  containers:
+  - name: busybox-ctr
+    image: busybox
+    stdin: true
+    tty: true
+    resources:
+      limits:
+        cpu: 500m
+        memory: 100Mi
+  - name: nginx-ctr
+    image: nginx
+    resources:
+      limits:
+        cpu: 1500m
+        memory: 100Mi
+```
+
+At admission time the RuntimeClass [admission controller](https://kubernetes.io/docs/reference/access-authn-authz/admission-controllers/)
+updates the workload's PodSpec to include the `overhead` as described in the RuntimeClass. If the PodSpec already has this field defined,
+the Pod will be rejected. In the given example, since only the RuntimeClass name is specified, the admission controller mutates the Pod
+to include an `overhead`.
+
+After the RuntimeClass admission controller, you can check the updated PodSpec:
+
+```bash
+kubectl get pod test-pod -o jsonpath='{.spec.overhead}'
+```
+
+The output is:
+```
+map[cpu:250m memory:120Mi]
+```
+
+If a ResourceQuota is defined, the sum of container requests as well as the
+`overhead` field are counted.
+
+When the kube-scheduler is deciding which node should run a new Pod, the scheduler considers that Pod's
+`overhead` as well as the sum of container requests for that Pod. For this example, the scheduler adds the
+requests and the overhead, then looks for a node that has 2.25 CPU and 320 MiB of memory available.
+
+Once a Pod is scheduled to a node, the kubelet on that node creates a new {{< glossary_tooltip text="cgroup" term_id="cgroup" >}}
+for the Pod. It is within this pod that the underlying container runtime will create containers.
+
+If the resource has a limit defined for each container (Guaranteed QoS or Bustrable QoS with limits defined),
+the kubelet will set an upper limit for the pod cgroup associated with that resource (cpu.cfs_quota_us for CPU
+and memory.limit_in_bytes memory). This upper limit is based on the sum of the container limits plus the `overhead`
+defined in the PodSpec.
+
+For CPU, if the Pod is Quaranteed or Bustrable QoS, kubelet will set cpu.shares based on the sum of container
+requests plus the `overhead` defined in the PodSpec.
+
+Looking at our example, verify the container requests for the workload:
+```bash
+kubectl get pod test-pod -o jsonpath='{.spec.containers[*].resources.limits}'
+```
+
+The total container requests are 2000m CPU and 200MiB of memory:
+```
+map[cpu: 500m memory:100Mi] map[cpu:1500m memory:100Mi]
+```
+
+Check this against what is observed by the node:
+```bash
+kubectl describe node | grep test-pod -B2
+```
+
+The output shows 2250m CPU and 320MiB of memory are requested, which includes PodOverhead:
+```
+  Namespace                   Name                CPU Requests  CPU Limits   Memory Requests  Memory Limits  AGE
+  ---------                   ----                ------------  ----------   ---------------  -------------  ---
+  default                     test-pod            2250m (56%)   2250m (56%)  320Mi (1%)       320Mi (1%)     36m
+```
+
+Check the Pod's memory cgroups on the node where the workload is running. First, on the particular node, determine
+the Pod identifier:
+
+```bash
+# Run this on the node where the Pod is scheduled
+sudo crictl pods | grep test-pod
+```
+
+The value observed is `7ccf55aee35dd`:
+```
+7ccf55aee35dd       58 minutes ago      Ready               test-pod                              default             0
+```
+
+From this, you can determine the cgroup path for the Pod:
+```bash
+# Run this on the node where the Pod is scheduled
+sudo crictl inspectp -o=json 7ccf55aee35dd | grep cgroupsPath
+```
+
+The resulting cgroup path includes the Pod's `pause` container. The Pod level cgroup is one directory above.
+```
+        "cgroupsPath": "/kubepods/podd7f4b509-cf94-4951-9417-d1087c92a5b2/7ccf55aee35dd16aca4189c952d83487297f3cd760f1bbf09620e206e7d0c27a"
+```
+
+In this specific case, the pod cgroup path is `kubepods/podd7f4b509-cf94-4951-9417-d1087c92a5b2`. Verify the Pod level cgroup setting for memory:
+```bash
+# Run this on the node where the Pod is scheduled.
+# Also, change the name of the cgroup to match the cgroup allocated for your pod.
+ cat /sys/fs/cgroup/memory/kubepods/podd7f4b509-cf94-4951-9417-d1087c92a5b2/memory.limit_in_bytes
+```
+
+This is 320 MiB, as expected:
+```
+335544320
+```
+
+### Observability
+
+A `kube_pod_overhead` metric is available in [kube-state-metrics](https://github.com/kubernetes/kube-state-metrics)
+to help identify when PodOverhead is being utilized and to help observe stability of workloads
+running with a defined Overhead. This functionality is not available in the 1.9 release of
+kube-state-metrics, but is expected in a following release. Users will need to build kube-state-metrics
+from source in the meantime.
 
 {{% /capture %}}
 
diff --git a/content/en/docs/concepts/containers/runtime-class.md b/content/en/docs/concepts/containers/runtime-class.md
@@ -174,15 +174,12 @@ Nodes](/docs/concepts/configuration/assign-pod-node/).
 
 ### Pod Overhead
 
-{{< feature-state for_k8s_version="v1.16" state="alpha" >}}
+{{< feature-state for_k8s_version="v1.18" state="beta" >}}
 
 As of Kubernetes v1.16, RuntimeClass includes support for specifying overhead associated with
 running a pod, as part of the [`PodOverhead`](/docs/concepts/configuration/pod-overhead/) feature.
-To use `PodOverhead`, you must have the PodOverhead [feature gate](/docs/reference/command-line-tools-reference/feature-gates/)
-enabled (it is off by default).
 
-
-Pod overhead is defined in RuntimeClass through the `Overhead` fields. Through the use of these fields,
+Pod overhead is defined in RuntimeClass through the `overhead` fields. Through the use of these fields,
 you can specify the overhead of running pods utilizing this RuntimeClass and ensure these overheads
 are accounted for in Kubernetes.