KEP-2837: PodLevelResources changes for 1.33

ndixita · ndixita · commit 3db5e6f334e9 · 2025-02-06T16:54:33.000Z
diff --git a/keps/sig-node/2837-pod-level-resource-spec/README.md b/keps/sig-node/2837-pod-level-resource-spec/README.md
@@ -383,7 +383,7 @@ consumption of the pod.
 
 #### PodSpec API changes
 
-New field in `PodSpec`
+New field in `PodSpec`:
 
 ```
 type PodSpec struct {
@@ -396,6 +396,37 @@ type PodSpec struct {
 }
 ```
 
+#### PodStatus API changes
+
+Extend `PodStatus` to include pod-level analog of the container status resource
+fields. Pod-level resource information in `PodStatus` is essential for pod-level [In-Place Pod
+Update]
+(https://github.com/kubernetes/enhancements/blob/master/keps/sig-node/1287-in-place-update-pod-resources/README.md#api-changes)
+as it provides a way to track, report and use the actual resource allocation for the
+pod, both before and after a resize operation.
+
+```
+type PodStatus struct {
+...
+  // Resources represents the compute resource requests and limits that have been
+  // applied at the pod level. If pod-level resources are not explicitly specified,
+  // then these will be the aggregate resources computed from containers. If limits are
+  // not defined for all containers (and not defined at the pod level) then no aggregate
+  // pod-level limits will be applied.
+  // +featureGate=InPlacePodVerticalScaling
+  // +optional
+  Resources *ResourceRequirements
+
+  // AllocatedResources is the total requests allocated for this pod by the node.
+  // Kubelet sets this to the accepted requests when a pod (or resize) is admitted.
+  // If pod-level requests are not set, this will be the total requests aggregated
+  // across containers in the pod.
+  // +featureGate=InPlacePodVerticalScaling
+  // +optional
+  AllocatedResources ResourceList
+}
+```
+
 #### PodSpec Validation Rules
 
 ##### Proposed Validation & Defaulting Rules
@@ -1172,6 +1203,179 @@ back to aggregating container requests.
   size of the pod's cgroup. This means the pod cgroup's resource limits will be
   set to accommodate both pod-level requests and pod overhead.
 
+#### Hugepages
+
+With the proposed changes, hugepages-2Mi and hugepages-1Gi will be added to the pod-level resources section, alongside CPU and memory. The hugetlb cgroup for the
+pod will then directly reflect the pod-level hugepage limits, rather than using an aggregated value from container limits. When scheduling, the scheduler will 
+consider hugepage requests at the pod level to find nodes with enough available
+resources.
+
+Containers will still need to mount an emptyDir volume to access the huge page filesystem (typically /dev/hugepages).  This is the standard way for containers to interact with huge pages, and this will not change. 
+
+#### Memory Manager
+
+With the introduction of pod-level resource specifications, the Kubernetes Memory
+Manager will evolve to track and enforce resource limits at both the pod and
+container levels. It will need to aggregate memory usage across all containers
+within a pod to calculate the pod's total memory consumption. The Memory Manager
+will then enforce the pod-level limit as the hard cap for the entire pod's memory
+usage, preventing it from exceeding the allocated amount.  While still
+maintaining container-level limit enforcement, the Memory Manager will need to
+coordinate with the Kubelet and eviction manager to make decisions about pod
+eviction or individual container termination when the pod-level limit is
+breached.
+
+#### CPU Manager
+
+With the introduction of pod-level resource specifications, the CPU manager in
+Kubernetes will adapt to manage CPU requests and limits at the pod level rather
+than solely at the container level. This change means that the CPU manager will
+allocate and enforce CPU resources based on the total requirements of the entire
+pod, allowing for more flexible and efficient CPU utilization across all
+containers within a pod. The CPU manager will need to ensure that the aggregate
+CPU usage of all containers in a pod does not exceed the pod-level limits. 
+
+The CPU Manager policies are container-level configurations that control the 
+fine-grained allocation of CPU resources to containers.  While CPU manager 
+policies will operate within the constraints of pod-level resource limits, they 
+do not directly apply at the pod level.
+
+#### In-Place Pod Resize
+
+##### API changes
+
+IPPR for pod-level resources requires extending `PodStatus` to include pod-level
+resource fields as detailed in [PodStatus API changes](#### PodStatus API changes)
+section.
+
+##### Resize Restart Policy
+
+Pod-level resize policy is not supported in the alpha stage of Pod-level resource
+feature. While a pod-level resize policy might be beneficial for VM-based runtimes
+like Kata Containers (potentially allowing the hypervisor to restart the entire VM
+on resize), this is a topic for future consideration. We plan to engage with the
+Kata community to discuss this further and will re-evaluate the need for a pod-level
+policy in subsequent development stages.
+
+The absence of a pod-level resize policy means that container restarts are
+exclusively managed by their individual `resizePolicy` configs. The example below of
+a pod with pod-level resources demonstrates several key aspects of this behavior,
+showing how containers without explicit limits (which inherit pod-level limits) interact
+with resize policy, and how containers with specified resources remain unaffected by
+pod-level resizes.
+
+```yaml
+apiVersion: v1
+kind: Pod
+metadata:
+  name:     pod-level-resources
+spec:
+  resources:
+    requests:
+      cpu: 100m
+      memory: 100Mi
+    limits:
+      cpu: 200m
+      memory: 200Mi
+  containers:
+    - name: c1
+      image: registry.k8s.io/pause:latest
+      resizePolicy:
+        - resourceName: "cpu"
+          restartPolicy: "NotRequired"
+        - resourceName: "memory"
+          restartPolicy: "RestartRequired"
+    - name: c2
+      image: registry.k8s.io/pause:latest
+      resources:
+        requests:
+          cpu: 50m
+          memory: 50Mi
+        limits:
+          cpu: 100m
+          memory: 100Mi
+      resizePolicy:
+        - resourceName: "cpu"
+          restartPolicy: "NotRequired"
+        - resourceName: "memory"
+          restartPolicy: "RestartRequired"
+```
+
+In this example:
+* CPU resizes: Neither container requires a restart for CPU resizes, and therefore CPU resizes at neither the container nor pod level will trigger any restarts.
+* Container c1 (inherited memory limit): c1 does not define any container level
+  resources, so the effective memory limit of the container is determined by the
+  pod-level limit. When the pod's limit is resized, c1's effective memory limit
+  changes. Because c1's memory resizePolicy is RestartRequired, a resize of the
+  pod-level memory limit will trigger a restart of container c1.
+* Container c2 (specified memory limit): c2 does define container-level resources,
+  so the effective memory limit of c2 is the container level limit. Therefore, a
+  resize of the pod-level memory limit doesn't change the effective container limit,
+  so the c2 is not restarted when the pod-level memory limit is resized.
+
+##### Implementation Details
+
+###### Allocating Pod-level Resources
+Allocation of pod-level resources will work the same as container-level resources. The allocated resources checkpoint will be extended to include pod-level resources, and the pod object will be updated with the allocated resources in the pod sync loop.
+
+###### Actuating Pod-level Resource Resize
+A dirty bit for pod-level resources will be added to the allocation checkpoint to
+signal the need for resize.  This change, along with the existing tracking and
+triggering mechanisms, is sufficient for pod-level resize actuation.  The same
+ordering rules for pod and container resource resizing will be applied for each
+resource as needed:
+1. Increase pod-level cgroup (if needed)
+2. Decrease container resources
+3. Decrease pod-level cgroup (if needed)
+4. Increase container resources
+
+###### Tracking Actual Pod-level Resources
+To accurately track actual pod-level resources during in-place pod resizing, several
+changes are required:
+
+1. Configuration reading: Pod-level resource config is currently read as part of the
+   resize flow, but will also need to be read during pod creation. Critically, the
+   configuration must be read again after the resize operation to capture the
+   updated resource values.  Currently, the configuration is only read before a
+   resize.
+   
+2. Pod Status Update: Because the pod status is updated before the resize takes
+   effect, the status will not immediately reflect the new resource values.  If a
+   container within the pod is also being resized, the container resize operation
+   will trigger a pod synchronization (pod-sync), which will refresh the pod's
+   status.  However, if only pod-level resources are being resized, a pod-sync must
+   be explicitly triggered to update the pod status with the new resource
+   allocation.
+
+3. [Scoped for Beta] Caching: Actual pod resource data may be cached to minimize API server load. This cache, if implemented, must be invalidated after each successful pod resize to ensure that subsequent reads retrieve the latest information. The need for and implementation of this caching mechanism will be evaluated in the beta phase.  Performance benchmarking will be conducted to determine if caching is required and, if so, what caching strategy is most appropriate.
+
+**Note for future enhancements for Ephemeral containers with pod-level resources and
+IPPR**
+Previously, assigning resources to ephemeral
+containers wasn't allowed because pod resource allocations were immutable. With
+the introduction of in-place pod resizing, users will gain more flexibility:
+
+* Adjust pod-level resources to accommodate the needs of ephemeral containers. This
+allows for a more dynamic allocation of resources within the pod.
+* Specify resource requests and limits directly for ephemeral containers. Kubernetes will
+then automatically resize the pod to ensure sufficient resources are available
+for both regular and ephemeral containers.
+
+Currently, setting `resources` for ephemeral containers is disallowed as pod
+resource allocations were immutable before In-Place Pod Resizing feature. With
+in-place pod resize for pod-level resource allocation, users should be able to
+either modify the pod-level resources to accommodate ephemeral containers or
+supply resources at container-level for ephemeral containers and kubernetes will
+resize the pod to accommodate the ephemeral containers.
+
+#### [Scoped for Beta] Topology Manager
+
+Note: This section includes only high level overview; Design details will be added in Beta stage.
+
+* The pod level scope for topology aligntment will consider pod level requests and limits instead of container level aggregates.
+* The hint providers will consider pod level requests and limits instead of
+  container level aggregates.
+
 #### [Scoped for Beta] User Experience Survey
 
 Before promoting the feature to Beta, we plan to conduct a UX survey to
@@ -1291,85 +1495,6 @@ KEPs.  The first change doesn’t present any user visible change, and if
 implemented, will in a small way reduce the effort for both of those KEPs by
 providing a single place to update the pod resource calculation.
 
-#### [Scoped for Beta] HugeTLB cgroup
-
-Note: This section includes only high level overview; Design details will be added in Beta stage.
-
-To support pod-level resource specifications for hugepages, Kubernetes will need to adjust how it handles hugetlb cgroups. Unlike memory, where an unset limit 
-means unlimited, an unset hugetlb limit is the same as setting it to 0.
-
-With the proposed changes, hugepages-2Mi and hugepages-1Gi will be added to the pod-level resources section, alongside CPU and memory. The hugetlb cgroup for the
-pod will then directly reflect the pod-level hugepage limits, rather than using an aggregated value from container limits. When scheduling, the scheduler will 
-consider hugepage requests at the pod level to find nodes with enough available resources.
-
-
-#### [Scoped for Beta] Topology Manager
-
-Note: This section includes only high level overview; Design details will be added in Beta stage.
-
-
-* (Tentative) Only pod level scope for topology alignment will be supported if pod level requests and limits are specified without container-level requests and limits.
-* The pod level scope for topology aligntment will consider pod level requests and limits instead of container level aggregates.
-* The hint providers will consider pod level requests and limits instead of container level aggregates.
-
-
-#### [Scoped for Beta] Memory Manager
-
-Note: This section includes only high level overview; Design details will be
-added in Beta stage.
-
-With the introduction of pod-level resource specifications, the Kubernetes Memory
-Manager will evolve to track and enforce resource limits at both the pod and
-container levels. It will need to aggregate memory usage across all containers
-within a pod to calculate the pod's total memory consumption. The Memory Manager
-will then enforce the pod-level limit as the hard cap for the entire pod's memory
-usage, preventing it from exceeding the allocated amount.  While still
-maintaining container-level limit enforcement, the Memory Manager will need to
-coordinate with the Kubelet and eviction manager to make decisions about pod
-eviction or individual container termination when the pod-level limit is
-breached.
-
-
-#### [Scoped for Beta] CPU Manager
-
-Note: This section includes only high level overview; Design details will be
-added in Beta stage.
-
-With the introduction of pod-level resource specifications, the CPU manager in
-Kubernetes will adapt to manage CPU requests and limits at the pod level rather
-than solely at the container level. This change means that the CPU manager will
-allocate and enforce CPU resources based on the total requirements of the entire
-pod, allowing for more flexible and efficient CPU utilization across all
-containers within a pod. The CPU manager will need to ensure that the aggregate
-CPU usage of all containers in a pod does not exceed the pod-level limits.
-
-#### [Scoped for Beta] In-Place Pod Resize
-
-In-Place Pod resizing of resources is not supported in alpha stage of Pod-level
-resources feature. **Users should avoid using in-place pod resizing if they are
-utilizing pod-level resources.**
-
-In version 1.33, the In-Place Pod resize functionality will be controlled by a
-separate feature gate and introduced as an independent alpha feature. This is
-necessary as it involves new fields in the PodStatus at the pod level.
-
-Note for design & implementation: Previously, assigning resources to ephemeral
-containers wasn't allowed because pod resource allocations were immutable. With
-the introduction of in-place pod resizing, users will gain more flexibility:
-
-* Adjust pod-level resources to accommodate the needs of ephemeral containers. This
-allows for a more dynamic allocation of resources within the pod.
-* Specify resource requests and limits directly for ephemeral containers. Kubernetes will
-then automatically resize the pod to ensure sufficient resources are available
-for both regular and ephemeral containers.
-
-Currently, setting `resources` for ephemeral containers is disallowed as pod
-resource allocations were immutable before In-Place Pod Resizing feature. With
-in-place pod resize for pod-level resource allocation, users should be able to
-either modify the pod-level resources to accommodate ephemeral containers or
-supply resources at container-level for ephemeral containers and kubernetes will
-resize the pod to accommodate the ephemeral containers.
-
 #### [Scoped for Beta] VPA
 
 TBD. Do not review for the alpha stage.
diff --git a/keps/sig-node/2837-pod-level-resource-spec/kep.yaml b/keps/sig-node/2837-pod-level-resource-spec/kep.yaml
@@ -26,11 +26,11 @@ stage: alpha
 # The most recent milestone for which work toward delivery of this KEP has been
 # done. This can be the current (upcoming) milestone, if it is being actively
 # worked on.
-latest-milestone: "v1.32"
+latest-milestone: "v1.33"
 
 # The milestone at which this feature was, or is targeted to be, at each stage.
 milestone:
-  alpha: "v1.32"
+  alpha: "v1.33"
 
 # The following PRR answers are required at alpha release
 # List the feature gate name and the components for which it must be enabled