Skip to content

Commit 3db5e6f

Browse files
committed
KEP-2837: PodLevelResources changes for 1.33
1 parent 1cee5af commit 3db5e6f

File tree

2 files changed

+207
-82
lines changed

2 files changed

+207
-82
lines changed

keps/sig-node/2837-pod-level-resource-spec/README.md

Lines changed: 205 additions & 80 deletions
Original file line numberDiff line numberDiff line change
@@ -383,7 +383,7 @@ consumption of the pod.
383383
384384
#### PodSpec API changes
385385
386-
New field in `PodSpec`
386+
New field in `PodSpec`:
387387

388388
```
389389
type PodSpec struct {
@@ -396,6 +396,37 @@ type PodSpec struct {
396396
}
397397
```
398398
399+
#### PodStatus API changes
400+
401+
Extend `PodStatus` to include pod-level analog of the container status resource
402+
fields. Pod-level resource information in `PodStatus` is essential for pod-level [In-Place Pod
403+
Update]
404+
(https://github.com/kubernetes/enhancements/blob/master/keps/sig-node/1287-in-place-update-pod-resources/README.md#api-changes)
405+
as it provides a way to track, report and use the actual resource allocation for the
406+
pod, both before and after a resize operation.
407+
408+
```
409+
type PodStatus struct {
410+
...
411+
// Resources represents the compute resource requests and limits that have been
412+
// applied at the pod level. If pod-level resources are not explicitly specified,
413+
// then these will be the aggregate resources computed from containers. If limits are
414+
// not defined for all containers (and not defined at the pod level) then no aggregate
415+
// pod-level limits will be applied.
416+
// +featureGate=InPlacePodVerticalScaling
417+
// +optional
418+
Resources *ResourceRequirements
419+
420+
// AllocatedResources is the total requests allocated for this pod by the node.
421+
// Kubelet sets this to the accepted requests when a pod (or resize) is admitted.
422+
// If pod-level requests are not set, this will be the total requests aggregated
423+
// across containers in the pod.
424+
// +featureGate=InPlacePodVerticalScaling
425+
// +optional
426+
AllocatedResources ResourceList
427+
}
428+
```
429+
399430
#### PodSpec Validation Rules
400431
401432
##### Proposed Validation & Defaulting Rules
@@ -1172,6 +1203,179 @@ back to aggregating container requests.
11721203
size of the pod's cgroup. This means the pod cgroup's resource limits will be
11731204
set to accommodate both pod-level requests and pod overhead.
11741205

1206+
#### Hugepages
1207+
1208+
With the proposed changes, hugepages-2Mi and hugepages-1Gi will be added to the pod-level resources section, alongside CPU and memory. The hugetlb cgroup for the
1209+
pod will then directly reflect the pod-level hugepage limits, rather than using an aggregated value from container limits. When scheduling, the scheduler will
1210+
consider hugepage requests at the pod level to find nodes with enough available
1211+
resources.
1212+
1213+
Containers will still need to mount an emptyDir volume to access the huge page filesystem (typically /dev/hugepages). This is the standard way for containers to interact with huge pages, and this will not change.
1214+
1215+
#### Memory Manager
1216+
1217+
With the introduction of pod-level resource specifications, the Kubernetes Memory
1218+
Manager will evolve to track and enforce resource limits at both the pod and
1219+
container levels. It will need to aggregate memory usage across all containers
1220+
within a pod to calculate the pod's total memory consumption. The Memory Manager
1221+
will then enforce the pod-level limit as the hard cap for the entire pod's memory
1222+
usage, preventing it from exceeding the allocated amount. While still
1223+
maintaining container-level limit enforcement, the Memory Manager will need to
1224+
coordinate with the Kubelet and eviction manager to make decisions about pod
1225+
eviction or individual container termination when the pod-level limit is
1226+
breached.
1227+
1228+
#### CPU Manager
1229+
1230+
With the introduction of pod-level resource specifications, the CPU manager in
1231+
Kubernetes will adapt to manage CPU requests and limits at the pod level rather
1232+
than solely at the container level. This change means that the CPU manager will
1233+
allocate and enforce CPU resources based on the total requirements of the entire
1234+
pod, allowing for more flexible and efficient CPU utilization across all
1235+
containers within a pod. The CPU manager will need to ensure that the aggregate
1236+
CPU usage of all containers in a pod does not exceed the pod-level limits.
1237+
1238+
The CPU Manager policies are container-level configurations that control the
1239+
fine-grained allocation of CPU resources to containers. While CPU manager
1240+
policies will operate within the constraints of pod-level resource limits, they
1241+
do not directly apply at the pod level.
1242+
1243+
#### In-Place Pod Resize
1244+
1245+
##### API changes
1246+
1247+
IPPR for pod-level resources requires extending `PodStatus` to include pod-level
1248+
resource fields as detailed in [PodStatus API changes](#### PodStatus API changes)
1249+
section.
1250+
1251+
##### Resize Restart Policy
1252+
1253+
Pod-level resize policy is not supported in the alpha stage of Pod-level resource
1254+
feature. While a pod-level resize policy might be beneficial for VM-based runtimes
1255+
like Kata Containers (potentially allowing the hypervisor to restart the entire VM
1256+
on resize), this is a topic for future consideration. We plan to engage with the
1257+
Kata community to discuss this further and will re-evaluate the need for a pod-level
1258+
policy in subsequent development stages.
1259+
1260+
The absence of a pod-level resize policy means that container restarts are
1261+
exclusively managed by their individual `resizePolicy` configs. The example below of
1262+
a pod with pod-level resources demonstrates several key aspects of this behavior,
1263+
showing how containers without explicit limits (which inherit pod-level limits) interact
1264+
with resize policy, and how containers with specified resources remain unaffected by
1265+
pod-level resizes.
1266+
1267+
```yaml
1268+
apiVersion: v1
1269+
kind: Pod
1270+
metadata:
1271+
name: pod-level-resources
1272+
spec:
1273+
resources:
1274+
requests:
1275+
cpu: 100m
1276+
memory: 100Mi
1277+
limits:
1278+
cpu: 200m
1279+
memory: 200Mi
1280+
containers:
1281+
- name: c1
1282+
image: registry.k8s.io/pause:latest
1283+
resizePolicy:
1284+
- resourceName: "cpu"
1285+
restartPolicy: "NotRequired"
1286+
- resourceName: "memory"
1287+
restartPolicy: "RestartRequired"
1288+
- name: c2
1289+
image: registry.k8s.io/pause:latest
1290+
resources:
1291+
requests:
1292+
cpu: 50m
1293+
memory: 50Mi
1294+
limits:
1295+
cpu: 100m
1296+
memory: 100Mi
1297+
resizePolicy:
1298+
- resourceName: "cpu"
1299+
restartPolicy: "NotRequired"
1300+
- resourceName: "memory"
1301+
restartPolicy: "RestartRequired"
1302+
```
1303+
1304+
In this example:
1305+
* CPU resizes: Neither container requires a restart for CPU resizes, and therefore CPU resizes at neither the container nor pod level will trigger any restarts.
1306+
* Container c1 (inherited memory limit): c1 does not define any container level
1307+
resources, so the effective memory limit of the container is determined by the
1308+
pod-level limit. When the pod's limit is resized, c1's effective memory limit
1309+
changes. Because c1's memory resizePolicy is RestartRequired, a resize of the
1310+
pod-level memory limit will trigger a restart of container c1.
1311+
* Container c2 (specified memory limit): c2 does define container-level resources,
1312+
so the effective memory limit of c2 is the container level limit. Therefore, a
1313+
resize of the pod-level memory limit doesn't change the effective container limit,
1314+
so the c2 is not restarted when the pod-level memory limit is resized.
1315+
1316+
##### Implementation Details
1317+
1318+
###### Allocating Pod-level Resources
1319+
Allocation of pod-level resources will work the same as container-level resources. The allocated resources checkpoint will be extended to include pod-level resources, and the pod object will be updated with the allocated resources in the pod sync loop.
1320+
1321+
###### Actuating Pod-level Resource Resize
1322+
A dirty bit for pod-level resources will be added to the allocation checkpoint to
1323+
signal the need for resize. This change, along with the existing tracking and
1324+
triggering mechanisms, is sufficient for pod-level resize actuation. The same
1325+
ordering rules for pod and container resource resizing will be applied for each
1326+
resource as needed:
1327+
1. Increase pod-level cgroup (if needed)
1328+
2. Decrease container resources
1329+
3. Decrease pod-level cgroup (if needed)
1330+
4. Increase container resources
1331+
1332+
###### Tracking Actual Pod-level Resources
1333+
To accurately track actual pod-level resources during in-place pod resizing, several
1334+
changes are required:
1335+
1336+
1. Configuration reading: Pod-level resource config is currently read as part of the
1337+
resize flow, but will also need to be read during pod creation. Critically, the
1338+
configuration must be read again after the resize operation to capture the
1339+
updated resource values. Currently, the configuration is only read before a
1340+
resize.
1341+
1342+
2. Pod Status Update: Because the pod status is updated before the resize takes
1343+
effect, the status will not immediately reflect the new resource values. If a
1344+
container within the pod is also being resized, the container resize operation
1345+
will trigger a pod synchronization (pod-sync), which will refresh the pod's
1346+
status. However, if only pod-level resources are being resized, a pod-sync must
1347+
be explicitly triggered to update the pod status with the new resource
1348+
allocation.
1349+
1350+
3. [Scoped for Beta] Caching: Actual pod resource data may be cached to minimize API server load. This cache, if implemented, must be invalidated after each successful pod resize to ensure that subsequent reads retrieve the latest information. The need for and implementation of this caching mechanism will be evaluated in the beta phase. Performance benchmarking will be conducted to determine if caching is required and, if so, what caching strategy is most appropriate.
1351+
1352+
**Note for future enhancements for Ephemeral containers with pod-level resources and
1353+
IPPR**
1354+
Previously, assigning resources to ephemeral
1355+
containers wasn't allowed because pod resource allocations were immutable. With
1356+
the introduction of in-place pod resizing, users will gain more flexibility:
1357+
1358+
* Adjust pod-level resources to accommodate the needs of ephemeral containers. This
1359+
allows for a more dynamic allocation of resources within the pod.
1360+
* Specify resource requests and limits directly for ephemeral containers. Kubernetes will
1361+
then automatically resize the pod to ensure sufficient resources are available
1362+
for both regular and ephemeral containers.
1363+
1364+
Currently, setting `resources` for ephemeral containers is disallowed as pod
1365+
resource allocations were immutable before In-Place Pod Resizing feature. With
1366+
in-place pod resize for pod-level resource allocation, users should be able to
1367+
either modify the pod-level resources to accommodate ephemeral containers or
1368+
supply resources at container-level for ephemeral containers and kubernetes will
1369+
resize the pod to accommodate the ephemeral containers.
1370+
1371+
#### [Scoped for Beta] Topology Manager
1372+
1373+
Note: This section includes only high level overview; Design details will be added in Beta stage.
1374+
1375+
* The pod level scope for topology aligntment will consider pod level requests and limits instead of container level aggregates.
1376+
* The hint providers will consider pod level requests and limits instead of
1377+
container level aggregates.
1378+
11751379
#### [Scoped for Beta] User Experience Survey
11761380

11771381
Before promoting the feature to Beta, we plan to conduct a UX survey to
@@ -1291,85 +1495,6 @@ KEPs. The first change doesn’t present any user visible change, and if
12911495
implemented, will in a small way reduce the effort for both of those KEPs by
12921496
providing a single place to update the pod resource calculation.
12931497
1294-
#### [Scoped for Beta] HugeTLB cgroup
1295-
1296-
Note: This section includes only high level overview; Design details will be added in Beta stage.
1297-
1298-
To support pod-level resource specifications for hugepages, Kubernetes will need to adjust how it handles hugetlb cgroups. Unlike memory, where an unset limit
1299-
means unlimited, an unset hugetlb limit is the same as setting it to 0.
1300-
1301-
With the proposed changes, hugepages-2Mi and hugepages-1Gi will be added to the pod-level resources section, alongside CPU and memory. The hugetlb cgroup for the
1302-
pod will then directly reflect the pod-level hugepage limits, rather than using an aggregated value from container limits. When scheduling, the scheduler will
1303-
consider hugepage requests at the pod level to find nodes with enough available resources.
1304-
1305-
1306-
#### [Scoped for Beta] Topology Manager
1307-
1308-
Note: This section includes only high level overview; Design details will be added in Beta stage.
1309-
1310-
1311-
* (Tentative) Only pod level scope for topology alignment will be supported if pod level requests and limits are specified without container-level requests and limits.
1312-
* The pod level scope for topology aligntment will consider pod level requests and limits instead of container level aggregates.
1313-
* The hint providers will consider pod level requests and limits instead of container level aggregates.
1314-
1315-
1316-
#### [Scoped for Beta] Memory Manager
1317-
1318-
Note: This section includes only high level overview; Design details will be
1319-
added in Beta stage.
1320-
1321-
With the introduction of pod-level resource specifications, the Kubernetes Memory
1322-
Manager will evolve to track and enforce resource limits at both the pod and
1323-
container levels. It will need to aggregate memory usage across all containers
1324-
within a pod to calculate the pod's total memory consumption. The Memory Manager
1325-
will then enforce the pod-level limit as the hard cap for the entire pod's memory
1326-
usage, preventing it from exceeding the allocated amount. While still
1327-
maintaining container-level limit enforcement, the Memory Manager will need to
1328-
coordinate with the Kubelet and eviction manager to make decisions about pod
1329-
eviction or individual container termination when the pod-level limit is
1330-
breached.
1331-
1332-
1333-
#### [Scoped for Beta] CPU Manager
1334-
1335-
Note: This section includes only high level overview; Design details will be
1336-
added in Beta stage.
1337-
1338-
With the introduction of pod-level resource specifications, the CPU manager in
1339-
Kubernetes will adapt to manage CPU requests and limits at the pod level rather
1340-
than solely at the container level. This change means that the CPU manager will
1341-
allocate and enforce CPU resources based on the total requirements of the entire
1342-
pod, allowing for more flexible and efficient CPU utilization across all
1343-
containers within a pod. The CPU manager will need to ensure that the aggregate
1344-
CPU usage of all containers in a pod does not exceed the pod-level limits.
1345-
1346-
#### [Scoped for Beta] In-Place Pod Resize
1347-
1348-
In-Place Pod resizing of resources is not supported in alpha stage of Pod-level
1349-
resources feature. **Users should avoid using in-place pod resizing if they are
1350-
utilizing pod-level resources.**
1351-
1352-
In version 1.33, the In-Place Pod resize functionality will be controlled by a
1353-
separate feature gate and introduced as an independent alpha feature. This is
1354-
necessary as it involves new fields in the PodStatus at the pod level.
1355-
1356-
Note for design & implementation: Previously, assigning resources to ephemeral
1357-
containers wasn't allowed because pod resource allocations were immutable. With
1358-
the introduction of in-place pod resizing, users will gain more flexibility:
1359-
1360-
* Adjust pod-level resources to accommodate the needs of ephemeral containers. This
1361-
allows for a more dynamic allocation of resources within the pod.
1362-
* Specify resource requests and limits directly for ephemeral containers. Kubernetes will
1363-
then automatically resize the pod to ensure sufficient resources are available
1364-
for both regular and ephemeral containers.
1365-
1366-
Currently, setting `resources` for ephemeral containers is disallowed as pod
1367-
resource allocations were immutable before In-Place Pod Resizing feature. With
1368-
in-place pod resize for pod-level resource allocation, users should be able to
1369-
either modify the pod-level resources to accommodate ephemeral containers or
1370-
supply resources at container-level for ephemeral containers and kubernetes will
1371-
resize the pod to accommodate the ephemeral containers.
1372-
13731498
#### [Scoped for Beta] VPA
13741499
13751500
TBD. Do not review for the alpha stage.

keps/sig-node/2837-pod-level-resource-spec/kep.yaml

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -26,11 +26,11 @@ stage: alpha
2626
# The most recent milestone for which work toward delivery of this KEP has been
2727
# done. This can be the current (upcoming) milestone, if it is being actively
2828
# worked on.
29-
latest-milestone: "v1.32"
29+
latest-milestone: "v1.33"
3030

3131
# The milestone at which this feature was, or is targeted to be, at each stage.
3232
milestone:
33-
alpha: "v1.32"
33+
alpha: "v1.33"
3434

3535
# The following PRR answers are required at alpha release
3636
# List the feature gate name and the components for which it must be enabled

0 commit comments

Comments
 (0)