Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
308 changes: 222 additions & 86 deletions keps/sig-node/2837-pod-level-resource-spec/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,7 @@
- [Components/Features changes](#componentsfeatures-changes)
- [Cgroup Structure Remains unchanged](#cgroup-structure-remains-unchanged)
- [PodSpec API changes](#podspec-api-changes)
- [PodStatus API changes](#podstatus-api-changes)
- [PodSpec Validation Rules](#podspec-validation-rules)
- [Proposed Validation & Defaulting Rules](#proposed-validation--defaulting-rules)
- [Comprehensive Tabular View](#comprehensive-tabular-view)
Expand All @@ -32,17 +33,20 @@
- [Admission Controller](#admission-controller)
- [Eviction Manager](#eviction-manager)
- [Pod Overhead](#pod-overhead)
- [Hugepages](#hugepages)
- [Memory Manager](#memory-manager)
- [In-Place Pod Resize](#in-place-pod-resize)
- [API changes](#api-changes)
- [Resize Restart Policy](#resize-restart-policy)
- [Implementation Details](#implementation-details)
- [[Scopred for Beta] CPU Manager](#scopred-for-beta-cpu-manager)
- [[Scoped for Beta] Topology Manager](#scoped-for-beta-topology-manager)
- [[Scoped for Beta] User Experience Survey](#scoped-for-beta-user-experience-survey)
- [[Scoped for Beta] Surfacing Pod Resource Requirements](#scoped-for-beta-surfacing-pod-resource-requirements)
- [The Challenge of Determining Effective Pod Resource Requirements](#the-challenge-of-determining-effective-pod-resource-requirements)
- [Goals of surfacing Pod Resource Requirements](#goals-of-surfacing-pod-resource-requirements)
- [Implementation Details](#implementation-details)
- [Implementation Details](#implementation-details-1)
- [Notes for implementation](#notes-for-implementation)
- [[Scoped for Beta] HugeTLB cgroup](#scoped-for-beta-hugetlb-cgroup)
- [[Scoped for Beta] Topology Manager](#scoped-for-beta-topology-manager)
- [[Scoped for Beta] Memory Manager](#scoped-for-beta-memory-manager)
- [[Scoped for Beta] CPU Manager](#scoped-for-beta-cpu-manager)
- [[Scoped for Beta] In-Place Pod Resize](#scoped-for-beta-in-place-pod-resize)
- [[Scoped for Beta] VPA](#scoped-for-beta-vpa)
- [[Scoped for Beta] Cluster Autoscaler](#scoped-for-beta-cluster-autoscaler)
- [[Scoped for Beta] Support for Windows](#scoped-for-beta-support-for-windows)
Expand Down Expand Up @@ -383,7 +387,7 @@ consumption of the pod.

#### PodSpec API changes

New field in `PodSpec`
New field in `PodSpec`:

```
type PodSpec struct {
Expand All @@ -396,6 +400,40 @@ type PodSpec struct {
}
```

#### PodStatus API changes

Extend `PodStatus` to include pod-level analog of the container status resource
fields. Pod-level resource information in `PodStatus` is essential for pod-level [In-Place Pod
Update]
(https://github.com/kubernetes/enhancements/blob/master/keps/sig-node/1287-in-place-update-pod-resources/README.md#api-changes)
as it provides a way to track, report and use the actual resource allocation for the
pod, both before and after a resize operation.

```
type PodStatus struct {
...
// Resources represents the compute resource requests and limits that have been
// applied at the pod level. If pod-level resources are not explicitly specified,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are only the resources supported in pod-level spec.resources (cpu, memory, and now hugepages...) aggregated here, or are other custom resources specified in the containers aggregated here? (as an aside, I think the pod-level resource validation errors if container-level resources are specified which are not included in pod-level resources at all... https://github.com/kubernetes/kubernetes/blob/ee22760391bae28954a69dff499d1cead9a9fcf0/pkg/apis/core/validation/validation.go#L4340-L4356).

What happens if pod-level spec.resources sets a pod-level cpu limit, but not a memory limit, and individual containers all set a memory limits. Does this include the pod-level cpu limit and the aggregated container memory limits?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question. For resources that get configured on the pod level cgroup, this should report the actual values applied there. For everything else, I'm not sure. Do pod-level extended resources make sense today?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DRA: I think pod-level GPUs could make sense and pod-level network interfaces are the ONLY real way to do network.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For those extended resources, this is still an open question. Luckily we can address this in later releases. Not a blocking for 1.33.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For those extended resources, this is still an open question. Luckily we can address this in later releases.

Ack. I think this should be stated in the non-goals if they are not already there.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is stated in the non-goals section that only CPU, memory and hugepages are supported for now https://github.com/kubernetes/enhancements/blob/master/keps/sig-node/2837-pod-level-resource-spec/README.md#non-goals

Also, what Jordan pointed out there's a bug in validation logic where if container-level resources are set for unsupported resource type, the validation logic will error out because aggregated container requests in this case will be greater than pod requests ( as pod-level resources won't be set for unsupported resources): https://github.com/kubernetes/kubernetes/blob/ee22760391bae28954a69dff499d1cead9a9fcf0/pkg/apis/core/validation/validation.go#L4340-L4356

I will fix the bug. Thanks @liggitt for finding the bug.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

// then these will be the aggregate resources computed from containers. If limits are
// not defined for all containers (and pod-level limits are also not set), those
// containers remain unrestricted, and no aggregate pod-level limits will be applied.
// Pod-level limit aggregation is only performed, and is meaningful only, when all
// containers have defined limits.
// +featureGate=InPlacePodVerticalScaling
// +featureGate=PodLevelResources
// +optional
Resources *ResourceRequirements
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using ResourceRequirements for this mirrors the container level field, but I wish we had gone with a custom type for that. We don't yet use ResourceClaims. Should we mirror the container type, or create a new type without resource claims?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We used the same type i.e. ResourceRequirements for Resources in PodSpec as well.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Other than duplication, what would be the disadvantage of de-duplicating types? I really dislike when we have fields in the API but they can't be used.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to add a new type ResourceConstraints or ResourceRequestsLimits?


// AllocatedResources is the total requests allocated for this pod by the node.
// Kubelet sets this to the accepted requests when a pod (or resize) is admitted.
// If pod-level requests are not set, this will be the total requests aggregated
// across containers in the pod.
// +featureGate=InPlacePodVerticalScaling
// +featureGate=PodLevelResources
// +optional
AllocatedResources ResourceList
}
```
#### PodSpec Validation Rules

##### Proposed Validation & Defaulting Rules
Expand Down Expand Up @@ -1172,6 +1210,183 @@ back to aggregating container requests.
size of the pod's cgroup. This means the pod cgroup's resource limits will be
set to accommodate both pod-level requests and pod overhead.

#### Hugepages

With the proposed changes, support for hugepages(with prefix hugepages-*) will be extended to the pod-level resources specifications, alongside CPU and memory. The hugetlb cgroup for the
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
With the proposed changes, support for hugepages(with prefix hugepages-*) will be extended to the pod-level resources specifications, alongside CPU and memory. The hugetlb cgroup for the
With the proposed changes, support for Linux hugepages (resources with prefix `hugepages-`) will be extended to the pod-level resources specifications, alongside CPU and memory. The hugetlb cgroup for the

?

pod will then directly reflect the pod-level hugepage limits, if specified, rather than using an aggregated value from container limits. When scheduling, the scheduler will
consider hugepage requests at the pod level to find nodes with enough available
resources.

Containers will still need to mount an emptyDir volume to access the huge page filesystem (typically /dev/hugepages). This is the standard way for containers to interact with huge pages, and this will not change.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This sounds confusing; /dev/hugepages isn't normally present within an empty directory (per emptyDir). Maybe clarify what we mean here.


#### Memory Manager

With the introduction of pod-level resource specifications, the Kubernetes Memory
Manager will evolve to track and enforce resource limits at both the pod and
container levels. It will need to aggregate memory usage across all containers
within a pod to calculate the pod's total memory consumption. The Memory Manager
will then enforce the pod-level limit as the hard cap for the entire pod's memory
usage, preventing it from exceeding the allocated amount. While still
maintaining container-level limit enforcement, the Memory Manager will need to
coordinate with the Kubelet and eviction manager to make decisions about pod
eviction or individual container termination when the pod-level limit is
breached.

#### In-Place Pod Resize

##### API changes

IPPR for pod-level resources requires extending `PodStatus` to include pod-level
resource fields as detailed in [PodStatus API changes](#### PodStatus API changes)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
resource fields as detailed in [PodStatus API changes](#### PodStatus API changes)
resource fields as detailed in [PodStatus API changes](#podstatus-api-changes)

section.

##### Resize Restart Policy

Pod-level resize policy is not supported in the alpha stage of Pod-level resource
feature. While a pod-level resize policy might be beneficial for VM-based runtimes
like Kata Containers (potentially allowing the hypervisor to restart the entire VM
on resize), this is a topic for future consideration. We plan to engage with the
Kata community to discuss this further and will re-evaluate the need for a pod-level
policy in subsequent development stages.

The absence of a pod-level resize policy means that container restarts are
exclusively managed by their individual `resizePolicy` configs. The example below of
a pod with pod-level resources demonstrates several key aspects of this behavior,
showing how containers without explicit limits (which inherit pod-level limits) interact
with resize policy, and how containers with specified resources remain unaffected by
pod-level resizes.

```yaml
apiVersion: v1
kind: Pod
metadata:
name: pod-level-resources
spec:
resources:
requests:
cpu: 100m
memory: 100Mi
limits:
cpu: 200m
memory: 200Mi
containers:
- name: c1
image: registry.k8s.io/pause:latest
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: can we avoid implying that latest is a good choice of version pin?

resizePolicy:
- resourceName: "cpu"
restartPolicy: "NotRequired"
- resourceName: "memory"
restartPolicy: "RestartRequired"
- name: c2
image: registry.k8s.io/pause:latest
resources:
requests:
cpu: 50m
memory: 50Mi
limits:
cpu: 100m
memory: 100Mi
resizePolicy:
- resourceName: "cpu"
restartPolicy: "NotRequired"
- resourceName: "memory"
restartPolicy: "RestartRequired"
```

In this example:
* CPU resizes: Neither container requires a restart for CPU resizes, and therefore CPU resizes at neither the container nor pod level will trigger any restarts.
* Container c1 (inherited memory limit): c1 does not define any container level
resources, so the effective memory limit of the container is determined by the
pod-level limit. When the pod's limit is resized, c1's effective memory limit
changes. Because c1's memory resizePolicy is RestartRequired, a resize of the
pod-level memory limit will trigger a restart of container c1.
* Container c2 (specified memory limit): c2 does define container-level resources,
so the effective memory limit of c2 is the container level limit. Therefore, a
resize of the pod-level memory limit doesn't change the effective container limit,
so the c2 is not restarted when the pod-level memory limit is resized.

##### Implementation Details

###### Allocating Pod-level Resources
Allocation of pod-level resources will work the same as container-level resources. The allocated resources checkpoint will be extended to include pod-level resources, and the pod object will be updated with the allocated resources in the pod sync loop.

###### Actuating Pod-level Resource Resize
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the record, I think we should probably be periodically asserting the "correct" size for pod resources, just as I think we should for container resources. No action needed here, but when we solve one, solve both.

The mechanism for actuating pod-level resize remains largely unchanged from the
existing container-level resize process. When pod-level resource configurations are
applied, the system handles the resize in a similar manner as it does for
container-level resources. This includes extending the existing logic to incorporate
directly configured pod-level resource settings.

The same ordering rules for pod and container resource resizing will be applied for each
resource as needed:
1. Increase pod-level cgroup (if needed)
2. Decrease container resources
3. Decrease pod-level cgroup (if needed)
4. Increase container resources

###### Tracking Actual Pod-level Resources
To accurately track actual pod-level resources during in-place pod resizing, several
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tallclair Given the discussion on how NRI plugins or systemd can mutate the resources (e.g. rounding), what happens when:

  • User specifies a pod with 2 containers, each using 5m cpu and the pod is 10m
  • An NRI plugin mutates the containers and rounds them up to 10m each

Are we smart enough to increase the pod?

Copy link
Member

@tallclair tallclair Feb 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, we don't today, but we could. We are reading the actual values from the runtime, so we could compute the the pod-level cgroups based on the sum of those instead of the allocated resources (or whichever is larger). We could even compute the diff with what we asked for, and add that to the pod.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm... but if NRI plugins change the value to be completely different, that'd just conflicts with how kubelet manages the cgroups. We can simply grab the values from the and assume those are the ones we want.

@samuelkarp

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed with Yuju above.

I am expecting that with those new efforts on the resource management we are doing and plan to do, the users eventually limit their NRI usages largely.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can continue discussing this, but this isn't a blocker here.

changes are required that are analogous to the changes made for container-level
in-place resizing:

1. Configuration reading: Pod-level resource config is currently read as part of the
resize flow, but will also need to be read during pod creation. Critically, the
configuration must be read again after the resize operation to capture the
updated resource values. Currently, the configuration is only read before a
resize.

2. Pod Status Update: Because the pod status is updated before the resize takes
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aside: We should probably re-evaluate this, outside the context of this KEP. Now that there are things reflected in the status that don't also trigger resync, we're going to need to resync the pod just to write another field to the status. I'm not sure off hand what the consequences of moving the status update to the end of PodSync would be.

effect, the status will not immediately reflect the new resource values. If a
container within the pod is also being resized, the container resize operation
will trigger a pod synchronization (pod-sync), which will refresh the pod's
status. However, if only pod-level resources are being resized, a pod-sync must
be explicitly triggered to update the pod status with the new resource
allocation.

3. [Scoped for Beta] Caching: Actual pod resource data may be cached to minimize API server load. This cache, if implemented, must be invalidated after each successful pod resize to ensure that subsequent reads retrieve the latest information. The need for and implementation of this caching mechanism will be evaluated in the beta phase. Performance benchmarking will be conducted to determine if caching is required and, if so, what caching strategy is most appropriate.

**Note for future enhancements for Ephemeral containers with pod-level resources and
IPPR**
Previously, assigning resources to ephemeral
containers wasn't allowed because pod resource allocations were immutable. With
the introduction of in-place pod resizing, users could gain more flexibility:

* Adjust pod-level resources to accommodate the needs of ephemeral containers. This
allows for a more dynamic allocation of resources within the pod.
* Specify resource requests and limits directly for ephemeral containers. Kubernetes will
then automatically resize the pod to ensure sufficient resources are available
for both regular and ephemeral containers.

Currently, setting `resources` for ephemeral containers is disallowed as pod
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Currently, setting `resources` for ephemeral containers is disallowed as pod
Prior to this KEP, setting `resources` for ephemeral containers was disallowed as pod

(to clarify)

resource allocations were immutable before In-Place Pod Resizing feature. With
in-place pod resize for pod-level resource allocation, users should be able to
either modify the pod-level resources to accommodate ephemeral containers or
supply resources at container-level for ephemeral containers and kubernetes will
resize the pod to accommodate the ephemeral containers.

#### [Scopred for Beta] CPU Manager
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
#### [Scopred for Beta] CPU Manager
#### [Scoped for Beta] CPU Manager


With the introduction of pod-level resource specifications, the CPU manager in
Kubernetes will adapt to manage CPU requests and limits at the pod level rather
than solely at the container level. This change means that the CPU manager will
allocate and enforce CPU resources based on the total requirements of the entire
pod, allowing for more flexible and efficient CPU utilization across all
containers within a pod. The CPU manager will need to ensure that the aggregate
CPU usage of all containers in a pod does not exceed the pod-level limits.

The CPU Manager policies are container-level configurations that control the
fine-grained allocation of CPU resources to containers. While CPU manager
policies will operate within the constraints of pod-level resource limits, they
do not directly apply at the pod level.

#### [Scoped for Beta] Topology Manager

Note: This section includes only high level overview; Design details will be added in Beta stage.

* The pod level scope for topology aligntment will consider pod level requests and limits instead of container level aggregates.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* The pod level scope for topology aligntment will consider pod level requests and limits instead of container level aggregates.
* The pod level scope for topology alignment will consider pod level requests and limits instead of container level aggregates.

* The hint providers will consider pod level requests and limits instead of
container level aggregates.

#### [Scoped for Beta] User Experience Survey

Before promoting the feature to Beta, we plan to conduct a UX survey to
Expand Down Expand Up @@ -1291,85 +1506,6 @@ KEPs. The first change doesn’t present any user visible change, and if
implemented, will in a small way reduce the effort for both of those KEPs by
providing a single place to update the pod resource calculation.

#### [Scoped for Beta] HugeTLB cgroup

Note: This section includes only high level overview; Design details will be added in Beta stage.

To support pod-level resource specifications for hugepages, Kubernetes will need to adjust how it handles hugetlb cgroups. Unlike memory, where an unset limit
means unlimited, an unset hugetlb limit is the same as setting it to 0.

With the proposed changes, hugepages-2Mi and hugepages-1Gi will be added to the pod-level resources section, alongside CPU and memory. The hugetlb cgroup for the
pod will then directly reflect the pod-level hugepage limits, rather than using an aggregated value from container limits. When scheduling, the scheduler will
consider hugepage requests at the pod level to find nodes with enough available resources.


#### [Scoped for Beta] Topology Manager

Note: This section includes only high level overview; Design details will be added in Beta stage.


* (Tentative) Only pod level scope for topology alignment will be supported if pod level requests and limits are specified without container-level requests and limits.
* The pod level scope for topology aligntment will consider pod level requests and limits instead of container level aggregates.
* The hint providers will consider pod level requests and limits instead of container level aggregates.


#### [Scoped for Beta] Memory Manager

Note: This section includes only high level overview; Design details will be
added in Beta stage.

With the introduction of pod-level resource specifications, the Kubernetes Memory
Manager will evolve to track and enforce resource limits at both the pod and
container levels. It will need to aggregate memory usage across all containers
within a pod to calculate the pod's total memory consumption. The Memory Manager
will then enforce the pod-level limit as the hard cap for the entire pod's memory
usage, preventing it from exceeding the allocated amount. While still
maintaining container-level limit enforcement, the Memory Manager will need to
coordinate with the Kubelet and eviction manager to make decisions about pod
eviction or individual container termination when the pod-level limit is
breached.


#### [Scoped for Beta] CPU Manager

Note: This section includes only high level overview; Design details will be
added in Beta stage.

With the introduction of pod-level resource specifications, the CPU manager in
Kubernetes will adapt to manage CPU requests and limits at the pod level rather
than solely at the container level. This change means that the CPU manager will
allocate and enforce CPU resources based on the total requirements of the entire
pod, allowing for more flexible and efficient CPU utilization across all
containers within a pod. The CPU manager will need to ensure that the aggregate
CPU usage of all containers in a pod does not exceed the pod-level limits.

#### [Scoped for Beta] In-Place Pod Resize

In-Place Pod resizing of resources is not supported in alpha stage of Pod-level
resources feature. **Users should avoid using in-place pod resizing if they are
utilizing pod-level resources.**

In version 1.33, the In-Place Pod resize functionality will be controlled by a
separate feature gate and introduced as an independent alpha feature. This is
necessary as it involves new fields in the PodStatus at the pod level.

Note for design & implementation: Previously, assigning resources to ephemeral
containers wasn't allowed because pod resource allocations were immutable. With
the introduction of in-place pod resizing, users will gain more flexibility:

* Adjust pod-level resources to accommodate the needs of ephemeral containers. This
allows for a more dynamic allocation of resources within the pod.
* Specify resource requests and limits directly for ephemeral containers. Kubernetes will
then automatically resize the pod to ensure sufficient resources are available
for both regular and ephemeral containers.

Currently, setting `resources` for ephemeral containers is disallowed as pod
resource allocations were immutable before In-Place Pod Resizing feature. With
in-place pod resize for pod-level resource allocation, users should be able to
either modify the pod-level resources to accommodate ephemeral containers or
supply resources at container-level for ephemeral containers and kubernetes will
resize the pod to accommodate the ephemeral containers.

#### [Scoped for Beta] VPA

TBD. Do not review for the alpha stage.
Expand Down
4 changes: 2 additions & 2 deletions keps/sig-node/2837-pod-level-resource-spec/kep.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -26,11 +26,11 @@ stage: alpha
# The most recent milestone for which work toward delivery of this KEP has been
# done. This can be the current (upcoming) milestone, if it is being actively
# worked on.
latest-milestone: "v1.32"
latest-milestone: "v1.33"

# The milestone at which this feature was, or is targeted to be, at each stage.
milestone:
alpha: "v1.32"
alpha: "v1.33"

# The following PRR answers are required at alpha release
# List the feature gate name and the components for which it must be enabled
Expand Down