KEP-2837: PodLevelResources changes for 1.33 alpha stage #5145

ndixita · 2025-02-06T16:58:23Z

One-line PR description: Details about IPPR, memory manager, and huge pages for another alpha in 1.33

Issue link: Pod level resources #2837

ndixita · 2025-02-06T16:59:03Z

/assign @yujuhong @tallclair @thockin @liggitt

ffromani · 2025-02-06T17:02:38Z

/cc

ndixita · 2025-02-06T17:10:41Z

/assign @dchen1107

liggitt

did a quick sweep and mostly asked questions about bits I couldn't figure out from the descriptions... I don't have a lot of context on the feature so will mostly defer to @thockin who did the API review on the alpha, I think

keps/sig-node/2837-pod-level-resource-spec/README.md

liggitt · 2025-02-07T04:01:38Z

keps/sig-node/2837-pod-level-resource-spec/README.md

+  // Resources represents the compute resource requests and limits that have been
+  // applied at the pod level. If pod-level resources are not explicitly specified,
+  // then these will be the aggregate resources computed from containers. If limits are
+  // not defined for all containers (and not defined at the pod level) then no aggregate


is this saying

limits will not be populated in this field if they're not specified in spec.resources and we can't aggregate them from containers because not all containers specify limits

limits will be populated here but not applied/enforced at the pod level?

Should be 1, IMO. Aggregating pod-level limits doesn't makes sense if not all containers are limited.

Updated the language for clarity

keps/sig-node/2837-pod-level-resource-spec/README.md

liggitt · 2025-02-07T04:08:50Z

keps/sig-node/2837-pod-level-resource-spec/README.md

+#### Hugepages
+
+With the proposed changes, hugepages-2Mi and hugepages-1Gi will be added to the pod-level resources section, alongside CPU and memory. The hugetlb cgroup for the
+pod will then directly reflect the pod-level hugepage limits, rather than using an aggregated value from container limits. When scheduling, the scheduler will 


I'm not that familiar with today's behavior... are individual containers not constrained to the hugepages limits they specify, but can steal allocated hugepages resources from a shared space sized to fit the other containers in the pod's hugepages requests?

With the current behavior, the containers are constrained are constrained to hugepages limits they specify. With pod-level resources, we want to enable the containers to use huge pages from a shared huge pages pool in a pod.

liggitt · 2025-02-07T04:15:58Z

keps/sig-node/2837-pod-level-resource-spec/README.md

+type PodStatus struct {
+...
+  // Resources represents the compute resource requests and limits that have been
+  // applied at the pod level. If pod-level resources are not explicitly specified,


Are only the resources supported in pod-level spec.resources (cpu, memory, and now hugepages...) aggregated here, or are other custom resources specified in the containers aggregated here? (as an aside, I think the pod-level resource validation errors if container-level resources are specified which are not included in pod-level resources at all... https://github.com/kubernetes/kubernetes/blob/ee22760391bae28954a69dff499d1cead9a9fcf0/pkg/apis/core/validation/validation.go#L4340-L4356).

What happens if pod-level spec.resources sets a pod-level cpu limit, but not a memory limit, and individual containers all set a memory limits. Does this include the pod-level cpu limit and the aggregated container memory limits?

Good question. For resources that get configured on the pod level cgroup, this should report the actual values applied there. For everything else, I'm not sure. Do pod-level extended resources make sense today?

DRA: I think pod-level GPUs could make sense and pod-level network interfaces are the ONLY real way to do network.

For those extended resources, this is still an open question. Luckily we can address this in later releases. Not a blocking for 1.33.

cc/ @yujuhong @johnbelamaric

For those extended resources, this is still an open question. Luckily we can address this in later releases.

Ack. I think this should be stated in the non-goals if they are not already there.

It is stated in the non-goals section that only CPU, memory and hugepages are supported for now https://github.com/kubernetes/enhancements/blob/master/keps/sig-node/2837-pod-level-resource-spec/README.md#non-goals

Also, what Jordan pointed out there's a bug in validation logic where if container-level resources are set for unsupported resource type, the validation logic will error out because aggregated container requests in this case will be greater than pod requests ( as pod-level resources won't be set for unsupported resources): https://github.com/kubernetes/kubernetes/blob/ee22760391bae28954a69dff499d1cead9a9fcf0/pkg/apis/core/validation/validation.go#L4340-L4356

I will fix the bug. Thanks @liggitt for finding the bug.

kubernetes/kubernetes#130131 is the PR @liggitt

keps/sig-node/2837-pod-level-resource-spec/README.md

ffromani

initial review; there are some topics I'd like to see more elaborated, but they are scoped for beta and in 1.33 we want to have another alpha IIRC, so we're good.

keps/sig-node/2837-pod-level-resource-spec/README.md

tallclair · 2025-02-10T18:36:07Z

keps/sig-node/2837-pod-level-resource-spec/README.md

+type PodStatus struct {
+...
+  // Resources represents the compute resource requests and limits that have been
+  // applied at the pod level. If pod-level resources are not explicitly specified,


Good question. For resources that get configured on the pod level cgroup, this should report the actual values applied there. For everything else, I'm not sure. Do pod-level extended resources make sense today?

tallclair · 2025-02-10T18:37:08Z

keps/sig-node/2837-pod-level-resource-spec/README.md

+  // Resources represents the compute resource requests and limits that have been
+  // applied at the pod level. If pod-level resources are not explicitly specified,
+  // then these will be the aggregate resources computed from containers. If limits are
+  // not defined for all containers (and not defined at the pod level) then no aggregate


Should be 1, IMO. Aggregating pod-level limits doesn't makes sense if not all containers are limited.

tallclair · 2025-02-10T18:38:23Z

keps/sig-node/2837-pod-level-resource-spec/README.md

+  // pod-level limits will be applied.
+  // +featureGate=InPlacePodVerticalScaling
+  // +optional
+  Resources *ResourceRequirements


Using ResourceRequirements for this mirrors the container level field, but I wish we had gone with a custom type for that. We don't yet use ResourceClaims. Should we mirror the container type, or create a new type without resource claims?

We used the same type i.e. ResourceRequirements for Resources in PodSpec as well.

Other than duplication, what would be the disadvantage of de-duplicating types? I really dislike when we have fields in the API but they can't be used.

Do we want to add a new type ResourceConstraints or ResourceRequestsLimits?

keps/sig-node/2837-pod-level-resource-spec/README.md

tallclair · 2025-02-10T19:03:39Z

keps/sig-node/2837-pod-level-resource-spec/README.md

+   updated resource values.  Currently, the configuration is only read before a
+   resize.
+
+2. Pod Status Update: Because the pod status is updated before the resize takes


Aside: We should probably re-evaluate this, outside the context of this KEP. Now that there are things reflected in the status that don't also trigger resync, we're going to need to resync the pod just to write another field to the status. I'm not sure off hand what the consequences of moving the status update to the end of PodSync would be.

keps/sig-node/2837-pod-level-resource-spec/README.md

thockin

API-wise this seems consistent with in-place resize, so all my concerns there apply here.

API:
/lgtm

keps/sig-node/2837-pod-level-resource-spec/README.md

thockin · 2025-02-11T22:55:27Z

keps/sig-node/2837-pod-level-resource-spec/README.md

+type PodStatus struct {
+...
+  // Resources represents the compute resource requests and limits that have been
+  // applied at the pod level. If pod-level resources are not explicitly specified,


DRA: I think pod-level GPUs could make sense and pod-level network interfaces are the ONLY real way to do network.

thockin · 2025-02-11T23:03:16Z

keps/sig-node/2837-pod-level-resource-spec/README.md

+  // pod-level limits will be applied.
+  // +featureGate=InPlacePodVerticalScaling
+  // +optional
+  Resources *ResourceRequirements


Other than duplication, what would be the disadvantage of de-duplicating types? I really dislike when we have fields in the API but they can't be used.

keps/sig-node/2837-pod-level-resource-spec/README.md

thockin · 2025-02-11T23:49:48Z

keps/sig-node/2837-pod-level-resource-spec/README.md

+###### Allocating Pod-level Resources
+Allocation of pod-level resources will work the same as container-level resources. The allocated resources checkpoint will be extended to include pod-level resources, and the pod object will be updated with the allocated resources in the pod sync loop.
+
+###### Actuating Pod-level Resource Resize


For the record, I think we should probably be periodically asserting the "correct" size for pod resources, just as I think we should for container resources. No action needed here, but when we solve one, solve both.

thockin · 2025-02-12T00:19:47Z

keps/sig-node/2837-pod-level-resource-spec/README.md

+4. Increase container resources
+
+###### Tracking Actual Pod-level Resources
+To accurately track actual pod-level resources during in-place pod resizing, several


@tallclair Given the discussion on how NRI plugins or systemd can mutate the resources (e.g. rounding), what happens when:

User specifies a pod with 2 containers, each using 5m cpu and the pod is 10m

An NRI plugin mutates the containers and rounds them up to 10m each

Are we smart enough to increase the pod?

Hmm, we don't today, but we could. We are reading the actual values from the runtime, so we could compute the the pod-level cgroups based on the sum of those instead of the allocated resources (or whichever is larger). We could even compute the diff with what we asked for, and add that to the pod.

Hmm... but if NRI plugins change the value to be completely different, that'd just conflicts with how kubelet manages the cgroups. We can simply grab the values from the and assume those are the ones we want.

@samuelkarp

Agreed with Yuju above.

I am expecting that with those new efforts on the resource management we are doing and plan to do, the users eventually limit their NRI usages largely.

We can continue discussing this, but this isn't a blocker here.

keps/sig-node/2837-pod-level-resource-spec/README.md

sftim · 2025-02-12T18:27:40Z

keps/sig-node/2837-pod-level-resource-spec/README.md


+#### Hugepages
+
+With the proposed changes, support for hugepages(with prefix hugepages-*) will be extended to the pod-level resources specifications, alongside CPU and memory. The hugetlb cgroup for the


Suggested change

With the proposed changes, support for hugepages(with prefix hugepages-*) will be extended to the pod-level resources specifications, alongside CPU and memory. The hugetlb cgroup for the

With the proposed changes, support for Linux hugepages (resources with prefix `hugepages-`) will be extended to the pod-level resources specifications, alongside CPU and memory. The hugetlb cgroup for the

?

sftim · 2025-02-12T18:28:42Z

keps/sig-node/2837-pod-level-resource-spec/README.md

+consider hugepage requests at the pod level to find nodes with enough available
+resources.
+
+Containers will still need to mount an emptyDir volume to access the huge page filesystem (typically /dev/hugepages).  This is the standard way for containers to interact with huge pages, and this will not change. 


This sounds confusing; /dev/hugepages isn't normally present within an empty directory (per emptyDir). Maybe clarify what we mean here.

sftim · 2025-02-12T18:28:50Z

keps/sig-node/2837-pod-level-resource-spec/README.md

+##### API changes
+
+IPPR for pod-level resources requires extending `PodStatus` to include pod-level
+resource fields as detailed in [PodStatus API changes](#### PodStatus API changes)


Suggested change

resource fields as detailed in [PodStatus API changes](#### PodStatus API changes)

resource fields as detailed in [PodStatus API changes](#podstatus-api-changes)

sftim · 2025-02-12T18:29:15Z

keps/sig-node/2837-pod-level-resource-spec/README.md

+      memory: 200Mi
+  containers:
+    - name: c1
+      image: registry.k8s.io/pause:latest


nit: can we avoid implying that latest is a good choice of version pin?

sftim · 2025-02-12T18:31:01Z

keps/sig-node/2837-pod-level-resource-spec/README.md

+then automatically resize the pod to ensure sufficient resources are available
+for both regular and ephemeral containers.
+
+Currently, setting `resources` for ephemeral containers is disallowed as pod


Suggested change

Currently, setting `resources` for ephemeral containers is disallowed as pod

Prior to this KEP, setting `resources` for ephemeral containers was disallowed as pod

(to clarify)

sftim · 2025-02-12T18:31:18Z

keps/sig-node/2837-pod-level-resource-spec/README.md

+supply resources at container-level for ephemeral containers and kubernetes will
+resize the pod to accommodate the ephemeral containers.
+
+#### [Scopred for Beta] CPU Manager


Suggested change

#### [Scopred for Beta] CPU Manager

#### [Scoped for Beta] CPU Manager

sftim · 2025-02-12T18:31:43Z

keps/sig-node/2837-pod-level-resource-spec/README.md

+
+Note: This section includes only high level overview; Design details will be added in Beta stage.
+
+* The pod level scope for topology aligntment will consider pod level requests and limits instead of container level aggregates.


Suggested change

* The pod level scope for topology aligntment will consider pod level requests and limits instead of container level aggregates.

* The pod level scope for topology alignment will consider pod level requests and limits instead of container level aggregates.

keps/sig-node/2837-pod-level-resource-spec/README.md

tallclair · 2025-02-13T18:03:34Z

/lgtm
/approve

Open comments shouldn't block this, and can be addressed in a follow-up PR if needed.

dchen1107 · 2025-02-13T18:21:29Z

/lgtm
/approve

@ndixita There are several small comments, especially with the newly added hugepage section. Please address them with the followup PR. Also there are several open questions, but none of them are the blocker for this feature. We should address them separately or latter release(s).

k8s-ci-robot · 2025-02-13T18:21:38Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: dchen1107, ndixita, tallclair

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~keps/sig-node/OWNERS~~ [dchen1107]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Feb 6, 2025

k8s-ci-robot requested review from derekwaynecarr and mrunalp February 6, 2025 16:58

k8s-ci-robot added kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory sig/node Categorizes an issue or PR as relevant to SIG Node. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Feb 6, 2025

k8s-ci-robot assigned liggitt, tallclair, thockin and yujuhong Feb 6, 2025

k8s-ci-robot requested a review from ffromani February 6, 2025 17:02

ndixita force-pushed the pod-level-resources-alpha2 branch from 3db5e6f to 11353e0 Compare February 6, 2025 17:10

k8s-ci-robot assigned dchen1107 Feb 6, 2025

liggitt reviewed Feb 7, 2025

View reviewed changes

pacoxu mentioned this pull request Feb 8, 2025

Pod level resources #2837

Open

23 tasks

ffromani reviewed Feb 10, 2025

View reviewed changes

keps/sig-node/2837-pod-level-resource-spec/README.md Outdated Show resolved Hide resolved

keps/sig-node/2837-pod-level-resource-spec/README.md Show resolved Hide resolved

tallclair reviewed Feb 10, 2025

View reviewed changes

swatisehgal reviewed Feb 11, 2025

View reviewed changes

keps/sig-node/2837-pod-level-resource-spec/README.md Outdated Show resolved Hide resolved

keps/sig-node/2837-pod-level-resource-spec/README.md Outdated Show resolved Hide resolved

ndixita force-pushed the pod-level-resources-alpha2 branch 3 times, most recently from 2dd5e72 to 137e064 Compare February 11, 2025 19:23

KEP-2837: PodLevelResources changes for 1.33

61128de

ndixita force-pushed the pod-level-resources-alpha2 branch from 137e064 to 61128de Compare February 11, 2025 19:25

haircommander reviewed Feb 11, 2025

View reviewed changes

keps/sig-node/2837-pod-level-resource-spec/README.md Show resolved Hide resolved

thockin reviewed Feb 12, 2025

View reviewed changes

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 12, 2025

sftim reviewed Feb 12, 2025

View reviewed changes

dchen1107 reviewed Feb 12, 2025

View reviewed changes

keps/sig-node/2837-pod-level-resource-spec/README.md Show resolved Hide resolved

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 13, 2025

k8s-ci-robot merged commit ea998b6 into kubernetes:master Feb 13, 2025
4 checks passed

k8s-ci-robot added this to the v1.33 milestone Feb 13, 2025

liggitt moved this to API review completed, 1.33 in API Reviews Feb 27, 2025


		#### Hugepages

		With the proposed changes, support for hugepages(with prefix hugepages-*) will be extended to the pod-level resources specifications, alongside CPU and memory. The hugetlb cgroup for the

	With the proposed changes, support for hugepages(with prefix hugepages-*) will be extended to the pod-level resources specifications, alongside CPU and memory. The hugetlb cgroup for the
	With the proposed changes, support for Linux hugepages (resources with prefix `hugepages-`) will be extended to the pod-level resources specifications, alongside CPU and memory. The hugetlb cgroup for the

	resource fields as detailed in [PodStatus API changes](#### PodStatus API changes)
	resource fields as detailed in [PodStatus API changes](#podstatus-api-changes)

	Currently, setting `resources` for ephemeral containers is disallowed as pod
	Prior to this KEP, setting `resources` for ephemeral containers was disallowed as pod

	#### [Scopred for Beta] CPU Manager
	#### [Scoped for Beta] CPU Manager


		Note: This section includes only high level overview; Design details will be added in Beta stage.

		* The pod level scope for topology aligntment will consider pod level requests and limits instead of container level aggregates.

KEP-2837: PodLevelResources changes for 1.33 alpha stage #5145

KEP-2837: PodLevelResources changes for 1.33 alpha stage #5145

Uh oh!

Conversation

ndixita commented Feb 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ndixita commented Feb 6, 2025

Uh oh!

ffromani commented Feb 6, 2025

Uh oh!

ndixita commented Feb 6, 2025

Uh oh!

liggitt left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ndixita Feb 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ffromani left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

thockin left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ndixita commented Feb 6, 2025 •

edited

Loading

ndixita Feb 11, 2025 •

edited

Loading

tallclair Feb 12, 2025 •

edited

Loading