Skip to content

Conversation

@ndixita
Copy link
Contributor

@ndixita ndixita commented Feb 6, 2025

  • One-line PR description: Details about IPPR, memory manager, and huge pages for another alpha in 1.33

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Feb 6, 2025
@k8s-ci-robot k8s-ci-robot added kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory sig/node Categorizes an issue or PR as relevant to SIG Node. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Feb 6, 2025
@ndixita
Copy link
Contributor Author

ndixita commented Feb 6, 2025

/assign @yujuhong @tallclair @thockin @liggitt

@ffromani
Copy link
Contributor

ffromani commented Feb 6, 2025

/cc

@k8s-ci-robot k8s-ci-robot requested a review from ffromani February 6, 2025 17:02
@ndixita ndixita force-pushed the pod-level-resources-alpha2 branch from 3db5e6f to 11353e0 Compare February 6, 2025 17:10
@ndixita
Copy link
Contributor Author

ndixita commented Feb 6, 2025

/assign @dchen1107

Copy link
Member

@liggitt liggitt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

did a quick sweep and mostly asked questions about bits I couldn't figure out from the descriptions... I don't have a lot of context on the feature so will mostly defer to @thockin who did the API review on the alpha, I think

// Resources represents the compute resource requests and limits that have been
// applied at the pod level. If pod-level resources are not explicitly specified,
// then these will be the aggregate resources computed from containers. If limits are
// not defined for all containers (and not defined at the pod level) then no aggregate
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this saying

  1. limits will not be populated in this field if they're not specified in spec.resources and we can't aggregate them from containers because not all containers specify limits
  2. limits will be populated here but not applied/enforced at the pod level?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be 1, IMO. Aggregating pod-level limits doesn't makes sense if not all containers are limited.

Copy link
Contributor Author

@ndixita ndixita Feb 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated the language for clarity

#### Hugepages

With the proposed changes, hugepages-2Mi and hugepages-1Gi will be added to the pod-level resources section, alongside CPU and memory. The hugetlb cgroup for the
pod will then directly reflect the pod-level hugepage limits, rather than using an aggregated value from container limits. When scheduling, the scheduler will
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not that familiar with today's behavior... are individual containers not constrained to the hugepages limits they specify, but can steal allocated hugepages resources from a shared space sized to fit the other containers in the pod's hugepages requests?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the current behavior, the containers are constrained are constrained to hugepages limits they specify. With pod-level resources, we want to enable the containers to use huge pages from a shared huge pages pool in a pod.

type PodStatus struct {
...
// Resources represents the compute resource requests and limits that have been
// applied at the pod level. If pod-level resources are not explicitly specified,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are only the resources supported in pod-level spec.resources (cpu, memory, and now hugepages...) aggregated here, or are other custom resources specified in the containers aggregated here? (as an aside, I think the pod-level resource validation errors if container-level resources are specified which are not included in pod-level resources at all... https://github.com/kubernetes/kubernetes/blob/ee22760391bae28954a69dff499d1cead9a9fcf0/pkg/apis/core/validation/validation.go#L4340-L4356).

What happens if pod-level spec.resources sets a pod-level cpu limit, but not a memory limit, and individual containers all set a memory limits. Does this include the pod-level cpu limit and the aggregated container memory limits?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question. For resources that get configured on the pod level cgroup, this should report the actual values applied there. For everything else, I'm not sure. Do pod-level extended resources make sense today?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DRA: I think pod-level GPUs could make sense and pod-level network interfaces are the ONLY real way to do network.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For those extended resources, this is still an open question. Luckily we can address this in later releases. Not a blocking for 1.33.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For those extended resources, this is still an open question. Luckily we can address this in later releases.

Ack. I think this should be stated in the non-goals if they are not already there.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is stated in the non-goals section that only CPU, memory and hugepages are supported for now https://github.com/kubernetes/enhancements/blob/master/keps/sig-node/2837-pod-level-resource-spec/README.md#non-goals

Also, what Jordan pointed out there's a bug in validation logic where if container-level resources are set for unsupported resource type, the validation logic will error out because aggregated container requests in this case will be greater than pod requests ( as pod-level resources won't be set for unsupported resources): https://github.com/kubernetes/kubernetes/blob/ee22760391bae28954a69dff499d1cead9a9fcf0/pkg/apis/core/validation/validation.go#L4340-L4356

I will fix the bug. Thanks @liggitt for finding the bug.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pacoxu pacoxu mentioned this pull request Feb 8, 2025
23 tasks
Copy link
Contributor

@ffromani ffromani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

initial review; there are some topics I'd like to see more elaborated, but they are scoped for beta and in 1.33 we want to have another alpha IIRC, so we're good.

type PodStatus struct {
...
// Resources represents the compute resource requests and limits that have been
// applied at the pod level. If pod-level resources are not explicitly specified,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question. For resources that get configured on the pod level cgroup, this should report the actual values applied there. For everything else, I'm not sure. Do pod-level extended resources make sense today?

// Resources represents the compute resource requests and limits that have been
// applied at the pod level. If pod-level resources are not explicitly specified,
// then these will be the aggregate resources computed from containers. If limits are
// not defined for all containers (and not defined at the pod level) then no aggregate
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be 1, IMO. Aggregating pod-level limits doesn't makes sense if not all containers are limited.

// pod-level limits will be applied.
// +featureGate=InPlacePodVerticalScaling
// +optional
Resources *ResourceRequirements
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using ResourceRequirements for this mirrors the container level field, but I wish we had gone with a custom type for that. We don't yet use ResourceClaims. Should we mirror the container type, or create a new type without resource claims?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We used the same type i.e. ResourceRequirements for Resources in PodSpec as well.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Other than duplication, what would be the disadvantage of de-duplicating types? I really dislike when we have fields in the API but they can't be used.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to add a new type ResourceConstraints or ResourceRequestsLimits?

updated resource values. Currently, the configuration is only read before a
resize.

2. Pod Status Update: Because the pod status is updated before the resize takes
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aside: We should probably re-evaluate this, outside the context of this KEP. Now that there are things reflected in the status that don't also trigger resync, we're going to need to resync the pod just to write another field to the status. I'm not sure off hand what the consequences of moving the status update to the end of PodSync would be.

@ndixita ndixita force-pushed the pod-level-resources-alpha2 branch 3 times, most recently from 2dd5e72 to 137e064 Compare February 11, 2025 19:23
@ndixita ndixita force-pushed the pod-level-resources-alpha2 branch from 137e064 to 61128de Compare February 11, 2025 19:25
Copy link
Member

@thockin thockin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

API-wise this seems consistent with in-place resize, so all my concerns there apply here.

API:
/lgtm

type PodStatus struct {
...
// Resources represents the compute resource requests and limits that have been
// applied at the pod level. If pod-level resources are not explicitly specified,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DRA: I think pod-level GPUs could make sense and pod-level network interfaces are the ONLY real way to do network.

// pod-level limits will be applied.
// +featureGate=InPlacePodVerticalScaling
// +optional
Resources *ResourceRequirements
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Other than duplication, what would be the disadvantage of de-duplicating types? I really dislike when we have fields in the API but they can't be used.

###### Allocating Pod-level Resources
Allocation of pod-level resources will work the same as container-level resources. The allocated resources checkpoint will be extended to include pod-level resources, and the pod object will be updated with the allocated resources in the pod sync loop.

###### Actuating Pod-level Resource Resize
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the record, I think we should probably be periodically asserting the "correct" size for pod resources, just as I think we should for container resources. No action needed here, but when we solve one, solve both.

4. Increase container resources

###### Tracking Actual Pod-level Resources
To accurately track actual pod-level resources during in-place pod resizing, several
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tallclair Given the discussion on how NRI plugins or systemd can mutate the resources (e.g. rounding), what happens when:

  • User specifies a pod with 2 containers, each using 5m cpu and the pod is 10m
  • An NRI plugin mutates the containers and rounds them up to 10m each

Are we smart enough to increase the pod?

Copy link
Member

@tallclair tallclair Feb 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, we don't today, but we could. We are reading the actual values from the runtime, so we could compute the the pod-level cgroups based on the sum of those instead of the allocated resources (or whichever is larger). We could even compute the diff with what we asked for, and add that to the pod.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm... but if NRI plugins change the value to be completely different, that'd just conflicts with how kubelet manages the cgroups. We can simply grab the values from the and assume those are the ones we want.

@samuelkarp

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed with Yuju above.

I am expecting that with those new efforts on the resource management we are doing and plan to do, the users eventually limit their NRI usages largely.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can continue discussing this, but this isn't a blocker here.

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 12, 2025

#### Hugepages

With the proposed changes, support for hugepages(with prefix hugepages-*) will be extended to the pod-level resources specifications, alongside CPU and memory. The hugetlb cgroup for the
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
With the proposed changes, support for hugepages(with prefix hugepages-*) will be extended to the pod-level resources specifications, alongside CPU and memory. The hugetlb cgroup for the
With the proposed changes, support for Linux hugepages (resources with prefix `hugepages-`) will be extended to the pod-level resources specifications, alongside CPU and memory. The hugetlb cgroup for the

?

consider hugepage requests at the pod level to find nodes with enough available
resources.

Containers will still need to mount an emptyDir volume to access the huge page filesystem (typically /dev/hugepages). This is the standard way for containers to interact with huge pages, and this will not change.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This sounds confusing; /dev/hugepages isn't normally present within an empty directory (per emptyDir). Maybe clarify what we mean here.

##### API changes

IPPR for pod-level resources requires extending `PodStatus` to include pod-level
resource fields as detailed in [PodStatus API changes](#### PodStatus API changes)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
resource fields as detailed in [PodStatus API changes](#### PodStatus API changes)
resource fields as detailed in [PodStatus API changes](#podstatus-api-changes)

memory: 200Mi
containers:
- name: c1
image: registry.k8s.io/pause:latest
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: can we avoid implying that latest is a good choice of version pin?

then automatically resize the pod to ensure sufficient resources are available
for both regular and ephemeral containers.

Currently, setting `resources` for ephemeral containers is disallowed as pod
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Currently, setting `resources` for ephemeral containers is disallowed as pod
Prior to this KEP, setting `resources` for ephemeral containers was disallowed as pod

(to clarify)

supply resources at container-level for ephemeral containers and kubernetes will
resize the pod to accommodate the ephemeral containers.

#### [Scopred for Beta] CPU Manager
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
#### [Scopred for Beta] CPU Manager
#### [Scoped for Beta] CPU Manager


Note: This section includes only high level overview; Design details will be added in Beta stage.

* The pod level scope for topology aligntment will consider pod level requests and limits instead of container level aggregates.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* The pod level scope for topology aligntment will consider pod level requests and limits instead of container level aggregates.
* The pod level scope for topology alignment will consider pod level requests and limits instead of container level aggregates.

@tallclair
Copy link
Member

/lgtm
/approve

Open comments shouldn't block this, and can be addressed in a follow-up PR if needed.

@dchen1107
Copy link
Member

/lgtm
/approve

@ndixita There are several small comments, especially with the newly added hugepage section. Please address them with the followup PR. Also there are several open questions, but none of them are the blocker for this feature. We should address them separately or latter release(s).

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: dchen1107, ndixita, tallclair

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 13, 2025
@k8s-ci-robot k8s-ci-robot merged commit ea998b6 into kubernetes:master Feb 13, 2025
4 checks passed
@k8s-ci-robot k8s-ci-robot added this to the v1.33 milestone Feb 13, 2025
@liggitt liggitt moved this to API review completed, 1.33 in API Reviews Feb 27, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory lgtm "Looks good to me", indicates that a PR is ready to be merged. sig/node Categorizes an issue or PR as relevant to SIG Node. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.

Projects

Status: API review completed, 1.33

Development

Successfully merging this pull request may close these issues.