-
Notifications
You must be signed in to change notification settings - Fork 1.6k
KEP-2837: PodLevelResources changes for 1.33 alpha stage #5145
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
KEP-2837: PodLevelResources changes for 1.33 alpha stage #5145
Conversation
|
/assign @yujuhong @tallclair @thockin @liggitt |
|
/cc |
3db5e6f to
11353e0
Compare
|
/assign @dchen1107 |
liggitt
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
did a quick sweep and mostly asked questions about bits I couldn't figure out from the descriptions... I don't have a lot of context on the feature so will mostly defer to @thockin who did the API review on the alpha, I think
| // Resources represents the compute resource requests and limits that have been | ||
| // applied at the pod level. If pod-level resources are not explicitly specified, | ||
| // then these will be the aggregate resources computed from containers. If limits are | ||
| // not defined for all containers (and not defined at the pod level) then no aggregate |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this saying
- limits will not be populated in this field if they're not specified in spec.resources and we can't aggregate them from containers because not all containers specify limits
- limits will be populated here but not applied/enforced at the pod level?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should be 1, IMO. Aggregating pod-level limits doesn't makes sense if not all containers are limited.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated the language for clarity
| #### Hugepages | ||
|
|
||
| With the proposed changes, hugepages-2Mi and hugepages-1Gi will be added to the pod-level resources section, alongside CPU and memory. The hugetlb cgroup for the | ||
| pod will then directly reflect the pod-level hugepage limits, rather than using an aggregated value from container limits. When scheduling, the scheduler will |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not that familiar with today's behavior... are individual containers not constrained to the hugepages limits they specify, but can steal allocated hugepages resources from a shared space sized to fit the other containers in the pod's hugepages requests?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With the current behavior, the containers are constrained are constrained to hugepages limits they specify. With pod-level resources, we want to enable the containers to use huge pages from a shared huge pages pool in a pod.
| type PodStatus struct { | ||
| ... | ||
| // Resources represents the compute resource requests and limits that have been | ||
| // applied at the pod level. If pod-level resources are not explicitly specified, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are only the resources supported in pod-level spec.resources (cpu, memory, and now hugepages...) aggregated here, or are other custom resources specified in the containers aggregated here? (as an aside, I think the pod-level resource validation errors if container-level resources are specified which are not included in pod-level resources at all... https://github.com/kubernetes/kubernetes/blob/ee22760391bae28954a69dff499d1cead9a9fcf0/pkg/apis/core/validation/validation.go#L4340-L4356).
What happens if pod-level spec.resources sets a pod-level cpu limit, but not a memory limit, and individual containers all set a memory limits. Does this include the pod-level cpu limit and the aggregated container memory limits?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good question. For resources that get configured on the pod level cgroup, this should report the actual values applied there. For everything else, I'm not sure. Do pod-level extended resources make sense today?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
DRA: I think pod-level GPUs could make sense and pod-level network interfaces are the ONLY real way to do network.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For those extended resources, this is still an open question. Luckily we can address this in later releases. Not a blocking for 1.33.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For those extended resources, this is still an open question. Luckily we can address this in later releases.
Ack. I think this should be stated in the non-goals if they are not already there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is stated in the non-goals section that only CPU, memory and hugepages are supported for now https://github.com/kubernetes/enhancements/blob/master/keps/sig-node/2837-pod-level-resource-spec/README.md#non-goals
Also, what Jordan pointed out there's a bug in validation logic where if container-level resources are set for unsupported resource type, the validation logic will error out because aggregated container requests in this case will be greater than pod requests ( as pod-level resources won't be set for unsupported resources): https://github.com/kubernetes/kubernetes/blob/ee22760391bae28954a69dff499d1cead9a9fcf0/pkg/apis/core/validation/validation.go#L4340-L4356
I will fix the bug. Thanks @liggitt for finding the bug.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
kubernetes/kubernetes#130131 is the PR @liggitt
ffromani
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
initial review; there are some topics I'd like to see more elaborated, but they are scoped for beta and in 1.33 we want to have another alpha IIRC, so we're good.
| type PodStatus struct { | ||
| ... | ||
| // Resources represents the compute resource requests and limits that have been | ||
| // applied at the pod level. If pod-level resources are not explicitly specified, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good question. For resources that get configured on the pod level cgroup, this should report the actual values applied there. For everything else, I'm not sure. Do pod-level extended resources make sense today?
| // Resources represents the compute resource requests and limits that have been | ||
| // applied at the pod level. If pod-level resources are not explicitly specified, | ||
| // then these will be the aggregate resources computed from containers. If limits are | ||
| // not defined for all containers (and not defined at the pod level) then no aggregate |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should be 1, IMO. Aggregating pod-level limits doesn't makes sense if not all containers are limited.
| // pod-level limits will be applied. | ||
| // +featureGate=InPlacePodVerticalScaling | ||
| // +optional | ||
| Resources *ResourceRequirements |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using ResourceRequirements for this mirrors the container level field, but I wish we had gone with a custom type for that. We don't yet use ResourceClaims. Should we mirror the container type, or create a new type without resource claims?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We used the same type i.e. ResourceRequirements for Resources in PodSpec as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Other than duplication, what would be the disadvantage of de-duplicating types? I really dislike when we have fields in the API but they can't be used.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want to add a new type ResourceConstraints or ResourceRequestsLimits?
| updated resource values. Currently, the configuration is only read before a | ||
| resize. | ||
|
|
||
| 2. Pod Status Update: Because the pod status is updated before the resize takes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aside: We should probably re-evaluate this, outside the context of this KEP. Now that there are things reflected in the status that don't also trigger resync, we're going to need to resync the pod just to write another field to the status. I'm not sure off hand what the consequences of moving the status update to the end of PodSync would be.
2dd5e72 to
137e064
Compare
137e064 to
61128de
Compare
thockin
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
API-wise this seems consistent with in-place resize, so all my concerns there apply here.
API:
/lgtm
| type PodStatus struct { | ||
| ... | ||
| // Resources represents the compute resource requests and limits that have been | ||
| // applied at the pod level. If pod-level resources are not explicitly specified, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
DRA: I think pod-level GPUs could make sense and pod-level network interfaces are the ONLY real way to do network.
| // pod-level limits will be applied. | ||
| // +featureGate=InPlacePodVerticalScaling | ||
| // +optional | ||
| Resources *ResourceRequirements |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Other than duplication, what would be the disadvantage of de-duplicating types? I really dislike when we have fields in the API but they can't be used.
| ###### Allocating Pod-level Resources | ||
| Allocation of pod-level resources will work the same as container-level resources. The allocated resources checkpoint will be extended to include pod-level resources, and the pod object will be updated with the allocated resources in the pod sync loop. | ||
|
|
||
| ###### Actuating Pod-level Resource Resize |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the record, I think we should probably be periodically asserting the "correct" size for pod resources, just as I think we should for container resources. No action needed here, but when we solve one, solve both.
| 4. Increase container resources | ||
|
|
||
| ###### Tracking Actual Pod-level Resources | ||
| To accurately track actual pod-level resources during in-place pod resizing, several |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tallclair Given the discussion on how NRI plugins or systemd can mutate the resources (e.g. rounding), what happens when:
- User specifies a pod with 2 containers, each using
5mcpu and the pod is10m - An NRI plugin mutates the containers and rounds them up to
10meach
Are we smart enough to increase the pod?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, we don't today, but we could. We are reading the actual values from the runtime, so we could compute the the pod-level cgroups based on the sum of those instead of the allocated resources (or whichever is larger). We could even compute the diff with what we asked for, and add that to the pod.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm... but if NRI plugins change the value to be completely different, that'd just conflicts with how kubelet manages the cgroups. We can simply grab the values from the and assume those are the ones we want.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed with Yuju above.
I am expecting that with those new efforts on the resource management we are doing and plan to do, the users eventually limit their NRI usages largely.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can continue discussing this, but this isn't a blocker here.
|
|
||
| #### Hugepages | ||
|
|
||
| With the proposed changes, support for hugepages(with prefix hugepages-*) will be extended to the pod-level resources specifications, alongside CPU and memory. The hugetlb cgroup for the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| With the proposed changes, support for hugepages(with prefix hugepages-*) will be extended to the pod-level resources specifications, alongside CPU and memory. The hugetlb cgroup for the | |
| With the proposed changes, support for Linux hugepages (resources with prefix `hugepages-`) will be extended to the pod-level resources specifications, alongside CPU and memory. The hugetlb cgroup for the |
?
| consider hugepage requests at the pod level to find nodes with enough available | ||
| resources. | ||
|
|
||
| Containers will still need to mount an emptyDir volume to access the huge page filesystem (typically /dev/hugepages). This is the standard way for containers to interact with huge pages, and this will not change. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This sounds confusing; /dev/hugepages isn't normally present within an empty directory (per emptyDir). Maybe clarify what we mean here.
| ##### API changes | ||
|
|
||
| IPPR for pod-level resources requires extending `PodStatus` to include pod-level | ||
| resource fields as detailed in [PodStatus API changes](#### PodStatus API changes) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TODO
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| resource fields as detailed in [PodStatus API changes](#### PodStatus API changes) | |
| resource fields as detailed in [PodStatus API changes](#podstatus-api-changes) |
| memory: 200Mi | ||
| containers: | ||
| - name: c1 | ||
| image: registry.k8s.io/pause:latest |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: can we avoid implying that latest is a good choice of version pin?
| then automatically resize the pod to ensure sufficient resources are available | ||
| for both regular and ephemeral containers. | ||
|
|
||
| Currently, setting `resources` for ephemeral containers is disallowed as pod |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| Currently, setting `resources` for ephemeral containers is disallowed as pod | |
| Prior to this KEP, setting `resources` for ephemeral containers was disallowed as pod |
(to clarify)
| supply resources at container-level for ephemeral containers and kubernetes will | ||
| resize the pod to accommodate the ephemeral containers. | ||
|
|
||
| #### [Scopred for Beta] CPU Manager |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| #### [Scopred for Beta] CPU Manager | |
| #### [Scoped for Beta] CPU Manager |
|
|
||
| Note: This section includes only high level overview; Design details will be added in Beta stage. | ||
|
|
||
| * The pod level scope for topology aligntment will consider pod level requests and limits instead of container level aggregates. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| * The pod level scope for topology aligntment will consider pod level requests and limits instead of container level aggregates. | |
| * The pod level scope for topology alignment will consider pod level requests and limits instead of container level aggregates. |
|
/lgtm Open comments shouldn't block this, and can be addressed in a follow-up PR if needed. |
|
/lgtm @ndixita There are several small comments, especially with the newly added hugepage section. Please address them with the followup PR. Also there are several open questions, but none of them are the blocker for this feature. We should address them separately or latter release(s). |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: dchen1107, ndixita, tallclair The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Uh oh!
There was an error while loading. Please reload this page.