Skip to content

Commit 168fd21

Browse files
committed
KEP-2485: ReadWriteOncePod beta production readiness
1 parent 1287006 commit 168fd21

File tree

3 files changed

+209
-33
lines changed

3 files changed

+209
-33
lines changed
Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,5 @@
11
kep-number: 2485
22
alpha:
33
approver: "@ehashman"
4+
beta:
5+
approver: "@deads2k"

keps/sig-storage/2485-read-write-once-pod-pv-access-mode/README.md

Lines changed: 203 additions & 30 deletions
Original file line numberDiff line numberDiff line change
@@ -94,17 +94,23 @@ tags, and then generate with `hack/update-toc.sh`.
9494
- [Design Details](#design-details)
9595
- [Kubernetes Changes, Access Mode](#kubernetes-changes-access-mode)
9696
- [Scheduler Enforcement](#scheduler-enforcement)
97+
- [Alpha](#alpha)
98+
- [Beta](#beta)
9799
- [Mount Enforcement](#mount-enforcement)
98100
- [CSI Specification Changes, Volume Capabilities](#csi-specification-changes-volume-capabilities)
99101
- [Test Plan](#test-plan)
102+
- [Prerequisite testing updates](#prerequisite-testing-updates)
103+
- [Unit tests](#unit-tests)
104+
- [Integration tests](#integration-tests)
105+
- [e2e tests](#e2e-tests)
100106
- [Validation of PersistentVolumeSpec Object](#validation-of-persistentvolumespec-object)
101107
- [Mounting and Mapping with ReadWriteOncePod](#mounting-and-mapping-with-readwriteoncepod)
102108
- [Mounting and Mapping with ReadWriteOnce](#mounting-and-mapping-with-readwriteonce)
103109
- [Mapping Kubernetes Access Modes to CSI Volume Capability Access Modes](#mapping-kubernetes-access-modes-to-csi-volume-capability-access-modes)
104110
- [End to End Tests](#end-to-end-tests)
105111
- [Graduation Criteria](#graduation-criteria)
106-
- [Alpha](#alpha)
107-
- [Beta](#beta)
112+
- [Alpha](#alpha-1)
113+
- [Beta](#beta-1)
108114
- [GA](#ga)
109115
- [Upgrade / Downgrade Strategy](#upgrade--downgrade-strategy)
110116
- [Version Skew Strategy](#version-skew-strategy)
@@ -389,6 +395,8 @@ This access mode will be enforced in two places:
389395

390396
#### Scheduler Enforcement
391397

398+
##### Alpha
399+
392400
First is at the time a pod is scheduled. When scheduling a pod, if another pod
393401
is found using the same PVC and the PVC uses ReadWriteOncePod, then scheduling
394402
will fail and the pod will be considered UnschedulableAndUnresolvable.
@@ -407,6 +415,24 @@ marked UnschedulableAndUnresolvable.
407415
[volume restrictions plugin]: https://github.com/kubernetes/kubernetes/blob/v1.21.0/pkg/scheduler/framework/plugins/volumerestrictions/volume_restrictions.go#L29
408416
[node info cache]: https://github.com/kubernetes/kubernetes/blob/v1.21.0/pkg/scheduler/framework/types.go#L357
409417

418+
##### Beta
419+
420+
Support for pod preemption is enforced in beta.
421+
422+
When a pod (A) is scheduled, if another pod (B) is found using the same PVC, the
423+
PVC uses ReadWriteOncePod, and pod (A) has higher priority than pod (B), then
424+
return Unschedulable (which will cause pod (B) to be preempted). If pod (A) has
425+
lower or equal priority compared with pod (B), return
426+
UnschedulableAndUnresolvable.
427+
428+
In the PreFilter phase of the volume restrictions scheduler plugin, we will
429+
build a cache of any existing pods and nodes using the ReadWriteOncePod PVCs on
430+
the pod to be scheduled. This cache will be saved as part of the scheduler's
431+
cycleState and forwarded to the following step. During AddPod and RemovePod we
432+
will add or remove references to the target ReadWriteOncePod PVCs to simulate
433+
preemption. During the Filter phase we will check caches for remaining
434+
references to the PVCs and compare their pod priorities if applicable.
435+
410436
#### Mount Enforcement
411437

412438
As an additional precaution this will also be enforced at the time a volume is
@@ -487,14 +513,7 @@ external-attacher, and external-resizer.
487513

488514
<!--
489515
**Note:** *Not required until targeted at a release.*
490-
491-
Consider the following in developing a test plan for this enhancement:
492-
- Will there be e2e and integration tests, in addition to unit tests?
493-
- How will it be tested in isolation vs with other components?
494-
495-
No need to outline all of the test cases, just the general strategy. Anything
496-
that would count as tricky in the implementation, and anything particularly
497-
challenging to test, should be called out.
516+
The goal is to ensure that we don't accept enhancements with inadequate testing.
498517
499518
All code is expected to have adequate tests (eventually with coverage
500519
expectations). Please adhere to the [Kubernetes testing guidelines][testing-guidelines]
@@ -503,6 +522,120 @@ when drafting this test plan.
503522
[testing-guidelines]: https://git.k8s.io/community/contributors/devel/sig-testing/testing.md
504523
-->
505524

525+
[X] I/we understand the owners of the involved components may require updates to
526+
existing tests to make this code solid enough prior to committing the changes
527+
necessary to implement this enhancement.
528+
529+
##### Prerequisite testing updates
530+
531+
<!--
532+
Based on reviewers feedback describe what additional tests need to be added prior
533+
implementing this enhancement to ensure the enhancements have also solid foundations.
534+
-->
535+
536+
None. New tests will be added for the transition to beta to support scheduler
537+
changes.
538+
539+
##### Unit tests
540+
541+
<!--
542+
In principle every added code should have complete unit test coverage, so providing
543+
the exact set of tests will not bring additional value.
544+
However, if complete unit test coverage is not possible, explain the reason of it
545+
together with explanation why this is acceptable.
546+
-->
547+
548+
<!--
549+
Additionally, for Alpha try to enumerate the core package you will be touching
550+
to implement this enhancement and provide the current unit coverage for those
551+
in the form of:
552+
- <package>: <date> - <current test coverage>
553+
The data can be easily read from:
554+
https://testgrid.k8s.io/sig-testing-canaries#ci-kubernetes-coverage-unit
555+
556+
This can inform certain test coverage improvements that we want to do before
557+
extending the production code to implement this enhancement.
558+
-->
559+
560+
In alpha, the following unit tests were updated. See
561+
https://github.com/kubernetes/kubernetes/pull/102028 and
562+
https://github.com/kubernetes/kubernetes/pull/103082 for more context.
563+
564+
- `k8s.io/kubernetes/pkg/apis/core/helper`: `09-22-2022` - `26.2`
565+
- `k8s.io/kubernetes/pkg/apis/core/v1/helper`: `09-22-2022` - `56.9`
566+
- `k8s.io/kubernetes/pkg/apis/core/validation`: `09-22-2022` - `82.3`
567+
- `k8s.io/kubernetes/pkg/controller/volume/persistentvolume`: `09-22-2022` - `79.4`
568+
- `k8s.io/kubernetes/pkg/kubelet/volumemanager/cache`: `09-22-2022` - `66.3`
569+
- `k8s.io/kubernetes/pkg/volume/csi/csi_client.go`: `09-22-2022` - `76.2`
570+
- `k8s.io/kubernetes/pkg/scheduler/apis/config/v1beta2`: `09-22-2022` - `76.8`
571+
- `k8s.io/kubernetes/pkg/scheduler/framework/plugins/volumerestrictions`: `09-22-2022` - `85`
572+
- `k8s.io/kubernetes/pkg/scheduler/framework`: `09-22-2022` - `77.1`
573+
574+
In beta, there will be additional unit test coverage for
575+
`k8s.io/kubernetes/pkg/scheduler/framework/plugins/volumerestrictions` to cover
576+
preemption logic.
577+
578+
##### Integration tests
579+
580+
<!--
581+
This question should be filled when targeting a release.
582+
For Alpha, describe what tests will be added to ensure proper quality of the enhancement.
583+
584+
For Beta and GA, add links to added tests together with links to k8s-triage for those tests:
585+
https://storage.googleapis.com/k8s-triage/index.html
586+
-->
587+
588+
Integration tests for scheduler plugin behavior are available here:
589+
590+
- [test/integration/scheduler/filters/filters_test.go] : [testgrid]
591+
592+
[test/integration/scheduler/filters/filters_test.go]: https://github.com/kubernetes/kubernetes/blob/v1.25.2/test/integration/scheduler/filters/filters_test.go
593+
[testgrid]: https://testgrid.k8s.io/presubmits-kubernetes-blocking#pull-kubernetes-integration&include-filter-by-regex=k8s.io/kubernetes/test/integration/scheduler/filters.TestUnschedulablePodBecomesSchedulable/scheduled_pod_uses_read-write-once-pod_pvc
594+
595+
##### e2e tests
596+
597+
<!--
598+
This question should be filled when targeting a release.
599+
For Alpha, describe what tests will be added to ensure proper quality of the enhancement.
600+
601+
For Beta and GA, add links to added tests together with links to k8s-triage for those tests:
602+
https://storage.googleapis.com/k8s-triage/index.html
603+
604+
We expect no non-infra related flakes in the last month as a GA graduation criteria.
605+
-->
606+
607+
For alpha, to test this feature end to end, we will need to check the following cases:
608+
609+
- A ReadWriteOncePod volume will succeed mounting when consumed by a single pod
610+
on a node
611+
- A ReadWriteOncePod volume will fail to mount when consumed by a second pod on
612+
the same node
613+
- A ReadWriteOncePod volume will fail to attach when consumed by a second pod on
614+
a different node
615+
616+
For testing the mapping for ReadWriteOnce, we should update the mock CSI driver
617+
to support the new volume capability access modes and cut a release. The
618+
existing Kubernetes end to end tests will be updated to use this version which
619+
will test the mapping behavior because most storage end to end tests rely on the
620+
ReadWriteOnce access mode.
621+
622+
E2E tests for alpha behavior can be found here:
623+
624+
- [test/e2e/storage/testsuites/readwriteoncepod.go]: [k8s-triage]
625+
626+
For beta, we will want to cover the additional cases for preemption:
627+
628+
- A high-priority pod requesting a ReadWriteOncePod volume that's already in-use
629+
will result in the preemption of the pod previously using the volume
630+
- A high-priority pod requesting multiple ReadWriteOncePod volumes that are
631+
already in-use will result in the preemption of multiple pods previously using
632+
the volumes
633+
- A low-priority (or no priority) pod requesting a ReadWriteOncePod volume
634+
that's already in-use will result in it being UnschedulableAndUnresolvable
635+
636+
[test/e2e/storage/testsuites/readwriteoncepod.go]: https://github.com/kubernetes/kubernetes/blob/master/test/e2e/storage/testsuites/readwriteoncepod.go
637+
[k8s-triage]: https://storage.googleapis.com/k8s-triage/index.html?pr=1&test=read-write-once-pod
638+
506639
#### Validation of PersistentVolumeSpec Object
507640

508641
To test the validation logic of the PersistentVolumeSpec, we need to check the
@@ -538,20 +671,6 @@ well as in CSI sidecars.
538671

539672
#### End to End Tests
540673

541-
To test this feature end to end, we will need to check the following cases:
542-
543-
- A ReadWriteOncePod volume will succeed mounting when consumed by a single pod
544-
on a node
545-
- A ReadWriteOncePod volume will fail to mount when consumed by a second pod on
546-
the same node
547-
- A ReadWriteOncePod volume will fail to attach when consumed by a second pod on
548-
a different node
549-
550-
For testing the mapping for ReadWriteOnce, we should update the mock CSI driver
551-
to support the new volume capability access modes and cut a release. The
552-
existing Kubernetes end to end tests will be updated to use this version which
553-
will test the mapping behavior because most storage end to end tests rely on the
554-
ReadWriteOnce access mode.
555674

556675
### Graduation Criteria
557676

@@ -623,8 +742,6 @@ in back-to-back releases.
623742
- Scheduler enforces ReadWriteOncePod access mode by marking pods as
624743
Unschedulable, preemption logic added
625744
- ReadWriteOncePod access mode has end to end test coverage
626-
- Mock CSI driver supports `SINGLE_NODE_*_WRITER` access modes, relevant end to
627-
end tests updated to use this driver
628745
- Hostpath CSI driver supports `SINGLE_NODE_*_WRITER` access modes, relevant end
629746
to end tests updated to use this driver
630747

@@ -832,13 +949,31 @@ Try to be as paranoid as possible - e.g., what if some components will restart
832949
mid-rollout?
833950
-->
834951

952+
Rolling out this feature involves enabling the ReadWriteOncePod feature gate
953+
across kube-apiserver, kube-scheduler, kubelet, and updating CSI driver and
954+
sidecar versions.
955+
956+
The only way this rollout can fail is if a user does not update all components,
957+
in which case the feature will not work. See the above section on version skews
958+
for behavior in this scenario.
959+
960+
Rolling out this feature does not impact any running workloads.
961+
835962
###### What specific metrics should inform a rollback?
836963

837964
<!--
838965
What signals should users be paying attention to when the feature is young
839966
that might indicate a serious problem?
840967
-->
841968

969+
If pods using ReadWriteOncePod PVCs fail to schedule, you may see an increase in
970+
`scheduler_unschedulable_pods{plugin="VolumeRestrictions"}`.
971+
972+
For enforcement in kubelet, if there are issues may see changes in metrics for
973+
"volume_mount" operations. For example, an increase in
974+
`storage_operation_duration_seconds_bucket{operation_name="volume_mount"}` for
975+
larger buckets may indicate issues with mount.
976+
842977
###### Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?
843978

844979
<!--
@@ -847,12 +982,24 @@ Longer term, we may want to require automated upgrade/rollback tests, but we
847982
are missing a bunch of machinery and tooling and can't do that now.
848983
-->
849984

985+
For alpha, manual tests were performed to:
986+
987+
- Unsuccessfully create workloads using ReadWriteOncePod PVCs prior to upgrade
988+
- Perform the upgrade (enabling feature flags and updating CSI drivers)
989+
- Successfully create workloads using ReadWriteOncePod PVCs
990+
- Perform the downgrade (disabling feature flags and downgrading CSI drivers)
991+
- Successfully delete ReadWriteOncePod PVCs
992+
993+
For beta, similar manual tests will need to be performed once implemented.
994+
850995
###### Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?
851996

852997
<!--
853998
Even if applying deprecation policies, they may still surprise some users.
854999
-->
8551000

1001+
No.
1002+
8561003
### Monitoring Requirements
8571004

8581005
<!--
@@ -867,18 +1014,21 @@ checking if there are objects with field X set) may be a last resort. Avoid
8671014
logs or events for this purpose.
8681015
-->
8691016

1017+
An operator can query for PersistentVolumeClaims and PersistentVolumes in the
1018+
cluster with the ReadWriteOncePod access mode. If any exist then the feature is
1019+
in use.
1020+
8701021
###### What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?
8711022

8721023
<!--
8731024
Pick one more of these and delete the rest.
8741025
-->
8751026

876-
- [ ] Metrics
877-
- Metric name:
1027+
- [X] Metrics
1028+
- Metric name: `scheduler_unschedulable_pods{plugin="VolumeRestrictions"}`
8781029
- [Optional] Aggregation method:
8791030
- Components exposing the metric:
880-
- [ ] Other (treat as last resort)
881-
- Details:
1031+
- kube-scheduler
8821032

8831033
###### What are the reasonable SLOs (Service Level Objectives) for the above SLIs?
8841034

@@ -892,13 +1042,17 @@ high level (needs more precise definitions) those may be things like:
8921042
- 99,9% of /health requests per day finish with 200 code
8931043
-->
8941044

1045+
Per-day percentage of CSI driver API calls finishing with 5XX errors <= 1%.
1046+
8951047
###### Are there any missing metrics that would be useful to have to improve observability of this feature?
8961048

8971049
<!--
8981050
Describe the metrics themselves and the reasons why they weren't added (e.g., cost,
8991051
implementation difficulties, etc.).
9001052
-->
9011053

1054+
No.
1055+
9021056
### Dependencies
9031057

9041058
<!--
@@ -922,6 +1076,18 @@ and creating new ones, as well as about cluster-level services (e.g. DNS):
9221076
- Impact of its degraded performance or high-error rates on the feature:
9231077
-->
9241078

1079+
This feature depends on the cluster having CSI drivers and sidecars that use CSI
1080+
spec v1.5.0 at minimum.
1081+
1082+
- [CSI drivers and sidecars]
1083+
- Usage description:
1084+
- Impact of its outage on the feature: Inability to perform CSI storage
1085+
operations on ReadWriteOncePod PVCs and PVs (for example, provisioning
1086+
volumes)
1087+
- Impact of its degraded performance or high-error rates on the feature:
1088+
Increase in latency performing CSI storage operations (due to repeated
1089+
retries)
1090+
9251091
### Scalability
9261092

9271093
<!--
@@ -1026,6 +1192,9 @@ details). For now, we leave it here.
10261192

10271193
###### How does this feature react if the API server and/or etcd is unavailable?
10281194

1195+
Existing ReadWriteOncePod volumes will continue working, however users will not
1196+
be able to make any changes to them.
1197+
10291198
###### What are other known failure modes?
10301199

10311200
<!--
@@ -1041,8 +1210,12 @@ For each of them, fill in the following information by copying the below templat
10411210
- Testing: Are there any tests for failure mode? If not, describe why.
10421211
-->
10431212

1213+
None.
1214+
10441215
###### What steps should be taken if SLOs are not being met to determine the problem?
10451216

1217+
Roll back the feature by disabling the ReadWriteOncePod feature gate.
1218+
10461219
## Implementation History
10471220

10481221
<!--

0 commit comments

Comments
 (0)