You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: keps/sig-autoscaling/4951-configurable-hpa-tolerance/README.md
+49-10Lines changed: 49 additions & 10 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -106,16 +106,16 @@ Items marked with (R) are required *prior to targeting to a milestone / release*
106
106
-[x] (R) KEP approvers have approved the KEP status as `implementable`
107
107
-[x] (R) Design details are appropriately documented
108
108
-[ ] (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors)
109
-
-[] e2e Tests for all Beta API Operations (endpoints)
109
+
-[x] e2e Tests for all Beta API Operations (endpoints)
110
110
-[ ] (R) Ensure GA e2e tests meet requirements for [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md)
111
111
-[ ] (R) Minimum Two Week Window for GA e2e tests to prove flake free
112
112
-[ ] (R) Graduation criteria is in place
113
113
-[ ] (R) [all GA Endpoints](https://github.com/kubernetes/community/pull/1806) must be hit by [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md)
114
-
-[] (R) Production readiness review completed
115
-
-[] (R) Production readiness review approved
114
+
-[x] (R) Production readiness review completed
115
+
-[x] (R) Production readiness review approved
116
116
-[ ] "Implementation History" section is up-to-date for milestone
117
-
-[] User-facing documentation has been created in [kubernetes/website], for publication to [kubernetes.io]
118
-
-[] Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes
117
+
-[x] User-facing documentation has been created in [kubernetes/website], for publication to [kubernetes.io]
118
+
-[x] Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes
We will add a unit test verifying that HPAs with and without the new fields are
553
+
[Unit tests have been added](https://github.com/kubernetes/kubernetes/pull/130797/commits/a41284d9fa3a3d5a5e8760db6e9fd4f7e5e6fca6#diff-98f8520444a477d01c5cc2e56f92939d5fb07893a234b8fee5b67c7c147a20e0) to verify that HPAs with and without the new fields are
547
554
properly validated, both when the feature gate is enabled or not.
548
555
549
556
### Rollout, Upgrade and Rollback Planning
@@ -564,13 +571,20 @@ rollout. Similarly, consider large clusters and how enablement/disablement
564
571
will rollout across nodes.
565
572
-->
566
573
574
+
This feature does not introduce new failure modes: during rollout/rollback, some
575
+
API servers will allow or disallow setting the new 'tolerance' field. The new
576
+
field is possibly ignored until the controller manager is fully updated.
577
+
567
578
###### What specific metrics should inform a rollback?
568
579
569
580
<!--
570
581
What signals should users be paying attention to when the feature is young
571
582
that might indicate a serious problem?
572
583
-->
573
584
585
+
A high `horizontal_pod_autoscaler_controller_metric_computation_duration_seconds`
586
+
metric can indicate a problem related to this feature.
587
+
574
588
###### Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?
575
589
576
590
<!--
@@ -579,12 +593,18 @@ Longer term, we may want to require automated upgrade/rollback tests, but we
579
593
are missing a bunch of machinery and tooling and can't do that now.
580
594
-->
581
595
596
+
I have manually tested a cluster upgrade, and this feature is in alpha without
597
+
(to the best of our knowledge) any user reporting an issue. GKE has automated
598
+
upgrade/downgrade tests that did not report any issue.
599
+
582
600
###### Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?
583
601
584
602
<!--
585
603
Even if applying deprecation policies, they may still surprise some users.
586
604
-->
587
605
606
+
No.
607
+
588
608
### Monitoring Requirements
589
609
590
610
<!--
@@ -625,9 +645,9 @@ values. Users can get both values using
625
645
and use them to verify that scaling events are triggered when their ratio is out
626
646
of tolerance.
627
647
628
-
We will update the controller-manager logs to help users understand the behavior
629
-
of the autoscaler. The data added to the logs will include the tolerance used
630
-
for each scaling decision.
648
+
The [controller-manager logs have been updated](https://github.com/kubernetes/kubernetes/pull/130797/commits/2dd9eda47ffd5556ff90446e91d22ddbecc05d2c#diff-f1c5a31aa8fb8e3fd64b6aa13d3358b504e6e25030f249f1652e244c105eafc7R846)
649
+
to help users understand the behavior of the autoscaler. The data added to the
650
+
logs includes the tolerance used for each scaling decision.
631
651
632
652
###### What are the reasonable SLOs (Service Level Objectives) for the enhancement?
633
653
@@ -698,6 +718,8 @@ and creating new ones, as well as about cluster-level services (e.g. DNS):
698
718
- Impact of its degraded performance or high-error rates on the feature:
699
719
-->
700
720
721
+
No, this feature does not depend on any specific service.
722
+
701
723
### Scalability
702
724
703
725
<!--
@@ -817,6 +839,8 @@ details). For now, we leave it here.
817
839
818
840
###### How does this feature react if the API server and/or etcd is unavailable?
819
841
842
+
API server or etcd issues do not impact this feature.
843
+
820
844
###### What are other known failure modes?
821
845
822
846
<!--
@@ -832,8 +856,14 @@ For each of them, fill in the following information by copying the below templat
832
856
- Testing: Are there any tests for failure mode? If not, describe why.
833
857
-->
834
858
859
+
We do not expect any new failure mode. (While setting inappropriate `tolerance`
860
+
values may cause HPAs to react too slowly or too fast, the feature is working as
861
+
intended.)
862
+
835
863
###### What steps should be taken if SLOs are not being met to determine the problem?
836
864
865
+
N/A.
866
+
837
867
## Implementation History
838
868
839
869
<!--
@@ -848,13 +878,17 @@ Major milestones might include:
0 commit comments