-
Notifications
You must be signed in to change notification settings - Fork 4.8k
OCPBUGS-62703: Relax duplicate events detection for Prometheus #30372
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Overrides the duplicate readiness error events' limit for Prometheus during upgrades. Since Prometheus needs some time to wind down (see [1]), it was causing Kubelet to exhibit readiness error events during the time span it took to terminate. This ignores those pings to a limit (100). [1]: https://github.com/prometheus-operator/prometheus-operator/blob/d0ae00fdedc656a5a1a290d9839b84d860f15428/pkg/prometheus/common.go#L56-L59 Signed-off-by: Pranshu Srivastava <[email protected]>
|
@rexagod: This pull request references Jira Issue OCPBUGS-62703, which is valid. The bug has been moved to the POST state. 3 validation(s) were run on this bug
The bug has been updated to refer to the pull request using the external bug tracker. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
| name: "PrometheusReadinessProbeErrorsDuringUpgrades", | ||
| locatorKeyRegexes: map[monitorapi.LocatorKey]*regexp.Regexp{ | ||
| monitorapi.LocatorNamespaceKey: regexp.MustCompile(`^` + statefulSetNamespace + `$`), | ||
| monitorapi.LocatorStatefulSetKey: regexp.MustCompile(`^` + statefulSetName + `$`), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm using a StatefulSet locator key here since that's how we deploy prometheus, however, the errors are reported pod-wise (see below), so I'm wondering if I should change this as well as the testIntervals conditional below to work with Pod keys instead of StatefulSet ones?
event happened 25 times, something is wrong: namespace/openshift-monitoring node/worker-0 pod/prometheus-k8s-0 hmsg/357171899f - reason/Unhealthy Readiness probe errored: rpc error: code = Unknown desc = command error: cannot register an exec PID: container is stopping, stdout: , stderr: , exit code -1 (12:24:14Z) result=reject }
I'll trigger a small number of upgrade jobs to see if this works as is.
|
/payload-aggregate periodic-ci-openshift-release-master-nightly-4.21-e2e-agent-ha-dualstack-conformance 3
|
|
@rexagod: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/7fbee5c0-a7bd-11f0-983c-93024592481b-0 |
|
/payload-job periodic-ci-openshift-release-master-nightly-4.21-e2e-agent-ha-dualstack-conformance |
|
@rexagod: trigger 4 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/e0d1c020-a809-11f0-8a17-a2fb9ee66422-0 |
|
/payload-job periodic-ci-openshift-release-master-nightly-4.21-e2e-agent-ha-dualstack-conformance
I'll amend the approach to use EDIT: Not a 100% sure if locators were the problem, since the substring I specified to look for here was |
|
@rexagod: trigger 3 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/0fbe0ad0-b433-11f0-94a6-8dc5b4538807-0 |
|
@rexagod: trigger 3 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/4331e300-b433-11f0-8cf4-836000ef77e7-0 |
|
Events fall under
|
| }) | ||
|
|
||
| if len(testIntervals) > 0 { | ||
| // Readiness probe errors are expected during upgrades, allow a higher threshold. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe explicit that readiness probes run during all the lifecycle of the container (including termination) and that Prometheus may take "some time" to stop (hence the termination grace period of 600s by default) which explains the probe errors (e.g. the web service is stopped but the process is still running).
|
|
||
| if len(testIntervals) > 0 { | ||
| // Readiness probe errors are expected during upgrades, allow a higher threshold. | ||
| // Set the threshold to 100 to allow for a high number of readiness probe errors |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
600 (termination period) / 5 (probe interval) = 120 but I agree that 100 is a good enough value.
|
Risk analysis has seen new tests most likely introduced by this PR. New tests seen in this PR at sha: 785a37c
|
| twoNodeEtcdEndpointsMatcher := newTwoNodeEtcdEndpointsConfigMissingEventMatcher(finalIntervals) | ||
| registry.AddPathologicalEventMatcherOrDie(twoNodeEtcdEndpointsMatcher) | ||
|
|
||
| prometheusReadinessProbeErrorsDuringUpgradesPathologicalEventMatcher := newPrometheusReadinessProbeErrorsDuringUpgradesPathologicalEventMatcher(finalIntervals) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Function indicates this is intended to relax only during upgrades, but it's added to the universal set here, not the upgrade specific set below. This should probably be moved down into that function below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I saw this fail for (what looks to be) non-upgrade jobs, such as periodic-ci-openshift-release-master-nightly-4.21-e2e-agent-ha-dualstack-conformance. I should probably rename this to drop the DuringUpgrades part to not be misleading.
PLMK if you still think this should be moved to upgrade jobs exclusively.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there an explanation why prom would be getting killed in a non upgrade job? As far as I understand this, that would be quite unexpected and still get flagged. Let me know if you have that job run handy, I'm curious.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From [openshift-origin-30372-nightly-4.21-e2e-agent-ha-dualstack-conformance], it seems a liveness probe failure:
22:57:35 (x6) | openshift-monitoring | kubelet | prometheus-k8s-0 | Unhealthy | Liveness probe failed: command timed out
-- | -- | -- | -- | -- | --
22:57:35 | openshift-monitoring | kubelet | prometheus-k8s-0 | Killing | Container prometheus failed liveness probe, will be restarted
Prometheus container in the affected pod seems fine, but other containers in the pod seem to be facing connection issues (possibly due to https://issues.redhat.com/browse/OCPBUGS-32021)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I doubt that https://issues.redhat.com/browse/OCPBUGS-32021 is involved here: the error logs come from the router which opens a TCP connection and then drops it after a successful connect.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The link to the job above didn't work seemingly, but it looks like it's probably a different symptom and would not hit your matcher as defined. I think it's probably best to make this an upgrade specific exception as this is the only area we're expecting this 10 minute delay.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, unlike https://issues.redhat.com/browse/OCPBUGS-5916 we are indeed able to establish a successful connection after retries (though IIUC it seems the connection heals in the linked ticket as well, as there's no functionality disruption reported?).
Ah, my bad. I've appended more context to the snippet (PTAL below), and here's the event-filter link. As you can see, the readiness error will register a PathologicalNew error if not explicitly ignored (>20 pings), and AFAICT this will be caught by the matcher.
22:57:35 (x6) | openshift-monitoring | kubelet | prometheus-k8s-0 | Unhealthy | Liveness probe failed: command timed out
-- | -- | -- | -- | -- | --
22:57:35 | openshift-monitoring | kubelet | prometheus-k8s-0 | Killing | Container prometheus failed liveness probe, will be restarted
22:57:39 (x7) | openshift-monitoring | kubelet | prometheus-k8s-0 | Unhealthy | Readiness probe failed: command timed out
23:02:09 (x55) | openshift-monitoring | kubelet | prometheus-k8s-0 | Unhealthy | Readiness probe errored and resulted in unknown state: rpc error: code = Unknown desc = command error: cannot register an exec PID: container is stopping, stdout: , stderr: , exit code -1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(this particular human message hits four cases covered in the tests, PTAL at https://github.com/openshift/origin/pull/30372/files#diff-d6cc316666a1dc1843f699fc90418fe34425e15b994f2c00d2390ba426cf33b3R719-R750)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lets keep this upgrade specific unless there's a very clear explanation why this is expected and ok in a non-upgrade job, which afaict there is not. Looking at the intervals from your job run you can see there is mass disruption at the time we lose the readiness probe. Problems occurring during that time are not the kind of thing we want to hide.
This dualstack job has a known bug related to that network disruption.
| strings.HasPrefix(eventInterval.Locator.Keys[monitorapi.LocatorPodKey], podNamePrefix) && | ||
| eventInterval.Message.Reason == messageReason && | ||
| strings.Contains(eventInterval.Message.HumanMessage, messageHumanizedSubstring) | ||
| }) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unless I'm missing something, this looping through all intervals is not required. It effectively duplicates the matcher logic above, and I think all you really need here is to return that matcher with the matcher.repeatThresholdOverride = 100 set, the framework should handle the rest as far as I can tell. Checking if the matcher will match anything before setting it's threshold looks unnecessary to me.
…ection for Prometheus
…nts detection for Prometheus
|
/lgtm Thank you for addressing this flake! |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: dgoodwin, rexagod The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
@rexagod: The following tests failed, say
Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |

Overrides the duplicate readiness error events' limit for Prometheus during upgrades. Since Prometheus needs some time to wind down (see 1), it was causing Kubelet to exhibit readiness error events during the time span it took to terminate. This ignores those pings to a limit (100).