-
Notifications
You must be signed in to change notification settings - Fork 4.8k
OCPBUGS-62703: Relax duplicate events detection for Prometheus #30372
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from 3 commits
b5321f7
0d1ade8
785a37c
8753df3
ed24ec4
85008d1
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -502,6 +502,9 @@ func NewUniversalPathologicalEventMatchers(kubeConfig *rest.Config, finalInterva | |
| twoNodeEtcdEndpointsMatcher := newTwoNodeEtcdEndpointsConfigMissingEventMatcher(finalIntervals) | ||
| registry.AddPathologicalEventMatcherOrDie(twoNodeEtcdEndpointsMatcher) | ||
|
|
||
| prometheusReadinessProbeErrorsDuringUpgradesPathologicalEventMatcher := newPrometheusReadinessProbeErrorsDuringUpgradesPathologicalEventMatcher(finalIntervals) | ||
| registry.AddPathologicalEventMatcherOrDie(prometheusReadinessProbeErrorsDuringUpgradesPathologicalEventMatcher) | ||
|
|
||
| return registry | ||
| } | ||
|
|
||
|
|
@@ -1171,3 +1174,66 @@ func newCrioReloadedTooOftenEventMatcher(finalInternals monitorapi.Intervals) Ev | |
| allowIfWithinIntervals: crioReloadedIntervals, | ||
| } | ||
| } | ||
|
|
||
| func newPrometheusReadinessProbeErrorsDuringUpgradesPathologicalEventMatcher(finalIntervals monitorapi.Intervals) EventMatcher { | ||
| podNamePrefix := "prometheus-k8s" | ||
| podNamespace := "openshift-monitoring" | ||
| messageHumanizedSubstring := "Readiness probe errored" | ||
| messageReason := monitorapi.UnhealthyReason | ||
| matcher := &SimplePathologicalEventMatcher{ | ||
| name: "PrometheusReadinessProbeErrorsDuringUpgrades", | ||
| locatorKeyRegexes: map[monitorapi.LocatorKey]*regexp.Regexp{ | ||
| monitorapi.LocatorNamespaceKey: regexp.MustCompile(`^` + podNamespace + `$`), | ||
| monitorapi.LocatorPodKey: regexp.MustCompile(`^` + podNamePrefix + `-[0,1]$`), | ||
| }, | ||
| messageReasonRegex: regexp.MustCompile(`^` + string(messageReason) + `$`), | ||
| messageHumanRegex: regexp.MustCompile(messageHumanizedSubstring), | ||
| jira: "https://issues.redhat.com/browse/OCPBUGS-62703", | ||
| } | ||
|
|
||
| // Sanity check in case no `finalIntervals` are provided. | ||
| if finalIntervals == nil || len(finalIntervals) == 0 { | ||
| matcher.neverAllow = true | ||
| return matcher | ||
| } | ||
|
|
||
| /* | ||
| 05:50:32 openshift-monitoring kubelet prometheus-k8s-1 | ||
| Unhealthy | ||
| Readiness probe errored: rpc error: code = NotFound desc = container is not created or running: checking if PID of 58577e7deb7b8ae87b8029b9988fa268613748d0743ce989748f27e52b199ef5 is running failed: container process not found | ||
|
|
||
| 05:53:52 (x25) openshift-monitoring kubelet prometheus-k8s-0 | ||
| Unhealthy | ||
| Readiness probe errored: rpc error: code = Unknown desc = command error: cannot register an exec PID: container is stopping, stdout: , stderr: , exit code -1 | ||
|
|
||
| 11:44:16 (x56) openshift-monitoring kubelet prometheus-k8s-0 | ||
| Unhealthy | ||
| Readiness probe errored and resulted in unknown state: rpc error: code = Unknown desc = command error: cannot register an exec PID: container is stopping, stdout: , stderr: , exit code -1 | ||
| */ | ||
| testIntervals := finalIntervals.Filter(func(eventInterval monitorapi.Interval) bool { | ||
| return eventInterval.Locator.Type == monitorapi.LocatorTypePod && | ||
| eventInterval.Locator.Keys[monitorapi.LocatorNamespaceKey] == podNamespace && | ||
| strings.HasPrefix(eventInterval.Locator.Keys[monitorapi.LocatorPodKey], podNamePrefix) && | ||
| eventInterval.Message.Reason == messageReason && | ||
| strings.Contains(eventInterval.Message.HumanMessage, messageHumanizedSubstring) | ||
| }) | ||
|
||
|
|
||
| if len(testIntervals) > 0 { | ||
| /* | ||
| Readiness probes run during the lifecycle of the container, including termination. | ||
| Prometheus pods may take some time to stop, and thus result in more kubelet pings than permitted by default (20). | ||
| With a termination grace period of 600s, these pods may lead to probe errors (e.g. the web service is stopped but the process is still running), which is expected during upgrades. | ||
|
|
||
| To address this, set the threshold to 100 (approximately 600 (termination period) / 5 (probe interval)), to allow for a high number of readiness probe errors during the upgrade, but not so high that we would miss a real problem. | ||
| The job below hit ~60 readiness errors during the upgrade: | ||
| https://prow.ci.openshift.org/view/gs/test-platform-results/logs/periodic-ci-openshift-release-master-ci-4.20-upgrade-from-stable-4.19-e2e-aws-ovn-upgrade/1977094149035266048, which makes sense to ignore, | ||
| However, the job below hit readiness errors 774 times during the upgrade: | ||
| https://prow.ci.openshift.org/view/gs/test-platform-results/logs/periodic-ci-openshift-release-master-ci-4.19-upgrade-from-stable-4.18-e2e-metal-ovn-single-node-rt-upgrade-test/1975691393640697856, which should be caught. | ||
| */ | ||
| matcher.repeatThresholdOverride = 100 | ||
| } else { | ||
| matcher.neverAllow = true | ||
| } | ||
|
|
||
| return matcher | ||
| } | ||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Function indicates this is intended to relax only during upgrades, but it's added to the universal set here, not the upgrade specific set below. This should probably be moved down into that function below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I saw this fail for (what looks to be) non-upgrade jobs, such as
periodic-ci-openshift-release-master-nightly-4.21-e2e-agent-ha-dualstack-conformance. I should probably rename this to drop theDuringUpgradespart to not be misleading.PLMK if you still think this should be moved to upgrade jobs exclusively.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there an explanation why prom would be getting killed in a non upgrade job? As far as I understand this, that would be quite unexpected and still get flagged. Let me know if you have that job run handy, I'm curious.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From [
openshift-origin-30372-nightly-4.21-e2e-agent-ha-dualstack-conformance], it seems a liveness probe failure:Prometheus container in the affected pod seems fine, but other containers in the pod seem to be facing connection issues (possibly due to https://issues.redhat.com/browse/OCPBUGS-32021)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I doubt that https://issues.redhat.com/browse/OCPBUGS-32021 is involved here: the error logs come from the router which opens a TCP connection and then drops it after a successful connect.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The link to the job above didn't work seemingly, but it looks like it's probably a different symptom and would not hit your matcher as defined. I think it's probably best to make this an upgrade specific exception as this is the only area we're expecting this 10 minute delay.
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, unlike https://issues.redhat.com/browse/OCPBUGS-5916 we are indeed able to establish a successful connection after retries (though IIUC it seems the connection heals in the linked ticket as well, as there's no functionality disruption reported?).
Ah, my bad. I've appended more context to the snippet (PTAL below), and here's the event-filter link. As you can see, the readiness error will register a
PathologicalNewerror if not explicitly ignored (>20 pings), and AFAICT this will be caught by the matcher.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(this particular human message hits four cases covered in the tests, PTAL at https://github.com/openshift/origin/pull/30372/files#diff-d6cc316666a1dc1843f699fc90418fe34425e15b994f2c00d2390ba426cf33b3R719-R750)
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lets keep this upgrade specific unless there's a very clear explanation why this is expected and ok in a non-upgrade job, which afaict there is not. Looking at the intervals from your job run you can see there is mass disruption at the time we lose the readiness probe. Problems occurring during that time are not the kind of thing we want to hide.
This dualstack job has a known bug related to that network disruption.