OCPBUGS-62703: Relax duplicate events detection for Prometheus #30372

rexagod · 2025-10-12T20:58:00Z

Overrides the duplicate readiness error events' limit for Prometheus during upgrades. Since Prometheus needs some time to wind down (see 1), it was causing Kubelet to exhibit readiness error events during the time span it took to terminate. This ignores those pings to a limit (100).

Overrides the duplicate readiness error events' limit for Prometheus during upgrades. Since Prometheus needs some time to wind down (see [1]), it was causing Kubelet to exhibit readiness error events during the time span it took to terminate. This ignores those pings to a limit (100). [1]: https://github.com/prometheus-operator/prometheus-operator/blob/d0ae00fdedc656a5a1a290d9839b84d860f15428/pkg/prometheus/common.go#L56-L59 Signed-off-by: Pranshu Srivastava <[email protected]>

openshift-ci-robot · 2025-10-12T20:58:08Z

@rexagod: This pull request references Jira Issue OCPBUGS-62703, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug

bug is open, matching expected state (open)
bug target version (4.21.0) matches configured target version for branch (4.21.0)
bug is in the state ASSIGNED, which is one of the valid states (NEW, ASSIGNED, POST)

The bug has been updated to refer to the pull request using the external bug tracker.

In response to this:

Overrides the duplicate readiness error events' limit for Prometheus during upgrades. Since Prometheus needs some time to wind down (see 1), it was causing Kubelet to exhibit readiness error events during the time span it took to terminate. This ignores those pings to a limit (100).

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

rexagod · 2025-10-12T21:05:20Z

pkg/monitortestlibrary/pathologicaleventlibrary/duplicated_event_patterns.go

+		name: "PrometheusReadinessProbeErrorsDuringUpgrades",
+		locatorKeyRegexes: map[monitorapi.LocatorKey]*regexp.Regexp{
+			monitorapi.LocatorNamespaceKey:   regexp.MustCompile(`^` + statefulSetNamespace + `$`),
+			monitorapi.LocatorStatefulSetKey: regexp.MustCompile(`^` + statefulSetName + `$`),


I'm using a StatefulSet locator key here since that's how we deploy prometheus, however, the errors are reported pod-wise (see below), so I'm wondering if I should change this as well as the testIntervals conditional below to work with Pod keys instead of StatefulSet ones?

event happened 25 times, something is wrong: namespace/openshift-monitoring node/worker-0 pod/prometheus-k8s-0 hmsg/357171899f - reason/Unhealthy Readiness probe errored: rpc error: code = Unknown desc = command error: cannot register an exec PID: container is stopping, stdout: , stderr: , exit code -1 (12:24:14Z) result=reject }

I'll trigger a small number of upgrade jobs to see if this works as is.

rexagod · 2025-10-12T22:47:57Z

/payload-aggregate periodic-ci-openshift-release-master-nightly-4.21-e2e-agent-ha-dualstack-conformance 3

periodic-ci-openshift-release-master-nightly-4.21-e2e-agent-ha-dualstack-conformance has the highest count based on these results in the last couple of days, so I'm going with that here.

openshift-ci · 2025-10-12T22:48:00Z

@rexagod: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

periodic-ci-openshift-release-master-nightly-4.21-e2e-agent-ha-dualstack-conformance

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/7fbee5c0-a7bd-11f0-983c-93024592481b-0

rexagod · 2025-10-13T07:54:41Z

/payload-job periodic-ci-openshift-release-master-nightly-4.21-e2e-agent-ha-dualstack-conformance
/payload-job periodic-ci-openshift-multiarch-master-nightly-4.20-upgrade-from-stable-4.19-ocp-e2e-upgrade-aws-ovn-multi-a-a
/payload-job periodic-ci-openshift-release-master-ci-4.21-e2e-gcp-ovn-upgrade
/payload-job periodic-ci-openshift-release-master-ci-4.20-upgrade-from-stable-4.19-e2e-aws-ovn-upgrade

openshift-ci · 2025-10-13T07:54:45Z

@rexagod: trigger 4 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

periodic-ci-openshift-release-master-nightly-4.21-e2e-agent-ha-dualstack-conformance
periodic-ci-openshift-multiarch-master-nightly-4.20-upgrade-from-stable-4.19-ocp-e2e-upgrade-aws-ovn-multi-a-a
periodic-ci-openshift-release-master-ci-4.21-e2e-gcp-ovn-upgrade
periodic-ci-openshift-release-master-ci-4.20-upgrade-from-stable-4.19-e2e-aws-ovn-upgrade

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/e0d1c020-a809-11f0-8a17-a2fb9ee66422-0

rexagod · 2025-10-28T19:19:44Z

/payload-job periodic-ci-openshift-release-master-nightly-4.21-e2e-agent-ha-dualstack-conformance
/payload-job periodic-ci-openshift-multiarch-master-nightly-4.21-upgrade-from-stable-4.20-ocp-e2e-upgrade-azure-ovn-multi-a-a
/payload-job periodic-ci-openshift-machine-config-operator-release-4.19-periodics-e2e-azure-mco-disruptive-techpreview

periodic-ci-openshift-release-master-nightly-4.21-e2e-agent-ha-dualstack-conformance surfaced the same error:

[Monitor:legacy-test-framework-invariants-pathological][sig-arch] events should not repeat pathologically for ns/openshift-monitoring expand_less | 0s
-- | --
{  1 events happened too frequently  event happened 56 times, something is wrong: namespace/openshift-monitoring node/worker-0 pod/prometheus-k8s-0 hmsg/00456c88f7 - reason/Unhealthy Readiness probe errored and resulted in unknown state: rpc error: code = Unknown desc = command error: cannot register an exec PID: container is stopping, stdout: , stderr: , exit code -1 (11:44:16Z) result=reject }

I'll amend the approach to use podLocator instead of the statefulSetLocator.

EDIT: Not a 100% sure if locators were the problem, since the substring I specified to look for here was messageHumanizedSubstring := "Readiness probe errored: rpc error", but we saw Readiness probe errored and resulted in unknown state: rpc error: in the linked CI job above.

openshift-ci · 2025-10-28T19:19:49Z

@rexagod: trigger 3 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

periodic-ci-openshift-release-master-nightly-4.21-e2e-agent-ha-dualstack-conformance
periodic-ci-openshift-multiarch-master-nightly-4.21-upgrade-from-stable-4.20-ocp-e2e-upgrade-azure-ovn-multi-a-a
periodic-ci-openshift-machine-config-operator-release-4.19-periodics-e2e-azure-mco-disruptive-techpreview

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/0fbe0ad0-b433-11f0-94a6-8dc5b4538807-0

openshift-ci · 2025-10-28T19:21:14Z

@rexagod: trigger 3 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

periodic-ci-openshift-release-master-nightly-4.21-e2e-agent-ha-dualstack-conformance
periodic-ci-openshift-multiarch-master-nightly-4.21-upgrade-from-stable-4.20-ocp-e2e-upgrade-azure-ovn-multi-a-a
periodic-ci-openshift-machine-config-operator-release-4.19-periodics-e2e-azure-mco-disruptive-techpreview

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/4331e300-b433-11f0-8cf4-836000ef77e7-0

rexagod · 2025-10-29T12:41:16Z

Events fall under PathologicalKnown now, not PathologicalNew (CIPI).

simonpasquier · 2025-10-29T12:57:23Z

pkg/monitortestlibrary/pathologicaleventlibrary/duplicated_event_patterns.go

+	})
+
+	if len(testIntervals) > 0 {
+		// Readiness probe errors are expected during upgrades, allow a higher threshold.


Maybe explicit that readiness probes run during all the lifecycle of the container (including termination) and that Prometheus may take "some time" to stop (hence the termination grace period of 600s by default) which explains the probe errors (e.g. the web service is stopped but the process is still running).

https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-startup-probes/#define-readiness-probes

simonpasquier · 2025-10-29T12:57:57Z

pkg/monitortestlibrary/pathologicaleventlibrary/duplicated_event_patterns.go

+
+	if len(testIntervals) > 0 {
+		// Readiness probe errors are expected during upgrades, allow a higher threshold.
+		// Set the threshold to 100 to allow for a high number of readiness probe errors


600 (termination period) / 5 (probe interval) = 120 but I agree that 100 is a good enough value.

…metheus

openshift-trt · 2025-10-30T14:46:36Z

Risk analysis has seen new tests most likely introduced by this PR.
Please ensure that new tests meet guidelines for naming and stability.

New tests seen in this PR at sha: 785a37c

"[sig-storage][OCPFeature:StorageNetworkPolicy] Storage Network Policy should ensure required NetworkPolicies exist with correct labels [Suite:openshift/conformance/parallel]" [Total: 5, Pass: 5, Fail: 0, Flake: 0]
"[sig-storage][OCPFeature:StorageNetworkPolicy] Storage Network Policy should verify required labels for CSI related Operators [Suite:openshift/conformance/parallel]" [Total: 5, Pass: 5, Fail: 0, Flake: 0]
"[sig-storage][OCPFeature:StorageNetworkPolicy] Storage Network Policy should verify required labels for CSO related Operators [Suite:openshift/conformance/parallel]" [Total: 5, Pass: 5, Fail: 0, Flake: 0]

dgoodwin · 2025-11-03T15:09:21Z

pkg/monitortestlibrary/pathologicaleventlibrary/duplicated_event_patterns.go

 	twoNodeEtcdEndpointsMatcher := newTwoNodeEtcdEndpointsConfigMissingEventMatcher(finalIntervals)
 	registry.AddPathologicalEventMatcherOrDie(twoNodeEtcdEndpointsMatcher)

+	prometheusReadinessProbeErrorsDuringUpgradesPathologicalEventMatcher := newPrometheusReadinessProbeErrorsDuringUpgradesPathologicalEventMatcher(finalIntervals)


Function indicates this is intended to relax only during upgrades, but it's added to the universal set here, not the upgrade specific set below. This should probably be moved down into that function below.

I saw this fail for (what looks to be) non-upgrade jobs, such as periodic-ci-openshift-release-master-nightly-4.21-e2e-agent-ha-dualstack-conformance. I should probably rename this to drop the DuringUpgrades part to not be misleading.

PLMK if you still think this should be moved to upgrade jobs exclusively.

Is there an explanation why prom would be getting killed in a non upgrade job? As far as I understand this, that would be quite unexpected and still get flagged. Let me know if you have that job run handy, I'm curious.

I doubt that https://issues.redhat.com/browse/OCPBUGS-32021 is involved here: the error logs come from the router which opens a TCP connection and then drops it after a successful connect.

The link to the job above didn't work seemingly, but it looks like it's probably a different symptom and would not hit your matcher as defined. I think it's probably best to make this an upgrade specific exception as this is the only area we're expecting this 10 minute delay.

I see, unlike https://issues.redhat.com/browse/OCPBUGS-5916 we are indeed able to establish a successful connection after retries (though IIUC it seems the connection heals in the linked ticket as well, as there's no functionality disruption reported?).

Ah, my bad. I've appended more context to the snippet (PTAL below), and here's the event-filter link. As you can see, the readiness error will register a PathologicalNew error if not explicitly ignored (>20 pings), and AFAICT this will be caught by the matcher.

22:57:35 (x6) | openshift-monitoring | kubelet | prometheus-k8s-0 | Unhealthy | Liveness probe failed: command timed out -- | -- | -- | -- | -- | -- 22:57:35 | openshift-monitoring | kubelet | prometheus-k8s-0 | Killing | Container prometheus failed liveness probe, will be restarted 22:57:39 (x7) | openshift-monitoring | kubelet | prometheus-k8s-0 | Unhealthy | Readiness probe failed: command timed out 23:02:09 (x55) | openshift-monitoring | kubelet | prometheus-k8s-0 | Unhealthy | Readiness probe errored and resulted in unknown state: rpc error: code = Unknown desc = command error: cannot register an exec PID: container is stopping, stdout: , stderr: , exit code -1

(this particular human message hits four cases covered in the tests, PTAL at https://github.com/openshift/origin/pull/30372/files#diff-d6cc316666a1dc1843f699fc90418fe34425e15b994f2c00d2390ba426cf33b3R719-R750)

Lets keep this upgrade specific unless there's a very clear explanation why this is expected and ok in a non-upgrade job, which afaict there is not. Looking at the intervals from your job run you can see there is mass disruption at the time we lose the readiness probe. Problems occurring during that time are not the kind of thing we want to hide.

This dualstack job has a known bug related to that network disruption.

dgoodwin · 2025-11-03T17:14:51Z

pkg/monitortestlibrary/pathologicaleventlibrary/duplicated_event_patterns.go

+			strings.HasPrefix(eventInterval.Locator.Keys[monitorapi.LocatorPodKey], podNamePrefix) &&
+			eventInterval.Message.Reason == messageReason &&
+			strings.Contains(eventInterval.Message.HumanMessage, messageHumanizedSubstring)
+	})


Unless I'm missing something, this looping through all intervals is not required. It effectively duplicates the matcher logic above, and I think all you really need here is to return that matcher with the matcher.repeatThresholdOverride = 100 set, the framework should handle the rest as far as I can tell. Checking if the matcher will match anything before setting it's threshold looks unnecessary to me.

…for Prometheus

…ection for Prometheus

…nts detection for Prometheus

dgoodwin · 2025-11-04T19:06:06Z

/lgtm

Thank you for addressing this flake!

openshift-ci · 2025-11-04T19:07:17Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: dgoodwin, rexagod

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [dgoodwin]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

openshift-ci · 2025-11-04T22:38:42Z

@rexagod: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
ci/prow/e2e-vsphere-ovn-upi	`85008d1`	link	true	`/test e2e-vsphere-ovn-upi`
ci/prow/unit	`85008d1`	link	true	`/test unit`
ci/prow/e2e-gcp-ovn-upgrade	`85008d1`	link	true	`/test e2e-gcp-ovn-upgrade`

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

openshift-ci bot requested review from p0lyn0mial and sjenning October 12, 2025 20:58

rexagod commented Oct 12, 2025

View reviewed changes

fixup! OCPBUGS-62703: Relax duplicate events detection for Prometheus

0d1ade8

simonpasquier reviewed Oct 29, 2025

View reviewed changes

fixup! fixup! OCPBUGS-62703: Relax duplicate events detection for Pro…

785a37c

…metheus

dgoodwin reviewed Nov 3, 2025

View reviewed changes

rexagod added 3 commits November 4, 2025 16:07

fixup! fixup! fixup! OCPBUGS-62703: Relax duplicate events detection …

8753df3

…for Prometheus

fixup! fixup! fixup! fixup! OCPBUGS-62703: Relax duplicate events det…

ed24ec4

…ection for Prometheus

fixup! fixup! fixup! fixup! fixup! OCPBUGS-62703: Relax duplicate eve…

85008d1

…nts detection for Prometheus

openshift-ci bot assigned dgoodwin Nov 4, 2025

openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Nov 4, 2025

openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Nov 4, 2025

OCPBUGS-62703: Relax duplicate events detection for Prometheus #30372

Are you sure you want to change the base?

OCPBUGS-62703: Relax duplicate events detection for Prometheus #30372

Conversation

rexagod commented Oct 12, 2025

Uh oh!

openshift-ci-robot commented Oct 12, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rexagod commented Oct 12, 2025

Uh oh!

openshift-ci bot commented Oct 12, 2025

Uh oh!

rexagod commented Oct 13, 2025

Uh oh!

openshift-ci bot commented Oct 13, 2025

Uh oh!

rexagod commented Oct 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

openshift-ci bot commented Oct 28, 2025

Uh oh!

openshift-ci bot commented Oct 28, 2025

Uh oh!

rexagod commented Oct 29, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

openshift-trt bot commented Oct 30, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rexagod Nov 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dgoodwin Nov 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dgoodwin commented Nov 4, 2025

Uh oh!

openshift-ci bot commented Nov 4, 2025

Uh oh!

openshift-ci bot commented Nov 4, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

rexagod commented Oct 28, 2025 •

edited

Loading

rexagod Nov 4, 2025 •

edited

Loading

dgoodwin Nov 4, 2025 •

edited

Loading