Skip to content

Conversation

@rexagod
Copy link
Member

@rexagod rexagod commented Oct 12, 2025

Overrides the duplicate readiness error events' limit for Prometheus during upgrades. Since Prometheus needs some time to wind down (see 1), it was causing Kubelet to exhibit readiness error events during the time span it took to terminate. This ignores those pings to a limit (100).

Overrides the duplicate readiness error events' limit for Prometheus
during upgrades. Since Prometheus needs some time to wind down (see
[1]), it was causing Kubelet to exhibit readiness error events during
the time span it took to terminate. This ignores those pings to a limit (100).

[1]: https://github.com/prometheus-operator/prometheus-operator/blob/d0ae00fdedc656a5a1a290d9839b84d860f15428/pkg/prometheus/common.go#L56-L59

Signed-off-by: Pranshu Srivastava <[email protected]>
@openshift-ci-robot openshift-ci-robot added jira/severity-moderate Referenced Jira bug's severity is moderate for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. labels Oct 12, 2025
@openshift-ci-robot
Copy link

@rexagod: This pull request references Jira Issue OCPBUGS-62703, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.21.0) matches configured target version for branch (4.21.0)
  • bug is in the state ASSIGNED, which is one of the valid states (NEW, ASSIGNED, POST)

The bug has been updated to refer to the pull request using the external bug tracker.

In response to this:

Overrides the duplicate readiness error events' limit for Prometheus during upgrades. Since Prometheus needs some time to wind down (see 1), it was causing Kubelet to exhibit readiness error events during the time span it took to terminate. This ignores those pings to a limit (100).

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci openshift-ci bot requested review from p0lyn0mial and sjenning October 12, 2025 20:58
name: "PrometheusReadinessProbeErrorsDuringUpgrades",
locatorKeyRegexes: map[monitorapi.LocatorKey]*regexp.Regexp{
monitorapi.LocatorNamespaceKey: regexp.MustCompile(`^` + statefulSetNamespace + `$`),
monitorapi.LocatorStatefulSetKey: regexp.MustCompile(`^` + statefulSetName + `$`),
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm using a StatefulSet locator key here since that's how we deploy prometheus, however, the errors are reported pod-wise (see below), so I'm wondering if I should change this as well as the testIntervals conditional below to work with Pod keys instead of StatefulSet ones?

event happened 25 times, something is wrong: namespace/openshift-monitoring node/worker-0 pod/prometheus-k8s-0 hmsg/357171899f - reason/Unhealthy Readiness probe errored: rpc error: code = Unknown desc = command error: cannot register an exec PID: container is stopping, stdout: , stderr: , exit code -1 (12:24:14Z) result=reject }

I'll trigger a small number of upgrade jobs to see if this works as is.

@rexagod
Copy link
Member Author

rexagod commented Oct 12, 2025

/payload-aggregate periodic-ci-openshift-release-master-nightly-4.21-e2e-agent-ha-dualstack-conformance 3

periodic-ci-openshift-release-master-nightly-4.21-e2e-agent-ha-dualstack-conformance has the highest count based on these results in the last couple of days, so I'm going with that here.

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Oct 12, 2025

@rexagod: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

  • periodic-ci-openshift-release-master-nightly-4.21-e2e-agent-ha-dualstack-conformance

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/7fbee5c0-a7bd-11f0-983c-93024592481b-0

@rexagod
Copy link
Member Author

rexagod commented Oct 13, 2025

/payload-job periodic-ci-openshift-release-master-nightly-4.21-e2e-agent-ha-dualstack-conformance
/payload-job periodic-ci-openshift-multiarch-master-nightly-4.20-upgrade-from-stable-4.19-ocp-e2e-upgrade-aws-ovn-multi-a-a
/payload-job periodic-ci-openshift-release-master-ci-4.21-e2e-gcp-ovn-upgrade
/payload-job periodic-ci-openshift-release-master-ci-4.20-upgrade-from-stable-4.19-e2e-aws-ovn-upgrade

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Oct 13, 2025

@rexagod: trigger 4 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

  • periodic-ci-openshift-release-master-nightly-4.21-e2e-agent-ha-dualstack-conformance
  • periodic-ci-openshift-multiarch-master-nightly-4.20-upgrade-from-stable-4.19-ocp-e2e-upgrade-aws-ovn-multi-a-a
  • periodic-ci-openshift-release-master-ci-4.21-e2e-gcp-ovn-upgrade
  • periodic-ci-openshift-release-master-ci-4.20-upgrade-from-stable-4.19-e2e-aws-ovn-upgrade

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/e0d1c020-a809-11f0-8a17-a2fb9ee66422-0

@rexagod
Copy link
Member Author

rexagod commented Oct 28, 2025

/payload-job periodic-ci-openshift-release-master-nightly-4.21-e2e-agent-ha-dualstack-conformance
/payload-job periodic-ci-openshift-multiarch-master-nightly-4.21-upgrade-from-stable-4.20-ocp-e2e-upgrade-azure-ovn-multi-a-a
/payload-job periodic-ci-openshift-machine-config-operator-release-4.19-periodics-e2e-azure-mco-disruptive-techpreview

periodic-ci-openshift-release-master-nightly-4.21-e2e-agent-ha-dualstack-conformance surfaced the same error:

[Monitor:legacy-test-framework-invariants-pathological][sig-arch] events should not repeat pathologically for ns/openshift-monitoring expand_less | 0s
-- | --
{  1 events happened too frequently  event happened 56 times, something is wrong: namespace/openshift-monitoring node/worker-0 pod/prometheus-k8s-0 hmsg/00456c88f7 - reason/Unhealthy Readiness probe errored and resulted in unknown state: rpc error: code = Unknown desc = command error: cannot register an exec PID: container is stopping, stdout: , stderr: , exit code -1 (11:44:16Z) result=reject }

I'll amend the approach to use podLocator instead of the statefulSetLocator.

EDIT: Not a 100% sure if locators were the problem, since the substring I specified to look for here was messageHumanizedSubstring := "Readiness probe errored: rpc error", but we saw Readiness probe errored and resulted in unknown state: rpc error: in the linked CI job above.

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Oct 28, 2025

@rexagod: trigger 3 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

  • periodic-ci-openshift-release-master-nightly-4.21-e2e-agent-ha-dualstack-conformance
  • periodic-ci-openshift-multiarch-master-nightly-4.21-upgrade-from-stable-4.20-ocp-e2e-upgrade-azure-ovn-multi-a-a
  • periodic-ci-openshift-machine-config-operator-release-4.19-periodics-e2e-azure-mco-disruptive-techpreview

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/0fbe0ad0-b433-11f0-94a6-8dc5b4538807-0

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Oct 28, 2025

@rexagod: trigger 3 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

  • periodic-ci-openshift-release-master-nightly-4.21-e2e-agent-ha-dualstack-conformance
  • periodic-ci-openshift-multiarch-master-nightly-4.21-upgrade-from-stable-4.20-ocp-e2e-upgrade-azure-ovn-multi-a-a
  • periodic-ci-openshift-machine-config-operator-release-4.19-periodics-e2e-azure-mco-disruptive-techpreview

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/4331e300-b433-11f0-8cf4-836000ef77e7-0

@rexagod
Copy link
Member Author

rexagod commented Oct 29, 2025

Events fall under PathologicalKnown now, not PathologicalNew (CIPI).

Screenshot 2025-10-29 at 6 09 23 PM

})

if len(testIntervals) > 0 {
// Readiness probe errors are expected during upgrades, allow a higher threshold.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe explicit that readiness probes run during all the lifecycle of the container (including termination) and that Prometheus may take "some time" to stop (hence the termination grace period of 600s by default) which explains the probe errors (e.g. the web service is stopped but the process is still running).

https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-startup-probes/#define-readiness-probes


if len(testIntervals) > 0 {
// Readiness probe errors are expected during upgrades, allow a higher threshold.
// Set the threshold to 100 to allow for a high number of readiness probe errors
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

600 (termination period) / 5 (probe interval) = 120 but I agree that 100 is a good enough value.

@openshift-trt
Copy link

openshift-trt bot commented Oct 30, 2025

Risk analysis has seen new tests most likely introduced by this PR.
Please ensure that new tests meet guidelines for naming and stability.

New tests seen in this PR at sha: 785a37c

  • "[sig-storage][OCPFeature:StorageNetworkPolicy] Storage Network Policy should ensure required NetworkPolicies exist with correct labels [Suite:openshift/conformance/parallel]" [Total: 5, Pass: 5, Fail: 0, Flake: 0]
  • "[sig-storage][OCPFeature:StorageNetworkPolicy] Storage Network Policy should verify required labels for CSI related Operators [Suite:openshift/conformance/parallel]" [Total: 5, Pass: 5, Fail: 0, Flake: 0]
  • "[sig-storage][OCPFeature:StorageNetworkPolicy] Storage Network Policy should verify required labels for CSO related Operators [Suite:openshift/conformance/parallel]" [Total: 5, Pass: 5, Fail: 0, Flake: 0]

twoNodeEtcdEndpointsMatcher := newTwoNodeEtcdEndpointsConfigMissingEventMatcher(finalIntervals)
registry.AddPathologicalEventMatcherOrDie(twoNodeEtcdEndpointsMatcher)

prometheusReadinessProbeErrorsDuringUpgradesPathologicalEventMatcher := newPrometheusReadinessProbeErrorsDuringUpgradesPathologicalEventMatcher(finalIntervals)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Function indicates this is intended to relax only during upgrades, but it's added to the universal set here, not the upgrade specific set below. This should probably be moved down into that function below.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I saw this fail for (what looks to be) non-upgrade jobs, such as periodic-ci-openshift-release-master-nightly-4.21-e2e-agent-ha-dualstack-conformance. I should probably rename this to drop the DuringUpgrades part to not be misleading.

PLMK if you still think this should be moved to upgrade jobs exclusively.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there an explanation why prom would be getting killed in a non upgrade job? As far as I understand this, that would be quite unexpected and still get flagged. Let me know if you have that job run handy, I'm curious.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From [openshift-origin-30372-nightly-4.21-e2e-agent-ha-dualstack-conformance], it seems a liveness probe failure:

22:57:35 (x6) | openshift-monitoring | kubelet | prometheus-k8s-0 | Unhealthy | Liveness probe failed: command timed out
-- | -- | -- | -- | -- | --
22:57:35 | openshift-monitoring | kubelet | prometheus-k8s-0 | Killing | Container prometheus failed liveness probe, will be restarted

Prometheus container in the affected pod seems fine, but other containers in the pod seem to be facing connection issues (possibly due to https://issues.redhat.com/browse/OCPBUGS-32021)?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I doubt that https://issues.redhat.com/browse/OCPBUGS-32021 is involved here: the error logs come from the router which opens a TCP connection and then drops it after a successful connect.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The link to the job above didn't work seemingly, but it looks like it's probably a different symptom and would not hit your matcher as defined. I think it's probably best to make this an upgrade specific exception as this is the only area we're expecting this 10 minute delay.

Copy link
Member Author

@rexagod rexagod Nov 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, unlike https://issues.redhat.com/browse/OCPBUGS-5916 we are indeed able to establish a successful connection after retries (though IIUC it seems the connection heals in the linked ticket as well, as there's no functionality disruption reported?).

Ah, my bad. I've appended more context to the snippet (PTAL below), and here's the event-filter link. As you can see, the readiness error will register a PathologicalNew error if not explicitly ignored (>20 pings), and AFAICT this will be caught by the matcher.

22:57:35 (x6) | openshift-monitoring | kubelet | prometheus-k8s-0 | Unhealthy | Liveness probe failed: command timed out
-- | -- | -- | -- | -- | --
22:57:35 | openshift-monitoring | kubelet | prometheus-k8s-0 | Killing | Container prometheus failed liveness probe, will be restarted
22:57:39 (x7) | openshift-monitoring | kubelet | prometheus-k8s-0 | Unhealthy | Readiness probe failed: command timed out
23:02:09 (x55) | openshift-monitoring | kubelet | prometheus-k8s-0 | Unhealthy | Readiness probe errored and resulted in unknown state: rpc error: code = Unknown desc = command error: cannot register an exec PID: container is stopping, stdout: , stderr: , exit code -1

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

@dgoodwin dgoodwin Nov 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets keep this upgrade specific unless there's a very clear explanation why this is expected and ok in a non-upgrade job, which afaict there is not. Looking at the intervals from your job run you can see there is mass disruption at the time we lose the readiness probe. Problems occurring during that time are not the kind of thing we want to hide.

This dualstack job has a known bug related to that network disruption.

strings.HasPrefix(eventInterval.Locator.Keys[monitorapi.LocatorPodKey], podNamePrefix) &&
eventInterval.Message.Reason == messageReason &&
strings.Contains(eventInterval.Message.HumanMessage, messageHumanizedSubstring)
})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unless I'm missing something, this looping through all intervals is not required. It effectively duplicates the matcher logic above, and I think all you really need here is to return that matcher with the matcher.repeatThresholdOverride = 100 set, the framework should handle the rest as far as I can tell. Checking if the matcher will match anything before setting it's threshold looks unnecessary to me.

@dgoodwin
Copy link
Contributor

dgoodwin commented Nov 4, 2025

/lgtm

Thank you for addressing this flake!

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Nov 4, 2025
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Nov 4, 2025

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: dgoodwin, rexagod

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Nov 4, 2025
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Nov 4, 2025

@rexagod: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-vsphere-ovn-upi 85008d1 link true /test e2e-vsphere-ovn-upi
ci/prow/unit 85008d1 link true /test unit
ci/prow/e2e-gcp-ovn-upgrade 85008d1 link true /test e2e-gcp-ovn-upgrade

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. jira/severity-moderate Referenced Jira bug's severity is moderate for the branch this PR is targeting. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants