Skip to content

Conversation

@wozniakjan
Copy link
Member

@wozniakjan wozniakjan commented Oct 15, 2025

The otel trace e2e test is very flaky and unstable, frequently failing when getting traces from ZipKin.

2025-10-14T17:24:11.0470498Z     helper.go:382: Waiting for deployment replicas to hit target. Deployment - interceptor-otel-tracing-test-deployment, Current  - 1, Target - 1
2025-10-14T17:24:11.0471417Z     interceptor_otel_tracing_test.go:266:
2025-10-14T17:24:11.0472867Z            Error Trace:    /root/runner/keda-arm64-http-add-on-2/_work/http-add-on/http-add-on/tests/checks/interceptor_otel_tracing/interceptor_otel_tracing_test.go:266
2025-10-14T17:24:11.0473999Z            Error:          "0" is not greater than or equal to "1"
2025-10-14T17:24:11.0474526Z            Test:           TestTraceGeneration
2025-10-14T17:24:11.0474985Z     interceptor_otel_tracing_test.go:269:
2025-10-14T17:24:11.0476428Z            Error Trace:    /root/runner/keda-arm64-http-add-on-2/_work/http-add-on/http-add-on/tests/checks/interceptor_otel_tracing/interceptor_otel_tracing_test.go:269
2025-10-14T17:24:11.0477423Z            Error:          Not equal:
2025-10-14T17:24:11.0477936Z                            expected: "200"
2025-10-14T17:24:11.0478442Z                            actual  : ""
2025-10-14T17:24:11.0478833Z
2025-10-14T17:24:11.0479227Z                            Diff:
2025-10-14T17:24:11.0479685Z                            --- Expected
2025-10-14T17:24:11.0480166Z                            +++ Actual
2025-10-14T17:24:11.0480633Z                            @@ -1 +1 @@
2025-10-14T17:24:11.0481051Z                            -200
2025-10-14T17:24:11.0481427Z                            +
2025-10-14T17:24:11.0481839Z            Test:           TestTraceGeneration

I think this is due to hardcoded sleep times in the test and the fact that otel collector is configured to push to zipkin in the testsuite setup, but zipkin is deployed long after, when this test is executed.

I observed frequent otel collector logs pointing to deploying zipkin later after otel collector, resulting in a much longer retry interval (dial tcp: lookup zipkin.zipkin on 10.43.0.10:53: no such host", "interval": "31.878599503s") than the hardcoded sleeps, which increases flakiness of the test

2025-10-15T09:03:39.291Z        info    internal/retry_sender.go:133    Exporting failed. Will retry the request after interval.        {"resource": {"service.instance.id": "a5e92fe7-6fbb-410b-942e-f94c5e18e354", "service.name": "otelcol-contrib", "service.version": "0.136.0"}, "otelcol.component.id": "zipkin", "otelcol.component.kind": "exporter", "otelcol.signal": "traces", "error": "failed to push trace data via Zipkin exporter: Post \"http://zipkin.zipkin:9411/api/v2/spans\": dial tcp: lookup zipkin.zipkin on 10.43.0.10:53: no such host", "interval": "31.878599503s"}

Checklist

  • Commits are signed with Developer Certificate of Origin (DCO)

@wozniakjan wozniakjan requested a review from Copilot October 15, 2025 09:51
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR addresses flakiness in the OTEL tracing e2e test by reorganizing the Zipkin deployment to occur during the test suite setup phase rather than within individual tests. This ensures Zipkin is available before the OTEL collector starts pushing traces, preventing connection failures and test instability.

  • Moved Zipkin deployment from individual test to shared setup phase
  • Updated test to use gomega's Eventually assertion instead of hardcoded sleeps
  • Improved error visibility by capturing both stdout and stderr in test execution

Reviewed Changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

File Description
tests/utils/setup_test.go Added Zipkin deployment template and setup function, updated Envoy Gateway version
tests/run-all.go Changed to capture combined output (stdout+stderr) for better debugging
tests/checks/interceptor_otel_tracing/interceptor_otel_tracing_test.go Removed Zipkin deployment from test, replaced sleeps with gomega Eventually assertion

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

@wozniakjan wozniakjan merged commit b38b95d into kedacore:main Oct 15, 2025
18 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants