Skip to content

Conversation

@skumawat2025
Copy link
Contributor

Description

Fix flaky RemotePrimaryLocalRecoveryIT by limiting rolling restarts to data nodes

Problem:
RemotePrimaryLocalRecoveryIT was failing intermittently due to ClusterManagerNotDiscoveredException during rolling restarts that included master nodes. Test failed within 100 iterations due to cluster manager discovery issues. The test only needs to verify remote migration local recovery after data node restarts.

Solution:
Modified rolling restart logic to only restart data nodes, excluding master nodes from the restart sequence. This change has proven stable across 800+ test iterations.

Related Issues

Resolves #14314

Check List

  • Functionality includes testing.
  • API changes companion pull request created, if applicable.
  • Public documentation issue/PR created, if applicable.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

@skumawat2025 skumawat2025 requested a review from a team as a code owner May 7, 2025 08:32
@github-actions github-actions bot added >test-failure Test failure from CI, local build, etc. autocut flaky-test Random test failure that succeeds on second run Storage:Remote labels May 7, 2025
@github-actions
Copy link
Contributor

github-actions bot commented May 7, 2025

❌ Gradle check result for a19b834: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@github-project-automation github-project-automation bot moved this to 👀 In review in Storage Project Board May 7, 2025
@skumawat2025 skumawat2025 force-pushed the flaky-test-fix-remote-primary-local-recovery branch from a19b834 to 2491e5d Compare May 7, 2025 10:17
@github-actions
Copy link
Contributor

github-actions bot commented May 7, 2025

❌ Gradle check result for 2491e5d: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@skumawat2025
Copy link
Contributor Author

Github check is failing

java.nio.file.NoSuchFileException: /var/jenkins/workspace/gradle-check/search/server/build/testrun/internalClusterTest/temp/org.opensearch.remotemigration.RemotePrimaryLocalRecoveryIT_8D86DF36D6500609-001/tempDir-002/repos/kIzUqYTnok/L10010110011010/HYQiXMdzRMCL2E84YcvskA/0/segments/data

@andrross
Copy link
Member

@skumawat2025 Any update here?

@opensearch-trigger-bot
Copy link
Contributor

This PR is stalled because it has been open for 30 days with no activity.

@opensearch-trigger-bot opensearch-trigger-bot bot added the stalled Issues that have stalled label Jun 22, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

autocut flaky-test Random test failure that succeeds on second run skip-changelog stalled Issues that have stalled Storage:Remote >test-failure Test failure from CI, local build, etc.

Projects

Status: 👀 In review

Development

Successfully merging this pull request may close these issues.

[AUTOCUT] Gradle Check Flaky Test Report for RemotePrimaryLocalRecoveryIT

3 participants