Fix flaky test RemotePrimaryLocalRecoveryIT #18230

skumawat2025 · 2025-05-07T08:32:02Z

Description

Fix flaky RemotePrimaryLocalRecoveryIT by limiting rolling restarts to data nodes

Problem:
RemotePrimaryLocalRecoveryIT was failing intermittently due to ClusterManagerNotDiscoveredException during rolling restarts that included master nodes. Test failed within 100 iterations due to cluster manager discovery issues. The test only needs to verify remote migration local recovery after data node restarts.

Solution:
Modified rolling restart logic to only restart data nodes, excluding master nodes from the restart sequence. This change has proven stable across 800+ test iterations.

Related Issues

Resolves #14314

Check List

Functionality includes testing.
API changes companion pull request created, if applicable.
Public documentation issue/PR created, if applicable.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

...rc/internalClusterTest/java/org/opensearch/remotemigration/RemotePrimaryLocalRecoveryIT.java

github-actions · 2025-05-07T09:39:24Z

❌ Gradle check result for a19b834: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Signed-off-by: Sandeep Kumawat <[email protected]>

github-actions · 2025-05-07T11:24:15Z

❌ Gradle check result for 2491e5d: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

skumawat2025 · 2025-05-07T11:32:41Z

Github check is failing

java.nio.file.NoSuchFileException: /var/jenkins/workspace/gradle-check/search/server/build/testrun/internalClusterTest/temp/org.opensearch.remotemigration.RemotePrimaryLocalRecoveryIT_8D86DF36D6500609-001/tempDir-002/repos/kIzUqYTnok/L10010110011010/HYQiXMdzRMCL2E84YcvskA/0/segments/data

andrross · 2025-05-22T18:29:31Z

@skumawat2025 Any update here?

opensearch-trigger-bot · 2025-06-22T15:22:18Z

This PR is stalled because it has been open for 30 days with no activity.

skumawat2025 requested a review from a team as a code owner May 7, 2025 08:32

github-actions bot added >test-failure Test failure from CI, local build, etc. autocut flaky-test Random test failure that succeeds on second run Storage:Remote labels May 7, 2025

github-project-automation bot added this to Storage Project Board May 7, 2025

skumawat2025 added the skip-changelog label May 7, 2025

sachinpkale reviewed May 7, 2025

View reviewed changes

...rc/internalClusterTest/java/org/opensearch/remotemigration/RemotePrimaryLocalRecoveryIT.java Show resolved Hide resolved

sachinpkale approved these changes May 7, 2025

View reviewed changes

github-project-automation bot moved this to 👀 In review in Storage Project Board May 7, 2025

Fix flaky test RemotePrimaryLocalRecoveryIT

2491e5d

Signed-off-by: Sandeep Kumawat <[email protected]>

skumawat2025 force-pushed the flaky-test-fix-remote-primary-local-recovery branch from a19b834 to 2491e5d Compare May 7, 2025 10:17

This was referenced May 7, 2025

[AUTOCUT] Gradle Check Flaky Test Report for S3BlobStoreRepositoryTests #14299

Open

[AUTOCUT] Gradle Check Flaky Test Report for SearchIT #18129

Closed

[AUTOCUT] Gradle Check Flaky Test Report for RemotePrimaryLocalRecoveryIT #14314

Open

opensearch-trigger-bot bot added the stalled Issues that have stalled label Jun 22, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix flaky test RemotePrimaryLocalRecoveryIT #18230

Fix flaky test RemotePrimaryLocalRecoveryIT #18230

skumawat2025 commented May 7, 2025

Uh oh!

Uh oh!

github-actions bot commented May 7, 2025

Uh oh!

github-actions bot commented May 7, 2025

Uh oh!

skumawat2025 commented May 7, 2025

Uh oh!

andrross commented May 22, 2025

Uh oh!

opensearch-trigger-bot bot commented Jun 22, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Fix flaky test RemotePrimaryLocalRecoveryIT #18230

Are you sure you want to change the base?

Fix flaky test RemotePrimaryLocalRecoveryIT #18230

Conversation

skumawat2025 commented May 7, 2025

Description

Related Issues

Check List

Uh oh!

Uh oh!

github-actions bot commented May 7, 2025

Uh oh!

github-actions bot commented May 7, 2025

Uh oh!

skumawat2025 commented May 7, 2025

Uh oh!

andrross commented May 22, 2025

Uh oh!

opensearch-trigger-bot bot commented Jun 22, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants