Skip to content

Conversation

@ashking94
Copy link
Member

@ashking94 ashking94 commented Mar 6, 2025

Description

This PR addresses flakiness in remote store download stats correctness tests by improving test reliability and consistency.

Problem

The existing remote store download stats tests (testDownloadStatsCorrectnessSinglePrimarySingleReplica and testDownloadStatsCorrectnessSinglePrimaryMultipleReplicaShards) were experiencing intermittent failures due to:

  • Race conditions in stats retrieval
  • Inconsistent replication state
  • Lack of robust waiting mechanisms

Solution

  • Added waitForReplication() method to ensure consistent replication state
  • Improved assertion logic to handle potential timing variations
  • Enhanced logging and error tracking
  • Refactored common test logic to reduce code duplication
  • Improved stats validation methods

Related Issues

Resolves #14310

Check List

  • Functionality includes testing.
  • [ ] API changes companion pull request created, if applicable.
  • [ ] Public documentation issue/PR created, if applicable.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

@github-actions
Copy link
Contributor

github-actions bot commented Mar 6, 2025

❌ Gradle check result for 12a112b: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@github-actions github-actions bot added >test-failure Test failure from CI, local build, etc. autocut flaky-test Random test failure that succeeds on second run Storage:Remote labels Mar 6, 2025
@ashking94 ashking94 moved this to Ready To Be Picked in Storage Project Board Mar 6, 2025
@ashking94 ashking94 moved this from Ready To Be Picked to 👀 In review in Storage Project Board Mar 6, 2025
@github-actions
Copy link
Contributor

github-actions bot commented Mar 6, 2025

✅ Gradle check result for 12a112b: SUCCESS

@codecov
Copy link

codecov bot commented Mar 6, 2025

Codecov Report

Attention: Patch coverage is 0% with 1 line in your changes missing coverage. Please review.

Project coverage is 72.43%. Comparing base (342c645) to head (12a112b).
Report is 4 commits behind head on main.

Files with missing lines Patch % Lines
...rch/index/remote/RemoteSegmentTransferTracker.java 0.00% 1 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##               main   #17535      +/-   ##
============================================
+ Coverage     72.40%   72.43%   +0.03%     
- Complexity    65683    65708      +25     
============================================
  Files          5311     5311              
  Lines        304890   304891       +1     
  Branches      44213    44213              
============================================
+ Hits         220743   220850     +107     
+ Misses        66045    65997      -48     
+ Partials      18102    18044      -58     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@shourya035
Copy link
Member

Thanks for the fix @ashking94 🙌

@ashking94
Copy link
Member Author

Before this change, the tests testDownloadStatsCorrectnessSinglePrimarySingleReplica and testDownloadStatsCorrectnessSinglePrimaryMultipleReplicaShards used to fail within 200 iterations. After this change, I have run both the tests for around 3K iterations and they have not failed yet. There are still running, will update here if they fail.

@gbbafna gbbafna merged commit c48efd0 into opensearch-project:main Mar 7, 2025
61 of 64 checks passed
@github-project-automation github-project-automation bot moved this from 👀 In review to ✅ Done in Storage Project Board Mar 7, 2025
@gbbafna gbbafna added the backport 2.x Backport to 2.x branch label Mar 7, 2025
opensearch-trigger-bot bot pushed a commit that referenced this pull request Mar 7, 2025
Signed-off-by: Ashish Singh <[email protected]>
(cherry picked from commit c48efd0)
Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
ashking94 added a commit to ashking94/OpenSearch that referenced this pull request Mar 7, 2025
ashking94 added a commit to ashking94/OpenSearch that referenced this pull request Mar 7, 2025
ashking94 added a commit that referenced this pull request Mar 7, 2025
ashking94 added a commit to ashking94/OpenSearch that referenced this pull request Mar 7, 2025
ashking94 added a commit to ashking94/OpenSearch that referenced this pull request Mar 7, 2025
ashking94 added a commit to ashking94/OpenSearch that referenced this pull request Mar 7, 2025
ashking94 added a commit that referenced this pull request Mar 7, 2025
* Fix flaky tests in RemoteStoreStatsIT (#17535)

Signed-off-by: Ashish Singh <[email protected]>

* Fix compilation issue for PR #17535 during backport (#17546)

Signed-off-by: Ashish Singh <[email protected]>

---------

Signed-off-by: Ashish Singh <[email protected]>
vinaykpud pushed a commit to vinaykpud/OpenSearch that referenced this pull request Mar 18, 2025
Signed-off-by: Ashish Singh <[email protected]>
Signed-off-by: Vinay Krishna Pudyodu <[email protected]>
vinaykpud pushed a commit to vinaykpud/OpenSearch that referenced this pull request Mar 18, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

autocut backport 2.x Backport to 2.x branch flaky-test Random test failure that succeeds on second run skip-changelog Storage:Remote >test-failure Test failure from CI, local build, etc.

Projects

Status: ✅ Done

Development

Successfully merging this pull request may close these issues.

[AUTOCUT] Gradle Check Flaky Test Report for RemoteStoreStatsIT

3 participants