Skip to content

Conversation

@guojialiang92
Copy link
Contributor

@guojialiang92 guojialiang92 commented Apr 16, 2025

Description

This PR fixes the flaky test org.opensearch.indices.replication.SegmentReplicationTargetServiceTests#testStartReplicationListenerSuccess.

Reproduce method command reference gradle check.

./gradlew ':server:test' --tests "org.opensearch.indices.replication.SegmentReplicationTargetServiceTests.testStartReplicationListenerSuccess" -Dtests.seed=67540127BBCFAD6A -Dtests.security.manager=true -Dtests.jvm.argline="-XX:TieredStopAtLevel=1 -XX:ReservedCodeCacheSize=64m" -Dtests.locale=fr-CA -Dtests.timezone=Africa/Casablanca -Druntime.java=21

This test cannot be reproduced stably, and there may be 1 error every 1000 runs.

The error message is as follows:

févr. 18, 2025 6:47:01 PM com.carrotsearch.randomizedtesting.RandomizedRunner$QueueUncaughtExceptionsHandler uncaughtException
AVERTISSEMENT: Uncaught exception in thread: Thread[#1897,opensearch[org.opensearch.indices.replication.SegmentReplicationTargetServiceTests][generic][T#2],5,TGRP-SegmentReplicationTargetServiceTests]
org.apache.lucene.store.AlreadyClosedException: engine is closed
	at __randomizedtesting.SeedInfo.seed([67540127BBCFAD6A]:0)
	at org.opensearch.index.shard.IndexShard.getEngine(IndexShard.java:3762)
	at org.opensearch.index.shard.IndexShard.getReplicationEngine(IndexShard.java:1697)
	at org.opensearch.index.shard.IndexShard.isSegmentReplicationAllowed(IndexShard.java:1828)
	at org.opensearch.index.shard.IndexShard.shouldProcessCheckpoint(IndexShard.java:1847)
	at org.opensearch.indices.replication.SegmentReplicationTargetService.onNewCheckpoint(SegmentReplicationTargetService.java:324)
	at org.opensearch.indices.replication.SegmentReplicationTargetService.lambda$processLatestReceivedCheckpoint$12(SegmentReplicationTargetService.java:484)
	at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:955)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
	at java.base/java.lang.Thread.run(Thread.java:1583)

Related Issues

Resolves #[15829]

Solution

In process segment replication, obtain NRTReplicationEngine through IndexShard#getReplicationEngine
If the Engine has been closed, return an exception AlreadyClosedException

Although IndexShardState.STARTED is checked before calling IndexShard#getReplicationEngine, shard close may be inserted between them, and atomicity cannot be guaranteed.

We can improve method IndexShard#getReplicationEngine by returning Optional.empty() when getEngine() throws a
AlreadyClosedException.

The logic related to segment replication will not continue to execute after judging that the return value is empty (IndexShard#finalizeReplication / IndexShard#isSegmentReplicationAllowed / SegmentReplicationTargetService#forceReplication), and there will be no side effects.

Check List

  • Functionality includes testing.
  • API changes companion pull request created, if applicable.
  • Public documentation issue/PR created, if applicable.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

@github-actions
Copy link
Contributor

❕ Gradle check result for 33d99d6: UNSTABLE

Please review all flaky tests that succeeded after retry and create an issue if one does not already exist to track the flaky failure.

@codecov
Copy link

codecov bot commented Apr 16, 2025

Codecov Report

Attention: Patch coverage is 87.50000% with 1 line in your changes missing coverage. Please review.

Project coverage is 72.49%. Comparing base (5799fe7) to head (f0c47e7).
Report is 12 commits behind head on main.

Files with missing lines Patch % Lines
...in/java/org/opensearch/index/shard/IndexShard.java 87.50% 0 Missing and 1 partial ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##               main   #17969      +/-   ##
============================================
- Coverage     72.57%   72.49%   -0.09%     
+ Complexity    67066    67052      -14     
============================================
  Files          5470     5475       +5     
  Lines        309697   309925     +228     
  Branches      45045    45067      +22     
============================================
- Hits         224776   224676     -100     
- Misses        66602    66962     +360     
+ Partials      18319    18287      -32     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Signed-off-by: guojialiang <[email protected]>
@github-actions
Copy link
Contributor

❌ Gradle check result for 19a3440: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Signed-off-by: guojialiang <[email protected]>
@guojialiang92 guojialiang92 force-pushed the dev/fix_SegmentReplicationTargetServiceTests_testStartReplicationListenerSuccess branch from 1302c08 to 9b38fbf Compare April 17, 2025 01:51
@github-actions
Copy link
Contributor

✅ Gradle check result for 9b38fbf: SUCCESS

@github-actions
Copy link
Contributor

❌ Gradle check result for 3c14bc9: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Signed-off-by: guojialiang <[email protected]>
@guojialiang92 guojialiang92 force-pushed the dev/fix_SegmentReplicationTargetServiceTests_testStartReplicationListenerSuccess branch from 3c14bc9 to f0c47e7 Compare April 17, 2025 03:24
@github-actions
Copy link
Contributor

❕ Gradle check result for f0c47e7: UNSTABLE

Please review all flaky tests that succeeded after retry and create an issue if one does not already exist to track the flaky failure.

@guojialiang92
Copy link
Contributor Author

Hi, @mch2
You may be familiar with this part of the logic, can you help me take a look at this fix when you have time?

Copy link
Member

@mch2 mch2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good thanks for fixing this @guojialiang92!

@mch2 mch2 merged commit f7da8c3 into opensearch-project:main Apr 17, 2025
34 of 35 checks passed
x-INFiN1TY-x pushed a commit to x-INFiN1TY-x/OpenSearch_Local that referenced this pull request Apr 24, 2025
…ionListenerSuccess (opensearch-project#17969)

* fix flaky test SegmentReplicationTargetServiceTests.testStartReplicationListenerSuccess

Signed-off-by: guojialiang <[email protected]>

* add UT

Signed-off-by: guojialiang <[email protected]>

* add comment

Signed-off-by: guojialiang <[email protected]>

* add tests

Signed-off-by: guojialiang <[email protected]>

---------

Signed-off-by: guojialiang <[email protected]>
Signed-off-by: Tanishq Ranjan <[email protected]>
Harsh-87 pushed a commit to Harsh-87/OpenSearch that referenced this pull request May 7, 2025
…ionListenerSuccess (opensearch-project#17969)

* fix flaky test SegmentReplicationTargetServiceTests.testStartReplicationListenerSuccess

Signed-off-by: guojialiang <[email protected]>

* add UT

Signed-off-by: guojialiang <[email protected]>

* add comment

Signed-off-by: guojialiang <[email protected]>

* add tests

Signed-off-by: guojialiang <[email protected]>

---------

Signed-off-by: guojialiang <[email protected]>
Signed-off-by: Harsh Kothari <[email protected]>
Harsh-87 pushed a commit to Harsh-87/OpenSearch that referenced this pull request May 7, 2025
…ionListenerSuccess (opensearch-project#17969)

* fix flaky test SegmentReplicationTargetServiceTests.testStartReplicationListenerSuccess

Signed-off-by: guojialiang <[email protected]>

* add UT

Signed-off-by: guojialiang <[email protected]>

* add comment

Signed-off-by: guojialiang <[email protected]>

* add tests

Signed-off-by: guojialiang <[email protected]>

---------

Signed-off-by: guojialiang <[email protected]>
Signed-off-by: Harsh Kothari <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants