Skip to content

Conversation

@fdesu
Copy link
Contributor

@fdesu fdesu commented Oct 1, 2025

Fix flaky IndexServiceTests.testAsyncTranslogTrimTaskOnClosedIndex

Description

TL;DR: This PR attempts to fix flaky IndexServiceTests.testAsyncTranslogTrimTaskOnClosedIndex which experiences race condition triggered by async tasks:

  1. task that turns off translog retention and trims the translog (see org.opensearch.index.shard.IndexShard#turnOffTranslogRetention -> AbstractRunnable#doRun) AND
  2. the async trim_translog task, which is scheduled to run every 200ms for test purposes

A bit longer version:
It seems there are 2 threads at play here, that hit the TranslogWriter.create at the same time. The TranslogWriter could be created only once though. The full stacktrace from one of CI runs follows:

[2025-10-01T05:17:20,732][WARN ][o.o.i.e.Engine           ] [node_s_0] [test][0] failed engine [translog trimming failed]
org.opensearch.index.translog.TranslogException: failed to create new translog file
	at org.opensearch.index.translog.Translog.createWriter(Translog.java:594) ~[main/:?]
	at org.opensearch.index.translog.LocalTranslog.<init>(LocalTranslog.java:95) ~[main/:?]
	at org.opensearch.index.translog.InternalTranslogFactory.newTranslog(InternalTranslogFactory.java:61) ~[main/:?]
	at org.opensearch.index.engine.NoOpEngine$2.trimUnreferencedTranslogFiles(NoOpEngine.java:188) [main/:?]
	at org.opensearch.index.shard.IndexShard.trimTranslog(IndexShard.java:1641) [main/:?]
	at org.opensearch.index.shard.IndexShard$4.doRun(IndexShard.java:3278) [main/:?]
	at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:975) [main/:?]
	at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:52) [main/:?]
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1095) [?:?]
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:619) [?:?]
	at java.base/java.lang.Thread.run(Thread.java:1447) [?:?]
Caused by: java.nio.file.FileAlreadyExistsException: /var/jenkins/workspace/gradle-check/search/server/build/testrun/test/temp/org.opensearch.index.IndexServiceTests_5511701C154C04C0-001/tempDir-002/data/nodes/0/indices/iaUQDlrKT7eEbuZ412jCTg/0/translog/translog-9.tlog
	at java.base/sun.nio.fs.UnixException.translateToIOException(UnixException.java:94) ~[?:?]
	at java.base/sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:106) ~[?:?]
	at java.base/sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:111) ~[?:?]
	at java.base/sun.nio.fs.UnixFileSystemProvider.newFileChannel(UnixFileSystemProvider.java:213) ~[?:?]
	at org.apache.lucene.tests.mockfile.FilterFileSystemProvider.newFileChannel(FilterFileSystemProvider.java:206) ~[lucene-test-framework-10.3.0.jar:10.3.0 e2871287e4a378739f0b74081d124e3668347875 - 2025-09-05 20:56:14]
	at org.apache.lucene.tests.mockfile.DisableFsyncFS.newFileChannel(DisableFsyncFS.java:44) ~[lucene-test-framework-10.3.0.jar:10.3.0 e2871287e4a378739f0b74081d124e3668347875 - 2025-09-05 20:56:14]
	at org.apache.lucene.tests.mockfile.FilterFileSystemProvider.newFileChannel(FilterFileSystemProvider.java:206) ~[lucene-test-framework-10.3.0.jar:10.3.0 e2871287e4a378739f0b74081d124e3668347875 - 2025-09-05 20:56:14]
	at java.base/java.nio.channels.FileChannel.open(FileChannel.java:301) ~[?:?]
	at java.base/java.nio.channels.FileChannel.open(FileChannel.java:353) ~[?:?]
	at org.opensearch.index.translog.ChannelFactory.open(ChannelFactory.java:51) ~[main/:?]
	at org.opensearch.index.translog.TranslogWriter.create(TranslogWriter.java:228) ~[main/:?]
	at org.opensearch.index.translog.Translog.createWriter(Translog.java:575) ~[main/:?]
	... 10 more

a bit deeper investigation shows that the 2 async jobs mentioned above can run with an unfortunate timing, both triggering IndexShard.trimTranslog which in turn calls NoOpEngine.trimUnreferencedTranslogFiles.

The critical section in trimUnreferencedTranslogFiles from NoOpEngine though seems to be protected by an instance of the ReadLock, which essentially allows both threads to enter the critical section and try to create an instance of TranslogWriter using the same Translog generation.

Creating TranslogWriter with the same generation will succeed the first time but fails subsequently as the underlying file translog-{generationId}.tlog already exists.

Solution: it seems that downgrading the lock being used in the critical section from engine-scoped ReadLock to an instance of WriteLock solves the issue.

What are your thoughts, does it make sense? Is it too much to penalize translog trimming that much?

Related Issues

Resolves #14407

Check List

  • Functionality includes testing.
  • [ ] API changes companion pull request created, if applicable.
  • [ ] Public documentation issue/PR created, if applicable.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

@github-actions github-actions bot added >test-failure Test failure from CI, local build, etc. autocut flaky-test Random test failure that succeeds on second run Indexing Indexing, Bulk Indexing and anything related to indexing labels Oct 1, 2025
@github-actions
Copy link
Contributor

github-actions bot commented Oct 1, 2025

❌ Gradle check result for 458b514: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@fdesu
Copy link
Contributor Author

fdesu commented Oct 1, 2025

> Task :server:internalClusterTest

Tests with failures:
 - org.opensearch.index.fielddata.FieldDataLoadingIT.testIndicesFieldDataCacheSizeSetting

5524 tests completed, 1 failed, 56 skipped

> Task :server:internalClusterTest FAILED
build complete, generating: /var/jenkins/workspace/gradle-check/search/build/64866.tar.bz2

is known to be flaky #19431

@fdesu fdesu changed the title Fix race condition when calling TranslogManager#trimUnreferencedTranslogFiles Issue #14407 Fix race condition when calling TranslogManager#trimUnreferencedTranslogFiles Oct 1, 2025
@fdesu
Copy link
Contributor Author

fdesu commented Oct 2, 2025

Btw, a better idea might be instead of downgrading the read lock to write lock, simply track whether a translog trimming is currently underway (with atomic boolean or similar) and skip trimming if it is currently being trimmed.

This way we leave the read lock in place and avoid unnecessary locking in the Engine. What do you think?

fdesu added 2 commits October 6, 2025 11:00
…logFiles

This fixes the IndexServiceTests.testAsyncTranslogTrimTaskOnClosedIndex which experiences race condition triggered by multiple threads: 1) task that turns off translog retention and trims the translog AND 2) the async trim_translog task.

Signed-off-by: Sergei Ustimenko <[email protected]>
Signed-off-by: Sergei Ustimenko <[email protected]>
@fdesu fdesu force-pushed the fix/issue-14407-flaky-IndexServiceTests branch from 458b514 to f825d0a Compare October 6, 2025 09:01
@github-actions
Copy link
Contributor

github-actions bot commented Oct 6, 2025

✅ Gradle check result for f825d0a: SUCCESS

@codecov
Copy link

codecov bot commented Oct 6, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 73.02%. Comparing base (cb65261) to head (f825d0a).
⚠️ Report is 6 commits behind head on main.

Additional details and impacted files
@@             Coverage Diff              @@
##               main   #19492      +/-   ##
============================================
+ Coverage     73.00%   73.02%   +0.02%     
- Complexity    70483    70510      +27     
============================================
  Files          5717     5719       +2     
  Lines        323021   323203     +182     
  Branches      46790    46811      +21     
============================================
+ Hits         235826   236030     +204     
+ Misses        68207    68133      -74     
- Partials      18988    19040      +52     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@andrross
Copy link
Member

andrross commented Oct 6, 2025

@ashking94 Can you take a look at this?

@andrross
Copy link
Member

andrross commented Oct 6, 2025

Btw, a better idea might be instead of downgrading the read lock to write lock, simply track whether a translog trimming is currently underway (with atomic boolean or similar) and skip trimming if it is currently being trimmed.

This way we leave the read lock in place and avoid unnecessary locking in the Engine. What do you think?

@fdesu The bug here is specific to NoOpEngine, which means it can only happen on a closed index. Is that right? If so, then I think the simple approach of using a write lock is probably fine as this is likely not a hot-path operation and contention on the lock should not be a big problem. What do you think?

@fdesu
Copy link
Contributor Author

fdesu commented Oct 7, 2025

@andrross thanks for taking a look! Yeah, it definitely seems so and the issue pops up only when 2 unfortunate async jobs are trying to trim the log which shouldn't happen too frequent.

Leaving the write lock in place looks like the simplest option to me, so it totally makes sense.

@mch2
Copy link
Member

mch2 commented Oct 7, 2025

thanks for the fix @fdesu, this one has been a frequent flaky for quite some time. After your analysis I'm able to repro this by lowering the task interval to ~50ms and can see both FileNotFound and FileAlreadyExist errors that have both been popping up in gc failures.

Also +1 to using the writeLock given its noOpEngine

@mch2 mch2 merged commit 8dc072f into opensearch-project:main Oct 7, 2025
33 checks passed
@fdesu fdesu deleted the fix/issue-14407-flaky-IndexServiceTests branch October 7, 2025 18:51
@fdesu
Copy link
Contributor Author

fdesu commented Oct 7, 2025

@andrross @mch2 thanks for reviews and the merge!

rgsriram pushed a commit to rgsriram/OpenSearch that referenced this pull request Oct 11, 2025
…ogManager#trimUnreferencedTranslogFiles (opensearch-project#19492)

* Fix race condition when calling TranslogManager#trimUnreferencedTranslogFiles

This fixes the IndexServiceTests.testAsyncTranslogTrimTaskOnClosedIndex which experiences race condition triggered by multiple threads: 1) task that turns off translog retention and trims the translog AND 2) the async trim_translog task.

Signed-off-by: Sergei Ustimenko <[email protected]>

* Add CHANGELOG entry

Signed-off-by: Sergei Ustimenko <[email protected]>

---------

Signed-off-by: Sergei Ustimenko <[email protected]>
Gagan6164 pushed a commit to Gagan6164/OpenSearch that referenced this pull request Oct 13, 2025
…ogManager#trimUnreferencedTranslogFiles (opensearch-project#19492)

* Fix race condition when calling TranslogManager#trimUnreferencedTranslogFiles

This fixes the IndexServiceTests.testAsyncTranslogTrimTaskOnClosedIndex which experiences race condition triggered by multiple threads: 1) task that turns off translog retention and trims the translog AND 2) the async trim_translog task.

Signed-off-by: Sergei Ustimenko <[email protected]>

* Add CHANGELOG entry

Signed-off-by: Sergei Ustimenko <[email protected]>

---------

Signed-off-by: Sergei Ustimenko <[email protected]>
Signed-off-by: Gagan Singh Saini <[email protected]>
peteralfonsi pushed a commit to peteralfonsi/OpenSearch that referenced this pull request Oct 15, 2025
…ogManager#trimUnreferencedTranslogFiles (opensearch-project#19492)

* Fix race condition when calling TranslogManager#trimUnreferencedTranslogFiles

This fixes the IndexServiceTests.testAsyncTranslogTrimTaskOnClosedIndex which experiences race condition triggered by multiple threads: 1) task that turns off translog retention and trims the translog AND 2) the async trim_translog task.

Signed-off-by: Sergei Ustimenko <[email protected]>

* Add CHANGELOG entry

Signed-off-by: Sergei Ustimenko <[email protected]>

---------

Signed-off-by: Sergei Ustimenko <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

autocut flaky-test Random test failure that succeeds on second run Indexing Indexing, Bulk Indexing and anything related to indexing >test-failure Test failure from CI, local build, etc.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[AUTOCUT] Gradle Check Flaky Test Report for IndexServiceTests

3 participants