-
Notifications
You must be signed in to change notification settings - Fork 2.3k
Issue #14407 Fix race condition when calling TranslogManager#trimUnreferencedTranslogFiles #19492
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Issue #14407 Fix race condition when calling TranslogManager#trimUnreferencedTranslogFiles #19492
Conversation
|
❌ Gradle check result for 458b514: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
is known to be flaky #19431 |
|
Btw, a better idea might be instead of downgrading the read lock to write lock, simply track whether a translog trimming is currently underway (with atomic boolean or similar) and skip trimming if it is currently being trimmed. This way we leave the read lock in place and avoid unnecessary locking in the Engine. What do you think? |
…logFiles This fixes the IndexServiceTests.testAsyncTranslogTrimTaskOnClosedIndex which experiences race condition triggered by multiple threads: 1) task that turns off translog retention and trims the translog AND 2) the async trim_translog task. Signed-off-by: Sergei Ustimenko <[email protected]>
Signed-off-by: Sergei Ustimenko <[email protected]>
458b514 to
f825d0a
Compare
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #19492 +/- ##
============================================
+ Coverage 73.00% 73.02% +0.02%
- Complexity 70483 70510 +27
============================================
Files 5717 5719 +2
Lines 323021 323203 +182
Branches 46790 46811 +21
============================================
+ Hits 235826 236030 +204
+ Misses 68207 68133 -74
- Partials 18988 19040 +52 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
@ashking94 Can you take a look at this? |
@fdesu The bug here is specific to NoOpEngine, which means it can only happen on a closed index. Is that right? If so, then I think the simple approach of using a write lock is probably fine as this is likely not a hot-path operation and contention on the lock should not be a big problem. What do you think? |
|
@andrross thanks for taking a look! Yeah, it definitely seems so and the issue pops up only when 2 unfortunate async jobs are trying to trim the log which shouldn't happen too frequent. Leaving the write lock in place looks like the simplest option to me, so it totally makes sense. |
|
thanks for the fix @fdesu, this one has been a frequent flaky for quite some time. After your analysis I'm able to repro this by lowering the task interval to ~50ms and can see both FileNotFound and FileAlreadyExist errors that have both been popping up in gc failures. Also +1 to using the writeLock given its noOpEngine |
…ogManager#trimUnreferencedTranslogFiles (opensearch-project#19492) * Fix race condition when calling TranslogManager#trimUnreferencedTranslogFiles This fixes the IndexServiceTests.testAsyncTranslogTrimTaskOnClosedIndex which experiences race condition triggered by multiple threads: 1) task that turns off translog retention and trims the translog AND 2) the async trim_translog task. Signed-off-by: Sergei Ustimenko <[email protected]> * Add CHANGELOG entry Signed-off-by: Sergei Ustimenko <[email protected]> --------- Signed-off-by: Sergei Ustimenko <[email protected]>
…ogManager#trimUnreferencedTranslogFiles (opensearch-project#19492) * Fix race condition when calling TranslogManager#trimUnreferencedTranslogFiles This fixes the IndexServiceTests.testAsyncTranslogTrimTaskOnClosedIndex which experiences race condition triggered by multiple threads: 1) task that turns off translog retention and trims the translog AND 2) the async trim_translog task. Signed-off-by: Sergei Ustimenko <[email protected]> * Add CHANGELOG entry Signed-off-by: Sergei Ustimenko <[email protected]> --------- Signed-off-by: Sergei Ustimenko <[email protected]> Signed-off-by: Gagan Singh Saini <[email protected]>
…ogManager#trimUnreferencedTranslogFiles (opensearch-project#19492) * Fix race condition when calling TranslogManager#trimUnreferencedTranslogFiles This fixes the IndexServiceTests.testAsyncTranslogTrimTaskOnClosedIndex which experiences race condition triggered by multiple threads: 1) task that turns off translog retention and trims the translog AND 2) the async trim_translog task. Signed-off-by: Sergei Ustimenko <[email protected]> * Add CHANGELOG entry Signed-off-by: Sergei Ustimenko <[email protected]> --------- Signed-off-by: Sergei Ustimenko <[email protected]>
Fix flaky
IndexServiceTests.testAsyncTranslogTrimTaskOnClosedIndexDescription
TL;DR: This PR attempts to fix flaky
IndexServiceTests.testAsyncTranslogTrimTaskOnClosedIndexwhich experiences race condition triggered by async tasks:org.opensearch.index.shard.IndexShard#turnOffTranslogRetention->AbstractRunnable#doRun) ANDtrim_translogtask, which is scheduled to run every 200ms for test purposesA bit longer version:
It seems there are 2 threads at play here, that hit the
TranslogWriter.createat the same time. TheTranslogWritercould be created only once though. The full stacktrace from one of CI runs follows:a bit deeper investigation shows that the 2 async jobs mentioned above can run with an unfortunate timing, both triggering
IndexShard.trimTranslogwhich in turn callsNoOpEngine.trimUnreferencedTranslogFiles.The critical section in
trimUnreferencedTranslogFilesfromNoOpEnginethough seems to be protected by an instance of the ReadLock, which essentially allows both threads to enter the critical section and try to create an instance ofTranslogWriterusing the same Translog generation.Creating TranslogWriter with the same generation will succeed the first time but fails subsequently as the underlying file
translog-{generationId}.tlogalready exists.Solution: it seems that downgrading the lock being used in the critical section from engine-scoped ReadLock to an instance of WriteLock solves the issue.
What are your thoughts, does it make sense? Is it too much to penalize translog trimming that much?
Related Issues
Resolves #14407
Check List
[ ] API changes companion pull request created, if applicable.[ ] Public documentation issue/PR created, if applicable.By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.