Skip to content

Conversation

@abhita
Copy link
Contributor

@abhita abhita commented May 19, 2025

Description

testCreateSplitIndex
The test creates split indices and tests whether:

  • Documents can be correctly split from a source index into indices with more shards (e.g., 2->4->8 shards)
  • All documents remain searchable and retrievable after splits
  • New writes to the split indices work properly
  • The split operation maintains data consistency across source and target indices while using remote storage

Based on the failure logs from recent PR builds ([examples]
https://build.ci.opensearch.org/job/gradle-check/54304/testReport/junit/org.opensearch.action.admin.indices.create/RemoteSplitIndexIT/testCreateSplitIndex/
, we can see that the test used to fail occasionally with
java.lang.AssertionError: expected:<0> but was:<103>
at __randomizedtesting.SeedInfo.seed([93C08AFB8B2D718E:EB805C4187ADBB8C]:0)
at org.junit.Assert.fail(Assert.java:89)
at org.junit.Assert.failNotEquals(Assert.java:835)
at org.junit.Assert.assertEquals(Assert.java:647)
at org.junit.Assert.assertEquals(Assert.java:633)
at org.opensearch.action.admin.indices.create.RemoteSplitIndexIT.lambda$cleanUp$0(RemoteSplitIndexIT.java:117)
at org.opensearch.test.OpenSearchTestCase.assertBusy(OpenSearchTestCase.java:1182)
at org.opensearch.action.admin.indices.create.RemoteSplitIndexIT.cleanUp(RemoteSplitIndexIT.java:115)

The test fails at the cleanup part where few of the translog files are not deleted leading yet leading to mismatch in the expected count of translog.

        if (RemoteStoreSettings.isPinnedTimestampsEnabled() == false) {
            assertBusy(() -> {
                try {
                    assertEquals(0, getFileCount(translogRepoPath));
                } catch (IOException e) {
                    fail();
                }
            }, 30, TimeUnit.SECONDS);
        }

Upon further investigating the logs, it looks like few of the translog files deletion fails with access denied errors.

[2025-04-23T21:58:13,920][INFO ][o.o.i.t.t.TranslogTransferManager] [node_t0] [source][0] Deleted all remote translog data at path=[-uWqzJi6T8SwYgRNl38U6w][0][translog][metadata]
[2025-04-23T21:58:13,933][INFO ][o.o.i.t.t.TranslogTransferManager] [node_t0] [source][0] Deleted all remote translog data at path=[-uWqzJi6T8SwYgRNl38U6w][0][translog][data]
[2025-04-23T21:58:13,948][INFO ][o.o.i.t.t.TranslogTransferManager] [node_t0] [target][0] Deleted all remote translog data at path=[16XyNNc7Tmm-oNEOWjzTUQ][0][translog][metadata]
[2025-04-23T21:58:13,957][INFO ][o.o.i.t.t.TranslogTransferManager] [node_t0] [target][0] Deleted all remote translog data at path=[16XyNNc7Tmm-oNEOWjzTUQ][0][translog][data]
[2025-04-23T21:58:13,970][ERROR][o.o.i.t.t.TranslogTransferManager] [node_t0] [target][1] Exception occurred while cleaning translog at path=[16XyNNc7Tmm-oNEOWjzTUQ][1][translog][data]
java.io.IOException: access denied: /workplace/abhital/git/OpenSearch/server/build/testrun/internalClusterTest/temp/org.opensearch.action.admin.indices.create.RemoteSplitIndexIT_93C08AFB8B2D718E-001/tempDir-004/repos/AVPZMcbucX/16XyNNc7Tmm-oNEOWjzTUQ/1/translog/data/2/translog-35.tlog
	at org.apache.lucene.tests.mockfile.WindowsFS.checkDeleteAccess(WindowsFS.java:117) ~[lucene-test-framework-10.1.0.jar:10.1.0 884954006de769dc43b811267230d625886e6515 - 2024-12-17 16:15:44]
	at org.apache.lucene.tests.mockfile.WindowsFS.delete(WindowsFS.java:126) ~[lucene-test-framework-10.1.0.jar:10.1.0 884954006de769dc43b811267230d625886e6515 - 2024-12-17 16:15:44]
	at java.base/java.nio.file.Files.delete(Files.java:1153) ~[?:?]
	at org.opensearch.common.blobstore.fs.FsBlobContainer$1.visitFile(FsBlobContainer.java:147) ~[main/:?]
	at org.opensearch.common.blobstore.fs.FsBlobContainer$1.visitFile(FsBlobContainer.java:137) ~[main/:?]
	at java.base/java.nio.file.Files.walkFileTree(Files.java:2799) ~[?:?]
	at java.base/java.nio.file.Files.walkFileTree(Files.java:2870) ~[?:?]
	at org.opensearch.common.blobstore.fs.FsBlobContainer.delete(FsBlobContainer.java:137) ~[main/:?]
	at org.opensearch.index.translog.transfer.BlobStoreTransferService.delete(BlobStoreTransferService.java:287) ~[main/:?]
	at org.opensearch.index.translog.transfer.BlobStoreTransferService.lambda$deleteAsync$10(BlobStoreTransferService.java:294) ~[main/:?]
	at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:916) ~[main/:?]

Taking initial assumption as some internal file locks not being released, we tried by lowering the await time and the test started failing more frequently.
As a fix we have increased the the timeout to 60 seconds.

 assertBusy(() -> {
                try {
                    assertEquals(0, getFileCount(translogRepoPath));
                } catch (IOException e) {
                    fail();
                }
            }, 60, TimeUnit.SECONDS);

Tested with the new timeout with ~2K+ consecutive IT runs without any failures

BUILD SUCCESSFUL in 27s
60 actionable tasks: 1 executed, 59 up-to-date
[2025-05-05 23:44:43] : ===================================================================
[2025-05-05 23:44:43] : counter=2719
[2025-05-05 23:44:43] : ===================================================================

Related Issues

Resolves #[14296]

Check List

  • [x ] Functionality includes testing.
  • API changes companion pull request created, if applicable.
  • Public documentation issue/PR created, if applicable.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

@abhita abhita requested a review from a team as a code owner May 19, 2025 03:55
@github-actions
Copy link
Contributor

✅ Gradle check result for 096f106: SUCCESS

@codecov
Copy link

codecov bot commented May 19, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 72.41%. Comparing base (557ea3a) to head (096f106).
Report is 4 commits behind head on main.

Additional details and impacted files
@@             Coverage Diff              @@
##               main   #18329      +/-   ##
============================================
- Coverage     72.43%   72.41%   -0.03%     
+ Complexity    67306    67281      -25     
============================================
  Files          5488     5488              
  Lines        311069   311069              
  Branches      45217    45217              
============================================
- Hits         225319   225251      -68     
- Misses        67393    67397       +4     
- Partials      18357    18421      +64     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@gbbafna gbbafna merged commit 93d5356 into opensearch-project:main May 19, 2025
34 of 35 checks passed
opensearch-trigger-bot bot pushed a commit that referenced this pull request May 19, 2025
Signed-off-by: Abhita Lakkabathini <[email protected]>
(cherry picked from commit 93d5356)
Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
gbbafna pushed a commit that referenced this pull request May 19, 2025
(cherry picked from commit 93d5356)

Signed-off-by: Abhita Lakkabathini <[email protected]>
Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
tandonks pushed a commit to tandonks/OpenSearch that referenced this pull request Jun 1, 2025
neuenfeldttj pushed a commit to neuenfeldttj/OpenSearch that referenced this pull request Jun 26, 2025
neuenfeldttj pushed a commit to neuenfeldttj/OpenSearch that referenced this pull request Jun 26, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants