Skip to content

Conversation

@x-INFiN1TY-x
Copy link
Contributor

@x-INFiN1TY-x x-INFiN1TY-x commented Jun 15, 2025

Related Issues/RFCs:


Problem Statement

OpenSearch clusters using remote-backed storage are susceptible to data inconsistencies when multiple primary shards concurrently attempt to upload segment metadata—particularly during network partitions or primary failovers. A stale (previously active) primary might overwrite metadata written by the newly promoted one. This risk undermines cluster safety and complicates automation in recovery flows.

Current multi-writer detection mechanisms are not robust enough to handle this reliably.


Solution Overview

This PR introduces ETag-based conditional writes to the remote segment metadata upload process. ETags (version identifiers from cloud storage systems like S3, GCS, or Azure) allow OpenSearch to safely coordinate access to shared resources. This mechanism ensures only the correct primary shard can write metadata, while stale primaries self-detect and fence themselves.

Key enhancements:

  1. ETag-Based Conditional Writes:
    Primary shards attach the known ETag to each metadata upload using the If-Match condition. If the ETag doesn't match the current version in the remote store, the write is rejected (HTTP 412 Precondition Failed).

  2. Fixed Metadata Filename:
    To enable ETag-based coordination, segment metadata is now always written to a fixed filename (e.g., "segment_metadata") instead of legacy dynamic filenames.

  3. Stale Primary Self-Fencing:

    • On promotion, the new primary performs a non-conditional (forced) metadata upload after clearing its local ETag knowledge. This updates the remote file and its ETag.
    • If the old primary tries to write using a stale ETag, the write fails. This triggers a controlled failShard() operation, fencing off the stale node.
  4. ETag Lifecycle Managed at Shard Level:
    IndexShard now caches the ETag for its segment metadata file and updates it based on the success/failure of remote operations.

This design shifts writer validation from OpenSearch into the remote store’s atomic operations—improving correctness and simplifying state coordination.


Key Implementation Details

IndexShard

  • Introduces a MetadataETagCache per shard to hold the latest known ETag.

  • Provides methods:

    • getMetadataETag()
    • updateMetadataETag()
    • clearMetadataETag()
  • On primary promotion, invokes initiateNonConditionalRemoteMetadataUpload():

    • Clears cached ETag to trigger an unconditional upload.
    • Performs an overwrite that establishes a new ETag and “claims” primary ownership.
    • Handles transient errors gracefully, relying on future refreshes to retry.

RemoteStoreRefreshListener

  • During each metadata upload:

    • Retrieves the current ETag from the shard.
    • Invokes uploadMetadata(...) with the ETag and a structured ActionListener.
  • On success: Updates shard’s cached ETag.

  • On Precondition Failed: Treats this as a stale primary detection, clears ETag, and calls failShard() for fencing.

  • Logs other failures without failing the shard.

RemoteSegmentStoreDirectory

  • Accepts a versionIdentifier (ETag) and enhanced ActionListener.

  • Constructs ConditionalWriteOptions based on the ETag:

    • If ETag is present → ifMatch
    • If ETag is null → unconditional upload
  • Always uses "segment_metadata" as the remote filename.

RemoteDirectory & BlobStore

  • copyFrom() method now takes ConditionalWriteOptions.
  • Passes them through to the underlying blobContainer.writeBlobConditionally(...) for storage-provider-specific handling.

Testing

Unit tests in RemoteSegmentStoreDirectoryTests have been expanded to verify:

  • ETag propagation and conditional write correctness.
  • Proper fencing behavior on ETag mismatches.
  • Correct switching between conditional and unconditional uploads.

Related Issues

Check List

  • New functionality has been documented.
  • Public documentation issue/PR created
  • Commits are signed per the DCO using --signoff

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.


Visualizing the Changes

  1. Sequence Diagram: Stale Primary Self-Fencing Mechanism
    Mermaid Chart - Create complex, visual diagrams with text  A smarter way of creating diagrams -2025-06-15-173440
    Demonstrates how an outdated ETag causes a 412 failure, triggering the stale primary’s self-initiated failShard().

  2. Architecture: Core Components for Conditional Metadata Upload
    Mermaid Chart - Create complex, visual diagrams with text  A smarter way of creating diagrams -2025-06-15-174623
    Shows interaction flow between IndexShard, RemoteStoreRefreshListener, RemoteSegmentStoreDirectory, RemoteDirectory, and BlobStore, including ETag usage.

  3. Flowchart: New Primary’s Metadata Ownership Claim
    Mermaid Chart - Create complex, visual diagrams with text  A smarter way of creating diagrams -2025-06-15-173732
    Shows how a new primary clears its ETag cache, performs a non-conditional upload, and updates its local ETag before assuming control.

tqranjan and others added 30 commits April 25, 2025 03:22
Signed-off-by: Tanishq Ranjan <[email protected]>
Introduces ConditionalWriteOptions and ConditionalWriteResponse classes to support conditional write operations in BlobContainer implementations.
…tainer & AsyncMultiStreamEncryptedBlobContainer
	plugins/repository-s3/src/main/java/org/opensearch/repositories/s3/S3BlobContainer.java
	plugins/repository-s3/src/test/java/org/opensearch/repositories/s3/S3BlobStoreContainerTests.java
2. Etag Cache Implementation
tqranjan added 9 commits June 16, 2025 18:22
# Conflicts:
#	plugins/repository-s3/src/main/java/org/opensearch/repositories/s3/S3BlobContainer.java
#	plugins/repository-s3/src/test/java/org/opensearch/repositories/s3/S3BlobStoreContainerTests.java
# Conflicts:
#	server/src/main/java/org/opensearch/common/blobstore/AsyncMultiStreamEncryptedBlobContainer.java
#	server/src/main/java/org/opensearch/common/blobstore/ConditionalWrite.java
@github-actions
Copy link
Contributor

❌ Gradle check result for 4ef0bcd: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@x-INFiN1TY-x x-INFiN1TY-x marked this pull request as draft June 27, 2025 07:40
@opensearch-trigger-bot
Copy link
Contributor

This PR is stalled because it has been open for 30 days with no activity.

@opensearch-trigger-bot opensearch-trigger-bot bot added the stalled Issues that have stalled label Jul 27, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

stalled Issues that have stalled

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants