Skip to content

Master failover during snapshotting could leave the snapshot incomplete #25281

@abeyad

Description

@abeyad

When a snapshot is finalized, two things happen:

  1. The snap-{uuid}.dat blob is written
  2. The index-N blob is written to include the new snapshot

If the master node fails while the snapshot is being finalized, the repository can be in one of 3 states:

  1. Master has not written either snap-{uuid}.dat nor index-N before failing.
  2. Master has written snap-{uuid}.dat but has not written index-N
  3. Master has written both snap-{uuid}.dat and index-N but has not removed the snapshot from the cluster state

Currently, we handle the first and third situations just fine. However, we do not handle the second situation properly - when the new master is elected, it will throw a FileAlreadyExistsException when it tries to take the snapshot to completion and sees that the snap-{uuid}.dat already exists, causing the snapshot finalization process to fail.

This issue is to improve the handling of the second situation.

This issue was discovered while debugging the test failure in DedicatedClusterSnapshotRestoreIT#testMasterShutdown (#25062). This test failed as a result of two issues:

  1. The index file not being written before the node was shutdown (leaving the snapshot incomplete)
  2. The MockRepository waiting to be awoken from being blocked on the index-N write, but the thread gets interrupted when closing the node, so I/O operations (including finalizing the snapshot by writing the index-N blob) throw a ClosedByInterruptException.

The test has been disabled with an AwaitsFix until this issue is resolved.

Metadata

Metadata

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions