-
Notifications
You must be signed in to change notification settings - Fork 25.6k
Description
When a snapshot is finalized, two things happen:
- The snap-{uuid}.dat blob is written
- The index-N blob is written to include the new snapshot
If the master node fails while the snapshot is being finalized, the repository can be in one of 3 states:
- Master has not written either snap-{uuid}.dat nor index-N before failing.
- Master has written snap-{uuid}.dat but has not written index-N
- Master has written both snap-{uuid}.dat and index-N but has not removed the snapshot from the cluster state
Currently, we handle the first and third situations just fine. However, we do not handle the second situation properly - when the new master is elected, it will throw a FileAlreadyExistsException when it tries to take the snapshot to completion and sees that the snap-{uuid}.dat already exists, causing the snapshot finalization process to fail.
This issue is to improve the handling of the second situation.
This issue was discovered while debugging the test failure in DedicatedClusterSnapshotRestoreIT#testMasterShutdown (#25062). This test failed as a result of two issues:
- The index file not being written before the node was shutdown (leaving the snapshot incomplete)
- The MockRepository waiting to be awoken from being blocked on the index-N write, but the thread gets interrupted when closing the node, so I/O operations (including finalizing the snapshot by writing the index-N blob) throw a ClosedByInterruptException.
The test has been disabled with an AwaitsFix until this issue is resolved.