Skip to content

SqlSnapshotStore with autoinitialization stops if DB is temporrarily inaccessible #3870

@balcko

Description

@balcko

Hi,
we are using Akka.Net version 1.3.12 on production with clustering and SqlServer persistence.
Application is hosted as a service on Windows server with target framework net461.

Even after the fix of akkadotnet/Akka.Persistence.SqlServer#104 we still encountered an issue, that persistent actors could not start after the planned DB maintenance end. The only solution for us was to restart actor systems in whole cluster via pbm.
In logs we have found following errors related to SnapshotStore:

  • Circuit Breaker is open; calls are failing fast
  • Error during snapshot store initialization

I have visually debugged the SqlServerSnapshotStore code and found two issues there, which probably caused the actor to stop:

  1. _breaker.WithCircuitBreaker(() => DeleteAsync(saveSnapshotFailure.Metadata));

If the DB becomes unavailable and saving snapshot fails so many times that circuit breaker opens, this line of code will immediately throw even without awaiting the task. This causes the actor to restart. Issue 2 happens afterwards

  1. If autoinitiliaze setting is on (our case), after the actor restart, parent SqlSnapshotStore starts initialization. Since DB is still unavailable, it fails and as a result actor is permanently stopped on the line

Now the whole actor system is screwed, since all new persistent actors can not start.
Fix could be following:

  1. exception should definitely not restart the actor (maybe just log exception and swallow it). Another possible approach would be to skip the whole "rollback" delete after the failed save, since it will anyway probably fail and missed snapshot save should not matter, since snapshotting should be used only for optimizing replays and in this case if event has been successfully persisted before, then missed snapshot save won't affect event sourcing.

  2. Actor should not stop if autoinitialize fails, the error should be logged and autoinitialization should be retried after some delay.

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions