Skip to content

Akka.Remote: strengthen serialization exception handling #7922

@Aaronontheweb

Description

@Aaronontheweb

Version Information
Version of Akka.NET? Current production versions (recent 1.4.x or 1.5.x)
Which Akka.NET Modules? Akka.Remote, Akka.Cluster, Akka.Serialization

Describe the bug
When Newtonsoft.Json serialization fails while serializing a wrapped payload (e.g., when using Phobos tracing which wraps messages in SpanEnvelope), the resulting exception causes an Akka.Remote.EndpointException that terminates the remote association and can down cluster nodes. This occurs because serialization exceptions thrown by Newtonsoft.Json are not being properly caught and handled by Akka.Remote's WriteSend protection logic.

The issue manifests when:

  1. A message is wrapped in an outer serialization layer (e.g., Phobos tracing)
  2. The inner payload contains data that triggers a Newtonsoft.Json serialization exception
  3. The exception propagates through EndpointWriter.WriteSend without being handled
  4. This results in an association failure that can down the cluster node

To Reproduce
Steps to reproduce the behavior:

  1. Configure a clustered Akka.NET application with message wrapping enabled (e.g., Phobos tracing)
  2. Send a message containing a type with a property that throws during Newtonsoft.Json serialization
  3. Example: a DateTimeOffset property constructed with invalid offset values
  4. Observe the Akka.Remote.EndpointException that causes association failure

Example problematic code pattern:

public class ProblematicType
{
    // Property that throws during serialization
    public DateTimeOffset Value => new DateTimeOffset(someDateTime, invalidOffset);
}

Expected behavior
Serialization failures should be handled gracefully without causing cluster association failures. The framework should:

  • Detect serialization exceptions during WriteSend
  • Handle them in a way that doesn't terminate remote associations or down cluster nodes
  • Provide clear error logging about the serialization failure
  • Isolate the problematic message (e.g., send to dead letters) rather than cascading the failure

Actual behavior
The serialization exception propagates unhandled through the endpoint writer, causing:

Akka.Remote.EndpointException: Failed to write message [Phobos.Tracing.SpanEnvelope] to the transport
 ---> Newtonsoft.Json.JsonSerializationException: Error getting value from 'Value' on '<Type>'.
 ---> System.ArgumentOutOfRangeException: The UTC time represented when the offset is applied must be between year 0 and 10,000. (Parameter 'offset')
   at System.DateTimeOffset..ctor(DateTime dateTime, TimeSpan offset)
   at <UserCode>.get_Value()
   at lambda_method178(Closure, Object)
   at Newtonsoft.Json.Serialization.ExpressionValueProvider.GetValue(Object target)
   [... stack trace continues through serialization pipeline ...]
   at Akka.Serialization.NewtonSoftJsonSerializer.ToBinary(Object obj)
   at Akka.Remote.MessageSerializer.Serialize(ExtendedActorSystem system, Information transportInformation, Object message)
   at Akka.Remote.EndpointWriter.WriteSend(Send send)
   --- End of inner exception stack trace ---
   at Akka.Remote.EndpointWriter.PublishAndThrow(Exception reason, LogLevel level, Boolean needToThrow)
   at Akka.Remote.EndpointWriter.WriteSend(Send send)

This results in:

  • Remote association termination
  • Cluster node downing
  • Potential cascading failures across the cluster

Environment

  • .NET 8.0, .NET 6.0, and .NET Framework 4.8 (all versions using Newtonsoft.Json serialization)
  • Observed in clustered environments with message wrapping/tracing enabled

Additional context

Root Cause
The problem is in Akka.Remote.EndpointWriter.WriteSend protection logic. The current exception handling doesn't properly catch or handle specific exception types thrown by Newtonsoft.Json during serialization failures. When serialization fails in a wrapped payload (common with tracing, monitoring, or other message-wrapping features), the exception propagates unhandled and terminates the association.

Impact

  • Cluster stability: Individual message serialization failures cascade into cluster-level failures
  • Production reliability: Unexpected data validation issues can trigger node downing
  • Debugging difficulty: The failure mode is not obvious from error messages
  • Cascading failures: Association termination triggers additional cluster management overhead

While applications can work around this by ensuring all serialized data is valid, the framework should be resilient to individual message serialization failures and not allow them to cascade into cluster-level failures.

Suggested Fix

  • Wrap Newtonsoft.Json serialization exceptions in a custom exception type that can be detected in WriteSend
  • Update the WriteSend exception handling to gracefully handle serialization failures
  • Consider isolating the problematic message (dead letter) rather than failing the entire association
  • Ensure comprehensive error logging to help diagnose serialization issues

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions