-
Notifications
You must be signed in to change notification settings - Fork 966
Troubleshooting
java.lang.InternalError(a fault occurred in a recent unsafe memory access operation in compiled Java code)
This is actually an out-of-disk-space issue, and typically indicates that not enough memory has been allocated to /dev/shm
.
To alleviate, check to make sure you have enough disk space.
In the samples, on Linux, this will probably be either at /dev/shm/aeron
or /tmp/aeron
(depending on your settings).
See this thread for a similar problem.
Note: if you are trying to run this inside a Linux Docker, be aware that, by default, Docker only allocates 64 MB to the shared memory space at /dev/shm
. However, the samples will quickly outgrow this.
You can work around this issue by using the --shm-size
argument for docker run
or shm_size
in docker-compose.yaml
.
Archive terminating clients with terminated: failed to send response for more than connectTimeoutMs=5000
As of Aeron 1.47.0, the Archive will send periodic heartbeats to Archive clients. If a client does not consume these heartbeats, the Archive will detect backpressure on the publication and terminate the control session for that client, which will cause the client to disconnect. This will happen if you establish a connection to an Archive but do not poll the client. Prior to Aeron 1.47.0, not polling an Archive client could cause back pressure on other clients.
When the Archive terminates a client, it will record this in the Archive's distinct error log. You can see this by running io.aeron.archive.ArchiveTool
with the errors
command. It will appear as an event similar to this:
1 observations from 2025-07-15 12:01:49.784+0100 to 2025-07-15 12:01:49.784+0100 for:
io.aeron.archive.client.ArchiveEvent: WARN - controlSessionId=1659949395 (controlResponseStreamId=999 controlResponseChannel=aeron:udp?endpoint=localhost:58617|mtu=1408|term-length=65536|session-id=535777809|sparse=true) terminated: failed to send response for more than connectTimeoutMs=5000
From the perspective of the client, the AeronArchive
instance will show it is in a DISCONNECTED
state, and attempting operations on the client will result in an error like the following:
io.aeron.exceptions.AeronException: ERROR - client is closed
at io.aeron.Aeron.nextCorrelationId(Aeron.java:497)
at io.aeron.archive.client.AeronArchive.startRecording(AeronArchive.java:676)
To avoid this issue, you should ensure you regularly poll any Archive clients you have connected. You can do this with any of the following methods:
AeronArchive#pollForErrorResponse
AeronArchive#checkForErrorResponse
AeronArchive#pollForRecordingSignals
AeronArchive#controlResponsePoller().poll()
An alternative option would be to close any Archive clients that no longer need to be connected. Once you have instructed the Archive to perform an operation, it will continue to do so even if the client that started the operation disconnects. For example, you can connect to an Archive, instruct it to start a recording, then disconnect, and the Archive will continue recording.
This error message, with its accompanying stack trace, can appear on versions of Aeron prior to 1.45.0 and gives the impression that something is not working correctly in the driver. However, this message is actually benign and does not indicate a problem with the driver itself, although it is a symptom of heavy loss in the system.
java.lang.IllegalStateException: maximum number of active RetransmitActions reached
at io.aeron.driver.RetransmitHandler.scanForAvailableRetransmit(RetransmitHandler.java:199)
at io.aeron.driver.RetransmitHandler.onNak(RetransmitHandler.java:88)
at io.aeron.driver.NetworkPublication.onNak(NetworkPublication.java:412)
Aeron limits the number of active retransmissions that can be in progress for a single publication. At times of heavy loss, many subscribers may send NAKs to a publisher, and this limit may be reached. If that happens, subsequent NAKs will not be serviced, and the receiver will re-send the NAK. By the time this happens, some of the retransmits may have completed, and the NAK can be serviced as normal. This issue is most likely to occur on MDC streams where group
is not set to true
, as there would be many recipients that may all observe the same loss, and each one will send a NAK, resulting in many retransmissions at the same time.
Prior to Aeron 1.45.0, breaching this limit would result in an IllegalStateException
being thrown, and the error message above being logged. Given that this is not actually an error condition, it now increments the Retransmit Pool Overflow count
counter, rather than throwing an exception. Another change in Aeron 1.45.0 to better infer group semantics on MDC publications should make this kind of issue less likely. It is also possible to increase the limit by setting aeron.max.resend
.
If you see this error, it indicates that your system is experiencing heavy loss and that the driver is still running on an old version of Aeron.
This error suggests an access to a counter after the underlying resources have been released. This can happen if you create a counter, close the Aeron instance through which it was created, and then attempt to use the counter. Depending on whether the access is a get
or a set
operation, you may see an error like one of the following:
# SIGSEGV (0xb) at pc=0x00007fa0f1d29ad9, pid=209799, tid=209801
#
# JRE version: OpenJDK Runtime Environment Zulu17.38+21-CA (17.0.5+8) (build 17.0.5+8-LTS)
# Java VM: OpenJDK 64-Bit Server VM Zulu17.38+21-CA (17.0.5+8-LTS, mixed mode, sharing, tiered, compressed oops, compressed class ptrs, g1 gc, linux-amd64)
# Problematic frame:
# J 2054 c1 org.agrona.concurrent.status.AtomicCounter.get()J (12 bytes) @ 0x00007fa0f1d29ad9 [0x00007fa0f1d29a40+0x0000000000000099]
# SIGSEGV (0xb) at pc=0x00007f2700cb271f, pid=218628, tid=218629
#
# JRE version: OpenJDK Runtime Environment Zulu17.38+21-CA (17.0.5+8) (build 17.0.5+8-LTS)
# Java VM: OpenJDK 64-Bit Server VM Zulu17.38+21-CA (17.0.5+8-LTS, mixed mode, sharing, tiered, compressed oops, compressed class ptrs, g1 gc, linux-amd64)
# Problematic frame:
# V [libjvm.so+0xeb271f] Unsafe_PutLongVolatile+0x12f
Systems running systemd
will clear /dev/shm when a user's session ends. This can result in the media driver directory being removed, even while the driver is still running. To prevent this from happening, you should run the driver as a system user or disable the option to remove IPC.
A system user is typically one with a user ID of less than 1000, although this range can be customised by setting SYS_UID_MIN
and SYS_UID_MAX
in /etc/login.defs
.
To disable the removal of IPC, set RemoveIPC=no
in /etc/systemd/logind.conf
.