Skip to content

Conversation

@mdh1418
Copy link
Member

@mdh1418 mdh1418 commented Aug 19, 2025

On rare occasions, EventPipeBuffers may be corrupted during their lifetimes (the reason is currently unknown, but the main suspects are corruption by an external process or some uncaught use-after-free scenario). Once an EventPipeEventInstance is written into an EventPipeBuffer, there are no validation mechanisms guaranteeing that the data is uncorrupted by the time the event instance is read from the buffer. As such, there has been instances where the in-process event listener has been observed to hit an Access Violation due to EventPipeInternal_GetNextEvent returning a non-NULL pointer to corrupted bytes.

In order to better diagnose whether EventPipeBuffers in these scenarios are being corrupted internally or externally, this PR aims two opt-in EventPipeBuffer guarding mechanisms:

  1. Header and Footer Guard signatures - To detect corruption and overrun
  2. Memory Virtual Protection - To limit instances where some internals corrupt the EventPipeBuffer, helping distinguish between an internal and external corruption.

Behind an EventPipeBufferGuardLevel config switch (DOTNET_EventPipeBufferGuardLevel envvar), this PR adds two levels of increasing EventPipeBuffer protection.

DOTNET_EventPipeBufferGuardLevel Behavior
0 No Protection, default
1 EventPipeBuffer Header and Footer guard signatures are active, and RaiseFailFastException triggers upon signature corruption detection. Memory Protection set to ReadOnly once the EventPipeBuffer is converted to read-only
2 In addition to level 1 behavior, EventPipeBuffer memory is set to ReadWrite during event writes, and ReadOnly at all other times after allocation

Header and Footer Guard Details

In the buffer header, we inject a magic + data relevant to the EventPipeBuffer's creation (timestamp, writing thread, and event sequence number), so in the event that the header/footer is partially corrupted, the remaining bytes can provide context for that buffer.

In the buffer footer, we inject a magic, it's inverse for a quick integrity check, a checksum computed from the header's identifiable bytes with a salt, and finally padding bytes to help detect buffer overrun + quick visual marker.

Note: To maintain EventPipeEventInstance 8-byte alignments, 32-bytes for each of the header and footer was determined to be a small enough overhead that provided a good starting point to diagnose buffer corruption.

Testing

Performed manual testing with a debugger, corrupting an EventPipeBuffer's guards before buffer_manager_move_next_event_any_thread during a call to ep_buffer_manager_get_next_event for an in-proc EventListener.

Previously, EventPipeBuffer's were not truly read-only when converted,
as metadata IDs would be generated on the fly and written into the
EventPipeEventInstance during event block writing. Instead, pass in the
computed metadata ID separately, allowing for buffers to be protected
with ClrVirtualProtect.
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR adds diagnostic mechanisms to EventPipeBuffer to detect and diagnose memory corruption issues. The implementation introduces configurable buffer guard levels with header/footer signatures and memory protection to help distinguish between internal and external corruption sources.

Key changes include:

  • Addition of configurable EventPipeBufferGuardLevel with three protection levels (0-2)
  • Implementation of header and footer guard structures with magic values and checksums
  • Integration of memory protection using virtual memory APIs when guards are enabled

Reviewed Changes

Copilot reviewed 15 out of 15 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
ep-types-forward.h Adds forward declarations for guard structures and enums for protection levels
ep-rt.h Adds runtime interface declarations for memory protection and fatal error handling
ep-file.c Refactors to pass metadata_id parameter directly instead of storing in event instance
ep-event-instance.h/.c Removes metadata_id field from EventPipeEventInstance structure
ep-buffer.h/.c Core implementation of buffer guards with header/footer structures and validation
ep-buffer-manager.h/.c Integrates guard level configuration and passes it to buffer allocation
ep-block.h/.c Updates to accept metadata_id as parameter instead of extracting from event instance
ep-rt-*.h Platform-specific implementations of memory protection and fatal error functions
clrconfigvalues.h Adds configuration value for EventPipeBufferGuardLevel

// [ uint64 magic2 ][ uint64 magic2_inv ][ uint64 checksum ] and remaining bytes (if any) left as 0xEB.
uint8_t *footer_base_bytes = instance->limit - EP_BUFFER_FOOTER_GUARD_SIZE;
EventPipeBufferFooterGuard *footer = (EventPipeBufferFooterGuard *)footer_base_bytes;
memset (footer, 0xEB, EP_BUFFER_FOOTER_GUARD_SIZE);
Copy link

Copilot AI Aug 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The magic number 0xEB should be defined as a named constant (e.g., EP_BUFFER_FOOTER_PADDING_BYTE) to improve code maintainability and make its purpose clear.

Suggested change
memset (footer, 0xEB, EP_BUFFER_FOOTER_GUARD_SIZE);
memset (footer, EP_BUFFER_FOOTER_PADDING_BYTE, EP_BUFFER_FOOTER_GUARD_SIZE);

Copilot uses AI. Check for mistakes.

// Verify the remaining bytes in the footer guard are the fill value 0xEB.
for (size_t i = 0; i < sizeof(footer->padding); i++) {
if (footer->padding[i] != 0xEB)
Copy link

Copilot AI Aug 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The hardcoded 0xEB value should match the named constant suggested for line 56 to ensure consistency and maintainability.

Suggested change
if (footer->padding[i] != 0xEB)
if (footer->padding[i] != EP_BUFFER_FOOTER_PADDING_FILL)

Copilot uses AI. Check for mistakes.
void
ep_rt_fatal_error_with_message (const ep_char8_t *message)
{
if (message != nullptr) {
Copy link

Copilot AI Aug 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should use 'is not null' instead of '!= nullptr' according to the coding guidelines for null checks.

Copilot uses AI. Check for mistakes.

// Header trailing padding bytes should still be 0x00
for (size_t i = 0; i < sizeof(header->padding); i++) {
if (header->padding[i] != 0x00)
Copy link

Copilot AI Aug 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The hardcoded 0x00 value should be defined as a named constant (e.g., EP_BUFFER_HEADER_PADDING_BYTE) for consistency with the footer padding approach.

Suggested change
if (header->padding[i] != 0x00)
// Header trailing padding bytes should still be EP_BUFFER_HEADER_PADDING_BYTE
for (size_t i = 0; i < sizeof(header->padding); i++) {
if (header->padding[i] != EP_BUFFER_HEADER_PADDING_BYTE)

Copilot uses AI. Check for mistakes.
@mdh1418 mdh1418 added the NO-MERGE The PR is not ready for merge yet (see discussion for detailed reasons) label Aug 19, 2025
@mdh1418 mdh1418 removed the NO-MERGE The PR is not ready for merge yet (see discussion for detailed reasons) label Oct 7, 2025
Copy link
Member

@noahfalk noahfalk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. A few nits inline and then then main item is we should make sure this works for NativeAOT too. I'm hoping we do that by shifting some of the runtime specific callouts into new minipal APIs.

size_t length,
EventPipePageProtection protection)
{
return true;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jkotas - is there some guiding principle on what we add to minipal and what should stay out?

Today EventPipe code relies on runtime-specific callouts for various OS functionality but I'm hoping we can shift that trajectory to depend more on minipal directly. This spot seems like an appealing example where we'd like to invoke an OS API but the existing pattern takes us through runtime-specific wrappers first. Adding VirtualProtect to minipal and using it feels like a good approach to me but I want to make sure I'm not abusing the intent of minipal.

Copy link
Member

@jkotas jkotas Oct 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good candidates for minipal are unambiguous trivial methods used from number of different places. "get current time stamp" is a perfect example.

minipal is not meant to wrap everything with platform specific implementation.

I am not sure whether mprotect is a good candidate for minimap. Most places that call mprotect tend to come with their own unique requirements.

This spot seems like an appealing example where we'd like to invoke an OS API but the existing pattern takes us through runtime-specific wrappers first.

This was setup years ago to work around build system limitations. It does not make sense. It would make a lot more sense for eventpipe to have OS-specific PALs and avoid trying to fit into runtime-specific wrappers.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Jan! We'll create our own EventPipe PAL as it sounds like these aren't appropriate for minipal.

Copy link
Member

@lateralusX lateralusX Oct 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The EventPipe runtime shim was originally put into place to make majority of EventPipe code runtime agnostic, but still reuse runtime unique implementation of low-level artifacts like IO, lock, threading, atomics, mainly keeping the EventPipe<->runtime interaction stable for CoreCLR while porting code from C++ to C and integrated with Mono, originally it even reused each runtime container implementation, but since then, I broken out that part into native/containers and if we don't think there is much value continue using runtime specific implementations of these low level artifacts shimmed by EventPipe runtime layer, then we should probably move towards one EventPipe OS PAL source file shared by Mono/CoreCLR/NAOT.

There will still be some runtime specific things that will stay in the runtime shim layer, but it will be smaller, and it will be simpler to run EventPipe standalone, making it simpler to get our low-level native runtime tests running outside of runtime. I believe some of these things would potentially end up in minipal, we already have some artifacts that could be used by EventPipe in minipal like, mutex, hig-res timers, utf8-ucs2 conversions and I had some volatile/atomics functions in another PR that would suite minipal and EventPipe as well.

I had an ambition for a long time to get our low-level native EventPipe and container tests currently under Mono, https://github.com/dotnet/runtime/tree/main/src/mono/mono/eventpipe/test, running as a runtime test (as a shared native library or separate binary executed from test) running on all runtimes, part of that was to make EventPipe less dependent on runtime specific artifacts, moving towards one EventPipe OS PAL shared by all runtimes. If we believe that is a good direction, then doing that work could start moving us towards an EventPipe OS PAL.

void
ep_rt_fatal_error_with_message (const ep_char8_t *message)
{
/* Not implemented, no-op */
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should make sure NativeAOT has an implementation of this too. Ideally similar to above we could have a shared minipal_raise_fatal_error() function that no longer needs runtime-specific callouts. Doing that depends on the fatal error handling staying simple, runtime agnostic, and directly aligning with underlying OS APIs.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fatal error handling staying simple, runtime agnostic, and directly aligning with underlying OS APIs.

Fatal error handling includes runtime-specific crash dump and watson logic currently...

instance->remaining_sequence_point_alloc_budget = sequence_point_allocation_budget;
}

buffer_guard_level = EP_MIN (ep_rt_config_value_get_buffer_guard_level (), EP_BUFFER_GUARD_LEVEL_PROTECT_OUTSIDE_WRITES);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like this is trying to clamp the range into valid enum values. I'd suggest define EP_BUFFER_GUARD_LEVEL_MAX = EP_BUFFER_GUARD_LEVEL_PROTECT_OUTSIDE_WRITES in the enum definition, then use that MAX value here. It better conveys the intent and its more likely to get updated correctly if we ever change the levels in the future.

buffer->current_read_event = NULL;

if (buffer->buffer_guard_level != EP_BUFFER_GUARD_LEVEL_NONE)
ep_buffer_ensure_guard_consistency (buffer);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd move the consistency check after the vprotect call to guarantee there is no time window where the memory can get corrupted after we check it and before it is protected.

if (buffer->buffer_guard_level != EP_BUFFER_GUARD_LEVEL_NONE)
ep_buffer_ensure_guard_consistency (buffer);

if ((buffer->buffer_guard_level >= EP_BUFFER_GUARD_LEVEL_PROTECT_OUTSIDE_WRITES) &&
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This if can be nested under the one at line 128 to make it clear that this code isn't on the hot path. The compiler may not be allowed to optimize that on its own if it has to conservatively assume that buffer->buffer_guard_level could change asynchronously.

if(level != NONE)
{
   ...
   if(level >= PROTECT_OUTSIDE_WRITES) ...
}

if (buffer->buffer_guard_level != EP_BUFFER_GUARD_LEVEL_NONE)
ep_buffer_ensure_guard_consistency (buffer);

if ((buffer->buffer_guard_level >= EP_BUFFER_GUARD_LEVEL_PROTECT_ON_READONLY) &&
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same thing, this if can be nested under the one at line 259 to ensure the compiler can keep it off the hot path.

uint32_t buffer_guard_level = 0;
gchar *value = g_getenv ("DOTNET_EventPipeBufferGuardLevel");
if (!value)
value = g_getenv ("COMPlus_EventPipeBufferGuardLevel");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should not need to support for COMPlus_ for any new variables.

(We are keeping COMPlus_ path for now for existing variables. Ideally, we will delete it at some point.)

@jkotas
Copy link
Member

jkotas commented Oct 15, 2025

On rare occasions, EventPipeBuffers may be corrupted during their lifetimes (the reason is currently unknown, but the main suspects are corruption by an external process or some uncaught use-after-free scenario).

I would bet on a race condition that leads to use-after-free.

We should have an issue that describes the observed problem and what we have found about the nature of the corruption so far. Once we trace it down, we may consider deleting this instrumentation or at least simplifying it - depending on what we find.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants