Skip to content

Conversation

@hds
Copy link
Contributor

@hds hds commented Oct 30, 2025

Motivation

There were two separate race conditions related to registration of
callsites. In both cases, it was possible that event or new_span
could be called before register_callsite had finished executing for
all subscribers.

The first case could be invoked when multiple (thread local) subscribers
were registering the same callsite and could cause some subscribers to
not receive a call to register_callsite at all. This case was fixed
in #2938.

The second case could be invoked when multiple threads reach the same
event or span for the first time and can occur in the presence of only a
single global default subscriber. The subscriber may receive calls to
event or new_span before the call to register_callsite has
finished executing. This may occur even with a relatively fast
register_callsite implentation - although it is less likely. A slow
implementation is more likely to trigger the error.

Solution

This change fixes the race condition by forcing any calls to
DefaultCallsite::register which run while another thread is
registering the same callsite to wait until registration has completed.

This is achieved with a loop around the check on the atomic representing
the registration state for that callsite. It will hotloop until the
registration is complete.

Tests have been added to both tracing-core and tracing which invoke
this error case and always fail when testing the previous code.

Fixes: #2743

There were two separate race conditions related to registration of
callsites. In both cases, it was possible that `event` or `new_span`
could be called before `register_callsite` had finished executing for
all subscribers.

The first case could be invoked when multiple (thread local) subscribers
were registering the same callsite and could cause some subscribers to
not receive a call to `register_callsite` at all. This case was fixed
in #2938.

The second case could be invoked when multiple threads reach the same
event or span for the first time and can occur in the presence of only a
single global default subscriber. The subscriber may receive calls to
`event` or `new_span` before the call to `register_callsite` has
finished executing. This may occur even with a relatively fast
`register_callsite` implentation - although it is less likely. A slow
implementation is more likely to trigger the error.

This change fixes the race condition by forcing any calls to
`DefaultCallsite::register` which run while another thread is
registering the same callsite to wait until registration has completed.

This is achieved with a loop around the check on the atomic representing
the registration state for that callsite. It will hotloop until the
registration is complete.

Tests have been added to both `tracing-core` and `tracing` which invoke
this error case and always fail when testing the previous code.

Fixes: #2743
@hds hds requested review from a team and hawkw as code owners October 30, 2025 17:53
// The callsite is being registered. We have to wait until
// registration is finished, otherwise the register_callsite
// call could be missed completely.
continue;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we must spin, then at least use std::hint::spin_loop().

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the tip!

This currently causes a deadlock in the case of reentrant traces from inside a subscriber, so I'm trying to come up with a solution that avoids that issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Slow register_callsite() can cause other subscribers to not receive register_callsite() at all

3 participants