core: fix race condition in `DefaultCallsite::register` #3401

hds · 2025-10-30T17:53:04Z

Motivation

There were two separate race conditions related to registration of
callsites. In both cases, it was possible that event or new_span
could be called before register_callsite had finished executing for
all subscribers.

The first case could be invoked when multiple (thread local) subscribers
were registering the same callsite and could cause some subscribers to
not receive a call to register_callsite at all. This case was fixed
in #2938.

The second case could be invoked when multiple threads reach the same
event or span for the first time and can occur in the presence of only a
single global default subscriber. The subscriber may receive calls to
event or new_span before the call to register_callsite has
finished executing. This may occur even with a relatively fast
register_callsite implentation - although it is less likely. A slow
implementation is more likely to trigger the error.

Solution

This change fixes the race condition by forcing any calls to
DefaultCallsite::register which run while another thread is
registering the same callsite to wait until registration has completed.

This is achieved with a loop around the check on the atomic representing
the registration state for that callsite. It will hotloop until the
registration is complete.

Tests have been added to both tracing-core and tracing which invoke
this error case and always fail when testing the previous code.

Fixes: #2743

There were two separate race conditions related to registration of callsites. In both cases, it was possible that `event` or `new_span` could be called before `register_callsite` had finished executing for all subscribers. The first case could be invoked when multiple (thread local) subscribers were registering the same callsite and could cause some subscribers to not receive a call to `register_callsite` at all. This case was fixed in #2938. The second case could be invoked when multiple threads reach the same event or span for the first time and can occur in the presence of only a single global default subscriber. The subscriber may receive calls to `event` or `new_span` before the call to `register_callsite` has finished executing. This may occur even with a relatively fast `register_callsite` implentation - although it is less likely. A slow implementation is more likely to trigger the error. This change fixes the race condition by forcing any calls to `DefaultCallsite::register` which run while another thread is registering the same callsite to wait until registration has completed. This is achieved with a loop around the check on the atomic representing the registration state for that callsite. It will hotloop until the registration is complete. Tests have been added to both `tracing-core` and `tracing` which invoke this error case and always fail when testing the previous code. Fixes: #2743

jstarks · 2025-11-08T03:30:37Z

tracing-core/src/callsite.rs

+                    // The callsite is being registered. We have to wait until
+                    // registration is finished, otherwise the register_callsite
+                    // call could be missed completely.
+                    continue;


If we must spin, then at least use std::hint::spin_loop().

Thanks for the tip!

This currently causes a deadlock in the case of reentrant traces from inside a subscriber, so I'm trying to come up with a solution that avoids that issue.

hds requested review from a team and hawkw as code owners October 30, 2025 17:53

Merge branch 'main' into hds/fix-defaultcallsite-register-race

6ddcedc

hds mentioned this pull request Oct 31, 2025

Slow register_callsite() can cause other subscribers to not receive register_callsite() at all #2743

Open

jstarks reviewed Nov 8, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

core: fix race condition in `DefaultCallsite::register` #3401

core: fix race condition in `DefaultCallsite::register` #3401

hds commented Oct 30, 2025

Uh oh!

jstarks Nov 8, 2025

Uh oh!

hds Nov 8, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

core: fix race condition in DefaultCallsite::register #3401

Are you sure you want to change the base?

core: fix race condition in DefaultCallsite::register #3401

Conversation

hds commented Oct 30, 2025

Motivation

Solution

Uh oh!

jstarks Nov 8, 2025

Choose a reason for hiding this comment

Uh oh!

hds Nov 8, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

core: fix race condition in `DefaultCallsite::register` #3401

core: fix race condition in `DefaultCallsite::register` #3401