core: fix race condition in DefaultCallsite::register
#3401
+255
−25
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Motivation
There were two separate race conditions related to registration of
callsites. In both cases, it was possible that
eventornew_spancould be called before
register_callsitehad finished executing forall subscribers.
The first case could be invoked when multiple (thread local) subscribers
were registering the same callsite and could cause some subscribers to
not receive a call to
register_callsiteat all. This case was fixedin #2938.
The second case could be invoked when multiple threads reach the same
event or span for the first time and can occur in the presence of only a
single global default subscriber. The subscriber may receive calls to
eventornew_spanbefore the call toregister_callsitehasfinished executing. This may occur even with a relatively fast
register_callsiteimplentation - although it is less likely. A slowimplementation is more likely to trigger the error.
Solution
This change fixes the race condition by forcing any calls to
DefaultCallsite::registerwhich run while another thread isregistering the same callsite to wait until registration has completed.
This is achieved with a loop around the check on the atomic representing
the registration state for that callsite. It will hotloop until the
registration is complete.
Tests have been added to both
tracing-coreandtracingwhich invokethis error case and always fail when testing the previous code.
Fixes: #2743