-
Notifications
You must be signed in to change notification settings - Fork 842
Description
Bug Report
There is code below that reproduces this issue consistently (increase the sleep time to make it 100%).
I've uploaded a branch containing the test that reproduces the issue and my first attempt at a fix (which doesn't appear to fix anything). There's a draft PR open against v0.1.x to show the different #2744.
Version
tracing-core: 0.1.30tracing: 0.1.37 (not necessary, can also be reproduced from justtracing-core)
Also confirmed present on the v0.1.x branch.
This error is not reproducible on master.
Platform
Linux and macOS confirmed. Have not tested on Windows.
Linux <name> 5.19.0-46-generic #47~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Wed Jun 21 15:35:31 UTC 2 x86_64 x86_64 x86_64 GNU/LinuxDarwin <name> 23.4.0 Darwin Kernel Version 23.4.0: Wed Feb 21 21:44:31 PST 2024; root:xnu-10063.101.15~2/RELEASE_X86_64 x86_64
Crates
tracing-core
Description
If a Subscriber's register_callsite() function takes too long to complete, it can cause other subscribers to not have register_callsite called at all. Even if the second subscriber is the one which is then recording the tracing event at that callsite.
- We start two threads, each thread:
a. Registers a localSubscriber(viatracing::subscriber::set_default)
b. Records a single event.
When the first subscriber takes too long inside register_callsite(), the second subscriber doesn't receive a call to register_callsite() at all, instead it receives a call to event() directly.
What is too long?
By ensuring the order in which the threads start executing, a thread::sleep of 50 microseconds will
I can reliably reproduce this issue with the code below with a thread::sleep of 50 microseconds, although it will sometimes succeed. I have seen it reproduce with a sleep of only 20 microseconds, although that was less frequent (3 failures out of 10 tries just now). The machine is a 6 core 3.xGHz Dell desktop that's about 6 years old - so it's not top of the line, but also not underpowered. Keep in mind that the test is calling println!, so that may be adding additional delay.
Here is a full reproduction of the issue as an integration test is below. Note that the assert that would actually make this test fail has been commented out so that it can be observed that subscriber-2 doesn't receive a call to register_callsite() at all (it is not that it receives it after the call to event() which I had originally suspected).
use std::{
ptr,
sync::atomic::{AtomicPtr, Ordering},
thread::{self, JoinHandle},
time::Duration,
};
use tracing_core::{
callsite::{Callsite as _, DefaultCallsite},
dispatcher::set_default,
field::{FieldSet, Value},
span, Dispatch, Event, Kind, Level, Metadata, Subscriber,
};
struct TestSubscriber {
sleep: Duration,
callsite: AtomicPtr<Metadata<'static>>,
}
impl TestSubscriber {
fn new(sleep_micros: u64) -> Self {
Self {
sleep: Duration::from_micros(sleep_micros),
callsite: AtomicPtr::new(ptr::null_mut()),
}
}
}
impl Subscriber for TestSubscriber {
fn register_callsite(&self, metadata: &'static Metadata<'static>) -> tracing_core::Interest {
if !self.sleep.is_zero() {
thread::sleep(self.sleep);
}
self.callsite
.store(metadata as *const _ as *mut _, Ordering::SeqCst);
tracing_core::Interest::always()
}
fn event(&self, event: &tracing_core::Event<'_>) {
let stored_callsite = self.callsite.load(Ordering::SeqCst);
let event_callsite: *mut Metadata<'static> = event.metadata() as *const _ as *mut _;
// This assert is the actual test.
assert_eq!(
stored_callsite, event_callsite,
"stored callsite: {stored_callsite:#?} does not match event \
callsite: {event_callsite:#?}. Was `event` called before \
`register_callsite`?"
);
}
fn enabled(&self, _metadata: &Metadata<'_>) -> bool {
true
}
fn new_span(&self, _span: &span::Attributes<'_>) -> span::Id {
span::Id::from_u64(0)
}
fn record(&self, _span: &span::Id, _values: &span::Record<'_>) {}
fn record_follows_from(&self, _span: &span::Id, _follows: &span::Id) {}
fn enter(&self, _span: &tracing_core::span::Id) {}
fn exit(&self, _span: &tracing_core::span::Id) {}
}
fn subscriber_thread(idx: usize, register_sleep_micros: u64) -> JoinHandle<()> {
thread::Builder::new()
.name(format!("subscriber-{idx}"))
.spawn(move || {
// We use a sleep to ensure the starting order of the 2 threads.
let subscriber = TestSubscriber::new(register_sleep_micros);
let _dispatch_guard = set_default(&Dispatch::new(subscriber));
static CALLSITE: DefaultCallsite = {
// The values of the metadata are unimportant
static META: Metadata<'static> = Metadata::new(
"event ",
"module::path",
Level::INFO,
None,
None,
None,
FieldSet::new(&["message"], tracing_core::callsite::Identifier(&CALLSITE)),
Kind::EVENT,
);
DefaultCallsite::new(&META)
};
let _interest = CALLSITE.interest();
let meta = CALLSITE.metadata();
let field = meta.fields().field("message").unwrap();
let message = format!("event-from-{idx}", idx = idx);
let values = [(&field, Some(&message as &dyn Value))];
let value_set = CALLSITE.metadata().fields().value_set(&values);
Event::dispatch(meta, &value_set);
// Wait a bit for everything to end (we don't want to remove the subscriber
// immediately because that will influence the test).
thread::sleep(Duration::from_millis(10));
})
.expect("failed to spawn thread")
}
/// Regression test for missing register_callsite call (#2743)
///
/// This test provokes the race condition which causes the second subscriber to not receive a
/// call to `register_callsite` before it receives a call to `event`.
///
/// Because the test depends on the interaction of multiple dispatchers in different threads,
/// it needs to be in a test file by itself.
#[test]
fn event_before_register() {
let subscriber_1_register_sleep_micros = 30;
let subscriber_2_register_sleep_micros = 0;
let jh1 = subscriber_thread(1, subscriber_1_register_sleep_micros);
// This delay ensures that the event callsite has interest() called first.
thread::sleep(Duration::from_micros(10));
let jh2 = subscriber_thread(2, subscriber_2_register_sleep_micros);
jh1.join().expect("failed to join thread");
jh2.join().expect("failed to join thread");
}If we add println! messages to the register_callsite and event methods, we get the following output:
subscriber-1: register_callsite: 0x000055555564bb88
subscriber-2: event (with callsite): 0x000055555564bb88 (stored callsite: 0x0000000000000000)
subscriber-1: event (with callsite): 0x000055555564bb88 (stored callsite: 0x000055555564bb88)
When the sleep in the first subscriber is set to 0, the expected behavior is observed:
subscriber-1: register_callsite: 0x000055555564bb88
subscriber-1: event (with callsite): 0x000055555564bb88 (stored callsite: 0x000055555564bb88)
subscriber-1: register_callsite: 0x000055555564bb88
subscriber-2: register_callsite: 0x000055555564bb88
subscriber-2: event (with callsite): 0x000055555564bb88 (stored callsite: 0x000055555564bb88)
My best guess is that the reregistration of the callsite (because there's a new subscriber subscriber-2 in the new thread) causes subscriber-2 to skip registering the callsite of the event when it executes it.
A description of the fix can be found in #2938.