Account for time spent tracing, use RDTSC for faster time #4524

Stypox · 2025-08-12T21:23:56Z

This PR does two things to drastically improve the tracing experience by removing some biases in timestamps caused by overheads in the tracing machinery:

It measures how much time is spent inside the tracing_chrome machinery and subtracts it from all subsequent timestamps saved to the trace file. This is not very elegant, but does the job quite well. The tracing_chrome machinery is inherently slow as it has to format traced arguments (e.g. call their Debug implementation), and this obviously can't be offloaded to a background thread, since it would involve passing references to data to another thread which would break memory safety.
It uses rdtsc to measure time on Linux x86/x86_64, which only takes a few nanoseconds and returns nanosecond-precise timestamps. std::time::Instant on the other hand takes ~1.3us which completely biases the length of spans shorter than a few microseconds. See at the bottom for the options considered for time measurements.

Sorry for not making two commits, the two changes are too intertwined to clearly separate them out. For more detailed information about the implementation, please read the comments and rustdocs in the code.

The following table was obtained by interpreting the Rust script at the bottom in Miri. "Execution time (cpu)" is obtained with Linux's time (and is a sum of the time spent on the Miri thread and on the tracing thread), and "Max time in trace file" is obtained by looking at the ts value of the last entry of the produced trace file. The sensitivity of the values in the table above seems to be ~10%. We can see that 1. (which measures time twice instead of once per trace event) adds no overhead if 2. is also employed. Using both 1. and 2. reduces the real runtime from 1.7s to 1.4s and especially reduces the time observed in trace files from 1.7s to 0.3s (though the tracing overhead still accounts for $1-91ms/273ms \approx 67\%$ of the time seen in the traces).

	Execution time (cpu)	Max time in trace file
Tracing disabled	91ms	-
Before this PR	1678ms	1650ms
With just 1.	3599ms	1511ms
With just 2.	1424ms	915ms
With 1. and 2.	1383ms	273ms

Rust script used for benchmarking with `n=100`

fn main() {
    let n: usize = std::env::args().nth(1).unwrap().parse().unwrap();
    let mut v = (0..n).into_iter().collect::<Vec<_>>();
    for i in &mut v {
        *i += 1;
    }
}

Options considered in a chat with @RalfJung

I was spending some time trying to figure out in which other places time is spent during the execution of Miri, and by using perf I noticed that a tangible amount of time is spent in code under the concurrency/ subfolder. The functions there are called very frequently and are all quite fast (they take just a few hundreds nanoseconds to execute), and when I added tracing calls to them the execution time of Miri nearly doubled. It turns out that tracing calls on Linux take a few microseconds each, and they spend most of the time just for measuring time. Here are the options we have:

The current implementation uses std::time::Instant, which internally uses clock_gettime(CLOCK_MONOTONIC) at least on Linux (see here and here). This system call has a ~1.3µs latency on my PC, and has nanosecond precision. This is not ideal for very short spans, but is ok-ish for spans of at least a few microseconds.
I considered clock_gettime(CLOCK_MONOTONIC_COARSE). The system call takes as little as ~5ns on my pc, but the clock's precision is a few milliseconds. Using this clock makes traces useless, because events happen much quicker than every few milliseconds, as shown here in section "Coarse time".
A third option exists, which is both fast and precise, but has portability and stability issues. x86's rdtsc instruction reads the TSC counter in the CPU, which was born to continuously count the clock cycles, and was later adapted to just count at a constant rate (due to CPU frequency scaling). This counter only exists on x86/x86_64 platforms, and suffers from stability issues: every core has its own TSC and they may be out-of-sync, so time might appear to go backwards if the scheduler decides to move Miri's thread from one core to another. To avoid this, Miri's thread can be forced to execute on just one core with sched_setaffinity. The implementation gets slightly more complicated if Miri uses more than one thread (e.g. with -Zmiri-many-seeds) but is still doable (and also I guess it is not so useful to trace Miri with -Zmiri-many-seeds anyway). Another issue is that some older CPUs have TSC counters that count with a varying frequency (due to frequency scaling) and are thus not usable to measure time, and this can be detected by reading a system file (read more here under "TSC"). rdtsc has nanosecond latency and nanosecond precision which would be perfect for tracing.

Note that all Apple devices have a clock that runs at 24MHz that can be read with mach_absolute_time (and std::time::Instant already uses it), so the latency there is 40ns. Linux on ARM also seems to have a pretty fast time implementation backing std::time::Instant, because at least on my Android phone std::time::Instant has a latency of ~250ns. The Windows system call is QueryPerformanceCounter, which should internally use TSC and hence be fast (see here, recommendation 1.). Also see this for the system calls used by std::time::Instant on the various platforms. Therefore the above problem only applies to Linux on x86/x86_64.

What would you suggest doing here? I would continue using std::time::Instant everywhere except for x86/x86_64 Linux devices that support constant-rate TSC, and instead use TSC there despite all of its caveats. This change can be made inside tracing_chrome.rs's get_ts function.

rustbot · 2025-08-12T21:24:00Z

Thank you for contributing to Miri!
Please remember to not force-push to the PR branch except when you need to rebase due to a conflict or when the reviewer asks you for it.

saethlin · 2025-08-12T21:30:00Z

The following table was obtained by interpreting the Rust script at the bottom in Miri.

FYI we have a benchmark suite that tries to tickle some pathological cases in the interpreter. What you've shared clearly demonstrates the value of this PR, but the benchmark suite programs may also be educational.

src/bin/log/tracing_chrome.rs