Skip to content

Conversation

@Stypox
Copy link
Contributor

@Stypox Stypox commented Aug 12, 2025

This PR does two things to drastically improve the tracing experience by removing some biases in timestamps caused by overheads in the tracing machinery:

  1. It measures how much time is spent inside the tracing_chrome machinery and subtracts it from all subsequent timestamps saved to the trace file. This is not very elegant, but does the job quite well. The tracing_chrome machinery is inherently slow as it has to format traced arguments (e.g. call their Debug implementation), and this obviously can't be offloaded to a background thread, since it would involve passing references to data to another thread which would break memory safety.
  2. It uses rdtsc to measure time on Linux x86/x86_64, which only takes a few nanoseconds and returns nanosecond-precise timestamps. std::time::Instant on the other hand takes ~1.3us which completely biases the length of spans shorter than a few microseconds. See at the bottom for the options considered for time measurements.

Sorry for not making two commits, the two changes are too intertwined to clearly separate them out. For more detailed information about the implementation, please read the comments and rustdocs in the code.

The following table was obtained by interpreting the Rust script at the bottom in Miri. "Execution time (cpu)" is obtained with Linux's time (and is a sum of the time spent on the Miri thread and on the tracing thread), and "Max time in trace file" is obtained by looking at the ts value of the last entry of the produced trace file. The sensitivity of the values in the table above seems to be ~10%. We can see that 1. (which measures time twice instead of once per trace event) adds no overhead if 2. is also employed. Using both 1. and 2. reduces the real runtime from 1.7s to 1.4s and especially reduces the time observed in trace files from 1.7s to 0.3s (though the tracing overhead still accounts for $1-91ms/273ms \approx 67\%$ of the time seen in the traces).

Execution time (cpu) Max time in trace file
Tracing disabled 91ms -
Before this PR 1678ms 1650ms
With just 1. 3599ms 1511ms
With just 2. 1424ms 915ms
With 1. and 2. 1383ms 273ms
Rust script used for benchmarking with `n=100`
fn main() {
    let n: usize = std::env::args().nth(1).unwrap().parse().unwrap();
    let mut v = (0..n).into_iter().collect::<Vec<_>>();
    for i in &mut v {
        *i += 1;
    }
}
Options considered in a chat with @RalfJung

I was spending some time trying to figure out in which other places time is spent during the execution of Miri, and by using perf I noticed that a tangible amount of time is spent in code under the concurrency/ subfolder. The functions there are called very frequently and are all quite fast (they take just a few hundreds nanoseconds to execute), and when I added tracing calls to them the execution time of Miri nearly doubled. It turns out that tracing calls on Linux take a few microseconds each, and they spend most of the time just for measuring time. Here are the options we have:

  1. The current implementation uses std::time::Instant, which internally uses clock_gettime(CLOCK_MONOTONIC) at least on Linux (see here and here). This system call has a ~1.3µs latency on my PC, and has nanosecond precision. This is not ideal for very short spans, but is ok-ish for spans of at least a few microseconds.
  2. I considered clock_gettime(CLOCK_MONOTONIC_COARSE). The system call takes as little as ~5ns on my pc, but the clock's precision is a few milliseconds. Using this clock makes traces useless, because events happen much quicker than every few milliseconds, as shown here in section "Coarse time".
  3. A third option exists, which is both fast and precise, but has portability and stability issues. x86's rdtsc instruction reads the TSC counter in the CPU, which was born to continuously count the clock cycles, and was later adapted to just count at a constant rate (due to CPU frequency scaling). This counter only exists on x86/x86_64 platforms, and suffers from stability issues: every core has its own TSC and they may be out-of-sync, so time might appear to go backwards if the scheduler decides to move Miri's thread from one core to another. To avoid this, Miri's thread can be forced to execute on just one core with sched_setaffinity. The implementation gets slightly more complicated if Miri uses more than one thread (e.g. with -Zmiri-many-seeds) but is still doable (and also I guess it is not so useful to trace Miri with -Zmiri-many-seeds anyway). Another issue is that some older CPUs have TSC counters that count with a varying frequency (due to frequency scaling) and are thus not usable to measure time, and this can be detected by reading a system file (read more here under "TSC"). rdtsc has nanosecond latency and nanosecond precision which would be perfect for tracing.

Note that all Apple devices have a clock that runs at 24MHz that can be read with mach_absolute_time (and std::time::Instant already uses it), so the latency there is 40ns. Linux on ARM also seems to have a pretty fast time implementation backing std::time::Instant, because at least on my Android phone std::time::Instant has a latency of ~250ns. The Windows system call is QueryPerformanceCounter, which should internally use TSC and hence be fast (see here, recommendation 1.). Also see this for the system calls used by std::time::Instant on the various platforms. Therefore the above problem only applies to Linux on x86/x86_64.

What would you suggest doing here? I would continue using std::time::Instant everywhere except for x86/x86_64 Linux devices that support constant-rate TSC, and instead use TSC there despite all of its caveats. This change can be made inside tracing_chrome.rs's get_ts function.

@rustbot
Copy link
Collaborator

rustbot commented Aug 12, 2025

Thank you for contributing to Miri!
Please remember to not force-push to the PR branch except when you need to rebase due to a conflict or when the reviewer asks you for it.

@rustbot rustbot added the S-waiting-on-review Status: Waiting for a review to complete label Aug 12, 2025
@Stypox Stypox force-pushed the tracing-chrome-overhead-rdtsc branch from f57d6e1 to 27cd341 Compare August 12, 2025 21:28
@saethlin
Copy link
Member

The following table was obtained by interpreting the Rust script at the bottom in Miri.

FYI we have a benchmark suite that tries to tickle some pathological cases in the interpreter. What you've shared clearly demonstrates the value of this PR, but the benchmark suite programs may also be educational.

Comment on lines 111 to 156
pub(super) fn tsc_to_microseconds() -> f64 {
const BUSY_WAIT: std::time::Duration = std::time::Duration::from_millis(10);
let tsc_start = rdtsc();
let instant_start = std::time::Instant::now();
while instant_start.elapsed() < BUSY_WAIT {
// `thread::sleep()` is not very precise at waking up the program at the right time,
// so use a busy loop instead.
core::hint::spin_loop();
}
let tsc_end = rdtsc();
(BUSY_WAIT.as_nanos() as f64) / 1000.0 / ((tsc_end - tsc_start) as f64)
}

/// Checks whether the TSC counter is available and runs at a constant rate independently
/// of CPU frequency.
pub(super) fn is_tsc_available() -> Option<bool> {
use std::io::{BufRead, BufReader};

let cpuinfo = std::fs::File::open("/proc/cpuinfo").ok()?;
let mut cpuinfo = BufReader::new(cpuinfo);

let mut buf = String::with_capacity(1024);
while cpuinfo.read_line(&mut buf).ok()? > 0 {
if buf.starts_with("flags") {
return Some(buf.contains("constant_tsc"));
}
buf.clear();
}
None // EOF
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is that really the recommended way to deal with TSC? Measure the frequency?!? And parse some text file in /proc to check whether it is present...?

Copy link
Contributor Author

@Stypox Stypox Aug 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I couldn't find a "recommended" way to deal with TSC, as documentation is quite scarce, but in the various sources I used as a reference they obtained the frequency by measuring it:

I also found this article explaining how to install a kernel driver that publishes the TSC frequency from the kernel, so I guess there is no way to obtain such frequency without having access to the kernel.

I also don't like this approach at all but it should calculate the right frequency quite reliably.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And parse some text file in /proc to check whether it is present...?

That's also done here

Copy link
Contributor Author

@Stypox Stypox Aug 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually I found a better way: running the __cpuid instruction with the correct parameters and reading the correct field. See how it's done in the raw_cpuid crate (here). I implemented that now and it works on my machine:

const LEAF: u32 = 0x80000007; // this is the leaf for "advanced power management info"
let cpuid = unsafe { __cpuid(LEAF) };
(cpuid.edx & (1 << 8)) != 0 // EDX bit 8 indicates invariant TSC

@Stypox
Copy link
Contributor Author

Stypox commented Aug 18, 2025

Thank you for the review, I have addressed your comments.

Let me know if the fencing approach with asm! and the detection approach with cpuid look good to you.

Now in with_elapsed_micros_subtracting_tracing I added duplicate comments between the two match branches (because they do the same thing except with different types), also let me know if that's fine.

I also kept the _ignored as-is since that's how it also was before.


unsafe {
// Fence the execution with `_mm_lfence` and `asm!` to ensure that neither the compiler
// nor the CPU reorder instructions, as that would make _rdtsc() run at the wrong time.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the lfence really needed? And why is it the right CPU fence here?

Copy link
Contributor Author

@Stypox Stypox Aug 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did see _mm_lfence() used somewhere and took it from there, but I don't know why the load+store mfence was not being used instead. I tried looking up more information about this online and found many implementations that don't fence at all, one that also fences with lfence (but only optionally), one that mixes lfence and mfence (but only when starting a timer and nevertheless it doesn't make much sense).

None of the implementations I found online fence against the compiler, so that's probably not needed after all. And given that we are not trying to measure the time taken for single/(very few) assembly instructions, it might not matter in any case whether we fence or not, since CPU reordering doesn't reorder instructions that are too far apart.

I tried removing the fencing completely and Miri+tracing got ~15% faster, which is something we might want to keep. The two following images show the usual table obtained with the query below on two trace files: the first one obtained with fences (lfence + asm!("", options(nostack, preserves_flags))) and the other with no fencing at all, and the percentages are very close (if not identical), indicating that removing fences does not change much what we actually measure.

image image
select "TOTAL PROGRAM DURATION" as name, count(*), max(ts + dur) as "sum(dur)", 100.0 as "%", null as "min(dur)", null as "max(dur)", null as "avg(dur)", null as "stddev(dur)" from slices union select "TOTAL OVER ALL SPANS (excluding events)" as name, count(*), sum(dur), cast(cast(sum(dur) as float) / (select max(ts + dur) from slices) * 1000 as int) / 10.0 as "%", min(dur), max(dur), cast(avg(dur) as int) as "avg(dur)", cast(sqrt(avg(dur*dur)-avg(dur)*avg(dur)) as int) as "stddev(dur)" from slices where parent_id is null and name != "frame" and name != "step" and dur > 0 union select name, count(*), sum(dur), cast(cast(sum(dur) as float) / (select max(ts + dur) from slices) * 1000 as int) / 10.0 as "%", min(dur), max(dur), cast(avg(dur) as int) as "avg(dur)", cast(sqrt(avg(dur*dur)-avg(dur)*avg(dur)) as int) as "stddev(dur)" from slices where parent_id is null and name != "frame" and name != "step" group by name order by sum(dur) desc, count(*) desc

So given the above observations, I'd remove fencing completely. I pushed a commit to do so.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAIK the fences (or rdtscp) are typically used to ensure measurement accuracy by preventing the CPU from reordering the rdtsc instruction and thus "blur" the measurement. It's mostly useful when very accurate timing is necessary, and overkill otherwise.

@Stypox
Copy link
Contributor Author

Stypox commented Aug 22, 2025

@rustbot ready

@RalfJung
Copy link
Member

This looks great, thanks! Please squash the commits using ./miri squash. Then write @rustbot ready after you force-pushed the squashed PR.

@rustbot author

@rustbot rustbot added S-waiting-on-author Status: Waiting for the PR author to address review comments and removed S-waiting-on-review Status: Waiting for a review to complete labels Aug 23, 2025
@rustbot
Copy link
Collaborator

rustbot commented Aug 23, 2025

Reminder, once the PR becomes ready for a review, use @rustbot ready.

@Stypox Stypox force-pushed the tracing-chrome-overhead-rdtsc branch from 468fa2c to ada3090 Compare August 23, 2025 10:51
@Stypox
Copy link
Contributor Author

Stypox commented Aug 23, 2025

@rustbot ready

@rustbot rustbot added S-waiting-on-review Status: Waiting for a review to complete and removed S-waiting-on-author Status: Waiting for the PR author to address review comments labels Aug 23, 2025
@RalfJung RalfJung enabled auto-merge August 23, 2025 10:56
@RalfJung RalfJung added this pull request to the merge queue Aug 23, 2025
Merged via the queue into rust-lang:master with commit 972ee78 Aug 23, 2025
15 checks passed
@rustbot rustbot removed the S-waiting-on-review Status: Waiting for a review to complete label Aug 23, 2025
@matthieu-m
Copy link

A few years ago I went down the rabbit hole of constructing the lowest overhead clock measurement possible.

The heart of it was relatively simple:

  1. Use rdtsc to obtain the number of cycles.
  2. Use an affine formula to scale that to the number of nanoseconds since the start of the unix epoch.
  3. Regularly re-evaluate the result of the affine formula against the real clock, to catch any drift.

The latter is actually surprisingly hard to get right, and it took a couple of tries and measurements to figure out exactly how to adjust the affine formula to get a monotonic timestamp & avoid "swerves" on readjustments.

And the whole exercise proved near pointless, in practice:

  • This neat little clock took about 14ns to compute the real time.
  • A call to gettimeofday, which is vDSO accelerated, takes 16ns.

It turns out that this whole thing is exactly how gettimeofday is essentially implemented under the hood, and while there's possibly a slight overhead (+2ns) in calling it, it saves a lot of hassle.

Is there a reason not to use gettimeofday, at least when available and vDSO-accelerated, instead of trying to implement a custom clock?

@RalfJung
Copy link
Member

clock_gettime is also on the vDSO, AFAIK. And that's what libstd uses. But somehow @Stypox got a lot of overhead with that, right?

@Stypox
Copy link
Contributor Author

Stypox commented Aug 28, 2025

Thanks for the insights! On my PC rdtsc is not considered usable as a source for time measurements by the kernel, which thus relies on slower sources. My CPU is quite recent (AMD Ryzen 7 5800H) so I am not sure why the kernel can't use rdtsc with it.

I guess this PR could be improved by checking which time source the kernel is using, and if it's already vDSO, we can use Instant normally.

@jsimmons
Copy link

Thanks for the insights! On my PC rdtsc is not considered usable as a source for time measurements by the kernel, which thus relies on slower sources. My CPU is quite recent (AMD Ryzen 7 5800H) so I am not sure why the kernel can't use rdtsc with it.

I guess this PR could be improved by checking which time source the kernel is using, and if it's already vDSO, we can use Instant normally.

The kernel disables tsc when it detects that it's not stable. It should say something in dmesg during boot about the time stamp source and measured frequency. Note that new hardware does not guarantee that your hardware tsc works, the clocks are notorious for being buggy on some platforms.

It's still possible to get an improvement from directly sampling with rdtsc, but if it's huge, it's a system configuration problem.

@matthieu-m
Copy link

For the sake of completeness, I benchmarked today std::time::Instant::now() on 2 distinct x64 machines:

  1. A "gaming" work desktop computer, running Ubuntu 18 via WSL2.
  2. One of the largest AWS instances I have access to (96 cores), running Ubuntu 22.

I don't have the exact CPU versions off-hand -- I should have taken notes --though both support x86-64-v4 (ie, AVX512) and are thus relatively recent.

On my desktop computer, I got timings of about 12ns, while on the AWS server, I got timings of about 16ns-20ns. I expect difference is mostly due to clocking -- those giant AWS instances are at about 3GHz, my own work computer would be pushing 4GHz.

Needless to say, it's a far cry from 1.2us, and is more in line with roughly the expected performance of rdtsc (~6ns) followed by clock frequency adjustment (specifically, multiplying the number cycles by the inverse of the clock frequency).

@RalfJung
Copy link
Member

Okay now I also have to try this.^^ What's a good quick way to benchmark Instant::now?

@RalfJung
Copy link
Member

I ran the example I found here and I'm seeing numbers around 16-22ns.

So... yeah that manual rdtsc doesn't usually seem necessary. Odd that Instant::now is so slow on your system, @Stypox.

@jsimmons
Copy link

jsimmons commented Aug 29, 2025

Sorry, I wasn't on my actual PC before. But to elaborate a bit, you're looking for these lines in your dmesg output. There's tsc-early too, which maybe says something interesting.

Aug 23 12:55:34 archlinux kernel: tsc: Fast TSC calibration using PIT
Aug 23 12:55:34 archlinux kernel: tsc: Detected 3393.477 MHz processor
Aug 23 12:55:34 archlinux kernel: tsc: Refined TSC clocksource calibration: 3393.623 MHz
Aug 23 12:55:34 archlinux kernel: clocksource: tsc: mask: 0xffffffffffffffff max_cycles: 0x30eac649395, max_idle_ns: 440795252668 ns
Aug 23 12:55:34 archlinux kernel: clocksource: Switched to clocksource tsc

On x86 you can also have other, dramatically slower, clock sources. Either virtualized timestamps in a vm, or things like hpet. Linux will try to default to tsc, but if it fails to calibrate because your time stamp counter isn't behaving, it will fall back to slower methods.

Generally speaking, if you aren't seeing tsc it's a big performance issue, and you might want to investigate more deeply, especially if you like to play video games on the machine. :) one thing to check is that you have the latest bios for your motherboard, since sometimes they fix things like this in patches.

@Stypox
Copy link
Contributor Author

Stypox commented Aug 30, 2025

It seems like there is a BIOS bug on my PC which has never been fixed, and TSC is therefore always unstable. I was able to force the kernel to use TSC anyway with tsc=reliable clocksource=tsc hpet=disable, which is however risky and problematic, but does make Instant much faster (~30ns), almost on par with reading raw TSC (~13ns). But anyway, it seems like it was just my PC's BIOS having problems with unstable TSC, not a widespread issue, so I guess parts of this PR could be reverted.

@RalfJung
Copy link
Member

Yeah, maybe we should remove the RDTSC code again? We can have some note in the docs for how people can tell whether they get the fast clock on Linux, and what they can do if they get a slow clock.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants