-
Couldn't load subscription status.
- Fork 400
Account for time spent tracing, use RDTSC for faster time #4524
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Account for time spent tracing, use RDTSC for faster time #4524
Conversation
|
Thank you for contributing to Miri! |
f57d6e1 to
27cd341
Compare
FYI we have a benchmark suite that tries to tickle some pathological cases in the interpreter. What you've shared clearly demonstrates the value of this PR, but the benchmark suite programs may also be educational. |
| pub(super) fn tsc_to_microseconds() -> f64 { | ||
| const BUSY_WAIT: std::time::Duration = std::time::Duration::from_millis(10); | ||
| let tsc_start = rdtsc(); | ||
| let instant_start = std::time::Instant::now(); | ||
| while instant_start.elapsed() < BUSY_WAIT { | ||
| // `thread::sleep()` is not very precise at waking up the program at the right time, | ||
| // so use a busy loop instead. | ||
| core::hint::spin_loop(); | ||
| } | ||
| let tsc_end = rdtsc(); | ||
| (BUSY_WAIT.as_nanos() as f64) / 1000.0 / ((tsc_end - tsc_start) as f64) | ||
| } | ||
|
|
||
| /// Checks whether the TSC counter is available and runs at a constant rate independently | ||
| /// of CPU frequency. | ||
| pub(super) fn is_tsc_available() -> Option<bool> { | ||
| use std::io::{BufRead, BufReader}; | ||
|
|
||
| let cpuinfo = std::fs::File::open("/proc/cpuinfo").ok()?; | ||
| let mut cpuinfo = BufReader::new(cpuinfo); | ||
|
|
||
| let mut buf = String::with_capacity(1024); | ||
| while cpuinfo.read_line(&mut buf).ok()? > 0 { | ||
| if buf.starts_with("flags") { | ||
| return Some(buf.contains("constant_tsc")); | ||
| } | ||
| buf.clear(); | ||
| } | ||
| None // EOF | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is that really the recommended way to deal with TSC? Measure the frequency?!? And parse some text file in /proc to check whether it is present...?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I couldn't find a "recommended" way to deal with TSC, as documentation is quite scarce, but in the various sources I used as a reference they obtained the frequency by measuring it:
- https://github.com/fast/fastant/blob/27c9ec5ec90b5b67113a748a4defee0d2519518c/src/tsc_now.rs#L234
- https://gist.github.com/pmttavara/6f06fc5c7679c07375483b06bb77430c
- https://github.com/sheroz/tick_counter/blob/main/src/lib.rs#L184
I also found this article explaining how to install a kernel driver that publishes the TSC frequency from the kernel, so I guess there is no way to obtain such frequency without having access to the kernel.
I also don't like this approach at all but it should calculate the right frequency quite reliably.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And parse some text file in /proc to check whether it is present...?
That's also done here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually I found a better way: running the __cpuid instruction with the correct parameters and reading the correct field. See how it's done in the raw_cpuid crate (here). I implemented that now and it works on my machine:
const LEAF: u32 = 0x80000007; // this is the leaf for "advanced power management info"
let cpuid = unsafe { __cpuid(LEAF) };
(cpuid.edx & (1 << 8)) != 0 // EDX bit 8 indicates invariant TSC|
Thank you for the review, I have addressed your comments. Let me know if the fencing approach with Now in I also kept the |
|
|
||
| unsafe { | ||
| // Fence the execution with `_mm_lfence` and `asm!` to ensure that neither the compiler | ||
| // nor the CPU reorder instructions, as that would make _rdtsc() run at the wrong time. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the lfence really needed? And why is it the right CPU fence here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did see _mm_lfence() used somewhere and took it from there, but I don't know why the load+store mfence was not being used instead. I tried looking up more information about this online and found many implementations that don't fence at all, one that also fences with lfence (but only optionally), one that mixes lfence and mfence (but only when starting a timer and nevertheless it doesn't make much sense).
None of the implementations I found online fence against the compiler, so that's probably not needed after all. And given that we are not trying to measure the time taken for single/(very few) assembly instructions, it might not matter in any case whether we fence or not, since CPU reordering doesn't reorder instructions that are too far apart.
I tried removing the fencing completely and Miri+tracing got ~15% faster, which is something we might want to keep. The two following images show the usual table obtained with the query below on two trace files: the first one obtained with fences (lfence + asm!("", options(nostack, preserves_flags))) and the other with no fencing at all, and the percentages are very close (if not identical), indicating that removing fences does not change much what we actually measure.
select "TOTAL PROGRAM DURATION" as name, count(*), max(ts + dur) as "sum(dur)", 100.0 as "%", null as "min(dur)", null as "max(dur)", null as "avg(dur)", null as "stddev(dur)" from slices union select "TOTAL OVER ALL SPANS (excluding events)" as name, count(*), sum(dur), cast(cast(sum(dur) as float) / (select max(ts + dur) from slices) * 1000 as int) / 10.0 as "%", min(dur), max(dur), cast(avg(dur) as int) as "avg(dur)", cast(sqrt(avg(dur*dur)-avg(dur)*avg(dur)) as int) as "stddev(dur)" from slices where parent_id is null and name != "frame" and name != "step" and dur > 0 union select name, count(*), sum(dur), cast(cast(sum(dur) as float) / (select max(ts + dur) from slices) * 1000 as int) / 10.0 as "%", min(dur), max(dur), cast(avg(dur) as int) as "avg(dur)", cast(sqrt(avg(dur*dur)-avg(dur)*avg(dur)) as int) as "stddev(dur)" from slices where parent_id is null and name != "frame" and name != "step" group by name order by sum(dur) desc, count(*) descSo given the above observations, I'd remove fencing completely. I pushed a commit to do so.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AFAIK the fences (or rdtscp) are typically used to ensure measurement accuracy by preventing the CPU from reordering the rdtsc instruction and thus "blur" the measurement. It's mostly useful when very accurate timing is necessary, and overkill otherwise.
|
@rustbot ready |
|
This looks great, thanks! Please squash the commits using @rustbot author |
|
Reminder, once the PR becomes ready for a review, use |
468fa2c to
ada3090
Compare
|
@rustbot ready |
|
A few years ago I went down the rabbit hole of constructing the lowest overhead clock measurement possible. The heart of it was relatively simple:
The latter is actually surprisingly hard to get right, and it took a couple of tries and measurements to figure out exactly how to adjust the affine formula to get a monotonic timestamp & avoid "swerves" on readjustments. And the whole exercise proved near pointless, in practice:
It turns out that this whole thing is exactly how Is there a reason not to use |
|
|
|
Thanks for the insights! On my PC rdtsc is not considered usable as a source for time measurements by the kernel, which thus relies on slower sources. My CPU is quite recent (AMD Ryzen 7 5800H) so I am not sure why the kernel can't use rdtsc with it. I guess this PR could be improved by checking which time source the kernel is using, and if it's already vDSO, we can use Instant normally. |
The kernel disables tsc when it detects that it's not stable. It should say something in dmesg during boot about the time stamp source and measured frequency. Note that new hardware does not guarantee that your hardware tsc works, the clocks are notorious for being buggy on some platforms. It's still possible to get an improvement from directly sampling with rdtsc, but if it's huge, it's a system configuration problem. |
|
For the sake of completeness, I benchmarked today
I don't have the exact CPU versions off-hand -- I should have taken notes --though both support x86-64-v4 (ie, AVX512) and are thus relatively recent. On my desktop computer, I got timings of about 12ns, while on the AWS server, I got timings of about 16ns-20ns. I expect difference is mostly due to clocking -- those giant AWS instances are at about 3GHz, my own work computer would be pushing 4GHz. Needless to say, it's a far cry from 1.2us, and is more in line with roughly the expected performance of |
|
Okay now I also have to try this.^^ What's a good quick way to benchmark |
|
Sorry, I wasn't on my actual PC before. But to elaborate a bit, you're looking for these lines in your dmesg output. There's tsc-early too, which maybe says something interesting. On x86 you can also have other, dramatically slower, clock sources. Either virtualized timestamps in a vm, or things like hpet. Linux will try to default to tsc, but if it fails to calibrate because your time stamp counter isn't behaving, it will fall back to slower methods. Generally speaking, if you aren't seeing tsc it's a big performance issue, and you might want to investigate more deeply, especially if you like to play video games on the machine. :) one thing to check is that you have the latest bios for your motherboard, since sometimes they fix things like this in patches. |
|
It seems like there is a BIOS bug on my PC which has never been fixed, and TSC is therefore always unstable. I was able to force the kernel to use TSC anyway with |
|
Yeah, maybe we should remove the RDTSC code again? We can have some note in the docs for how people can tell whether they get the fast clock on Linux, and what they can do if they get a slow clock. |
This PR does two things to drastically improve the tracing experience by removing some biases in timestamps caused by overheads in the tracing machinery:
tracing_chromemachinery and subtracts it from all subsequent timestamps saved to the trace file. This is not very elegant, but does the job quite well. Thetracing_chromemachinery is inherently slow as it has to format traced arguments (e.g. call their Debug implementation), and this obviously can't be offloaded to a background thread, since it would involve passing references to data to another thread which would break memory safety.rdtscto measure time on Linux x86/x86_64, which only takes a few nanoseconds and returns nanosecond-precise timestamps.std::time::Instanton the other hand takes~1.3uswhich completely biases the length of spans shorter than a few microseconds. See at the bottom for the options considered for time measurements.Sorry for not making two commits, the two changes are too intertwined to clearly separate them out. For more detailed information about the implementation, please read the comments and rustdocs in the code.
The following table was obtained by interpreting the Rust script at the bottom in Miri. "Execution time (cpu)" is obtained with Linux's$1-91ms/273ms \approx 67\%$ of the time seen in the traces).
time(and is a sum of the time spent on the Miri thread and on the tracing thread), and "Max time in trace file" is obtained by looking at thetsvalue of the last entry of the produced trace file. The sensitivity of the values in the table above seems to be ~10%. We can see that 1. (which measures time twice instead of once per trace event) adds no overhead if 2. is also employed. Using both 1. and 2. reduces the real runtime from 1.7s to 1.4s and especially reduces the time observed in trace files from 1.7s to 0.3s (though the tracing overhead still accounts forRust script used for benchmarking with `n=100`
Options considered in a chat with @RalfJung
I was spending some time trying to figure out in which other places time is spent during the execution of Miri, and by using
perfI noticed that a tangible amount of time is spent in code under theconcurrency/subfolder. The functions there are called very frequently and are all quite fast (they take just a few hundreds nanoseconds to execute), and when I added tracing calls to them the execution time of Miri nearly doubled. It turns out that tracing calls on Linux take a few microseconds each, and they spend most of the time just for measuring time. Here are the options we have:std::time::Instant, which internally usesclock_gettime(CLOCK_MONOTONIC)at least on Linux (see here and here). This system call has a~1.3µslatency on my PC, and has nanosecond precision. This is not ideal for very short spans, but is ok-ish for spans of at least a few microseconds.clock_gettime(CLOCK_MONOTONIC_COARSE). The system call takes as little as~5nson my pc, but the clock's precision is a few milliseconds. Using this clock makes traces useless, because events happen much quicker than every few milliseconds, as shown here in section "Coarse time".rdtscinstruction reads the TSC counter in the CPU, which was born to continuously count the clock cycles, and was later adapted to just count at a constant rate (due to CPU frequency scaling). This counter only exists on x86/x86_64 platforms, and suffers from stability issues: every core has its own TSC and they may be out-of-sync, so time might appear to go backwards if the scheduler decides to move Miri's thread from one core to another. To avoid this, Miri's thread can be forced to execute on just one core withsched_setaffinity. The implementation gets slightly more complicated if Miri uses more than one thread (e.g. with-Zmiri-many-seeds) but is still doable (and also I guess it is not so useful to trace Miri with-Zmiri-many-seedsanyway). Another issue is that some older CPUs have TSC counters that count with a varying frequency (due to frequency scaling) and are thus not usable to measure time, and this can be detected by reading a system file (read more here under "TSC").rdtschas nanosecond latency and nanosecond precision which would be perfect for tracing.Note that all Apple devices have a clock that runs at 24MHz that can be read with
mach_absolute_time(andstd::time::Instantalready uses it), so the latency there is 40ns. Linux on ARM also seems to have a pretty fast time implementation backingstd::time::Instant, because at least on my Android phonestd::time::Instanthas a latency of~250ns. The Windows system call isQueryPerformanceCounter, which should internally use TSC and hence be fast (see here, recommendation 1.). Also see this for the system calls used bystd::time::Instanton the various platforms. Therefore the above problem only applies to Linux on x86/x86_64.What would you suggest doing here? I would continue using
std::time::Instanteverywhere except for x86/x86_64 Linux devices that support constant-rate TSC, and instead use TSC there despite all of its caveats. This change can be made insidetracing_chrome.rs'sget_tsfunction.