-
Notifications
You must be signed in to change notification settings - Fork 3
Add benchmarks, derived from rand #39
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
|
Ah, using So |
|
Thanks for doing this. I'll run these on my HW and see how the results compare to the benchmarks I made. I see that your benchmarks do some explicit setup of some Criterion parameters, while I just used the defaults. It would be surprising if that makes a huge difference, though. I suspect most of it is down to running on different platforms. The readme does make it clear that the benchmarks were on an Apple Macbook Air M1. Is it possible that rand has processor specific optimizations that are used by the AMD 5800X, but that there are no such (or less) processor specific optimizations for the M1? If so, then I have no problem mentioning that in the readme, to avoid letting people believe that the relative performance seen on the M1 can be expected on Intel/AMD platforms as well. I have no plans to actually compete with rand on raw performance. The goal was just to provide good enough performance for most people, without ever introducing to processor specific instructions or unsafe code. I was in fact very surprised that I seemed to be in the same neighborhood at all with regards to performance. The fact that performance seemed on par with rand on the M1 is something I suspected was due to rand not being optimized specifically for that platform. |
|
I see that the throughput you get when generating bytes from ChaCha12 using rand is more then 50% of the throughput you get when using Xoshiro256++. That is impressive. Well done. I must admit that I was ever so slightly pleased to see that smallrand is a smidgen ahead of rand when generating bytes with Xoshiro256++, but it's so marginal that it makes no practical difference. I did see a surprisingly large performance advantage in smallrand's favor on the M1 when generating sub-ranges of u64 values, though. I wonder how that compares on your AMD 5800X. What rust version did you use for these tests, by the way? |
To run these? |
|
Yes, I think
|
|
I will probably create a public repo of the benchmarks I used to get the performance measurements in smallrand's readme.md file, instead of adding benchmarks to this repo. It might interest you to know that the new 0.10.0 release candidate for rand has greatly improved ChcCha performance on ARM (the Apple M1 in my case). In my benchmark of rand's generation of u64 values using ChaCha12, the time is down by 50% compared to the 0.9 version. rand is now faster than smallrand on the ChaCha benchmarks, as expected. My readme will obviously need a full overhaul. The only one of my benchmarks where smallrand clearly outperforms rand is when generating integers in a range. I think we're using the same algorithm, so there must be something in the rand design that prevents optimization. It might be tempting to take a look one day if I have some spare time on my hands. |
|
I have created a repo for the benchmarks that I made: https://github.com/hpenne/randbenches These are the numbers I get on my machine: So I'll definitely need to update my readme when you release. But notice how slow |
|
I presume this is using the I'm a bit confused about which Xoshiro256++ benches you ran. Regarding ranges in rand, there are rather too many implementations, first because there are pre-baked vs single-sample implementations (though the latter should do well when the range is If you dig through the PR history, you'll see that there have been multiple attempts to optimise these, though most benchmarks were run only one machine (AMD Zen 3). |
Yes. RC0, I think.
These are from the benches found under https://github.com/hpenne/randbenches I'm sure the structuring could be improved using Criterion groups as you did, but I haven't got around to that yet. The benchmark for uniform distribution in a range uses Xoshiro256++, not ChaCha12. Using the faster algorithm seemed a better choice when you want to test the performance of the uniform distribution and not the random algorithm. There are no comparative tests there yet for uniform distribution of floats, although they should be easy to add. There might be some difference there in performance as I put some extra work into the algorithm to use the full dynamic range of the floats even for values close to zero, which will cost a few extra cycles but is slightly better for use cases like simulation etc. I noticed that many other crates for random don't (most seemed to just convert to float and do scale+offset), although I cannot recall what |
|
Some of your results (e.g. Xoshiro fill) seem to be missing above.
Really, we're testing the combination; it does seem relevant to me to test the uniform-range algorithm with multiple RNGs. Look for example here. I ran your benches on my 5800X and got very different results: So, it's entirely possible that the choices I made in rust-random/rand#1287 and rust-random/rand#1289 are heavily biased by the CPU I'm using. Re-evaluating on a different machine could prove interesting. Unfortunate that I didn't get more help running benchmarks at the time. I'm not convinced it's worth going anywhere from here (the linked PRs and benchmarks were a lot of work), but feel free to investigate further if you like. |
Yes, I did not think see them as all that relevant to the discussion about ChaCha. I can run this again and attach the complete output if you like. It should be easy to add a benchmark for uniform range based on ChaCha12 as well.
That is really interesting. I tweaked my code until the uniform range code performed really well on my hardware, and you did the same on yours. We both ended up optimizing the code for the compiler optimizer algorithms specific to our architecture. Thus my implementation for uniform ranges of integers beats yours on my HW architecture, and your implementation beats mine on your HW architecture. That makes a lot of sense. Oh, and I was incorrect regarding benchmarks for uniform floats. I did write one. Unfortunately the measurement for |
This is the closest equivalent of rand's generator benchmarks.
smallrand results (AMD 5800X):
rand results:
So my take is that your Xoshiro256++ implementation is fast but the byte-slice implementation is not, while your ChaCha12 implementation is a long way behind
rand_chacha.The
u64performance does roughly correlate to your README (exceptingrand_chacha), but the fill-bytes performance does not and I don't think that's only down to the different CPUs used.