Skip to content

Conversation

@dhardy
Copy link

@dhardy dhardy commented Aug 13, 2025

This is the closest equivalent of rand's generator benchmarks.

smallrand results (AMD 5800X):

random_bytes/xoshiro256++
                        time:   [738.65 ns 740.90 ns 743.77 ns]
                        thrpt:  [1.2822 GiB/s 1.2872 GiB/s 1.2911 GiB/s]
random_bytes/small      time:   [735.70 ns 736.28 ns 736.85 ns]
                        thrpt:  [1.2943 GiB/s 1.2953 GiB/s 1.2963 GiB/s]
random_bytes/chacha12   time:   [3.8578 µs 3.8639 µs 3.8700 µs]
                        thrpt:  [252.34 MiB/s 252.74 MiB/s 253.14 MiB/s]
random_bytes/std        time:   [3.8743 µs 3.8784 µs 3.8825 µs]
                        thrpt:  [251.53 MiB/s 251.80 MiB/s 252.06 MiB/s]
random_u32/xoshiro256++ time:   [650.41 ps 650.94 ps 651.70 ps]
                        thrpt:  [5.7163 GiB/s 5.7229 GiB/s 5.7276 GiB/s]
random_u32/small        time:   [651.86 ps 653.00 ps 654.31 ps]
                        thrpt:  [5.6935 GiB/s 5.7049 GiB/s 5.7149 GiB/s]
random_u32/chacha12     time:   [3.3913 ns 3.3988 ns 3.4087 ns]
                        thrpt:  [1.0929 GiB/s 1.0961 GiB/s 1.0985 GiB/s]
random_u32/std          time:   [3.3888 ns 3.3901 ns 3.3923 ns]
                        thrpt:  [1.0982 GiB/s 1.0989 GiB/s 1.0993 GiB/s]
random_u64/xoshiro256++ time:   [653.56 ps 654.00 ps 654.58 ps]
                        thrpt:  [11.382 GiB/s 11.392 GiB/s 11.400 GiB/s]
random_u64/small        time:   [657.01 ps 659.91 ps 663.17 ps]
                        thrpt:  [11.235 GiB/s 11.290 GiB/s 11.340 GiB/s]
random_u64/chacha12     time:   [6.2850 ns 6.2916 ns 6.2989 ns]
                        thrpt:  [1.1828 GiB/s 1.1842 GiB/s 1.1855 GiB/s]
random_u64/std          time:   [6.3087 ns 6.3105 ns 6.3123 ns]
                        thrpt:  [1.1803 GiB/s 1.1807 GiB/s 1.1810 GiB/s]

rand results:

random_bytes/chacha12   time:   [312.97 ns 313.68 ns 314.41 ns]
                        thrpt:  [3.0332 GiB/s 3.0403 GiB/s 3.0472 GiB/s]
random_bytes/std        time:   [329.09 ns 329.20 ns 329.30 ns]
                        thrpt:  [2.8961 GiB/s 2.8970 GiB/s 2.8979 GiB/s]
random_bytes/small      time:   [173.87 ns 175.26 ns 176.88 ns]
                        thrpt:  [5.3918 GiB/s 5.4415 GiB/s 5.4851 GiB/s]
random_u32/chacha12     time:   [1.3055 ns 1.3111 ns 1.3173 ns]
                        thrpt:  [2.8281 GiB/s 2.8412 GiB/s 2.8535 GiB/s]
random_u32/std          time:   [1.2764 ns 1.2800 ns 1.2838 ns]
                        thrpt:  [2.9018 GiB/s 2.9105 GiB/s 2.9186 GiB/s]
random_u32/small        time:   [664.89 ps 669.13 ps 673.75 ps]
                        thrpt:  [5.5292 GiB/s 5.5673 GiB/s 5.6029 GiB/s]
random_u64/chacha12     time:   [1.9956 ns 2.0027 ns 2.0105 ns]
                        thrpt:  [3.7059 GiB/s 3.7202 GiB/s 3.7335 GiB/s]
random_u64/std          time:   [2.0429 ns 2.0516 ns 2.0611 ns]
                        thrpt:  [3.6149 GiB/s 3.6316 GiB/s 3.6470 GiB/s]
random_u64/small        time:   [668.26 ps 668.54 ps 668.83 ps]
                        thrpt:  [11.140 GiB/s 11.145 GiB/s 11.149 GiB/s]

So my take is that your Xoshiro256++ implementation is fast but the byte-slice implementation is not, while your ChaCha12 implementation is a long way behind rand_chacha.

The u64 performance does roughly correlate to your README (excepting rand_chacha), but the fill-bytes performance does not and I don't think that's only down to the different CPUs used.

@dhardy
Copy link
Author

dhardy commented Aug 13, 2025

Ah, using fill_u8 is much faster than fill:

random_bytes/xoshiro256++
                        time:   [170.92 ns 171.51 ns 172.22 ns]
                        thrpt:  [5.5376 GiB/s 5.5605 GiB/s 5.5795 GiB/s]
                 change:
                        time:   [-77.007% -76.896% -76.797%] (p = 0.00 < 0.05)
                        thrpt:  [+330.98% +332.83% +334.92%]
                        Performance has improved.
Found 7 outliers among 100 measurements (7.00%)
  1 (1.00%) high mild
  6 (6.00%) high severe
random_bytes/small      time:   [170.95 ns 171.54 ns 172.35 ns]
                        thrpt:  [5.5334 GiB/s 5.5593 GiB/s 5.5787 GiB/s]
                 change:
                        time:   [-76.735% -76.548% -76.327%] (p = 0.00 < 0.05)
                        thrpt:  [+322.42% +326.40% +329.82%]
                        Performance has improved.
Found 13 outliers among 100 measurements (13.00%)
  4 (4.00%) high mild
  9 (9.00%) high severe
random_bytes/chacha12   time:   [891.73 ns 892.09 ns 892.49 ns]
                        thrpt:  [1.0686 GiB/s 1.0690 GiB/s 1.0695 GiB/s]
                 change:
                        time:   [-77.343% -77.276% -77.220%] (p = 0.00 < 0.05)
                        thrpt:  [+338.99% +340.07% +341.36%]
                        Performance has improved.
Found 7 outliers among 100 measurements (7.00%)
  5 (5.00%) high mild
  2 (2.00%) high severe
random_bytes/std        time:   [913.05 ns 913.61 ns 914.45 ns]
                        thrpt:  [1.0429 GiB/s 1.0439 GiB/s 1.0445 GiB/s]
                 change:
                        time:   [-76.803% -76.659% -76.534%] (p = 0.00 < 0.05)
                        thrpt:  [+326.14% +328.44% +331.09%]
                        Performance has improved.
Found 8 outliers among 100 measurements (8.00%)
  1 (1.00%) low mild
  3 (3.00%) high mild
  4 (4.00%) high severe

So rand_chacha is ~3× faster than your implementation according to my benchmarks which is still very different than what the README claims.

@hpenne
Copy link
Owner

hpenne commented Aug 13, 2025

Thanks for doing this. I'll run these on my HW and see how the results compare to the benchmarks I made. I see that your benchmarks do some explicit setup of some Criterion parameters, while I just used the defaults. It would be surprising if that makes a huge difference, though. I suspect most of it is down to running on different platforms.

The readme does make it clear that the benchmarks were on an Apple Macbook Air M1. Is it possible that rand has processor specific optimizations that are used by the AMD 5800X, but that there are no such (or less) processor specific optimizations for the M1? If so, then I have no problem mentioning that in the readme, to avoid letting people believe that the relative performance seen on the M1 can be expected on Intel/AMD platforms as well.

I have no plans to actually compete with rand on raw performance. The goal was just to provide good enough performance for most people, without ever introducing to processor specific instructions or unsafe code. I was in fact very surprised that I seemed to be in the same neighborhood at all with regards to performance. The fact that performance seemed on par with rand on the M1 is something I suspected was due to rand not being optimized specifically for that platform.

@hpenne
Copy link
Owner

hpenne commented Aug 13, 2025

I see that the throughput you get when generating bytes from ChaCha12 using rand is more then 50% of the throughput you get when using Xoshiro256++. That is impressive. Well done.

I must admit that I was ever so slightly pleased to see that smallrand is a smidgen ahead of rand when generating bytes with Xoshiro256++, but it's so marginal that it makes no practical difference.

I did see a surprisingly large performance advantage in smallrand's favor on the M1 when generating sub-ranges of u64 values, though. I wonder how that compares on your AMD 5800X.

What rust version did you use for these tests, by the way?

@dhardy
Copy link
Author

dhardy commented Aug 13, 2025

What rust version did you use for these tests, by the way?

To run these?

$ rustc -V
rustc 1.90.0-nightly (b56aaec52 2025-07-24)

@dhardy
Copy link
Author

dhardy commented Aug 13, 2025

@hpenne
Copy link
Owner

hpenne commented Sep 14, 2025

I will probably create a public repo of the benchmarks I used to get the performance measurements in smallrand's readme.md file, instead of adding benchmarks to this repo.

It might interest you to know that the new 0.10.0 release candidate for rand has greatly improved ChcCha performance on ARM (the Apple M1 in my case). In my benchmark of rand's generation of u64 values using ChaCha12, the time is down by 50% compared to the 0.9 version. rand is now faster than smallrand on the ChaCha benchmarks, as expected. My readme will obviously need a full overhaul.

The only one of my benchmarks where smallrand clearly outperforms rand is when generating integers in a range. I think we're using the same algorithm, so there must be something in the rand design that prevents optimization. It might be tempting to take a look one day if I have some spare time on my hands.

@hpenne
Copy link
Owner

hpenne commented Sep 17, 2025

I have created a repo for the benchmarks that I made: https://github.com/hpenne/randbenches

These are the numbers I get on my machine:

With rand 0.10.0-rc.0:

rand ChaCha12 fill_bytes
                        time:   [125.50 ns 125.65 ns 125.82 ns]
rand ChaCha12 u64       time:   [4.2458 ns 4.2766 ns 4.3247 ns]
rand Xoshiro256++ fill_bytes
rand Xoshiro256++ range time:   [3.5709 ns 3.5740 ns 3.5779 ns]
rand Xoshiro256++ range f64
rand Xoshiro256++ u64   time:   [1.1449 ns 1.1493 ns 1.1548 ns]


With smallrand 1.0.1

smallrand ChaCha12 fill time:   [231.95 ns 232.08 ns 232.26 ns]
smallrand ChaCha12 u64  time:   [7.3359 ns 7.3394 ns 7.3436 ns]
smallrand Xoshiro256++ fill
                        time:   [36.011 ns 36.031 ns 36.054 ns]
smallrand Xoshiro256++ range
smallrand Xoshiro256++ range f64
                        time:   [1.2358 ns 1.2372 ns 1.2386 ns]
smallrand Xoshiro256++ u64
                        time:   [1.1407 ns 1.1412 ns 1.1419 ns]

So I'll definitely need to update my readme when you release. But notice how slow rand seems for integers in a range. smallrand seems almost 3x faster.

@dhardy
Copy link
Author

dhardy commented Sep 17, 2025

I presume this is using the chacha20 crate (rand v0.10.x)?

I'm a bit confused about which Xoshiro256++ benches you ran.

Regarding ranges in rand, there are rather too many implementations, first because there are pre-baked vs single-sample implementations (though the latter should do well when the range is const), and second because there are separate implementations for float and int types (sometimes with multiple int implementations), and third because of inclusive vs exclusive ranges (less significant).

If you dig through the PR history, you'll see that there have been multiple attempts to optimise these, though most benchmarks were run only one machine (AMD Zen 3).

@hpenne
Copy link
Owner

hpenne commented Sep 18, 2025

I presume this is using the chacha20 crate (rand v0.10.x)?

Yes. RC0, I think.

I'm a bit confused about which Xoshiro256++ benches you ran.

These are from the benches found under https://github.com/hpenne/randbenches
They are convenient here, as I can run the same benchmarks on rand, fastrand and smallrand in one go, using one common setup. It seemed more controlled than running benches in two separate projects.

I'm sure the structuring could be improved using Criterion groups as you did, but I haven't got around to that yet.

The benchmark for uniform distribution in a range uses Xoshiro256++, not ChaCha12. Using the faster algorithm seemed a better choice when you want to test the performance of the uniform distribution and not the random algorithm.

There are no comparative tests there yet for uniform distribution of floats, although they should be easy to add. There might be some difference there in performance as I put some extra work into the algorithm to use the full dynamic range of the floats even for values close to zero, which will cost a few extra cycles but is slightly better for use cases like simulation etc. I noticed that many other crates for random don't (most seemed to just convert to float and do scale+offset), although I cannot recall what rand does.

@dhardy
Copy link
Author

dhardy commented Sep 18, 2025

Some of your results (e.g. Xoshiro fill) seem to be missing above.

Using the faster algorithm seemed a better choice when you want to test the performance of the uniform distribution and not the random algorithm.

Really, we're testing the combination; it does seem relevant to me to test the uniform-range algorithm with multiple RNGs. Look for example here.

I ran your benches on my 5800X and got very different results:

fastrand range          time:   [728.54 ps 734.28 ps 740.44 ps]
rand ChaCha12 fill_bytes
                        time:   [57.279 ns 57.386 ns 57.523 ns]
rand ChaCha12 u64       time:   [1.8336 ns 1.8360 ns 1.8393 ns]
rand Xoshiro256++ fill_bytes
                        time:   [21.237 ns 21.317 ns 21.426 ns]
rand Xoshiro256++ range time:   [837.09 ps 838.19 ps 839.34 ps]
rand Xoshiro256++ range f64
                        time:   [840.98 ps 841.70 ps 842.57 ps]
rand Xoshiro256++ u64   time:   [642.88 ps 643.95 ps 645.08 ps]
smallrand ChaCha12 fill time:   [187.67 ns 187.79 ns 187.93 ns]
smallrand ChaCha12 u64  time:   [6.1360 ns 6.1436 ns 6.1539 ns]
smallrand Xoshiro256++ fill
                        time:   [21.343 ns 21.367 ns 21.394 ns]
smallrand Xoshiro256++ range
                        time:   [930.34 ps 933.28 ps 936.91 ps]
smallrand Xoshiro256++ range f64
                        time:   [1.1231 ns 1.1277 ns 1.1326 ns]
smallrand Xoshiro256++ u64
                        time:   [640.22 ps 641.17 ps 642.31 ps]

So, it's entirely possible that the choices I made in rust-random/rand#1287 and rust-random/rand#1289 are heavily biased by the CPU I'm using. Re-evaluating on a different machine could prove interesting. Unfortunate that I didn't get more help running benchmarks at the time.

I'm not convinced it's worth going anywhere from here (the linked PRs and benchmarks were a lot of work), but feel free to investigate further if you like.

@hpenne
Copy link
Owner

hpenne commented Sep 18, 2025

Some of your results (e.g. Xoshiro fill) seem to be missing above.

Yes, I did not think see them as all that relevant to the discussion about ChaCha. I can run this again and attach the complete output if you like. It should be easy to add a benchmark for uniform range based on ChaCha12 as well.

I ran your benches on my 5800X and got very different results:
So, it's entirely possible that the choices I made in rust-random/rand#1287 and rust-random/rand#1289 are heavily biased by the CPU I'm using.

That is really interesting. I tweaked my code until the uniform range code performed really well on my hardware, and you did the same on yours. We both ended up optimizing the code for the compiler optimizer algorithms specific to our architecture. Thus my implementation for uniform ranges of integers beats yours on my HW architecture, and your implementation beats mine on your HW architecture. That makes a lot of sense.

Oh, and I was incorrect regarding benchmarks for uniform floats. I did write one. Unfortunately the measurement for rand got cut in editing above. On the 5800X you are faster. The difference was smaller on the M1 when I made the readme, but rand was marginally faster on that platform as well. I noted in the readme that I think my algorithm is slightly better (but slower), although I think most users will never care about the extra effort done here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants