-
-
Notifications
You must be signed in to change notification settings - Fork 3.1k
std.crypto.onetimeauth.ghash: faster GHASH on modern CPUs #13566
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Carryless multiplication was slow on older Intel CPUs, justifying the need for using Karatsuba multiplication. This is not the case any more; using 4 multiplications to multiply two 128-bit numbers is actually faster than 3 multiplications + shifts and additions. This is also true on aarch64. Keep using Karatsuba only when targeting x86 (granted, this is a bit of a brutal shortcut, we should really list all the CPU models that had a slow clmul instruction). Also remove useless agg_2 treshold and restore the ability to precompute only H and H^2 in ReleaseSmall. Finally, avoid using u256. Using 128-bit registers is actually faster.
30a5485 to
45d8998
Compare
| const I256 = struct { | ||
| hi: u128, | ||
| lo: u128, | ||
| mid: u128, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think I understand what mid is
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Schoolbook multiplication:
ab * cd =
bd
+ ad
+ bc
+ ac
ac is hi, bd is lo and in the middle we have ad+bc. Eventually, a shifted addition will get rid of it:
mid = ad + bc
lo = bd + (mid << n)
hi = ac + (mid >> n)
We are doing several multiplications in a row. So, instead of adding the middle term after every multiplication, we can accumulate the lo, hi and mid values, and only add mid at the end. This is what we do here.
|
Reverted in 72d3f4b, failing std lib tests on wasm32-wasi: |
…iglang#13566)" This reapplies commit 72d3f4b.
Carryless multiplication was slow on older Intel CPUs, justifying the need for using Karatsuba multiplication.
This is not the case any more; using 4 multiplications to multiply two 128-bit numbers is actually faster than 3 multiplications + shifts and additions.
This is also true on aarch64.
Keep using Karatsuba only when targeting x86 (granted, this is a bit of a brutal shortcut, we should really list all the CPU models that had a slow clmul instruction).
Also remove useless agg_2 treshold and restore the ability to precompute only H and H^2 in ReleaseSmall.
Finally, avoid using u256. Using 128-bit registers is actually faster.