-
Notifications
You must be signed in to change notification settings - Fork 5.2k
Ensure various scalar cross platform helper APIs are handled directly as intrinsic #80789
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch, @kunalspathak Issue DetailsMuch like with APIs directly exposed on However, unlike the vector APIs, the bit operation APIs were not directly handled as intrinsic and only ever executed a software fallback which included manual dispatch to the relevant underlying hardware intrinsics. This works well most of the time, but it does have the side effect of reducing/impacting JIT throughput, being subject to inlining heuristics, and was not able to participate in constant folding. This PR updates those APIs to be directly imported as the relevant hardware intrinsic, when supported, and to more generally support constant folding.
|
src/coreclr/jit/importercalls.cpp
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could handle this specially with a check for negative values, but it is a more complex change and so I decided to push it out to a later PR.
src/coreclr/jit/importercalls.cpp
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since there isn't an "always available" instruction for x86/x64, we should probably import this as GenTreeIntrinsic much as happens for various Math APIs like Sin, Cos, and other APIs.
Doing so would allow us to still perform post import constant folding and then transform this back into a GT_CALL during rationalization on older hardware.
However, given it is a more complex change I opted to push it out to a later PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For Arm64, we could do the same or we could add basic SIMD constant folding support for PopCount, AddAcross, and ToScalar. CreateScalarNode will already generate a GT_CNS_VEC where applicable, including post import.
46a98b0 to
20e1693
Compare
392fc87 to
cd5959a
Compare
37d0475 to
76b84c5
Compare
76b84c5 to
a527d5a
Compare
e3f2d19 to
6483f2e
Compare
|
Not a perfect diff due to 40 missed contexts, but still good overall and showing some TP improvements + positive diffs. There is notably a very small |
|
/azp run runtime-coreclr jitstress-isas-x86, runtime-coreclr jitstress-isas-arm, runtime-coreclr outerloop |
|
Azure Pipelines successfully started running 3 pipeline(s). |
|
CC. @dotnet/jit-contrib, this is ready for review. |
Much like with APIs directly exposed on
Vector64/128/256/512, several of the APIs exposed onBitOperationsare "cross platform helper APIs" and are used in various perf critical code paths.However, unlike the vector APIs, the bit operation APIs were not directly handled as intrinsic and only ever executed a software fallback which included manual dispatch to the relevant underlying hardware intrinsics. This works well most of the time, but it does have the side effect of reducing/impacting JIT throughput, being subject to inlining heuristics, and was not able to participate in constant folding.
This PR updates those APIs to be directly imported as the relevant hardware intrinsic, when supported, and to more generally support constant folding.