-
-
Notifications
You must be signed in to change notification settings - Fork 887
Low-hanging fruit codegen optimizations #2401
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Low-hanging fruit codegen optimizations #2401
Conversation
src/ImageSharp/Formats/Jpeg/Components/Decoder/ArithmeticScanDecoder.cs
Outdated
Show resolved
Hide resolved
JimBobSquarePants
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this... I've gone a fair way through but want to comment now.
I think we might need to take a step back and try to find a good balance between creating the best codegen and readability.
I see two issues.
- How we're handling loops is inconsistent, sometimes we cast in diffferent ways in the loop declaration
nint i = ... (nint)(uint)vsnint i = ... (uint)and sometimes we we cast inside the loop. - We're doing casting during field assignment outside of hotpaths. I don't want to introduce changes that will only have very marginal effect sacrificing readabilty in the process.
| colorWhite.FromRgba32(Color.White); | ||
| ref byte dataRef = ref MemoryMarshal.GetReference(data); | ||
| for (nint y = top; y < top + height; y++) | ||
| for (nint y = top; y < (nint)(uint)(top + height); y++) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In some places we haven't dont the second explicit cast from uint
e.g. the loops are written as
for (nint y = top; y < (uint)(top + height); y++)I think we should strive to be super consistent here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Having another thought on this, I think using nuint is the best option.
- the cast to
uintis enough, so no need for(nint)(uint) - the compiler errors for
for (nuint i = 0; i < span.Length; ++i)whilst there's no error ifnint i = 0is used -- so it's impossible to miss one optimization - for 32-bit optimized code is emitted too
Hence I'll update the usages to nuint.
| public static void AddGreenToBlueAndRed(Span<uint> pixelData) | ||
| { | ||
| if (Avx2.IsSupported) | ||
| if (Avx2.IsSupported && pixelData.Length >= 8) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did the length check early and made a do-while-loop.
This prevents the creation* of the vector addGreenToBlueAndRedMaskAvx2 when not needed.
The same on other loops in this file.
* actually it's a memory load from the data-segment in recent .NET versions
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've seen similar in the recent CRC PR to the runtime.
| nint u = n - m; | ||
|
|
||
| for (int i = 0; i < u; i += 4) | ||
| for (nint i = 0; i < u; i += 4) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because of the n - m the loop with nint is kept. With nuint one has to take care of underflow due the subtraction.
Alternatives with using nuint would be casting to a signed type or having a if + do-while looping construct. I think the nint as used here is simpler.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, that makes sense. Thanks for the additional explanation!
| out uint p3, | ||
| out uint p2, | ||
| out uint p1, | ||
| out uint p0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The uint here saves lots of casts elsewhere.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep. Very good!
| source.Save( | ||
| path, | ||
| encoder ?? source.GetConfiguration().ImageFormatsManager.FindEncoder(<#= fmt #>Format.Instance)); | ||
| encoder ?? source.GetConfiguration().ImageFormatsManager.GetEncoder(<#= fmt #>Format.Instance)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was a left-over from #2317?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wow! I thought I'd redone these. Thanks!!
|
@gfoidl I'm still reviewing this. There's a LOT to read so it's gonna take me a few nights. Thanks for your patience. |
|
No rush (and ultiimately it's up to you when to review 😉). The pain point with such a PR is lots of touched files. But on the pro side the code-base gets in a consistent and good shape (regarding the micro-optimizations). |
JimBobSquarePants
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is looking really good! Just a few comments and questions.
| ref byte dBase = ref MemoryMarshal.GetReference(dest); | ||
|
|
||
| Shuffle.InverseMMShuffle(this.Control, out int p3, out int p2, out int p1, out int p0); | ||
| Shuffle.InverseMMShuffle(this.Control, out uint p3, out uint p2, out uint p1, out uint p0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is one of those. "Why didn't I think of this!" moments. 😁
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, then it would be boring for me 😉.
src/ImageSharp/Common/Helpers/SimdUtils.FallbackIntrinsics128.cs
Outdated
Show resolved
Hide resolved
| nint u = n - m; | ||
|
|
||
| for (int i = 0; i < u; i += 4) | ||
| for (nint i = 0; i < u; i += 4) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, that makes sense. Thanks for the additional explanation!
| out uint p3, | ||
| out uint p2, | ||
| out uint p1, | ||
| out uint p0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep. Very good!
| // We can assign the Bgr24 value directly to last three bytes of this instance. | ||
| ref byte thisRef = ref Unsafe.As<Abgr32, byte>(ref this); | ||
| ref byte thisRefFromB = ref Unsafe.AddByteOffset(ref thisRef, new IntPtr(1)); | ||
| ref byte thisRefFromB = ref Unsafe.AddByteOffset(ref thisRef, 1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wouldn't this be better as (nuint)1u?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AddByteOffset has overloads for nuint and IntPtr, thus C#-compiler picks the nuint overload and treats the literal 1 as nuint-constant. No need for the cast here.
src/ImageSharp/Processing/Processors/Binarization/BinaryThresholdProcessor{TPixel}.cs
Show resolved
Hide resolved
src/ImageSharp/Processing/Processors/Convolution/ConvolutionProcessor{TPixel}.cs
Outdated
Show resolved
Hide resolved
src/ImageSharp/Processing/Processors/Convolution/EdgeDetectorCompassProcessor{TPixel}.cs
Outdated
Show resolved
Hide resolved
| for (int i = 0; i < this.kernelSize; i++) | ||
| { | ||
| int currentYIndex = Unsafe.Add(ref sampleRowBase, i); | ||
| int currentYIndex = Unsafe.Add(ref sampleRowBase, (uint)i); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cast in the outer loop for consistency?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just below i is needed for slicing as int.
It would be more consistent, but also two casts instead of one in the loop body.
I have no strong opinion what's better here. Consistency or less casts?
gfoidl
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Feedback is addressed / commented.
Once you're happy with the changes I need to push another commit to fix cases like
| nuint count = (uint)(source.Length / Vector<float>.Count); |
Super happy with everything so far! Fire ahead 😄 |
Here we go: 2bbf1cb (only changed that division by vector length (which I unfortunately optimized, then later de-optimized, and with that commit optimized again -- now it's enough of that ping-poing 😉)) |
JimBobSquarePants
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Amazing stuff. Thank you!
Prerequisites
Description
Scanned through the code-base (Regex based search) and did some low-hanging fruit optimizations.
Unsafe.Add-usage tweaked the code / loops to don't emit sign extending movs (movsxd->movormovzx)Sorry for touching so many files...
Cf. #2397 (comment)