Skip to content

Conversation

@gfoidl
Copy link
Contributor

@gfoidl gfoidl commented Mar 13, 2023

Prerequisites

  • I have written a descriptive pull-request title
  • I have verified that there are no overlapping pull-requests open
  • I have verified that I am following the existing coding patterns and practice as demonstrated in the repository. These follow strict Stylecop rules 👮.
  • I have provided test coverage for my change (where applicable)

Description

Scanned through the code-base (Regex based search) and did some low-hanging fruit optimizations.

  • optimized division by constants
  • for Unsafe.Add-usage tweaked the code / loops to don't emit sign extending movs (movsxd -> mov or movzx)

Sorry for touching so many files...
Cf. #2397 (comment)

Copy link
Member

@JimBobSquarePants JimBobSquarePants left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this... I've gone a fair way through but want to comment now.

I think we might need to take a step back and try to find a good balance between creating the best codegen and readability.

I see two issues.

  1. How we're handling loops is inconsistent, sometimes we cast in diffferent ways in the loop declaration nint i = ... (nint)(uint) vs nint i = ... (uint) and sometimes we we cast inside the loop.
  2. We're doing casting during field assignment outside of hotpaths. I don't want to introduce changes that will only have very marginal effect sacrificing readabilty in the process.

colorWhite.FromRgba32(Color.White);
ref byte dataRef = ref MemoryMarshal.GetReference(data);
for (nint y = top; y < top + height; y++)
for (nint y = top; y < (nint)(uint)(top + height); y++)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In some places we haven't dont the second explicit cast from uint
e.g. the loops are written as

for (nint y = top; y < (uint)(top + height); y++)

I think we should strive to be super consistent here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having another thought on this, I think using nuint is the best option.

  • the cast to uint is enough, so no need for (nint)(uint)
  • the compiler errors for for (nuint i = 0; i < span.Length; ++i) whilst there's no error if nint i = 0 is used -- so it's impossible to miss one optimization
  • for 32-bit optimized code is emitted too

Hence I'll update the usages to nuint.

public static void AddGreenToBlueAndRed(Span<uint> pixelData)
{
if (Avx2.IsSupported)
if (Avx2.IsSupported && pixelData.Length >= 8)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did the length check early and made a do-while-loop.
This prevents the creation* of the vector addGreenToBlueAndRedMaskAvx2 when not needed.
The same on other loops in this file.

* actually it's a memory load from the data-segment in recent .NET versions

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've seen similar in the recent CRC PR to the runtime.

Comment on lines +347 to +349
nint u = n - m;

for (int i = 0; i < u; i += 4)
for (nint i = 0; i < u; i += 4)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because of the n - m the loop with nint is kept. With nuint one has to take care of underflow due the subtraction.

Alternatives with using nuint would be casting to a signed type or having a if + do-while looping construct. I think the nint as used here is simpler.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, that makes sense. Thanks for the additional explanation!

Comment on lines +514 to +517
out uint p3,
out uint p2,
out uint p1,
out uint p0)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The uint here saves lots of casts elsewhere.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep. Very good!

source.Save(
path,
encoder ?? source.GetConfiguration().ImageFormatsManager.FindEncoder(<#= fmt #>Format.Instance));
encoder ?? source.GetConfiguration().ImageFormatsManager.GetEncoder(<#= fmt #>Format.Instance));
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was a left-over from #2317?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow! I thought I'd redone these. Thanks!!

@JimBobSquarePants
Copy link
Member

@gfoidl I'm still reviewing this. There's a LOT to read so it's gonna take me a few nights. Thanks for your patience.

@gfoidl
Copy link
Contributor Author

gfoidl commented Mar 21, 2023

No rush (and ultiimately it's up to you when to review 😉).

The pain point with such a PR is lots of touched files. But on the pro side the code-base gets in a consistent and good shape (regarding the micro-optimizations).

Copy link
Member

@JimBobSquarePants JimBobSquarePants left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is looking really good! Just a few comments and questions.

ref byte dBase = ref MemoryMarshal.GetReference(dest);

Shuffle.InverseMMShuffle(this.Control, out int p3, out int p2, out int p1, out int p0);
Shuffle.InverseMMShuffle(this.Control, out uint p3, out uint p2, out uint p1, out uint p0);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is one of those. "Why didn't I think of this!" moments. 😁

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, then it would be boring for me 😉.

Comment on lines +347 to +349
nint u = n - m;

for (int i = 0; i < u; i += 4)
for (nint i = 0; i < u; i += 4)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, that makes sense. Thanks for the additional explanation!

Comment on lines +514 to +517
out uint p3,
out uint p2,
out uint p1,
out uint p0)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep. Very good!

// We can assign the Bgr24 value directly to last three bytes of this instance.
ref byte thisRef = ref Unsafe.As<Abgr32, byte>(ref this);
ref byte thisRefFromB = ref Unsafe.AddByteOffset(ref thisRef, new IntPtr(1));
ref byte thisRefFromB = ref Unsafe.AddByteOffset(ref thisRef, 1);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't this be better as (nuint)1u?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AddByteOffset has overloads for nuint and IntPtr, thus C#-compiler picks the nuint overload and treats the literal 1 as nuint-constant. No need for the cast here.

for (int i = 0; i < this.kernelSize; i++)
{
int currentYIndex = Unsafe.Add(ref sampleRowBase, i);
int currentYIndex = Unsafe.Add(ref sampleRowBase, (uint)i);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cast in the outer loop for consistency?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just below i is needed for slicing as int.
It would be more consistent, but also two casts instead of one in the loop body.

I have no strong opinion what's better here. Consistency or less casts?

Copy link
Contributor Author

@gfoidl gfoidl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Feedback is addressed / commented.


Once you're happy with the changes I need to push another commit to fix cases like

nuint count = (uint)(source.Length / Vector<float>.Count);
which I de-optimized in a previous commit of this PR (got the wrong casts here in order to emit fastest code).

@JimBobSquarePants
Copy link
Member

Feedback is addressed / commented.

Once you're happy with the changes I need to push another commit to fix cases like

nuint count = (uint)(source.Length / Vector<float>.Count);

which I de-optimized in a previous commit of this PR (got the wrong casts here in order to emit fastest code).

Super happy with everything so far! Fire ahead 😄

@gfoidl
Copy link
Contributor Author

gfoidl commented Mar 24, 2023

Fire ahead 😄

Here we go: 2bbf1cb (only changed that division by vector length (which I unfortunately optimized, then later de-optimized, and with that commit optimized again -- now it's enough of that ping-poing 😉))

Copy link
Member

@JimBobSquarePants JimBobSquarePants left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Amazing stuff. Thank you!

@JimBobSquarePants JimBobSquarePants merged commit 0a1f05b into SixLabors:main Mar 24, 2023
@gfoidl gfoidl deleted the codegen-optimizations branch March 24, 2023 12:24
@gfoidl gfoidl mentioned this pull request Mar 24, 2023
4 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants