Skip to content

Conversation

@SwapnilGaikwad
Copy link
Contributor

This patch adds SIMD implementation of Span.Reverse() for Arm64. It improves performance on Arm64 (speedup ~8x for Bytes, ~4.5x for Chars, ~2x for Int32). There is no noticeable performance difference observed on x86.

Arm64 (Altra):

|  Method        |                                                                                               Toolchain | Size |      Mean |    Error |   StdDev |    Median |       Min |       Max | Ratio | MannWhitney(2%) | Allocated | Alloc Ratio |
|----------------|-------------------------------------------------------------------------------------------------------- |----- |----------:|---------:|---------:|----------:|----------:|----------:|------:|---------------- |----------:|------------:|
| Reverse  (Byte)| /unchecked_intrinsic/bin/testhost/net7.0-Linux-Release-arm64/shared/Microsoft.NETCore.App/7.0.0/corerun |  512 |  21.79 ns | 0.022 ns | 0.021 ns |  21.80 ns |  21.74 ns |  21.81 ns |  0.12 |          Faster |         - |          NA |
| Reverse  (Byte)|      /unchecked_main/bin/testhost/net7.0-Linux-Release-arm64/shared/Microsoft.NETCore.App/7.0.0/corerun |  512 | 178.59 ns | 0.291 ns | 0.272 ns | 178.65 ns | 178.01 ns | 179.16 ns |  1.00 |            Base |         - |          NA |
| Reverse  (Char)| /unchecked_intrinsic/bin/testhost/net7.0-Linux-Release-arm64/shared/Microsoft.NETCore.App/7.0.0/corerun |  512 |  38.95 ns | 0.141 ns | 0.117 ns |  38.93 ns |  38.76 ns |  39.22 ns |  0.22 |          Faster |         - |          NA |
| Reverse  (Char)|      /unchecked_main/bin/testhost/net7.0-Linux-Release-arm64/shared/Microsoft.NETCore.App/7.0.0/corerun |  512 | 179.92 ns | 0.769 ns | 0.642 ns | 179.90 ns | 178.71 ns | 181.18 ns |  1.00 |            Base |         - |          NA |
| Reverse (Int32)| /unchecked_intrinsic/bin/testhost/net7.0-Linux-Release-arm64/shared/Microsoft.NETCore.App/7.0.0/corerun |  512 |  87.81 ns | 0.011 ns | 0.010 ns |  87.81 ns |  87.80 ns |  87.83 ns |  0.49 |          Faster |         - |          NA |
| Reverse (Int32)|      /unchecked_main/bin/testhost/net7.0-Linux-Release-arm64/shared/Microsoft.NETCore.App/7.0.0/corerun |  512 | 178.40 ns | 0.106 ns | 0.088 ns | 178.41 ns | 178.26 ns | 178.53 ns |  1.00 |            Base |         - |          NA |

x86 (Xeon Gold 5120T):

|  Method         |                                                                                               Toolchain | Size |     Mean |    Error |   StdDev |   Median |      Min |      Max | Ratio | MannWhitney(2%) | Allocated | Alloc Ratio |
|---------------- |-------------------------------------------------------------------------------------------------------- |----- |---------:|---------:|---------:|---------:|---------:|---------:|------:|---------------- |----------:|------------:|
| Reverse  (Byte) |    /base_src/artifacts/bin/testhost/net7.0-Linux-Release-x64/shared/Microsoft.NETCore.App/7.0.0/corerun |  512 | 21.22 ns | 0.014 ns | 0.013 ns | 21.22 ns | 21.20 ns | 21.24 ns |  1.00 |            Base |         - |          NA |
| Reverse  (Byte) | /runtime_src/artifacts/bin/testhost/net7.0-Linux-Release-x64/shared/Microsoft.NETCore.App/7.0.0/corerun |  512 | 21.24 ns | 0.006 ns | 0.005 ns | 21.24 ns | 21.24 ns | 21.25 ns |  1.00 |            Same |         - |          NA |
| Reverse  (Char) |    /base_src/artifacts/bin/testhost/net7.0-Linux-Release-x64/shared/Microsoft.NETCore.App/7.0.0/corerun |  512 | 35.89 ns | 0.285 ns | 0.267 ns | 35.68 ns | 35.68 ns | 36.38 ns |  1.00 |            Base |         - |          NA |
| Reverse  (Char) | /runtime_src/artifacts/bin/testhost/net7.0-Linux-Release-x64/shared/Microsoft.NETCore.App/7.0.0/corerun |  512 | 35.81 ns | 0.195 ns | 0.182 ns | 35.68 ns | 35.68 ns | 36.13 ns |  1.00 |            Same |         - |          NA |
| Reverse (Int32) |    /base_src/artifacts/bin/testhost/net7.0-Linux-Release-x64/shared/Microsoft.NETCore.App/7.0.0/corerun |  512 | 69.57 ns | 0.004 ns | 0.004 ns | 69.57 ns | 69.56 ns | 69.58 ns |  1.00 |            Base |         - |          NA |
| Reverse (Int32) | /runtime_src/artifacts/bin/testhost/net7.0-Linux-Release-x64/shared/Microsoft.NETCore.App/7.0.0/corerun |  512 | 69.13 ns | 0.012 ns | 0.010 ns | 69.12 ns | 69.12 ns | 69.15 ns |  0.99 |            Same |         - |          NA |

@ghost
Copy link

ghost commented Jul 25, 2022

I couldn't figure out the best area label to add to this PR. If you have write-permissions please help me learn by adding exactly one area label.

@ghost ghost added the community-contribution Indicates that the PR has been added by a community member label Jul 25, 2022
@SwapnilGaikwad SwapnilGaikwad force-pushed the github-span-reverse-byte-intrinsic branch from c48606f to 3d4cc3f Compare July 26, 2022 14:00
@SwapnilGaikwad SwapnilGaikwad force-pushed the github-span-reverse-byte-intrinsic branch from 3d4cc3f to aee62fe Compare July 29, 2022 10:06
@SwapnilGaikwad
Copy link
Contributor Author

The new version of the patch removes changes from Base64Encoder/Base64Decoder to use Vector128.Shuffle() and focuses on Span.Reverse(). I'll add a separate patch to refactor the encoder/decoder.
Also, refactored the AVX2 implementations to use Vector256.Shuffle().

@SwapnilGaikwad SwapnilGaikwad force-pushed the github-span-reverse-byte-intrinsic branch from aee62fe to c673a31 Compare August 1, 2022 11:13
@SwapnilGaikwad
Copy link
Contributor Author

Debugging test failures. Unfortunately, the failures are not reproducing locally.

@kunalspathak
Copy link
Contributor

@dotnet/jit-contrib

tempLast = Avx2.Shuffle(tempLast, reverseMask);
tempLast = Avx2.Permute2x128(tempLast, tempLast, 0b00_01);
tempFirst = Vector256.Shuffle(tempFirst, Vector256.Create(15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0));
tempLast = Vector256.Shuffle(tempLast, Vector256.Create(15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@SwapnilGaikwad I was able to reproduce the test failure you observe, they're fixed if I change Vector256.Shuffle to Avx2.Shuffle

cc @tannergooding

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Related issue: #72793

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @EgorBo, I will rollback changes to AVX2 and update the patch.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is likely due to Vector256.Shuffle being 1x256 op rather than 2x128 ops (Avx2.Shuffle is the latter).

You generally need to offset the counts of the upper elements by Vector128<T>.Count to ensure the operation works as expected.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to confirm, is there any reason to not continue using reverseMask that is created once outside the loop instead of using Vector256.Create()? It should get hoisted outside the loop, but can you double check?

Copy link
Member

@tannergooding tannergooding Aug 1, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because for .NET 7 the Shuffle check for "is this a constant" happens in import only and so constant prop and other bits won't have happened yet and the import as intrinsic will fail.

This is something I want to fix early for .NET 8.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is something I want to fix early for .NET 8.

So until that happens, we should hoist those creations manually outside the loop?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am specifically referring to Vector256.Create(15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0) part.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, the point is we should not hoist them because that will break intrinsic recognition for Vector256.Shuffle. We explicitly want the intrinsic recognition to happen and then the JIT will CSE the constant and hoist it itself.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I see what you are saying.

@kunalspathak
Copy link
Contributor

kunalspathak commented Aug 1, 2022

Unfortunately, the failures are not reproducing locally.

You can try replicating the failures using the exact binaries that were run in CI. For that, you need runfo tool to download the payload. Here are the instructions to replicate e.g. these failures, which I was able to reproduce on my windows-x64 modern hardware machine.

dotnet tool install -g runfo

runfo get-helix-payload --jobid=1c2b1b1c-b400-4071-8dd9-68568aad1590 --output=some\folder --workitems=System.Memory.Tests --no-dumps

<extract the largest zip folder in correlation-payload> in e.g. some\folder\correlation>

<extract zip folder in workitems> in e.g. some\folder\workitems

cd some\folder\workitems

RunTests.cmd --runtime-path some\folder\correlation

Let us know if you still have trouble reproing the failures.

{
ref byte bufByte = ref Unsafe.As<char, byte>(ref buf);
nuint byteLength = length * sizeof(char);
Vector256<byte> reverseMask = Vector256.Create(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The issue with when you tried to replace Avx2.Shuffle with Vector256.Shuffle is that you didn't adjust the reverseMask.

You should change this:

Vector256.Create(
    (byte)15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0,   // first 128-bit lane
          15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0)); // second 128-bit lane

To:

Vector256.Create(
    (byte)15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0,   // first 128-bit lane
          31, 30, 29, 28, 27 26, 25, 24, 23, 22, 21, 20, 19, 18, 17, 16)); // second 128-bit lane

The Vector256 APIs operate as if it is 1x256-bit vector rather than as 2x128-bit vector lanes. This is consistent with how AVX-512, Arm64, WASM, Vector64, Vector128, and other types all operate.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you do this, then Vector256.Shuffle(tempFirst, Vector256.Create(...)) will work as expected and still be performant on AVX2 hardware where you don't want to cross lanes.

Copy link
Contributor Author

@SwapnilGaikwad SwapnilGaikwad Aug 1, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Vector256 APIs operate as if it is 1x256-bit vector

In this case, shouldn't we adjust the mask to

Vector256.Create(
    (byte)31, 30, 29, 28, 27, 26, 25, 24, 23, 22, 21, 20, 19, 18, 17, 16,
          15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0 ));

I noticed the issue the reverse mask but couldn't reproduce the failure yet. Does the runfo tool expects to use Windows only? The steps using runfo to reproduce the pipeline failure seem create a batch file to run on Windows.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

runfo

It can be used on Linux as well. You just need to pass the right job-id and work item which you can find it in AzDo.

image

The RunTests.cmd/sh is present in the zip folder you download.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ooh, right. Thanks Kunal.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we leave the AVX2 changes out of this patch. The patch is now self contained to changes in the 128bit variants.

I'm happy for 256bit part to be changed if it's obvious, but clearly it's going to require some extra debugging to get right and not break performance (given #72793). Let's avoid feature creeping this PR.

Copy link
Member

@EgorBo EgorBo Aug 2, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's fine to leave it untouched, crossplatform Vector256 apis are mostly for consistency, they're not crossplatform and unlikely to be ever so

@a74nh
Copy link
Contributor

a74nh commented Aug 10, 2022

I don't think there are any outstanding review comments on this patch ? (The CI failures just look like timeouts?)

Copy link
Member

@adamsitnik adamsitnik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks great to me! Big thanks for your contribution @SwapnilGaikwad !

I have provided four minor suggestions. I am going to apply them now so we can merge the PR today. I hope you don't mind.

Array arrayClone2 = (Array)array.Clone();
Array.Reverse(arrayClone2, index, length);
Assert.Equal(expected, expected);
Assert.Equal(expected, arrayClone2);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

great catch! 👍

{
// SByte
yield return new object[] { new sbyte[] { 1, 2, 3, 4, 5 }, 0, 5, new sbyte[] { 5, 4, 3, 2, 1 } };
yield return new object[] { new sbyte[] { 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65 }, 0, 65, new sbyte[] { 65, 64, 63, 62, 61, 60, 59, 58, 57, 56, 55, 54, 53, 52, 51, 50, 49, 48, 47, 46, 45, 44, 43, 42, 41, 40, 39, 38, 37, 36, 35, 34, 33, 32, 31, 30, 29, 28, 27, 26, 25, 24, 23, 22, 21, 20, 19, 18, 17, 16, 15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1 } };
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thank you for adding a lot of new test cases!

@adamsitnik adamsitnik added the tenet-performance Performance related issue label Aug 10, 2022
@adamsitnik adamsitnik modified the milestones: 8.0.0, 7.0.0 Aug 10, 2022
@ghost
Copy link

ghost commented Aug 10, 2022

Tagging subscribers to this area: @dotnet/area-system-memory
See info in area-owners.md if you want to be subscribed.

Issue Details

This patch adds SIMD implementation of Span.Reverse() for Arm64. It improves performance on Arm64 (speedup ~8x for Bytes, ~4.5x for Chars, ~2x for Int32). There is no noticeable performance difference observed on x86.

Arm64 (Altra):

|  Method        |                                                                                               Toolchain | Size |      Mean |    Error |   StdDev |    Median |       Min |       Max | Ratio | MannWhitney(2%) | Allocated | Alloc Ratio |
|----------------|-------------------------------------------------------------------------------------------------------- |----- |----------:|---------:|---------:|----------:|----------:|----------:|------:|---------------- |----------:|------------:|
| Reverse  (Byte)| /unchecked_intrinsic/bin/testhost/net7.0-Linux-Release-arm64/shared/Microsoft.NETCore.App/7.0.0/corerun |  512 |  21.79 ns | 0.022 ns | 0.021 ns |  21.80 ns |  21.74 ns |  21.81 ns |  0.12 |          Faster |         - |          NA |
| Reverse  (Byte)|      /unchecked_main/bin/testhost/net7.0-Linux-Release-arm64/shared/Microsoft.NETCore.App/7.0.0/corerun |  512 | 178.59 ns | 0.291 ns | 0.272 ns | 178.65 ns | 178.01 ns | 179.16 ns |  1.00 |            Base |         - |          NA |
| Reverse  (Char)| /unchecked_intrinsic/bin/testhost/net7.0-Linux-Release-arm64/shared/Microsoft.NETCore.App/7.0.0/corerun |  512 |  38.95 ns | 0.141 ns | 0.117 ns |  38.93 ns |  38.76 ns |  39.22 ns |  0.22 |          Faster |         - |          NA |
| Reverse  (Char)|      /unchecked_main/bin/testhost/net7.0-Linux-Release-arm64/shared/Microsoft.NETCore.App/7.0.0/corerun |  512 | 179.92 ns | 0.769 ns | 0.642 ns | 179.90 ns | 178.71 ns | 181.18 ns |  1.00 |            Base |         - |          NA |
| Reverse (Int32)| /unchecked_intrinsic/bin/testhost/net7.0-Linux-Release-arm64/shared/Microsoft.NETCore.App/7.0.0/corerun |  512 |  87.81 ns | 0.011 ns | 0.010 ns |  87.81 ns |  87.80 ns |  87.83 ns |  0.49 |          Faster |         - |          NA |
| Reverse (Int32)|      /unchecked_main/bin/testhost/net7.0-Linux-Release-arm64/shared/Microsoft.NETCore.App/7.0.0/corerun |  512 | 178.40 ns | 0.106 ns | 0.088 ns | 178.41 ns | 178.26 ns | 178.53 ns |  1.00 |            Base |         - |          NA |

x86 (Xeon Gold 5120T):

|  Method         |                                                                                               Toolchain | Size |     Mean |    Error |   StdDev |   Median |      Min |      Max | Ratio | MannWhitney(2%) | Allocated | Alloc Ratio |
|---------------- |-------------------------------------------------------------------------------------------------------- |----- |---------:|---------:|---------:|---------:|---------:|---------:|------:|---------------- |----------:|------------:|
| Reverse  (Byte) |    /base_src/artifacts/bin/testhost/net7.0-Linux-Release-x64/shared/Microsoft.NETCore.App/7.0.0/corerun |  512 | 21.22 ns | 0.014 ns | 0.013 ns | 21.22 ns | 21.20 ns | 21.24 ns |  1.00 |            Base |         - |          NA |
| Reverse  (Byte) | /runtime_src/artifacts/bin/testhost/net7.0-Linux-Release-x64/shared/Microsoft.NETCore.App/7.0.0/corerun |  512 | 21.24 ns | 0.006 ns | 0.005 ns | 21.24 ns | 21.24 ns | 21.25 ns |  1.00 |            Same |         - |          NA |
| Reverse  (Char) |    /base_src/artifacts/bin/testhost/net7.0-Linux-Release-x64/shared/Microsoft.NETCore.App/7.0.0/corerun |  512 | 35.89 ns | 0.285 ns | 0.267 ns | 35.68 ns | 35.68 ns | 36.38 ns |  1.00 |            Base |         - |          NA |
| Reverse  (Char) | /runtime_src/artifacts/bin/testhost/net7.0-Linux-Release-x64/shared/Microsoft.NETCore.App/7.0.0/corerun |  512 | 35.81 ns | 0.195 ns | 0.182 ns | 35.68 ns | 35.68 ns | 36.13 ns |  1.00 |            Same |         - |          NA |
| Reverse (Int32) |    /base_src/artifacts/bin/testhost/net7.0-Linux-Release-x64/shared/Microsoft.NETCore.App/7.0.0/corerun |  512 | 69.57 ns | 0.004 ns | 0.004 ns | 69.57 ns | 69.56 ns | 69.58 ns |  1.00 |            Base |         - |          NA |
| Reverse (Int32) | /runtime_src/artifacts/bin/testhost/net7.0-Linux-Release-x64/shared/Microsoft.NETCore.App/7.0.0/corerun |  512 | 69.13 ns | 0.012 ns | 0.010 ns | 69.12 ns | 69.12 ns | 69.15 ns |  0.99 |            Same |         - |          NA |
Author: SwapnilGaikwad
Assignees: SwapnilGaikwad, kunalspathak
Labels:

area-System.Memory, tenet-performance, community-contribution

Milestone: 7.0.0

@SwapnilGaikwad
Copy link
Contributor Author

Thanks a lot @adamsitnik for pushing this PR further 👍

Copy link
Contributor

@kunalspathak kunalspathak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks for your contribution.

@adamsitnik
Copy link
Member

The failure is unrelated (#73668), merging!

@adamsitnik adamsitnik merged commit f244adb into dotnet:main Aug 10, 2022
@SwapnilGaikwad SwapnilGaikwad deleted the github-span-reverse-byte-intrinsic branch August 11, 2022 10:36
@ghost ghost locked as resolved and limited conversation to collaborators Sep 10, 2022
@kunalspathak
Copy link
Contributor

Improvements dotnet/perf-autofiling-issues#7374

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

arch-arm64 area-System.Memory community-contribution Indicates that the PR has been added by a community member tenet-performance Performance related issue

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants