3 <==> 4 Channel Shuffling with Hardware Intrinsics #1409

JimBobSquarePants · 2020-10-30T23:20:56Z

Prerequisites

I have written a descriptive pull-request title
I have verified that there are no overlapping pull-requests open
I have verified that I am following matches the existing coding patterns and practice as demonstrated in the repository. These follow strict Stylecop rules 👮.
I have provided test coverage for my change (where applicable)

Description

Adds three methods to SimdUtils:

Pad3Shuffle4 Pads a buffer representing 3 channel pixels to 4 channels and shuffles the result.
Shuffle4Slice3 Shuffles a buffer representing 4 channel pixels and slices it to 3 channels.
Shuffle3 Shuffles a buffer representing 3 channel pixels.

All fallbacks are better than or equal two current implementations.

~I've completed a basic implementation for now with results in ~~2.5-3x speedup. I looked into more loop unrolling based on this example but kept messing up my offsets so I left it. @saucecontrol if you fancy helping me here please do.~~ Fixed, thanks @saucecontrol!

Once the initial implementation is reviewed I will make the methods generic and add optimized fallback versions of XYZW shuffling as per @antonfirsov previous suggestions in 4 channel shuffling which I can use custom structs similar to Rgb24 and Rgba32 to provide no loss in current performance on older platforms. Done!

Current Benchmarks

Pad3Shuffle4

BenchmarkDotNet=v0.12.1, OS=Windows 10.0.19041.572 (2004/?/20H1)
Intel Core i7-8650U CPU 1.90GHz (Kaby Lake R), 1 CPU, 8 logical and 4 physical cores
.NET Core SDK=3.1.403
[Host]             : .NET Core 3.1.9 (CoreCLR 4.700.20.47201, CoreFX 4.700.20.47203), X64 RyuJIT
1. No HwIntrinsics : .NET Core 3.1.9 (CoreCLR 4.700.20.47201, CoreFX 4.700.20.47203), X64 RyuJIT
2. AVX             : .NET Core 3.1.9 (CoreCLR 4.700.20.47201, CoreFX 4.700.20.47203), X64 RyuJIT
3. SSE             : .NET Core 3.1.9 (CoreCLR 4.700.20.47201, CoreFX 4.700.20.47203), X64 RyuJIT

Runtime=.NET Core 3.1

Method	Job	EnvironmentVariables	Count	Mean	Error	StdDev	Median	Ratio	RatioSD	Gen 0	Gen 1	Gen 2	Allocated
Pad3Shuffle4	1. No HwIntrinsics	COMPlus_EnableHWIntrinsic=0,COMPlus_FeatureSIMD=0	96	120.64 ns	7.190 ns	21.200 ns	114.26 ns	1.00	0.00	-	-	-	-
Pad3Shuffle4	2. AVX	Empty	96	23.63 ns	0.175 ns	0.155 ns	23.65 ns	0.15	0.01	-	-	-	-
Pad3Shuffle4	3. SSE	COMPlus_EnableAVX=0	96	25.25 ns	0.356 ns	0.298 ns	25.27 ns	0.17	0.01	-	-	-	-

Pad3Shuffle4FastFallback	1. No HwIntrinsics	COMPlus_EnableHWIntrinsic=0,COMPlus_FeatureSIMD=0	96	14.80 ns	0.358 ns	1.032 ns	14.64 ns	1.00	0.00	-	-	-	-
Pad3Shuffle4FastFallback	2. AVX	Empty	96	24.84 ns	0.376 ns	0.333 ns	24.74 ns	1.57	0.06	-	-	-	-
Pad3Shuffle4FastFallback	3. SSE	COMPlus_EnableAVX=0	96	24.58 ns	0.471 ns	0.704 ns	24.38 ns	1.60	0.09	-	-	-	-

Pad3Shuffle4	1. No HwIntrinsics	COMPlus_EnableHWIntrinsic=0,COMPlus_FeatureSIMD=0	384	258.92 ns	4.873 ns	4.069 ns	257.95 ns	1.00	0.00	-	-	-	-
Pad3Shuffle4	2. AVX	Empty	384	41.41 ns	0.859 ns	1.204 ns	41.33 ns	0.16	0.00	-	-	-	-
Pad3Shuffle4	3. SSE	COMPlus_EnableAVX=0	384	40.74 ns	0.848 ns	0.793 ns	40.48 ns	0.16	0.00	-	-	-	-

Pad3Shuffle4FastFallback	1. No HwIntrinsics	COMPlus_EnableHWIntrinsic=0,COMPlus_FeatureSIMD=0	384	74.50 ns	0.490 ns	0.383 ns	74.49 ns	1.00	0.00	-	-	-	-
Pad3Shuffle4FastFallback	2. AVX	Empty	384	40.74 ns	0.624 ns	0.584 ns	40.72 ns	0.55	0.01	-	-	-	-
Pad3Shuffle4FastFallback	3. SSE	COMPlus_EnableAVX=0	384	38.28 ns	0.534 ns	0.417 ns	38.22 ns	0.51	0.01	-	-	-	-

Pad3Shuffle4	1. No HwIntrinsics	COMPlus_EnableHWIntrinsic=0,COMPlus_FeatureSIMD=0	768	503.91 ns	6.466 ns	6.048 ns	501.58 ns	1.00	0.00	-	-	-	-
Pad3Shuffle4	2. AVX	Empty	768	62.86 ns	0.332 ns	0.277 ns	62.80 ns	0.12	0.00	-	-	-	-
Pad3Shuffle4	3. SSE	COMPlus_EnableAVX=0	768	64.59 ns	0.469 ns	0.415 ns	64.62 ns	0.13	0.00	-	-	-	-

Pad3Shuffle4FastFallback	1. No HwIntrinsics	COMPlus_EnableHWIntrinsic=0,COMPlus_FeatureSIMD=0	768	110.51 ns	0.592 ns	0.554 ns	110.33 ns	1.00	0.00	-	-	-	-
Pad3Shuffle4FastFallback	2. AVX	Empty	768	64.72 ns	1.306 ns	1.090 ns	64.51 ns	0.59	0.01	-	-	-	-
Pad3Shuffle4FastFallback	3. SSE	COMPlus_EnableAVX=0	768	62.11 ns	0.816 ns	0.682 ns	61.98 ns	0.56	0.01	-	-	-	-

Pad3Shuffle4	1. No HwIntrinsics	COMPlus_EnableHWIntrinsic=0,COMPlus_FeatureSIMD=0	1536	1,005.84 ns	13.176 ns	12.325 ns	1,004.70 ns	1.00	0.00	-	-	-	-
Pad3Shuffle4	2. AVX	Empty	1536	110.05 ns	0.256 ns	0.214 ns	110.04 ns	0.11	0.00	-	-	-	-
Pad3Shuffle4	3. SSE	COMPlus_EnableAVX=0	1536	110.23 ns	0.545 ns	0.483 ns	110.09 ns	0.11	0.00	-	-	-	-

Pad3Shuffle4FastFallback	1. No HwIntrinsics	COMPlus_EnableHWIntrinsic=0,COMPlus_FeatureSIMD=0	1536	220.37 ns	1.601 ns	1.419 ns	220.13 ns	1.00	0.00	-	-	-	-
Pad3Shuffle4FastFallback	2. AVX	Empty	1536	111.54 ns	2.173 ns	2.901 ns	111.27 ns	0.51	0.01	-	-	-	-
Pad3Shuffle4FastFallback	3. SSE	COMPlus_EnableAVX=0	1536	110.23 ns	0.456 ns	0.427 ns	110.25 ns	0.50	0.00	-	-	-	-

Shuffle4Slice3

BenchmarkDotNet=v0.12.1, OS=Windows 10.0.19041.572 (2004/?/20H1)
Intel Core i7-8650U CPU 1.90GHz (Kaby Lake R), 1 CPU, 8 logical and 4 physical cores
.NET Core SDK=3.1.403
[Host]             : .NET Core 3.1.9 (CoreCLR 4.700.20.47201, CoreFX 4.700.20.47203), X64 RyuJIT
1. No HwIntrinsics : .NET Core 3.1.9 (CoreCLR 4.700.20.47201, CoreFX 4.700.20.47203), X64 RyuJIT
2. AVX             : .NET Core 3.1.9 (CoreCLR 4.700.20.47201, CoreFX 4.700.20.47203), X64 RyuJIT
3. SSE             : .NET Core 3.1.9 (CoreCLR 4.700.20.47201, CoreFX 4.700.20.47203), X64 RyuJIT

Runtime=.NET Core 3.1

Method	Job	EnvironmentVariables	Count	Mean	Error	StdDev	Median	Ratio	RatioSD	Gen 0	Gen 1	Gen 2	Allocated
Shuffle4Slice3	1. No HwIntrinsics	COMPlus_EnableHWIntrinsic=0,COMPlus_FeatureSIMD=0	128	56.44 ns	2.843 ns	8.382 ns	56.70 ns	1.00	0.00	-	-	-	-
Shuffle4Slice3	2. AVX	Empty	128	27.15 ns	0.556 ns	0.762 ns	27.34 ns	0.41	0.03	-	-	-	-
Shuffle4Slice3	3. SSE	COMPlus_EnableAVX=0	128	26.36 ns	0.321 ns	0.268 ns	26.26 ns	0.38	0.02	-	-	-	-

Shuffle4Slice3FastFallback	1. No HwIntrinsics	COMPlus_EnableHWIntrinsic=0,COMPlus_FeatureSIMD=0	128	25.85 ns	0.494 ns	0.462 ns	25.84 ns	1.00	0.00	-	-	-	-
Shuffle4Slice3FastFallback	2. AVX	Empty	128	26.15 ns	0.113 ns	0.106 ns	26.16 ns	1.01	0.02	-	-	-	-
Shuffle4Slice3FastFallback	3. SSE	COMPlus_EnableAVX=0	128	25.57 ns	0.078 ns	0.061 ns	25.56 ns	0.99	0.02	-	-	-	-

Shuffle4Slice3	1. No HwIntrinsics	COMPlus_EnableHWIntrinsic=0,COMPlus_FeatureSIMD=0	256	97.47 ns	0.327 ns	0.289 ns	97.35 ns	1.00	0.00	-	-	-	-
Shuffle4Slice3	2. AVX	Empty	256	32.61 ns	0.107 ns	0.095 ns	32.62 ns	0.33	0.00	-	-	-	-
Shuffle4Slice3	3. SSE	COMPlus_EnableAVX=0	256	33.21 ns	0.169 ns	0.150 ns	33.15 ns	0.34	0.00	-	-	-	-

Shuffle4Slice3FastFallback	1. No HwIntrinsics	COMPlus_EnableHWIntrinsic=0,COMPlus_FeatureSIMD=0	256	52.34 ns	0.779 ns	0.729 ns	51.94 ns	1.00	0.00	-	-	-	-
Shuffle4Slice3FastFallback	2. AVX	Empty	256	32.16 ns	0.111 ns	0.104 ns	32.16 ns	0.61	0.01	-	-	-	-
Shuffle4Slice3FastFallback	3. SSE	COMPlus_EnableAVX=0	256	33.61 ns	0.342 ns	0.319 ns	33.62 ns	0.64	0.01	-	-	-	-

Shuffle4Slice3	1. No HwIntrinsics	COMPlus_EnableHWIntrinsic=0,COMPlus_FeatureSIMD=0	512	210.74 ns	3.825 ns	5.956 ns	207.70 ns	1.00	0.00	-	-	-	-
Shuffle4Slice3	2. AVX	Empty	512	51.03 ns	0.535 ns	0.501 ns	51.18 ns	0.24	0.01	-	-	-	-
Shuffle4Slice3	3. SSE	COMPlus_EnableAVX=0	512	66.60 ns	1.313 ns	1.613 ns	65.93 ns	0.31	0.01	-	-	-	-

Shuffle4Slice3FastFallback	1. No HwIntrinsics	COMPlus_EnableHWIntrinsic=0,COMPlus_FeatureSIMD=0	512	119.12 ns	1.905 ns	1.689 ns	118.52 ns	1.00	0.00	-	-	-	-
Shuffle4Slice3FastFallback	2. AVX	Empty	512	50.33 ns	0.382 ns	0.339 ns	50.41 ns	0.42	0.01	-	-	-	-
Shuffle4Slice3FastFallback	3. SSE	COMPlus_EnableAVX=0	512	49.25 ns	0.555 ns	0.492 ns	49.26 ns	0.41	0.01	-	-	-	-

Shuffle4Slice3	1. No HwIntrinsics	COMPlus_EnableHWIntrinsic=0,COMPlus_FeatureSIMD=0	1024	423.55 ns	4.891 ns	4.336 ns	423.27 ns	1.00	0.00	-	-	-	-
Shuffle4Slice3	2. AVX	Empty	1024	77.13 ns	1.355 ns	2.264 ns	76.19 ns	0.19	0.01	-	-	-	-
Shuffle4Slice3	3. SSE	COMPlus_EnableAVX=0	1024	79.39 ns	0.103 ns	0.086 ns	79.37 ns	0.19	0.00	-	-	-	-

Shuffle4Slice3FastFallback	1. No HwIntrinsics	COMPlus_EnableHWIntrinsic=0,COMPlus_FeatureSIMD=0	1024	226.57 ns	2.930 ns	2.598 ns	226.10 ns	1.00	0.00	-	-	-	-
Shuffle4Slice3FastFallback	2. AVX	Empty	1024	80.25 ns	1.647 ns	2.082 ns	80.98 ns	0.35	0.01	-	-	-	-
Shuffle4Slice3FastFallback	3. SSE	COMPlus_EnableAVX=0	1024	84.99 ns	1.234 ns	1.155 ns	85.60 ns	0.38	0.01	-	-	-	-

Shuffle4Slice3	1. No HwIntrinsics	COMPlus_EnableHWIntrinsic=0,COMPlus_FeatureSIMD=0	2048	794.96 ns	1.735 ns	1.538 ns	795.15 ns	1.00	0.00	-	-	-	-
Shuffle4Slice3	2. AVX	Empty	2048	128.41 ns	0.417 ns	0.390 ns	128.24 ns	0.16	0.00	-	-	-	-
Shuffle4Slice3	3. SSE	COMPlus_EnableAVX=0	2048	127.24 ns	0.294 ns	0.229 ns	127.23 ns	0.16	0.00	-	-	-	-

Shuffle4Slice3FastFallback	1. No HwIntrinsics	COMPlus_EnableHWIntrinsic=0,COMPlus_FeatureSIMD=0	2048	382.97 ns	1.064 ns	0.831 ns	382.87 ns	1.00	0.00	-	-	-	-
Shuffle4Slice3FastFallback	2. AVX	Empty	2048	126.93 ns	0.382 ns	0.339 ns	126.94 ns	0.33	0.00	-	-	-	-
Shuffle4Slice3FastFallback	3. SSE	COMPlus_EnableAVX=0	2048	149.36 ns	1.875 ns	1.754 ns	149.33 ns	0.39	0.00	-	-	-	-

Shuffle3

Note: No fast fallback implementation here. I experimented with casting to uint and using shuffling but it was a shade slower than the default.

BenchmarkDotNet=v0.12.1, OS=Windows 10.0.19041.572 (2004/?/20H1)
Intel Core i7-8650U CPU 1.90GHz (Kaby Lake R), 1 CPU, 8 logical and 4 physical cores
.NET Core SDK=3.1.403
[Host]             : .NET Core 3.1.9 (CoreCLR 4.700.20.47201, CoreFX 4.700.20.47203), X64 RyuJIT
1. No HwIntrinsics : .NET Core 3.1.9 (CoreCLR 4.700.20.47201, CoreFX 4.700.20.47203), X64 RyuJIT
2. AVX             : .NET Core 3.1.9 (CoreCLR 4.700.20.47201, CoreFX 4.700.20.47203), X64 RyuJIT
3. SSE             : .NET Core 3.1.9 (CoreCLR 4.700.20.47201, CoreFX 4.700.20.47203), X64 RyuJIT

Runtime=.NET Core 3.1

Method	Job	EnvironmentVariables	Count	Mean	Error	StdDev	Median	Ratio	RatioSD	Gen 0	Gen 1	Gen 2	Allocated
Shuffle3	1. No HwIntrinsics	COMPlus_EnableHWIntrinsic=0,COMPlus_FeatureSIMD=0	96	48.46 ns	1.034 ns	2.438 ns	47.46 ns	1.00	0.00	-	-	-	-
Shuffle3	2. AVX	Empty	96	32.42 ns	0.537 ns	0.476 ns	32.34 ns	0.66	0.04	-	-	-	-
Shuffle3	3. SSE	COMPlus_EnableAVX=0	96	32.51 ns	0.373 ns	0.349 ns	32.56 ns	0.66	0.03	-	-	-	-

Shuffle3	1. No HwIntrinsics	COMPlus_EnableHWIntrinsic=0,COMPlus_FeatureSIMD=0	384	199.04 ns	1.512 ns	1.180 ns	199.17 ns	1.00	0.00	-	-	-	-
Shuffle3	2. AVX	Empty	384	71.20 ns	2.654 ns	7.784 ns	69.60 ns	0.41	0.02	-	-	-	-
Shuffle3	3. SSE	COMPlus_EnableAVX=0	384	63.23 ns	0.569 ns	0.505 ns	63.21 ns	0.32	0.00	-	-	-	-

Shuffle3	1. No HwIntrinsics	COMPlus_EnableHWIntrinsic=0,COMPlus_FeatureSIMD=0	768	391.28 ns	5.087 ns	3.972 ns	391.22 ns	1.00	0.00	-	-	-	-
Shuffle3	2. AVX	Empty	768	109.12 ns	2.149 ns	2.010 ns	108.66 ns	0.28	0.01	-	-	-	-
Shuffle3	3. SSE	COMPlus_EnableAVX=0	768	106.51 ns	0.734 ns	0.613 ns	106.56 ns	0.27	0.00	-	-	-	-

Shuffle3	1. No HwIntrinsics	COMPlus_EnableHWIntrinsic=0,COMPlus_FeatureSIMD=0	1536	773.70 ns	5.516 ns	4.890 ns	772.96 ns	1.00	0.00	-	-	-	-
Shuffle3	2. AVX	Empty	1536	190.41 ns	1.090 ns	0.851 ns	190.38 ns	0.25	0.00	-	-	-	-
Shuffle3	3. SSE	COMPlus_EnableAVX=0	1536	190.94 ns	0.985 ns	0.769 ns	190.85 ns	0.25	0.00	-	-	-	-

codecov · 2020-10-30T23:54:08Z

Codecov Report

Merging #1409 (3cda066) into master (cf9cc6b) will increase coverage by 0.14%.
The diff coverage is 99.33%.

@@            Coverage Diff             @@
##           master    #1409      +/-   ##
==========================================
+ Coverage   82.99%   83.14%   +0.14%     
==========================================
  Files         692      695       +3     
  Lines       31189    31484     +295     
  Branches     3578     3586       +8     
==========================================
+ Hits        25884    26176     +292     
- Misses       4582     4585       +3     
  Partials      723      723

Flag	Coverage Δ
unittests	`83.14% <99.33%> (+0.14%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
src/ImageSharp/Common/Helpers/SimdUtils.Shuffle.cs	`92.30% <83.33%> (-4.99%)`	⬇️
...eSharp/Common/Helpers/Shuffle/IComponentShuffle.cs	`100.00% <100.00%> (ø)`
...ImageSharp/Common/Helpers/Shuffle/IPad3Shuffle4.cs	`100.00% <100.00%> (ø)`
src/ImageSharp/Common/Helpers/Shuffle/IShuffle3.cs	`100.00% <100.00%> (ø)`
...ageSharp/Common/Helpers/Shuffle/IShuffle4Slice3.cs	`100.00% <100.00%> (ø)`
...mageSharp/Common/Helpers/SimdUtils.HwIntrinsics.cs	`97.80% <100.00%> (+1.28%)`	⬆️
...ions/Generated/Argb32.PixelOperations.Generated.cs	`100.00% <100.00%> (ø)`
...tions/Generated/Bgr24.PixelOperations.Generated.cs	`100.00% <100.00%> (ø)`
...ions/Generated/Bgra32.PixelOperations.Generated.cs	`100.00% <100.00%> (ø)`
...tions/Generated/Rgb24.PixelOperations.Generated.cs	`100.00% <100.00%> (ø)`
... and 5 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update cf9cc6b...3cda066. Read the comment docs.

src/ImageSharp/Common/Helpers/SimdUtils.HwIntrinsics.cs

antonfirsov · 2020-11-02T18:00:00Z

@JimBobSquarePants I want to do a proper review here, but not sure about the timing. Lucky case: by Wednesday evening, worst case: by Sunday. Is this acceptable for you?

JimBobSquarePants · 2020-11-02T18:18:21Z

@antonfirsov No worries. Hopefully someone else from the team can also have a look in the interim.

I added the Rgb24/Vector4 benchmarks to #1354 (comment)

We're looking at a very healthy speedup on all target frameworks.

antonfirsov · 2020-11-02T18:43:35Z

@JimBobSquarePants I tried to destile some derived numbers to make conclusions from #1354 (comment):

ToVector4_Rgb24

Count	Runtime	Master	Branch	Speedup (`Master/Branch`)
64	.NET 4.7.2	310.2	355.5	0.87257384
64	.NET Core 2.1	230.2	228.5	1.007439825
64	.NET Core 3.1	236.7	217	1.09078341
256	.NET 4.7.2	622.5	448.9	1.386723101
256	.NET Core 2.1	498.1	309.2	1.610931436
256	.NET Core 3.1	436.8	212.3	2.05746585
2048	.NET 4.7.2	3,460.60	1,974.10	1.753001368
2048	.NET Core 2.1	3,421.30	1,985.50	1.723142785
2048	.NET Core 3.1	2,972.20	1,165.00	2.551244635

FromVector4_Rgb24

Count	Runtime	Master	Branch	Speedup (`Master/Branch`)
64	.NET 4.7.2	316.4	320.8	0.986284289
64	.NET Core 2.1	238.9	246	0.971138211
64	.NET Core 3.1	250.3	243.4	1.028348398
256	.NET 4.7.2	1,051.30	967	1.087176836
256	.NET Core 2.1	846.8	1,003.30	0.844014751
256	.NET Core 3.1	640.2	437	1.464988558
2048	.NET 4.7.2	4,551.20	4,391.60	1.036342108
2048	.NET Core 2.1	4,390.40	4,225.60	1.039000379
2048	.NET Core 3.1	2,979.40	1,822.70	1.634607999

Things I don't understand:

Why do the benchmarks show speedup for 2.1 and .NET Framework for Count >= 256? There is no SIMD, and it doesn't look like we changed something there.
Is there a regression in FromVector4_Rgb24 for 2.1 with Count == 256? (Or is it noise?)

JimBobSquarePants · 2020-11-02T19:09:23Z

@antonfirsov the speedup is due to this change. Our original per pixel implementation turned out to be quite slow.

https://github.com/SixLabors/ImageSharp/pull/1409/files#diff-92b35c7d3a5d6901602e57d9dfd618b0d6ceb8cf3442fcfa044111c0e1f509dfR73

The regression is just noise. There’s massive error on that run for some reason

peter-dolkens · 2020-11-03T23:49:13Z

The "Fastfallback" is a regression in performance - presumably because it is no longer falling back, but doing it properly, quickly.

Is this only ever triggered algorithmically?

Could people be running this in fallback mode deliberately as it's faster and close enough for their needs?

Should you be retaining the "Fastfallback" path, or happy to discontinue that support.

For simplicity, I'd certainly lean towards making it obsolete, but you know your library better than anyone.

Not sure if it's the kind of thing that might deliberately get used in low-quality bulk tasks like thumbnail generation.

JimBobSquarePants · 2020-11-04T00:19:03Z

@peter-dolkens if you check the linked benchmarks the error for that regression run is super high. Likely due to my laptop throttling. I’ll post updated ones in the morning

Those fast fallbacks are tuned algorithms for particular shuffle operations. They only kick in on target frameworks without intrinsics and when there are remaining items we cannot process in bulk as they don’t fit within a vector.

Co-authored-by: Clinton Ingram <[email protected]>

…geSharp into js/Shuffle3Channel

JimBobSquarePants · 2020-11-06T20:36:08Z

Oh wow! I really broke this!

3 <==> 4 Channel Shuffling with Hardware Intrinsics

JimBobSquarePants added 2 commits October 30, 2020 20:38

Initial 3padshuffle4

49e9364

Add Shuffle4Slice3

1d21dc9

JimBobSquarePants added the area:performance label Oct 30, 2020

JimBobSquarePants added this to the 1.1.0 milestone Oct 30, 2020

JimBobSquarePants requested a review from a team October 30, 2020 23:20

JimBobSquarePants changed the title ~~[Draft] 3 <==> 4 Channel Shuffling with Harware Intrinsics~~ [Draft] 3 <==> 4 Channel Shuffling with Hardware Intrinsics Oct 30, 2020

JimBobSquarePants added 2 commits October 30, 2020 23:22

Cleanup

9f38d40

Merge branch 'master' into js/Shuffle3Channel

2421a56

fix spans directly

1b85483

saucecontrol reviewed Oct 31, 2020

View reviewed changes

src/ImageSharp/Common/Helpers/SimdUtils.HwIntrinsics.cs Outdated Show resolved Hide resolved

saucecontrol reviewed Oct 31, 2020

View reviewed changes

src/ImageSharp/Common/Helpers/SimdUtils.HwIntrinsics.cs Outdated Show resolved Hide resolved

JimBobSquarePants added 7 commits October 31, 2020 18:58

Faster Pad3Shuffle4

21611e1

Faster Shuffle4Slice3

f462bfe

Update benchmark

2d1f2cc

Fast fallbacks

d5b2577

Don't cast full spans

893bfdd

Shuffle3 + Tests

76d5277

Cleanup and fix tests

49062c4

antonfirsov mentioned this pull request Nov 2, 2020

Non-generic Image.Load should decode Jpeg into Image<Rgb24> #1410

Closed

Fix Shuffle4Slice3, wire up shuffles.

8c32469

JimBobSquarePants changed the title ~~[Draft] 3 <==> 4 Channel Shuffling with Hardware Intrinsics~~ 3 <==> 4 Channel Shuffling with Hardware Intrinsics Nov 2, 2020

JimBobSquarePants marked this pull request as ready for review November 2, 2020 16:50

Add Rgb24 <==> Vector4 benchmarks

1f73b21

JimBobSquarePants and others added 4 commits November 6, 2020 19:50

Update src/ImageSharp/Common/Helpers/SimdUtils.HwIntrinsics.cs

e1168ad

Co-authored-by: Clinton Ingram <[email protected]>

Merge branch 'js/Shuffle3Channel' of https://github.com/SixLabors/Ima…

a46fb9b

…geSharp into js/Shuffle3Channel

Use ROS trick all round and optimize Shuffle3

74dd8cd

Merge branch 'master' into js/Shuffle3Channel

56cfd96

Fix shuffle

3cda066

JimBobSquarePants merged commit 9f51a92 into master Nov 6, 2020

JimBobSquarePants deleted the js/Shuffle3Channel branch November 6, 2020 21:32

JimBobSquarePants added a commit that referenced this pull request Mar 13, 2021

Merge pull request #1409 from SixLabors/js/Shuffle3Channel

522a91e

3 <==> 4 Channel Shuffling with Hardware Intrinsics

dependabot bot mentioned this pull request Sep 23, 2025

Bump the nuget group with 5 updates norschel/enterJSWebSecurity2025-Demo1#8

Open

dependabot bot mentioned this pull request Oct 19, 2025

Bump the nuget group with 1 update ewdlop/bepuphysics2Fork#1

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

3 <==> 4 Channel Shuffling with Hardware Intrinsics #1409

3 <==> 4 Channel Shuffling with Hardware Intrinsics #1409

Uh oh!

JimBobSquarePants commented Oct 30, 2020 •

edited

Loading

Uh oh!

codecov bot commented Oct 30, 2020 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

antonfirsov commented Nov 2, 2020

Uh oh!

JimBobSquarePants commented Nov 2, 2020

Uh oh!

antonfirsov commented Nov 2, 2020 •

edited

Loading

Uh oh!

JimBobSquarePants commented Nov 2, 2020 •

edited

Loading

Uh oh!

peter-dolkens commented Nov 3, 2020 •

edited

Loading

Uh oh!

JimBobSquarePants commented Nov 4, 2020 •

edited

Loading

Uh oh!

JimBobSquarePants commented Nov 6, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Uh oh!

3 <==> 4 Channel Shuffling with Hardware Intrinsics #1409

3 <==> 4 Channel Shuffling with Hardware Intrinsics #1409

Uh oh!

Conversation

JimBobSquarePants commented Oct 30, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Prerequisites

Description

Current Benchmarks

Uh oh!

codecov bot commented Oct 30, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

Uh oh!

antonfirsov commented Nov 2, 2020

Uh oh!

JimBobSquarePants commented Nov 2, 2020

Uh oh!

antonfirsov commented Nov 2, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

ToVector4_Rgb24

FromVector4_Rgb24

Things I don't understand:

Uh oh!

JimBobSquarePants commented Nov 2, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

peter-dolkens commented Nov 3, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

JimBobSquarePants commented Nov 4, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

JimBobSquarePants commented Nov 6, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

JimBobSquarePants commented Oct 30, 2020 •

edited

Loading

codecov bot commented Oct 30, 2020 •

edited

Loading

antonfirsov commented Nov 2, 2020 •

edited

Loading

JimBobSquarePants commented Nov 2, 2020 •

edited

Loading

peter-dolkens commented Nov 3, 2020 •

edited

Loading

JimBobSquarePants commented Nov 4, 2020 •

edited

Loading