-
-
Notifications
You must be signed in to change notification settings - Fork 888
3 <==> 4 Channel Shuffling with Hardware Intrinsics #1409
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Codecov Report
@@ Coverage Diff @@
## master #1409 +/- ##
==========================================
+ Coverage 82.99% 83.14% +0.14%
==========================================
Files 692 695 +3
Lines 31189 31484 +295
Branches 3578 3586 +8
==========================================
+ Hits 25884 26176 +292
- Misses 4582 4585 +3
Partials 723 723
Flags with carried forward coverage won't be shown. Click here to find out more.
Continue to review full report at Codecov.
|
|
@JimBobSquarePants I want to do a proper review here, but not sure about the timing. Lucky case: by Wednesday evening, worst case: by Sunday. Is this acceptable for you? |
|
@antonfirsov No worries. Hopefully someone else from the team can also have a look in the interim. I added the Rgb24/Vector4 benchmarks to #1354 (comment) We're looking at a very healthy speedup on all target frameworks. |
|
@JimBobSquarePants I tried to destile some derived numbers to make conclusions from #1354 (comment): ToVector4_Rgb24
FromVector4_Rgb24
Things I don't understand:
|
|
@antonfirsov the speedup is due to this change. Our original per pixel implementation turned out to be quite slow. The regression is just noise. There’s massive error on that run for some reason |
|
The "Fastfallback" is a regression in performance - presumably because it is no longer falling back, but doing it properly, quickly. Is this only ever triggered algorithmically? Could people be running this in fallback mode deliberately as it's faster and close enough for their needs? Should you be retaining the "Fastfallback" path, or happy to discontinue that support. For simplicity, I'd certainly lean towards making it obsolete, but you know your library better than anyone. Not sure if it's the kind of thing that might deliberately get used in low-quality bulk tasks like thumbnail generation. |
|
@peter-dolkens if you check the linked benchmarks the error for that regression run is super high. Likely due to my laptop throttling. I’ll post updated ones in the morning Those fast fallbacks are tuned algorithms for particular shuffle operations. They only kick in on target frameworks without intrinsics and when there are remaining items we cannot process in bulk as they don’t fit within a vector. |
Co-authored-by: Clinton Ingram <[email protected]>
…geSharp into js/Shuffle3Channel
|
Oh wow! I really broke this! |
3 <==> 4 Channel Shuffling with Hardware Intrinsics
Prerequisites
Description
Adds three methods to
SimdUtils:Pad3Shuffle4Pads a buffer representing 3 channel pixels to 4 channels and shuffles the result.Shuffle4Slice3Shuffles a buffer representing 4 channel pixels and slices it to 3 channels.Shuffle3Shuffles a buffer representing 3 channel pixels.All fallbacks are better than or equal two current implementations.
~I've completed a basic implementation for now with results in
2.5-3x speedup. I looked into more loop unrolling based on this example but kept messing up my offsets so I left it. @saucecontrol if you fancy helping me here please do.Fixed, thanks @saucecontrol!Once the initial implementation is reviewed I will make the methods generic and add optimized fallback versions ofDone!XYZWshuffling as per @antonfirsov previous suggestions in 4 channel shuffling which I can use custom structs similar toRgb24andRgba32to provide no loss in current performance on older platforms.Current Benchmarks
Pad3Shuffle4
Shuffle4Slice3
Shuffle3
Note: No fast fallback implementation here. I experimented with casting to uint and using shuffling but it was a shade slower than the default.