-
-
Notifications
You must be signed in to change notification settings - Fork 888
Description
Problem
The current float and Vector4 -based pixel blender API does not give us too much space for introducing the rasterization perf improvements needed for SixLabors/ImageSharp.Drawing#102. With our current bulk API, the maximum we can do is to process 2 pixels in one AVX batch, since we can fit only 8 float-s into one AVX register. This means that the expected speedup for blending is around or below 2x. With this we would keep lagging behind Skia and GDI significantly.
Idea
We should explore API-s and implementations working with UInt16-based fixed point arithmetics. This is technically very similar to approach taken by the libjpeg decoder SIMD pipelines which we eventually also want to adapt. In theory, UInt16-based bulk processing should reduce the time spent in pixel blenders by ~4x (or more) when AVX2 is present.
This will require API additions similar to the following:
public abstract class PixelBlender<TPixel>
{
public void Blend<TPixelSrc>(
Configuration configuration,
Span<TPixel> destination,
ReadOnlySpan<TPixel> background,
ReadOnlySpan<TPixelSrc> source,
// 'amount' is scaled to 0-255. Could be byte, but with UInt16 we will avoid some unnecessary conversions
ReadOnlySpan<UInt16> amount);
protected virtual BlendFunction(
Configuration configuration,
Span<Rgba32> destination,
ReadOnlySpan<Rgba32> background,
ReadOnlySpan<Rgba32> source,
ReadOnlySpan<UInt16> amount);
}
public static class PorterDuffFunctions
{
/*public*/ static Vector4 NormalSrcOver(Span<Rgba32> destination,
ReadOnlySpan<Rgba32> background,
ReadOnlySpan<Rgba32> source,
ReadOnlySpan<UInt16> opacity);
}Update:
In the first API variant there was a type ScaledUInt16Vector4, but after thinking it through, I realized it is unnecessary. We should work with Rgba32 to maximize perf.