Optimize LZMA range decoder #910
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
I noticed this duplicated bit of logic inside the range decoder. Turns out, there's a helper function called
Normalize2that does the exact same thing and was never even called. This quirk came all the way from the original LZMA C# SDK it seems. I can see why, as with .NET Framework 4.8, attempting to use the helper function results in performance degradation, so someone manually inlined it. But that's not even necessary, as[MethodImpl(MethodImplOptions.AggressiveInlining)]can be used to achieve the same performance.But the more interesting part, hence this PR, is that newer JITs (tested with .NET 8.0) don't like the manually inlined version of the code as much; when calling
Normalize2instead, it seems to get inlined either way (even without the attribute that we put as a hint for .NET 4.8), and the performance is better.Core i7-6700k (3.2% reduction):
Apple M3 (11.6% reduction):
That's reduction in overall time to extract the whole archive. Tested with a smaller archive in BenchmarkDotNet and I was honestly in disbelief at first on the M3, but it's real. I also manually benchmarked using the same setup/archive as my first PR, it was a 412MB Qt 7z taking 26+ seconds, and it indeed shaved an entire 3 seconds off the extraction time.