Skip to content
Merged
Show file tree
Hide file tree
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 5 additions & 0 deletions src/libraries/System.Private.CoreLib/src/System/Double.cs
Original file line number Diff line number Diff line change
Expand Up @@ -188,12 +188,17 @@ public int CompareTo(object? value)
throw new ArgumentException(SR.Arg_MustBeDouble);
}

[MethodImpl(MethodImplOptions.AggressiveInlining)]
public int CompareTo(double value)
{
if (m_value < value) return -1;
if (m_value > value) return 1;
if (m_value == value) return 0;
return CompareToWithAtLeastOneNaN(value);
}

private int CompareToWithAtLeastOneNaN(double value)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the manual split inlining actually make the code significantly smaller in real world situations vs. just aggressive inlining the whole thing? The call introduced by the split inlining is going to have secondary effect like extra register spilling, so it is not obvious to me that it is actually profitable.

Copy link
Member

@tannergooding tannergooding Aug 17, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The entire inlined assembly is fairly large: #56501 (comment)

NaN checks in particular aren't done very efficiently today and we should try to improve it in .NET 7. So with the partial inlining, the call will remove 2-4 extra branches from the inlined code-path (in favor of a call which may spill on a probably unlikely path).

In the link, I gave an example of what we could generate for this entire comparison if we did something "more optimally".

Copy link
Contributor Author

@rickbrew rickbrew Aug 17, 2021

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have a particular preference on which we go with or if we hold off and wait for the JIT to make this more intrinsic or even simply improve the general handling for these kinds of checks and branches.

Copy link
Member

@jkotas jkotas Aug 17, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://sharplab.io/#v2:EYLgxg9gTgpgtADwGwBYA0AXEBDAzgWwB8ABAJgEYBYAKGIAYACY8gOgCUBXAOwwEt8YLAMIR8AB14AbGFADKMgG68wMXAG4aNYgGYmpBkIYBvGgzNNdxFAwCyACgCUx0+YC+Nd9S27cGKBzAMBgAVVQwaE2pzJnIkBmAICEkGAElcWWgMGAATAH0RcWxYYIgAMQ5JSQBPFK5JXi4cuzYYbGyAeTqq2TFsLgAebIgOYGkAPgZsBxczSOjogDNoBjsGoN4GAF4GOjUGDf7JlgAZGC4AcwwACwY4BnI93gBqJ+mo+Y/eBZWC3uKyirVWr1RrZOzYADavAAumhJlCGE97tCnIc6E5iAB2BgLbCSXAwDTvD5Yhh+DiEmYMVzRKlUiE2GDXCDZFLiSR2RnM1ns9piPgQLi4FgAQXO51guFwvAUMGBDQa5wc0KpOhicTWBlEfxgJXKlRqdQaTQaDAAIsNRjAGPhcgo8RS4RaRtIGPbJBS3tE5h99t87Lb3RSGIcgzAMdi4A8qdEvitAw7rRMwxH7kTfX743bE1ttimmNjdjHzAB6EsMEVBaR4IKC60Qb7Xa1h3D7VsAOWw7ZYxbMcbsQxdgjSnfbAezHvDXoz5lJg6tLBHXbs+YA/DsGCBbtHifMYPiYL35qSd9FPLTd8w4gkkql0pkcvltUVdRAevUMPLQc1Wh0uj0+kGS1xkmadZiPJYoBWTUNm2XZ9hDI5Tgua5bjTfYXjAmd+1+F8SnfXhPyNb9IRhOFSMRZFUR2VNcQPdNfVJclKV3GlzE0XcGSZK4WTZMQOS5HieX4vkBSFUVxUlaVZS/RVlVVSxYn2HgtUKf4CKIkETS4c1gOtBNJzhedXRTKkfU+f0DODUNE1TKMGIsrMwwYZNbILNMj37KzrU2PM3NJItd2iUlcP+AB1QirkrU4a06GBR3HMM4RTBzqQ46IxCgGVsCydVlKCULXwi65otaXw4oS01nStG0J0dBhjObWyzKPMsKyrMra0aBgGzJK4msnVteA7LseyC8x+0axdcAS7yHCwjM5z06aErXDct3so99wJI9guxU83A8IA=

shows the behavior of the two options in a simple real world case. Notice that the IsSorted_CompareToFullyInlined (code size 0x59) produces smaller code than IsSorted_CompareToSplitInlined (code size 0x6f). The obvious problem is that the slow path is a good inlining candidate and the JIT decides to inline it. Once you try to fix it by marking the slow path with NoInlining:

https://sharplab.io/#v2:EYLgxg9gTgpgtADwGwBYA0AXEBDAzgWwB8ABAJgEYBYAKGIAYACY8gOgCUBXAOwwEt8YLAMIR8AB14AbGFADKMgG68wMXAG4aNYgGYmpBkIYBvGgzNNdxFAwCyACgCUx0+YC+Nd9S27cGKBzAMBgAVVQwaE2pzJnIkBmAICEkGAElcWWgMGAATAH0RcWxYYIgAMQ5JSQBPFK5JXi4cuzYYbGyAeTqq2TFsLgAebIgOYGkAPgZsBxczSOjogDNoBjsGoN4GAF4GOjUGDf7JlgAZGC4AcwwACwY4BnI93gBqJ+mo+Y/eBZWC3uKyirVWr1RrZOzYADavAAumhJlCGE97tCnIc6E5iAB2BgLbCSXAwDTvD5Yhh+DiEmYMVzRKlUiE2GDXCDZFLiSR2RnM1ns9piPgQLi4FgAQXO51guFwvAUMGBDQa5wc0KpOhicTWBlEfxgJXKlRqdQaTQaDAAIsNRjAGPhcgo8RS4RaRtIGPbJBS3tE5h99t87Lb3RSGIcgzAMdi4A8qdEvitAw7rRMwxH7kTfX743bE1ttimmNjdjHzAB6EsMEVBaR4IKC60Qb7Xa1h3D7VsAOWw7ZYxbMcbsQxdgjSnfbAezHvDXoz5lJg6tLBHXbs+YA/DsGCBbtHifMYPiYL35qSd9FPLTd8w4gkkql0pkcvltUVdRAevUMPLQc1Wh0uj0+kGS1xkmadZiPJYoBWTUNm2XZ9hDI5Tgua5bjTfYXjAmd+1+F8SnfXhPyNb9IRhOFSMRZFUR2VNcQPdNfVJclKV3GlzE0XcGSZK4WTZMQOS5HieX4vkBSFUVxUlaVZS/RVlVVSxYn2HgtUKf4CKIkETS4c1gOtBNJzhedXRTKkfU+f0DODUNE1TKMGIsrMwwYZNbILNMj37KzrU2PM3NJItd2iUlcP+AB1QirkrU4a06GBR3HMM4RTBzqQ46IuO5PiBO43jeX5XhBWFdsIFki55N3MQoBlbAsnVZSglC18IuuaLWl8OKEtNZ0rRtCdHQYYzm1ssyjzLCsq3a2tGgYBsySuYbJ1bXgOy7HsgvMfshsXXAEu8hwsIzOc9J2hK1w3Ld7KPfcCSPYLsVPNwPCAA

it gets a bit better, but the code size of the non-split inlined version is still smaller.

I don't have a particular preference on which we go with or if we hold off and wait for the JIT to make this more intrinsic

I agree that it would be nice to teach the JIT to deal with the efficiently. If we want to tweak the performance by manual inlining, simple AggressiveInlining should be better than the manual split inlinining as my examples demonstrated.

Copy link
Member

@tannergooding tannergooding Aug 17, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we want to tweak the performance by manual inlining, simple AggressiveInlining should be better than the manual split inlinining as my examples demonstrated.

I'm fine with this. I would like to disagree on the examples adequately demonstrating the difference however as smaller code isn't necessarily better.

There are many factors that can impact perf here including the likelihood that a given path will be hit (NaNs are likely rare) and considerations such as the number of branches in a "small window" (16-32 aligned bytes of assembly). This was why I was initially hesitant about the change as it could regress certain "hot loops" due to the additional branches (7) that would now be present directly in the loop. Of course, the partial inlining could also impact this in interesting ways with its overall larger codegen but it does reduce the branch count by 3 compared to full inlining.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that you would be likely able to come up with cases where the split inlining is better due for micro-architecture reasons, if you tried hard enough. The data we have so far in this thread is that a simple AggressiveInlining produces faster and smaller code.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To satisfy my own curiosity, I tried it in double.cs in corelib. Here main is what's currently checked in, pr1 is CompareTo aggressively inlined, and pr2 is it split.

private double[] _doubles = Enumerable.Range(0, 1_000_000).Select(i => i * 1.23).ToArray();
private double[] _scratch = new double[1_000_000];

[Benchmark]
public bool IsSorted()
{
    ReadOnlySpan<double> a = _doubles;
    for (int i = 0; i < a.Length - 1; i++)
        if (a[i].CompareTo(a[i + 1]) > 0)
            return false;
    return true;
}

[Benchmark]
public void CopyAndSort()
{
    _doubles.CopyTo(_scratch, 0);
    Array.Sort(_scratch);
}

[Benchmark]
public int Search() => Array.BinarySearch(_doubles, 2_000_000);

[Benchmark]
public int CompareSequence() => _doubles.AsSpan().SequenceCompareTo(_doubles);
Method Toolchain Mean Ratio Code Size
IsSorted main\corerun.exe 1,921,366.71 ns 1.00 155 B
IsSorted pr1\corerun.exe 739,726.37 ns 0.39 97 B
IsSorted pr2\corerun.exe 970,601.44 ns 0.50 117 B
CopyAndSort main\corerun.exe 11,761,377.29 ns 1.00 1,060 B
CopyAndSort pr1\corerun.exe 11,810,071.67 ns 1.00 1,060 B
CopyAndSort pr2\corerun.exe 11,837,290.42 ns 1.01 1,060 B
Search main\corerun.exe 58.53 ns 1.00 207 B
Search pr1\corerun.exe 35.51 ns 0.61 207 B
Search pr2\corerun.exe 38.65 ns 0.66 207 B
CompareSequence main\corerun.exe 1,829,633.22 ns 1.00 330 B
CompareSequence pr1\corerun.exe 1,452,953.83 ns 0.79 306 B
CompareSequence pr2\corerun.exe 1,523,349.48 ns 0.83 316 B

I asked whether it would make sense to split it. I'm fine with the answer being "no" 😄

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright I have reverted the commit that split the inlining, so we're back to just regular aggressive inlined

{
// At least one of the values is NaN.
if (IsNaN(m_value))
return IsNaN(value) ? 0 : -1;
Expand Down
5 changes: 5 additions & 0 deletions src/libraries/System.Private.CoreLib/src/System/Single.cs
Original file line number Diff line number Diff line change
Expand Up @@ -183,12 +183,17 @@ public int CompareTo(object? value)
throw new ArgumentException(SR.Arg_MustBeSingle);
}

[MethodImpl(MethodImplOptions.AggressiveInlining)]
public int CompareTo(float value)
{
if (m_value < value) return -1;
if (m_value > value) return 1;
if (m_value == value) return 0;
return CompareToWithAtLeastOneNaN(value);
}

private int CompareToWithAtLeastOneNaN(float value)
{
// At least one of the values is NaN.
if (IsNaN(m_value))
return IsNaN(value) ? 0 : -1;
Expand Down