-
Notifications
You must be signed in to change notification settings - Fork 5.2k
Description
Inliner: methodology for code quality measurements
We'd like to measure the impact of inlining on performance, so we can
build predictive models. Ideally we'd measure the isolated impact of
just one inline, so that we can attribute performance change
accurately to the inline that is responsible.
For instance, if A calls B (A->B), we can measure two binaries, one
with A->B and the other with B inlined into A (A+B). By comparing the
performance of the two we can largely attribute the overall
performance change to the changes in A from inlining.
We can also measure inlining ensembles, where there are multiple
inlining differences between two binaries, but the attribution problem
becomes considerably harder. We'll discuss this a bit more
subsequently.
Obligatory note of caution: modern hardware is complex and the changes
in performance may arise from many factors unrelated to inlining. We
can view the impact of the complex HW as a noise source. We need to
minimize this noise as much as possible. Where if it can't be
sufficiently minimized, we may need to look into things like
STABILIZER to help average out the noise impact.
So, ideally, we'll measure something relevant to perf, with low
noise. We also want to make a lot of measurements, so something
relatively insensitive to HW details and machine loading would be
ideal. And it would be really nice to attribute perf back to the code
artifacts, so we can be confident the changes in overall perf came
about directly from the changes in the code the compiler created.
And, whatever we do needs to scale to work with realistic benchmark
scenarios.
Below are some possibilities.
Measurement Approaches
- Machine counters. Instructions retired, as measured by the
processor's performance monitoring unit (PMU) seems to be the most
appropriate, with sub-1% noise levels.
Clock cycles might also be appropriate but are quite a bit noisier.
However PMU counters are not usually available in VMs, so we'd need to
run on bare hardware. On Windows, accessing PMU counters require
kernel mode driver support and the machine must boot with Hyper-V
disabled (note Hyper-V seems to get turned on by default with Windows
10). Possibile measurement tools include VTune and
xunit-performance. On Linux,perfcan read PMU counters from user
mode. Typically all these will require running on a quiet machine,
since counter values are generally are not part of the per-thread
context. Correlation back to jit produced code might be tricky, since
counters are used in sampling mode, counters can drift, inclusive
count data would require accurate stack walks, and we need accurate
symbols for the correlation. - Run under some virtualized execution environment. For instance,
something like Pin. Making this work would likely require some
customization to handle jitted code. Has some overhead. Might be able
to correlate directly back to jit artifacts and so work without
symbols. - Compiler instrumentation. This would need to be done late (unlike
the current IBC done by the jit) since we want actual HW instruction
counts, not logical block counts. There are some challenges
externalizing the data but we could probably hijack parts of the IBC
mechanism. Should be able to correlate directly back to jit artifacts
and so work without symbols. Would have some overhead.
Measuring ensembles
It would be nice to measure the impact of more than one inline per
run. Is this doable? Maybe. If we have inlines in unrelated parts of
the call graph then likely their impacts can be measured
concurrently. But if one of the inlines "dominates" the other then it
may potentially alter things to the point where the modelling gets
tricky.
Consider A->B->C, E->F. If we try to measure A+B and B+C in the same
run, there's no guarantee that B+C is even called in the second
run. Even if it is, it may not behave the same if it has
caller-correlated behavior. So attributing perf changes in B+C to the
inline alone is likely to lead to inaccuracies.
Now if we have A+B and E+F, and there's no runtime call stack with all
four methods active, we can possibly measure the impact of the two
inlines in the same run. But this independence may not be knowable.
Interpreting Results
Assume for now we've solved the measurement problem satisfactorily.
What can we make of the data?
So we have two scenarios, S0 (no inlines) and S1 (A+B). Then
impact(A+B) ~ perf(S1) - perf(S0)
This lets us measure the impact of one inline per run. Depending on
noise level we may need to iterate. Also we need to ensure that A is
executed during the run.
Ideally we'd then extract the number of calls to A and produce a
normalized mean impact per call:
meanImpact(A+B) = impact(A+B) / calls(A)
After all, the change in perf might happen because A is called a lot,
and only sometimes calls B, or that A is not called that often, but
calls B in a loop.
Note even if A never calls B in the run, there might still be some
change in A's performance (eg extra spills in prolog/epilog). However
it would probably be best to try and model the direct effects where
possible. That is, the inliner will likely have a hard time modelling
the impact of code changes that are not in the neighborhood of the
call site (a classic example here is register pressure...) So we
might also insist that B have nonzero call count in the first scenario
and verify that the calls to B are reduced in S1.
A plausible "driver" for a doing inline measurements would be
something like the following:
(1) run some baseline config, capture baseline inline trees.
(2) repeat in perf mode, gather perf data.
(3) attribute perf data to methods.
(4) pick hottest method, iterate on it's subtrees, gathering per-inline impacts one by one.
(5) return that method to its default profile, pick another method, repeat.
(6) continue until we've run out of methods that execute enough to give interesting data.
Alternatives
Another approach here is to avoid trying to model the actual magnitude
of the effects directly. We instead measure the performance of a whole
bunch of ensembles (generated randomly, say), and "score" each inline
in an ensemble based on the observed performance results. Aggregating
scores for an inline across ensembles then provides a synthetic
benefit metric for a given inline. That is, if the ensembles that
include A+B tend to have better performance, and ensembles that don't
include A+B tend to be have worse performance, A+B will have a good
score, and then we can try and build predictors for the score, or set
some kind of threshold for a "good" score and build a classifier to
identify inlines that have sufficiently good scores.
category:cq
theme:inlining
skill-level:expert
cost:large