Inliner: methodology for code quality measurements

## Inliner: methodology for code quality measurements

We'd like to measure the impact of inlining on performance, so we can
build predictive models. Ideally we'd measure the _isolated_ impact of
just one inline, so that we can attribute performance change
accurately to the inline that is responsible.

For instance, if A calls B (A->B), we can measure two binaries, one
with A->B and the other with B inlined into A (A+B). By comparing the
performance of the two we can largely attribute the overall
performance change to the changes in A from inlining.

We can also measure inlining ensembles, where there are multiple
inlining differences between two binaries, but the attribution problem
becomes considerably harder. We'll discuss this a bit more
subsequently.

Obligatory note of caution: modern hardware is complex and the changes
in performance may arise from many factors unrelated to inlining. We
can view the impact of the complex HW as a noise source. We need to
minimize this noise as much as possible. Where if it can't be
sufficiently minimized, we may need to look into things like
[STABILIZER](http://people.cs.umass.edu/~emery/pubs/stabilizer-asplos13.pdf) to help average out the noise impact.

So, ideally, we'll measure something relevant to perf, with low
noise. We also want to make a lot of measurements, so something
relatively insensitive to HW details and machine loading would be
ideal. And it would be really nice to attribute perf back to the code
artifacts, so we can be confident the changes in overall perf came
about directly from the changes in the code the compiler created.

And, whatever we do needs to scale to work with realistic benchmark
scenarios.

Below are some possibilities.
### Measurement Approaches
- Machine counters. Instructions retired, as measured by the
  processor's performance monitoring unit (PMU) seems to be the most
  appropriate, [with sub-1% noise levels](http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.620.9917&rep=rep1&type=pdf).
  Clock cycles might also be appropriate but are quite a bit noisier.
  However PMU counters are not usually available in VMs, so we'd need to
  run on bare hardware. On Windows, accessing PMU counters require
  kernel mode driver support and the machine must boot with Hyper-V
  disabled (note Hyper-V seems to get turned on by default with Windows
  10). Possibile measurement tools include VTune and
  xunit-performance. On Linux, `perf` can read PMU counters from user
  mode. Typically all these will require running on a quiet machine,
  since counter values are generally are not part of the per-thread
  context. Correlation back to jit produced code might be tricky, since
  counters are used in sampling mode, counters can drift, inclusive
  count data would require accurate stack walks, and we need accurate
  symbols for the correlation.
- Run under some virtualized execution environment. For instance,
  something like Pin. Making this work would likely require some
  customization to handle jitted code. Has some overhead. Might be able
  to correlate directly back to jit artifacts and so work without
  symbols.
- Compiler instrumentation. This would need to be done late (unlike
  the current IBC done by the jit) since we want actual HW instruction
  counts, not logical block counts. There are some challenges
  externalizing the data but we could probably hijack parts of the IBC
  mechanism. Should be able to correlate directly back to jit artifacts
  and so work without symbols. Would have some overhead.
### Measuring ensembles

It would be nice to measure the impact of more than one inline per
run. Is this doable? Maybe. If we have inlines in unrelated parts of
the call graph then likely their impacts can be measured
concurrently. But if one of the inlines "dominates" the other then it
may potentially alter things to the point where the modelling gets
tricky.

Consider A->B->C, E->F. If we try to measure A+B and B+C in the same
run, there's no guarantee that B+C is even called in the second
run. Even if it is, it may not behave the same if it has
caller-correlated behavior. So attributing perf changes in B+C to the
inline alone is likely to lead to inaccuracies.

Now if we have A+B and E+F, and there's no runtime call stack with all
four methods active, we can possibly measure the impact of the two
inlines in the same run. But this independence may not be knowable.
### Interpreting Results

Assume for now we've solved the measurement problem satisfactorily.
What can we make of the data?

So we have two scenarios, S0 (no inlines) and S1 (A+B). Then 

```
  impact(A+B) ~ perf(S1) - perf(S0)
```

This lets us measure the impact of one inline per run. Depending on
noise level we may need to iterate. Also we need to ensure that A is
executed during the run.

Ideally we'd then extract the number of calls to A and produce a
normalized mean impact per call:

```
  meanImpact(A+B) = impact(A+B) / calls(A)
```

After all, the change in perf might happen because A is called a lot,
and only sometimes calls B, or that A is not called that often, but
calls B in a loop.

Note even if A never calls B in the run, there might still be some
change in A's performance (eg extra spills in prolog/epilog).  However
it would probably be best to try and model the direct effects where
possible.  That is, the inliner will likely have a hard time modelling
the impact of code changes that are not in the neighborhood of the
call site (a classic example here is register pressure...)  So we
might also insist that B have nonzero call count in the first scenario
and verify that the calls to B are reduced in S1.

A plausible "driver" for a doing inline measurements would be
something like the following:

(1) run some baseline config, capture baseline inline trees.
(2) repeat in perf mode, gather perf data.
(3) attribute perf data to methods.
(4) pick hottest method, iterate on it's subtrees, gathering per-inline impacts one by one.
(5) return that method to its default profile, pick another method, repeat.
(6) continue until we've run out of methods that execute enough to give interesting data.
### Alternatives

Another approach here is to avoid trying to model the actual magnitude
of the effects directly. We instead measure the performance of a whole
bunch of ensembles (generated randomly, say), and "score" each inline
in an ensemble based on the observed performance results. Aggregating
scores for an inline across ensembles then provides a synthetic
benefit metric for a given inline. That is, if the ensembles that
include A+B tend to have better performance, and ensembles that don't
include A+B tend to be have worse performance, A+B will have a good
score, and then we can try and build predictors for the score, or set
some kind of threshold for a "good" score and build a classifier to
identify inlines that have sufficiently good scores.

category:cq
theme:inlining
skill-level:expert
cost:large

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Inliner: methodology for code quality measurements #5673