Skip to content

Add Heirarchical Performance Testing (HPT) technique to compare_to? #168

@mdboom

Description

@mdboom

I recently came across a technique for distilling benchmark measurements into a single number that takes into account the fact that some benchmarks are more consistent/reliable than others, called Heirarchical Performance Testing (HPT). There is an implementation (in bash!!!) for the PARSEC benchmark suite. I ported it to Python and ran it over the big Faster CPython data set.

The results are pretty useful -- for example, while a lot of the main specialization work in 3.11 has a reliability of 100%, some recent changes to the GC have a speed improvement but with a lower reliability, accounting for the fact that GC changes have a lot more randomness (more moving parts and interactions with other things happening in the OS). I think this reliability number, along with the more stable "expected speedup at the 99th percentile", is a lot more useful for evaluating a change (especially small changes) than the geometric mean. I did not, however, see the massive 3.5x discrepancy between the 99th percentile number and the geometric mean reported in the paper (on a different dataset).

Is there interest in adding this metric to the output of pyperf's compare_to command?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions