|
| 1 | +# Benchmarking Guide |
| 2 | + |
| 3 | +Standard steps for running micro-benchmarks, comparing revisions, and capturing quick profiles when submitting performance-related changes. |
| 4 | + |
| 5 | +## TL;DR (60 seconds) |
| 6 | + |
| 7 | +```bash |
| 8 | +# Install benchstat once |
| 9 | +go install golang.org/x/perf/cmd/benchstat@latest |
| 10 | + |
| 11 | +# Baseline from main |
| 12 | +git checkout main |
| 13 | +go test ./... -bench=.^ -benchmem -run=^$ -count=10 -benchtime=10s > /tmp/base.txt |
| 14 | + |
| 15 | +# Candidate from your branch |
| 16 | +git checkout my-perf-branch |
| 17 | +go test ./... -bench=.^ -benchmem -run=^$ -count=10 -benchtime=10s > /tmp/cand.txt |
| 18 | + |
| 19 | +# Compare |
| 20 | +benchstat /tmp/base.txt /tmp/cand.txt |
| 21 | +``` |
| 22 | + |
| 23 | +Paste the `benchstat` table in your PR with Go/OS/CPU details and the flags you used. |
| 24 | + |
| 25 | +--- |
| 26 | + |
| 27 | +## Purpose & scope |
| 28 | + |
| 29 | +* Use **micro-benchmarks** to validate small performance changes (allocations, hot functions, handler paths). |
| 30 | +* This guide covers **baseline vs candidate** comparisons and quick CPU/memory profiling. |
| 31 | +* For end-to-end throughput/tail latency, complement with integration or load tests as appropriate. |
| 32 | + |
| 33 | +## Reproducibility checklist |
| 34 | + |
| 35 | +* Run **baseline and candidate on the same machine** with minimal background load. |
| 36 | +* Pin concurrency: set `GOMAXPROCS` (often to your CPU count). |
| 37 | +* Use multiple repetitions (`-count`) and a fixed run time (`-benchtime`) to reduce variance. |
| 38 | +* Do one warmup run before collecting results. |
| 39 | +* Record **Go version, OS/CPU model**, and the exact flags you used. |
| 40 | + |
| 41 | +Example env header to include in your PR: |
| 42 | + |
| 43 | +``` |
| 44 | +go version: go1.22.x |
| 45 | +OS/CPU: <your OS> / <CPU model> |
| 46 | +GOMAXPROCS=<n>; flags: -count=10 -benchtime=10s |
| 47 | +``` |
| 48 | + |
| 49 | +## Running micro-benchmarks |
| 50 | + |
| 51 | +Choose either **all benchmarkable packages** or a **specific scope**. |
| 52 | + |
| 53 | +* All benchmarks (repo-wide): |
| 54 | + |
| 55 | + ```bash |
| 56 | + go test ./... -bench=.^ -benchmem -run=^$ -count=10 -benchtime=10s |
| 57 | + ``` |
| 58 | +* Specific package: |
| 59 | + |
| 60 | + ```bash |
| 61 | + go test ./path/to/pkg -bench=.^ -benchmem -run=^$ -count=15 -benchtime=5s |
| 62 | + ``` |
| 63 | +* Specific benchmark (regex): |
| 64 | + |
| 65 | + ```bash |
| 66 | + go test ./path/to/pkg -bench='^BenchmarkUnaryEcho$' -benchmem -run=^$ -count=20 -benchtime=1s |
| 67 | + ``` |
| 68 | + |
| 69 | +**Flag notes** |
| 70 | + |
| 71 | +* `-bench=.^` runs all benchmarks in scope; narrow with a regex when needed. |
| 72 | +* `-benchmem` reports `B/op` and `allocs/op`. |
| 73 | +* `-run=^$` skips non-benchmark tests. |
| 74 | +* `-count` repeats whole runs to stabilize results (10–20 is common). |
| 75 | +* `-benchtime` sets per-benchmark run time; increase for noisy benches. |
| 76 | + |
| 77 | +## Baseline vs candidate with benchstat |
| 78 | + |
| 79 | +1. **Baseline (main):** |
| 80 | + |
| 81 | + ```bash |
| 82 | + git checkout main |
| 83 | + go test ./... -bench=.^ -benchmem -run=^$ -count=10 -benchtime=10s > /tmp/base.txt |
| 84 | + ``` |
| 85 | +2. **Candidate (your branch):** |
| 86 | + |
| 87 | + ```bash |
| 88 | + git checkout my-perf-branch |
| 89 | + go test ./... -bench=.^ -benchmem -run=^$ -count=10 -benchtime=10s > /tmp/cand.txt |
| 90 | + ``` |
| 91 | +3. **Compare:** |
| 92 | + |
| 93 | + ```bash |
| 94 | + benchstat /tmp/base.txt /tmp/cand.txt |
| 95 | + ``` |
| 96 | + |
| 97 | +**Interpreting `benchstat`** |
| 98 | + |
| 99 | +* Focus on `ns/op`, `B/op`, `allocs/op`. |
| 100 | +* Negative **delta** = improvement. |
| 101 | +* `p=` is a significance indicator (smaller is stronger). |
| 102 | +* Call out **meaningful** wins (e.g., ≥5–10%) and explain why your change helps. |
| 103 | + |
| 104 | +**Sample output (illustrative)** |
| 105 | + |
| 106 | +``` |
| 107 | +name old ns/op new ns/op delta |
| 108 | +UnaryEcho/Small-8 12,340 11,020 -10.7% (p=0.002 n=10+10) |
| 109 | +B/op 1,456 1,290 -11.4% |
| 110 | +allocs/op 12.0 11.0 -8.3% |
| 111 | +``` |
| 112 | + |
| 113 | +## Quick profiling with pprof (optional) |
| 114 | + |
| 115 | +When you need to see *why* a change moves performance: |
| 116 | + |
| 117 | +```bash |
| 118 | +# CPU profile for one benchmark |
| 119 | +go test ./path/to/pkg -bench='^BenchmarkUnaryEcho$' -run=^$ -cpuprofile=cpu.out -benchtime=30s |
| 120 | + |
| 121 | +# Memory profile (alloc space) |
| 122 | +go test ./path/to/pkg -bench='^BenchmarkUnaryEcho$' -run=^$ -memprofile=mem.out -benchtime=30s |
| 123 | +``` |
| 124 | + |
| 125 | +Inspect: |
| 126 | + |
| 127 | +```bash |
| 128 | +go tool pprof cpu.out # commands: 'top', 'top -cum', 'web' |
| 129 | +go tool pprof mem.out |
| 130 | +``` |
| 131 | + |
| 132 | +Include a short note in your PR (e.g., "fewer copies on hot path; top symbol shifted from X to Y"). |
| 133 | + |
| 134 | +## Using helper scripts (if present) |
| 135 | + |
| 136 | +If this repository provides helper scripts under `./benchmark` or `./scripts/` to run or capture benchmarks, you may use them to produce **raw outputs** for baseline and candidate with the **same flags**, then compare with `benchstat` as shown above. |
| 137 | + |
| 138 | +Plain `go test -bench` commands are equally fine as long as you capture raw outputs and attach a `benchstat` diff. |
| 139 | + |
| 140 | +## What to include in a performance PR |
| 141 | + |
| 142 | +* A **benchstat** table comparing baseline vs candidate |
| 143 | +* **Environment header**: Go version, OS/CPU, `GOMAXPROCS` |
| 144 | +* **Flags** used: `-count`, `-benchtime`, any selectors |
| 145 | +* (Optional) **pprof** highlights (top symbols or a flamegraph) |
| 146 | +* One paragraph on *why* the change helps (evidence beats theory) |
| 147 | + |
| 148 | +## Troubleshooting |
| 149 | + |
| 150 | +* **High variance?** Increase `-count` or `-benchtime`, narrow the scope, and close background apps. |
| 151 | +* **Network noise?** Prefer in-memory transports for micro-benchmarks. |
| 152 | +* **Different machines?** Don’t compare across hosts; run both sides on the same box. |
| 153 | +* **Allocs improved but ns/op didn’t?** Still valuable—less GC pressure at scale. |
| 154 | + |
| 155 | +--- |
| 156 | + |
| 157 | +Maintainers: if you prefer different default `-count` / `-benchtime`, or want a `make benchmark` target that wraps these commands, this can be added in a follow-up PR. |
0 commit comments