Key Performance Metrics
Four families of metrics describe almost everything that matters under load: latency, throughput, errors and saturation. Most performance-testing mistakes trace back to measuring the wrong one — or summarising it wrongly.
Latency: percentiles or nothing
Response-time distributions are long-tailed, so the mean is dominated by the bulk of fast requests and hides the slow tail entirely. A service can report a 120 ms average while 1% of users wait 8 seconds. We report:
| Percentile | Meaning | Use |
|---|---|---|
| p50 | Typical experience | Sanity baseline |
| p90/p95 | Unlucky-but-common experience | Primary SLO target |
| p99 | The tail users churn over | Secondary SLO; tail-health signal |
| max | Worst single request | Timeout and outlier investigation |
Tail latencies compound: a page that fans out to 10 backend calls hits each backend's p99 far more often than 1% of the time. At 10 parallel calls, ~10% of pages experience at least one p99 backend response. Tails are a fan-out problem, not an edge case.
Throughput: requests vs transactions
Requests/second is what tools report; business transactions/second is what stakeholders mean. A checkout might be 14 requests. We model and report both, with the mapping explicit. Throughput is only meaningful alongside latency: any system can do 10,000 tps if you don't care how long responses take.
Errors: rate, type and honesty
Error rate under load is a first-class result, not a footnote. We classify by mechanism — timeouts vs connection refusals vs HTTP 5xx vs semantic failures (200-with-error-body) — because each implicates a different layer. Semantic failures are the silent killer: status-code-only checks under stress routinely miss them.
Saturation: the leading indicator
Latency and errors are symptoms; saturation is the cause. We track utilisation and queue depth on every constrained resource: CPU, memory/GC, connection pools, thread pools, disk and network I/O, database locks. The queue depth metrics matter more than utilisation — a resource at 80% utilisation with a growing queue is in worse shape than one at 95% with none. (The mechanism: Little's Law & queueing theory.)