Performance Testing in CI/CD
Annual big-bang performance tests find regressions months after the commit that caused them. Pipeline-integrated testing finds them at merge time, while the diff is one screen of code.
The test pyramid, for performance
| Tier | Trigger | Duration | Scope |
|---|---|---|---|
| Smoke | Every merge | 2–5 min | Critical endpoints, modest fixed load, threshold-gated |
| Baseline | Nightly | 30–60 min | Core workload model at steady load, trend-tracked |
| Full | Per release / scheduled | Hours | Complete workload model, stress & soak variants |
The smoke tier is the highest-value, lowest-cost addition most teams can make: a 3-minute
k6 job with thresholds (p(95)<500) that fails the build catches the worst
regressions for nearly free.
Budgets as code
Performance budgets live in the repository next to the code they constrain — reviewed, versioned and enforced like any other test. A budget change is a visible, deliberate decision in a pull request, not a silent drift.
Making pipeline results trustworthy
Stable environments: noisy shared environments produce flaky gates that teams learn to ignore. Dedicated (if modest) performance environments, or at minimum consistent container resources, are a precondition. Relative comparison: nightly tiers compare against a rolling baseline of recent runs rather than absolute targets, flagging statistically significant drift — this tolerates environment differences while still catching regressions. Trend dashboards: per-build latency and throughput trends make slow degradation visible across weeks, the kind no single gate catches.
What pipelines can't replace
CI-scale tests run at reduced load on reduced environments: they catch regressions superbly and predict absolute capacity poorly. Go-live decisions, peak-event readiness and scaling validation still require full-scale engagements against production-parity environments — the two practices complement rather than substitute. We help teams build both.