Common Bottlenecks: A Field Guide
After enough engagements, bottlenecks become recognisable species. Here are the ones we encounter most, with the signatures that identify them.
Connection pool exhaustion
Signature: throughput plateaus at a suspiciously round number; latency climbs
sharply while database CPU stays low; threads pile up waiting on pool acquisition.
Mechanism: the pool (DB, HTTP client, Redis) is smaller than concurrency
demands; requests queue for connections, not for work.
pool wait-time metrics, thread dumps showing acquisition waits.
Fix is rarely just "make it bigger" — size from Little's
Law and confirm the backend has headroom for the larger pool.
Garbage collection pressure
Signature: periodic latency spikes with clean resource graphs between them;
p99 far worse than p95; spike cadence correlates with GC logs.
Mechanism: allocation rate outruns the collector, forcing stop-the-world pauses,
or heap sizing forces frequent full GCs.
Verification: GC logs time-aligned with latency spikes; heap-after-GC trend.
Under soak testing, a rising heap-after-GC floor is the classic leak signature.
N+1 queries and chatty data access
Signature: response time scales with result-set size; database query
count per transaction is large; each query is individually fast.
Mechanism: ORM lazy-loading issues one query per row instead of a join or batch.
Invisible in functional tests with 5-row fixtures; devastating against 500-row production data
— one reason we insist on production-scale test
data.
Lock contention
Signature: throughput stops scaling while CPU stays low everywhere;
latency variance increases with concurrency; database lock-wait or app monitor-wait metrics climb.
Mechanism: a serialised section — a hot row, a table lock, a synchronised
block, a single-threaded event loop stage — caps the whole system (Amdahl's law in the wild).
Verification: lock-wait events, thread dumps showing convoys behind one monitor.
Missing indexes at scale
Signature: queries fast in dev, slow under test; database CPU and I/O high;
plans show sequential scans.
Mechanism: the optimiser switches plans as tables grow; a scan acceptable at
10k rows is lethal at 50M. Another argument for production-volume test data.
Retry storms and synchronised clients
Signature: a brief blip triggers a sustained outage; load measured at the
server exceeds load offered by the test; recovery requires intervention.
Mechanism: aggressive client retries amplify failures — each slow response
spawns retries that add load to an already saturated system.
Fix pattern: retry budgets, exponential backoff with jitter, circuit breakers
— then verify under stress test conditions.