// Knowledge Base

Common Bottlenecks: A Field Guide

After enough engagements, bottlenecks become recognisable species. Here are the ones we encounter most, with the signatures that identify them.

Connection pool exhaustion

Signature: throughput plateaus at a suspiciously round number; latency climbs sharply while database CPU stays low; threads pile up waiting on pool acquisition.
Mechanism: the pool (DB, HTTP client, Redis) is smaller than concurrency demands; requests queue for connections, not for work.
Verification: pool wait-time metrics, thread dumps showing acquisition waits. Fix is rarely just "make it bigger" — size from Little's Law and confirm the backend has headroom for the larger pool.

Garbage collection pressure

Signature: periodic latency spikes with clean resource graphs between them; p99 far worse than p95; spike cadence correlates with GC logs.
Mechanism: allocation rate outruns the collector, forcing stop-the-world pauses, or heap sizing forces frequent full GCs.
Verification: GC logs time-aligned with latency spikes; heap-after-GC trend. Under soak testing, a rising heap-after-GC floor is the classic leak signature.

N+1 queries and chatty data access

Signature: response time scales with result-set size; database query count per transaction is large; each query is individually fast.
Mechanism: ORM lazy-loading issues one query per row instead of a join or batch. Invisible in functional tests with 5-row fixtures; devastating against 500-row production data — one reason we insist on production-scale test data.

Lock contention

Signature: throughput stops scaling while CPU stays low everywhere; latency variance increases with concurrency; database lock-wait or app monitor-wait metrics climb.
Mechanism: a serialised section — a hot row, a table lock, a synchronised block, a single-threaded event loop stage — caps the whole system (Amdahl's law in the wild).
Verification: lock-wait events, thread dumps showing convoys behind one monitor.

Missing indexes at scale

Signature: queries fast in dev, slow under test; database CPU and I/O high; plans show sequential scans.
Mechanism: the optimiser switches plans as tables grow; a scan acceptable at 10k rows is lethal at 50M. Another argument for production-volume test data.

Retry storms and synchronised clients

Signature: a brief blip triggers a sustained outage; load measured at the server exceeds load offered by the test; recovery requires intervention.
Mechanism: aggressive client retries amplify failures — each slow response spawns retries that add load to an already saturated system.
Fix pattern: retry budgets, exponential backoff with jitter, circuit breakers — then verify under stress test conditions.