5/4 DB incident series
- When a cardinality=1 single-column index beats the compound index
- DB pool exhaustion cascade: do not mistake victim traces for causes
- Latent slow query incident: why “why today?” may have no clean answer
- 124 index anti-patterns found, 19 dropped — why the rest stayed
- JPA @Index is not prod DB index — 5 Entity-DB drift patterns
![]()
Introduction
When Apparent connection leak detected warnings fire from eight services, the intuitive reaction is: “we have eight leaks.” In a single-pool cascade, that intuition is often wrong. One path can be the cause; the rest can be victims.
This post describes how we separated cause from victim during the incident review.
Background / Problem
At the P0 moment, leak warnings appeared across eight service paths:
| Source | Warning count |
|---|---|
| User progress lookup | 20 |
| Catalog tree lazy load | 11 |
| Virtual task aggregation | 6 |
| Integrated status aggregation | 4 |
| User list lookup | 3 |
| Calendar schedule | 2 |
| Auth login / refresh | 4 |
| Notification scheduler | 1 |
The first instinct was to audit transaction leaks in every domain. That would have turned into several days of unnecessary work.
Mechanism
HikariCP leakDetectionThreshold defaults to 0, meaning disabled. Our setting was 30 seconds.
The warning means: a borrowed connection has not been returned within the threshold. It does not necessarily mean the code forgot to close a connection.
In this incident:
- Several concurrent executions of the same slow path held connections for too long.
- The pool became saturated.
- Other requests waited for connections or ran under heavy DB CPU pressure.
- Normal queries also crossed the 30-second threshold.
- Those normal paths emitted their own stack traces as leak warnings.
Cause / Victim Rule
For each warning source:
EXPLAIN [that query];
-- then measure standalone runtime on prod-like data
- Fast alone: victim of the cascade.
- Slow alone: possible independent cause.
The result was simple: one slow query was the cause; the rest were victims. Fixing the one index problem removed the entire warning spread the next day, even with higher traffic.
Lessons
Distributed signals do not imply distributed causes
The same incident can generate many stack traces. Treat them as evidence, not as separate tickets by default.
Validate standalone behavior first
Before starting code changes in seven domains, measure each query alone. Thirty minutes of diagnosis can save days of false cleanup.
The word “leak” is misleading
Read the warning as “borrow time exceeded threshold.” It can still reveal a real leak, but it can also reveal slow work under a saturated pool.
Next
After root cause and cascade structure were clear, the next question was: why did it happen today? The next post covers latent incidents where that question has no clean trigger answer.
References
- HikariCP: leakDetectionThreshold
- Useful pool metrics during review:
pool_active,pool_pending,pool_idle - Resilience4j circuit breakers as a first containment layer during pool exhaustion
Comments