Case study · Incident retro

Cache Stampede During Deploy — 14 minutes to all-clear

A routine deploy bumped a cache key prefix. Every request became a miss, the backend pool saturated, p99 latency ran from 80ms to 2,400ms. We had 15 minutes of error budget left for the quarter. Here's the play-by-play.

SEV-2 Role · SRE on-call Stack · k8s · Redis · Go · Prometheus Duration · 14 min

The metric — p99 latency vs cache hit rate

Press Replay to scrub the 14 minutes. Two signals: p99 latency (rose) overlaid with cache hit rate (cyan). The story is in the gap between them.

incident-2024-cache-stampede.replay

p99 latency (ms) cache hit rate (%) T+00:00

Timeline

T+00:00
Deploy goes live
Service rev v2.4.1 ships. Cache key prefix was bumped as part of the schema change.
T+00:12
p99 jumps 80ms → 2,400ms
Every request becomes a cache miss. Backend pool starts saturating.
T+00:45
Cache hit rate 94% → 12%
Coalescing isn't in place — N concurrent misses → N upstream calls per key.
T+02:00
PagerDuty: HighErrorBudgetBurn
Burn rate alert fires. I'm on the bridge with the deploy lead inside 90s.
T+03:30
Root cause confirmed
Key prefix bump in the deploy config invalidated the entire cache namespace.
T+05:00
Mitigation rolled
Pushed singleflight coalescing + a synchronous warm-up step on the canary pod.
T+07:00
Hit rate climbing through 60%
Coalescing is doing its job — one backend hit per key per concurrent burst.
T+11:00
p99 back under 200ms
Below SLO. Auth, search and feed dashboards all green.
T+14:00
All-clear, runbook updated
Cache-prefix changes added to the migrations checklist; coalescing standardized in service template.

What I did

Shipped singleflight request coalescing in the cache client — one upstream call per key per concurrent burst, not N.
Added a synchronous cache warm-up gate to the deploy pipeline whenever a key prefix changes; canary pods don't take traffic until hit rate ≥ 70%.
Lowered the error-budget burn alert from 5m to 2m — earlier paging on this exact failure class.
Wrote and dry-ran the runbook: prefix-bump diagnosis in under 60 seconds.

What I learned

Cache prefix changes are deploys disguised as config. Treat them like DB migrations: pre-warm, blue-green, and verify hit rate before draining the old fleet.

Coalescing is cheaper than scaling. A 1-line singleflight wrapper beat 8x of backend capacity at a fraction of the cost.

Synthetic monitoring should include cache hit rate, not just latency. Latency tells you something's wrong; hit rate tells you what.

Outcomes

14 min Total recovery time SLO target: <15 min

0 Customer-facing errors Graceful degrade held

−60% Similar incidents (6mo) After coalescing rollout

99.97% Quarterly availability Within error budget