Grid Powering the dark
  1. WebGPU adapter
  2. Allocating renderer
  3. Compiling shaders
  4. Laying out grid
  5. Calibrating spotlight
  6. Frame 0
Probing WebGPU… 0%
← Back to portfolio

Case study · Incident retro

Cache Stampede During Deploy — 14 minutes to all-clear

A routine deploy bumped a cache key prefix. Every request became a miss, the backend pool saturated, p99 latency ran from 80ms to 2,400ms. We had 15 minutes of error budget left for the quarter. Here's the play-by-play.

SEV-2 Role · SRE on-call Stack · k8s · Redis · Go · Prometheus Duration · 14 min

The metric — p99 latency vs cache hit rate

Press Replay to scrub the 14 minutes. Two signals: p99 latency (rose) overlaid with cache hit rate (cyan). The story is in the gap between them.

incident-2024-cache-stampede.replay
p99 latency (ms) cache hit rate (%) T+00:00

Timeline

  1. T+00:00
    Deploy goes live

    Service rev v2.4.1 ships. Cache key prefix was bumped as part of the schema change.

  2. T+00:12
    p99 jumps 80ms → 2,400ms

    Every request becomes a cache miss. Backend pool starts saturating.

  3. T+00:45
    Cache hit rate 94% → 12%

    Coalescing isn't in place — N concurrent misses → N upstream calls per key.

  4. T+02:00
    PagerDuty: HighErrorBudgetBurn

    Burn rate alert fires. I'm on the bridge with the deploy lead inside 90s.

  5. T+03:30
    Root cause confirmed

    Key prefix bump in the deploy config invalidated the entire cache namespace.

  6. T+05:00
    Mitigation rolled

    Pushed singleflight coalescing + a synchronous warm-up step on the canary pod.

  7. T+07:00
    Hit rate climbing through 60%

    Coalescing is doing its job — one backend hit per key per concurrent burst.

  8. T+11:00
    p99 back under 200ms

    Below SLO. Auth, search and feed dashboards all green.

  9. T+14:00
    All-clear, runbook updated

    Cache-prefix changes added to the migrations checklist; coalescing standardized in service template.

What I did

  • Shipped singleflight request coalescing in the cache client — one upstream call per key per concurrent burst, not N.
  • Added a synchronous cache warm-up gate to the deploy pipeline whenever a key prefix changes; canary pods don't take traffic until hit rate ≥ 70%.
  • Lowered the error-budget burn alert from 5m to 2m — earlier paging on this exact failure class.
  • Wrote and dry-ran the runbook: prefix-bump diagnosis in under 60 seconds.

What I learned

Cache prefix changes are deploys disguised as config. Treat them like DB migrations: pre-warm, blue-green, and verify hit rate before draining the old fleet.

Coalescing is cheaper than scaling. A 1-line singleflight wrapper beat 8x of backend capacity at a fraction of the cost.

Synthetic monitoring should include cache hit rate, not just latency. Latency tells you something's wrong; hit rate tells you what.

Outcomes

14 min Total recovery time SLO target: <15 min
0 Customer-facing errors Graceful degrade held
−60% Similar incidents (6mo) After coalescing rollout
99.97% Quarterly availability Within error budget
Built mono-first @svemulapati · WebGPU · TSL