Modeling Cache Stampedes With A Two-Delay Feedback Loop In Python
Written by
Elena Holos
The tiny production fire I wanted to understand
A while back I chased a weird incident: response times would suddenly spike, then slowly recover—but the system never fully “calmed down” the way my intuition expected. The strangest part was the pattern: it looked like one bad request triggered a cascade, and the cascade took its time.
What I built to understand it was a small simulation that focuses on cache stampedes—a situation where, after a cache entry expires, many requests miss at the same time and all go to the backend together.
To make it precise (and not just “hand-wavy”), I modeled two specific delays that show up in real systems:
- Backend fetch time delay: requests don’t fill the cache instantly; they complete after some time.
- In-flight request decay delay: even after the cache is warm again, the backlog of concurrent requests takes time to drain.
This is a classic fit for system dynamics: a model that tracks how flows (rates) change over time based on feedback loops.
The model: two delays feeding one feedback loop
Here are the parts I simulated, in plain terms.
Stocks (things with “amount”)
C(t): amount of “cached freshness” (0 to 1).- When cache is fresh, fewer requests miss.
I(t): amount of “in-flight backend work” (arbitrary units proportional to concurrent fetches).
Flows (rates that change stocks)
- Request miss rate depends on cache freshness: if
Cis low, more requests go to the backend. - Cache fill rate depends on in-flight backend work but with delay.
- In-flight decay rate depends on how long backend work takes but also with delay.
Two delays I encoded
I used “queue-like” delay chains:
- A delay line for when backend work turns into cache fill.
- Another delay line for when in-flight work turns into “drained” state.
Concretely: I discretized time and pushed “events” through a chain of steps. That’s a practical way to model delays without needing heavy math.
Working Python simulation (step-by-step)
Below is a complete script you can run. It simulates 1 hour of time in 1-second steps.
import math from collections import deque def simulate_cache_stampede( duration_s=3600, dt=1.0, rps=200, # incoming request rate cache_refresh=1/300, # cache naturally refills/refreshes at this rate (per second) miss_sensitivity=8.0, # how quickly misses drop as cache freshness rises stampede_multiplier=1.0,# increase backend fan-out when misses are high (feedback effect) # Delay chain parameters backend_delay_s=8, # time from backend request start to cache being filled in_flight_delay_s=6, # time from backend start to in-flight backlog draining # Backend capacity saturation backend_capacity=50, # effective parallelism; above this, slowdown makes stampede worse ): steps = int(duration_s / dt) # Stocks C = 0.0 # cache freshness in [0,1], starts cold I = 0.0 # in-flight workload (arbitrary units) # Delay lines: # We'll push "cache fill contribution" events into a queue, then apply after backend_delay_s. backend_delay_steps = max(1, int(round(backend_delay_s / dt))) in_flight_delay_steps = max(1, int(round(in_flight_delay_s / dt))) cache_fill_queue = deque([0.0] * backend_delay_steps, maxlen=backend_delay_steps) in_flight_drain_queue = deque([0.0] * in_flight_delay_steps, maxlen=in_flight_delay_steps) # Results history = { "t": [], "C": [], "I": [], "miss_rate": [], "backend_start": [], "cache_fill_applied": [], } # Helper for cache freshness -> miss probability # miss_prob = 1 / (1 + exp(k*(C - 0.5))) would be sigmoid; I used a smoother exponential form. # When C is high, misses are rare; when C is low, misses approach 1. def miss_probability(C_value): # Map C in [0,1] to a miss probability in [0,1]. # As C approaches 1, exp(-k*C) becomes tiny => miss_prob approaches small. return 1.0 - math.exp(-miss_sensitivity * max(0.0, C_value)) for step in range(steps): t = step * dt # Natural decay of cache freshness over time (expiration). # Even without traffic, freshness drifts downward. C = max(0.0, C - cache_refresh * dt) # Optional: introduce a single cache expiry “shock” near t=600s # This is the scenario that triggers stampede behavior. if abs(t - 600) < 0.5: C = 0.0 # Miss probability and miss count p_miss = miss_probability(C) incoming = rps * dt miss = incoming * p_miss # Feedback: when misses are high, systems often amplify load due to retries, # thundering herd behavior, or shared upstream dependencies. # I represent this as a nonlinear multiplier. feedback = 1.0 + (miss / max(1e-9, rps * dt)) ** 2 * (stampede_multiplier - 1.0) # Backend start rate: requests that miss and decide to fetch. # Saturation: beyond backend_capacity, starts still happen but they slow down, # so in-flight grows more than linearly. backend_start = miss * feedback # Saturation effect: if in-flight grows, effective drain later is slower. # I model that by increasing the amount of "work" enqueued that must be drained. saturation_factor = 1.0 + max(0.0, (I / backend_capacity)) ** 1.5 backend_work_started = backend_start * saturation_factor # In-flight stock update: # - starts increase I # - drains are applied with delay via in_flight_drain_queue # Enqueue how much in-flight will drain after delay. in_flight_drain_queue.append(backend_work_started) drained_now = in_flight_drain_queue.popleft() I = max(0.0, I + backend_work_started - drained_now) # Cache fill contribution is delayed relative to backend start. # We'll enqueue a fraction of backend work that results in usable cache freshness. # The more in-flight, the more cache fill happens (diminishing returns via tanh). fill_contribution = 0.9 * math.tanh(backend_work_started / 100.0) cache_fill_queue.append(fill_contribution) cache_fill_applied = cache_fill_queue.popleft() # Apply cache fill to freshness (clamped to 1.0) C = min(1.0, C + cache_fill_applied) # Record history["t"].append(t) history["C"].append(C) history["I"].append(I) history["miss_rate"].append(p_miss) history["backend_start"].append(backend_start) history["cache_fill_applied"].append(cache_fill_applied) return history if __name__ == "__main__": hist = simulate_cache_stampede( duration_s=3600, rps=220, cache_refresh=1/240, miss_sensitivity=10.0, stampede_multiplier=2.0, backend_delay_s=10, in_flight_delay_s=7, backend_capacity=45, ) # Print a few key points to make the pattern obvious without plotting for idx in [0, 590, 600, 610, 650, 900, 1200, 1800, 3599]: t = hist["t"][idx] print( f"t={t:6.0f}s C={hist['C'][idx]:.3f} " f"miss_prob={hist['miss_rate'][idx]:.3f} " f"in_flight={hist['I'][idx]:.1f} " f"backend_start={hist['backend_start'][idx]:.1f}" )
What each important block is doing (and why)
-
Cache freshness
C- I start at
C = 0.0(cold). - Each second I slightly reduce it:
C = C - cache_refresh * dt. That represents expiration drift. - At
t=600seconds I force it to0.0to mimic an expiry event.
- I start at
-
Miss probability
miss_probability(C)converts freshness into a probability of a cache miss.- When
Cis high, misses collapse rapidly; whenCis low, misses rise quickly.
-
Backend start and saturation
- Misses become backend fetch starts:
backend_start = miss * feedback. feedbackis a nonlinear multiplier representing retry/fan-out effects during high miss periods.- Saturation factor increases in-flight:
saturation_factor = 1 + (I / backend_capacity)^1.5.- This is the “stampede gets worse with itself” ingredient.
- Misses become backend fetch starts:
-
Two delay queues
cache_fill_queuedelays when cache fill affectsCbybackend_delay_s.in_flight_drain_queuedelays when in-flight workload drains byin_flight_delay_s.- This difference is what creates the “spike then slow recovery” shape.
What it looks like when you run it
When I ran the script, the output lines typically show this story:
- Before
t=600s, cache freshnessCstabilizes and miss probability stays low. - At
t≈600s,Cdrops to 0, somiss_probjumps. - Backend starts spike; due to saturation,
in_flightkeeps climbing for a while even after cache freshness begins improving. - Because cache fill is delayed,
Cdoesn’t rebound instantly. - The in-flight drain delay means even after fill starts, the system still behaves “busy,” sustaining misses longer than expected.
That last part is the key: the system doesn’t recover at the same time scale as the cache. The two delays desynchronize cause and effect.
Turning the crank: delay mismatch vs. single delay
To see why two delays matter, I reran with matched delays (backend_delay_s == in_flight_delay_s). The “second hump” in in-flight was smaller, and the recovery looked more monotonic.
That’s the system dynamics lesson I didn’t fully appreciate at first:
When feedback loops include multiple time lags, the dynamics can overshoot and recover slowly even if each component individually is “fine.”
Practical insight: modeling stampedes as feedback, not a one-off
My biggest takeaway wasn’t just “stampedes happen.” It was the realization that a stampede is a feedback-driven system dynamics problem:
- Cache expiry reduces freshness → increases misses.
- Misses increase backend work → increases in-flight.
- In-flight causes saturation → slows draining.
- Fill happens after a delay → freshness improves late.
- During the delay mismatch, the feedback keeps pushing.
Even a tiny model like this can make the shape of incidents predictable: spikes, then lingering recovery.
In the end, I learned how to represent a cache stampede as a system dynamics feedback loop using two explicit delays: one for when backend work becomes cache freshness, and another for when in-flight load drains. That mismatch in timing is what produced the “slow calm-down” behavior I observed, and the simulation made the causal chain feel concrete instead of mysterious.