Systems ThinkingJune 19, 2026

Modeling Cache Stampedes With A Two-Delay Feedback Loop In Python

E

Written by

Elena Holos

The tiny production fire I wanted to understand

A while back I chased a weird incident: response times would suddenly spike, then slowly recover—but the system never fully “calmed down” the way my intuition expected. The strangest part was the pattern: it looked like one bad request triggered a cascade, and the cascade took its time.

What I built to understand it was a small simulation that focuses on cache stampedes—a situation where, after a cache entry expires, many requests miss at the same time and all go to the backend together.

To make it precise (and not just “hand-wavy”), I modeled two specific delays that show up in real systems:

  • Backend fetch time delay: requests don’t fill the cache instantly; they complete after some time.
  • In-flight request decay delay: even after the cache is warm again, the backlog of concurrent requests takes time to drain.

This is a classic fit for system dynamics: a model that tracks how flows (rates) change over time based on feedback loops.


The model: two delays feeding one feedback loop

Here are the parts I simulated, in plain terms.

Stocks (things with “amount”)

  • C(t): amount of “cached freshness” (0 to 1).
    • When cache is fresh, fewer requests miss.
  • I(t): amount of “in-flight backend work” (arbitrary units proportional to concurrent fetches).

Flows (rates that change stocks)

  • Request miss rate depends on cache freshness: if C is low, more requests go to the backend.
  • Cache fill rate depends on in-flight backend work but with delay.
  • In-flight decay rate depends on how long backend work takes but also with delay.

Two delays I encoded

I used “queue-like” delay chains:

  • A delay line for when backend work turns into cache fill.
  • Another delay line for when in-flight work turns into “drained” state.

Concretely: I discretized time and pushed “events” through a chain of steps. That’s a practical way to model delays without needing heavy math.


Working Python simulation (step-by-step)

Below is a complete script you can run. It simulates 1 hour of time in 1-second steps.

import math from collections import deque def simulate_cache_stampede( duration_s=3600, dt=1.0, rps=200, # incoming request rate cache_refresh=1/300, # cache naturally refills/refreshes at this rate (per second) miss_sensitivity=8.0, # how quickly misses drop as cache freshness rises stampede_multiplier=1.0,# increase backend fan-out when misses are high (feedback effect) # Delay chain parameters backend_delay_s=8, # time from backend request start to cache being filled in_flight_delay_s=6, # time from backend start to in-flight backlog draining # Backend capacity saturation backend_capacity=50, # effective parallelism; above this, slowdown makes stampede worse ): steps = int(duration_s / dt) # Stocks C = 0.0 # cache freshness in [0,1], starts cold I = 0.0 # in-flight workload (arbitrary units) # Delay lines: # We'll push "cache fill contribution" events into a queue, then apply after backend_delay_s. backend_delay_steps = max(1, int(round(backend_delay_s / dt))) in_flight_delay_steps = max(1, int(round(in_flight_delay_s / dt))) cache_fill_queue = deque([0.0] * backend_delay_steps, maxlen=backend_delay_steps) in_flight_drain_queue = deque([0.0] * in_flight_delay_steps, maxlen=in_flight_delay_steps) # Results history = { "t": [], "C": [], "I": [], "miss_rate": [], "backend_start": [], "cache_fill_applied": [], } # Helper for cache freshness -> miss probability # miss_prob = 1 / (1 + exp(k*(C - 0.5))) would be sigmoid; I used a smoother exponential form. # When C is high, misses are rare; when C is low, misses approach 1. def miss_probability(C_value): # Map C in [0,1] to a miss probability in [0,1]. # As C approaches 1, exp(-k*C) becomes tiny => miss_prob approaches small. return 1.0 - math.exp(-miss_sensitivity * max(0.0, C_value)) for step in range(steps): t = step * dt # Natural decay of cache freshness over time (expiration). # Even without traffic, freshness drifts downward. C = max(0.0, C - cache_refresh * dt) # Optional: introduce a single cache expiry “shock” near t=600s # This is the scenario that triggers stampede behavior. if abs(t - 600) < 0.5: C = 0.0 # Miss probability and miss count p_miss = miss_probability(C) incoming = rps * dt miss = incoming * p_miss # Feedback: when misses are high, systems often amplify load due to retries, # thundering herd behavior, or shared upstream dependencies. # I represent this as a nonlinear multiplier. feedback = 1.0 + (miss / max(1e-9, rps * dt)) ** 2 * (stampede_multiplier - 1.0) # Backend start rate: requests that miss and decide to fetch. # Saturation: beyond backend_capacity, starts still happen but they slow down, # so in-flight grows more than linearly. backend_start = miss * feedback # Saturation effect: if in-flight grows, effective drain later is slower. # I model that by increasing the amount of "work" enqueued that must be drained. saturation_factor = 1.0 + max(0.0, (I / backend_capacity)) ** 1.5 backend_work_started = backend_start * saturation_factor # In-flight stock update: # - starts increase I # - drains are applied with delay via in_flight_drain_queue # Enqueue how much in-flight will drain after delay. in_flight_drain_queue.append(backend_work_started) drained_now = in_flight_drain_queue.popleft() I = max(0.0, I + backend_work_started - drained_now) # Cache fill contribution is delayed relative to backend start. # We'll enqueue a fraction of backend work that results in usable cache freshness. # The more in-flight, the more cache fill happens (diminishing returns via tanh). fill_contribution = 0.9 * math.tanh(backend_work_started / 100.0) cache_fill_queue.append(fill_contribution) cache_fill_applied = cache_fill_queue.popleft() # Apply cache fill to freshness (clamped to 1.0) C = min(1.0, C + cache_fill_applied) # Record history["t"].append(t) history["C"].append(C) history["I"].append(I) history["miss_rate"].append(p_miss) history["backend_start"].append(backend_start) history["cache_fill_applied"].append(cache_fill_applied) return history if __name__ == "__main__": hist = simulate_cache_stampede( duration_s=3600, rps=220, cache_refresh=1/240, miss_sensitivity=10.0, stampede_multiplier=2.0, backend_delay_s=10, in_flight_delay_s=7, backend_capacity=45, ) # Print a few key points to make the pattern obvious without plotting for idx in [0, 590, 600, 610, 650, 900, 1200, 1800, 3599]: t = hist["t"][idx] print( f"t={t:6.0f}s C={hist['C'][idx]:.3f} " f"miss_prob={hist['miss_rate'][idx]:.3f} " f"in_flight={hist['I'][idx]:.1f} " f"backend_start={hist['backend_start'][idx]:.1f}" )

What each important block is doing (and why)

  • Cache freshness C

    • I start at C = 0.0 (cold).
    • Each second I slightly reduce it: C = C - cache_refresh * dt. That represents expiration drift.
    • At t=600 seconds I force it to 0.0 to mimic an expiry event.
  • Miss probability

    • miss_probability(C) converts freshness into a probability of a cache miss.
    • When C is high, misses collapse rapidly; when C is low, misses rise quickly.
  • Backend start and saturation

    • Misses become backend fetch starts: backend_start = miss * feedback.
    • feedback is a nonlinear multiplier representing retry/fan-out effects during high miss periods.
    • Saturation factor increases in-flight: saturation_factor = 1 + (I / backend_capacity)^1.5.
      • This is the “stampede gets worse with itself” ingredient.
  • Two delay queues

    • cache_fill_queue delays when cache fill affects C by backend_delay_s.
    • in_flight_drain_queue delays when in-flight workload drains by in_flight_delay_s.
    • This difference is what creates the “spike then slow recovery” shape.

What it looks like when you run it

When I ran the script, the output lines typically show this story:

  • Before t=600s, cache freshness C stabilizes and miss probability stays low.
  • At t≈600s, C drops to 0, so miss_prob jumps.
  • Backend starts spike; due to saturation, in_flight keeps climbing for a while even after cache freshness begins improving.
  • Because cache fill is delayed, C doesn’t rebound instantly.
  • The in-flight drain delay means even after fill starts, the system still behaves “busy,” sustaining misses longer than expected.

That last part is the key: the system doesn’t recover at the same time scale as the cache. The two delays desynchronize cause and effect.


Turning the crank: delay mismatch vs. single delay

To see why two delays matter, I reran with matched delays (backend_delay_s == in_flight_delay_s). The “second hump” in in-flight was smaller, and the recovery looked more monotonic.

That’s the system dynamics lesson I didn’t fully appreciate at first:

When feedback loops include multiple time lags, the dynamics can overshoot and recover slowly even if each component individually is “fine.”


Practical insight: modeling stampedes as feedback, not a one-off

My biggest takeaway wasn’t just “stampedes happen.” It was the realization that a stampede is a feedback-driven system dynamics problem:

  • Cache expiry reduces freshness → increases misses.
  • Misses increase backend work → increases in-flight.
  • In-flight causes saturation → slows draining.
  • Fill happens after a delay → freshness improves late.
  • During the delay mismatch, the feedback keeps pushing.

Even a tiny model like this can make the shape of incidents predictable: spikes, then lingering recovery.


In the end, I learned how to represent a cache stampede as a system dynamics feedback loop using two explicit delays: one for when backend work becomes cache freshness, and another for when in-flight load drains. That mismatch in timing is what produced the “slow calm-down” behavior I observed, and the simulation made the causal chain feel concrete instead of mysterious.