Modeling Cache Stampedes With A Two-Delay Feedback Loop In Python

The tiny production fire I wanted to understand

A while back I chased a weird incident: response times would suddenly spike, then slowly recover—but the system never fully “calmed down” the way my intuition expected. The strangest part was the pattern: it looked like one bad request triggered a cascade, and the cascade took its time.

What I built to understand it was a small simulation that focuses on cache stampedes—a situation where, after a cache entry expires, many requests miss at the same time and all go to the backend together.

To make it precise (and not just “hand-wavy”), I modeled two specific delays that show up in real systems:

Backend fetch time delay: requests don’t fill the cache instantly; they complete after some time.
In-flight request decay delay: even after the cache is warm again, the backlog of concurrent requests takes time to drain.

This is a classic fit for system dynamics: a model that tracks how flows (rates) change over time based on feedback loops.

The model: two delays feeding one feedback loop

Here are the parts I simulated, in plain terms.

Stocks (things with “amount”)

C(t): amount of “cached freshness” (0 to 1).
- When cache is fresh, fewer requests miss.
I(t): amount of “in-flight backend work” (arbitrary units proportional to concurrent fetches).

Flows (rates that change stocks)

Request miss rate depends on cache freshness: if C is low, more requests go to the backend.
Cache fill rate depends on in-flight backend work but with delay.
In-flight decay rate depends on how long backend work takes but also with delay.

Two delays I encoded

I used “queue-like” delay chains:

A delay line for when backend work turns into cache fill.
Another delay line for when in-flight work turns into “drained” state.

Concretely: I discretized time and pushed “events” through a chain of steps. That’s a practical way to model delays without needing heavy math.

Working Python simulation (step-by-step)

Below is a complete script you can run. It simulates 1 hour of time in 1-second steps.

import math
from collections import deque

def simulate_cache_stampede(
    duration_s=3600,
    dt=1.0,
    rps=200,                 # incoming request rate
    cache_refresh=1/300,    # cache naturally refills/refreshes at this rate (per second)
    miss_sensitivity=8.0,   # how quickly misses drop as cache freshness rises
    stampede_multiplier=1.0,# increase backend fan-out when misses are high (feedback effect)

    # Delay chain parameters
    backend_delay_s=8,      # time from backend request start to cache being filled
    in_flight_delay_s=6,   # time from backend start to in-flight backlog draining

    # Backend capacity saturation
    backend_capacity=50,    # effective parallelism; above this, slowdown makes stampede worse
):
    steps = int(duration_s / dt)

    # Stocks
    C = 0.0  # cache freshness in [0,1], starts cold
    I = 0.0  # in-flight workload (arbitrary units)

    # Delay lines:
    # We'll push "cache fill contribution" events into a queue, then apply after backend_delay_s.
    backend_delay_steps = max(1, int(round(backend_delay_s / dt)))
    in_flight_delay_steps = max(1, int(round(in_flight_delay_s / dt)))

    cache_fill_queue = deque([0.0] * backend_delay_steps, maxlen=backend_delay_steps)
    in_flight_drain_queue = deque([0.0] * in_flight_delay_steps, maxlen=in_flight_delay_steps)

    # Results
    history = {
        "t": [],
        "C": [],
        "I": [],
        "miss_rate": [],
        "backend_start": [],
        "cache_fill_applied": [],
    }

    # Helper for cache freshness -> miss probability
    # miss_prob = 1 / (1 + exp(k*(C - 0.5))) would be sigmoid; I used a smoother exponential form.
    # When C is high, misses are rare; when C is low, misses approach 1.
    def miss_probability(C_value):
        # Map C in [0,1] to a miss probability in [0,1].
        # As C approaches 1, exp(-k*C) becomes tiny => miss_prob approaches small.
        return 1.0 - math.exp(-miss_sensitivity * max(0.0, C_value))

    for step in range(steps):
        t = step * dt

        # Natural decay of cache freshness over time (expiration).
        # Even without traffic, freshness drifts downward.
        C = max(0.0, C - cache_refresh * dt)

        # Optional: introduce a single cache expiry “shock” near t=600s
        # This is the scenario that triggers stampede behavior.
        if abs(t - 600) < 0.5:
            C = 0.0

        # Miss probability and miss count
        p_miss = miss_probability(C)
        incoming = rps * dt
        miss = incoming * p_miss

        # Feedback: when misses are high, systems often amplify load due to retries,
        # thundering herd behavior, or shared upstream dependencies.
        # I represent this as a nonlinear multiplier.
        feedback = 1.0 + (miss / max(1e-9, rps * dt)) ** 2 * (stampede_multiplier - 1.0)

        # Backend start rate: requests that miss and decide to fetch.
        # Saturation: beyond backend_capacity, starts still happen but they slow down,
        # so in-flight grows more than linearly.
        backend_start = miss * feedback

        # Saturation effect: if in-flight grows, effective drain later is slower.
        # I model that by increasing the amount of "work" enqueued that must be drained.
        saturation_factor = 1.0 + max(0.0, (I / backend_capacity)) ** 1.5
        backend_work_started = backend_start * saturation_factor

        # In-flight stock update:
        # - starts increase I
        # - drains are applied with delay via in_flight_drain_queue
        # Enqueue how much in-flight will drain after delay.
        in_flight_drain_queue.append(backend_work_started)
        drained_now = in_flight_drain_queue.popleft()
        I = max(0.0, I + backend_work_started - drained_now)

        # Cache fill contribution is delayed relative to backend start.
        # We'll enqueue a fraction of backend work that results in usable cache freshness.
        # The more in-flight, the more cache fill happens (diminishing returns via tanh).
        fill_contribution = 0.9 * math.tanh(backend_work_started / 100.0)
        cache_fill_queue.append(fill_contribution)
        cache_fill_applied = cache_fill_queue.popleft()

        # Apply cache fill to freshness (clamped to 1.0)
        C = min(1.0, C + cache_fill_applied)

        # Record
        history["t"].append(t)
        history["C"].append(C)
        history["I"].append(I)
        history["miss_rate"].append(p_miss)
        history["backend_start"].append(backend_start)
        history["cache_fill_applied"].append(cache_fill_applied)

    return history


if __name__ == "__main__":
    hist = simulate_cache_stampede(
        duration_s=3600,
        rps=220,
        cache_refresh=1/240,
        miss_sensitivity=10.0,
        stampede_multiplier=2.0,
        backend_delay_s=10,
        in_flight_delay_s=7,
        backend_capacity=45,
    )

    # Print a few key points to make the pattern obvious without plotting
    for idx in [0, 590, 600, 610, 650, 900, 1200, 1800, 3599]:
        t = hist["t"][idx]
        print(
            f"t={t:6.0f}s  C={hist['C'][idx]:.3f}  "
            f"miss_prob={hist['miss_rate'][idx]:.3f}  "
            f"in_flight={hist['I'][idx]:.1f}  "
            f"backend_start={hist['backend_start'][idx]:.1f}"
        )

What each important block is doing (and why)

Cache freshness C
- I start at C = 0.0 (cold).
- Each second I slightly reduce it: C = C - cache_refresh * dt. That represents expiration drift.
- At t=600 seconds I force it to 0.0 to mimic an expiry event.
Miss probability
- miss_probability(C) converts freshness into a probability of a cache miss.
- When C is high, misses collapse rapidly; when C is low, misses rise quickly.
Backend start and saturation
- Misses become backend fetch starts: backend_start = miss * feedback.
- feedback is a nonlinear multiplier representing retry/fan-out effects during high miss periods.
- Saturation factor increases in-flight: saturation_factor = 1 + (I / backend_capacity)^1.5.
  - This is the “stampede gets worse with itself” ingredient.
Two delay queues
- cache_fill_queue delays when cache fill affects C by backend_delay_s.
- in_flight_drain_queue delays when in-flight workload drains by in_flight_delay_s.
- This difference is what creates the “spike then slow recovery” shape.

What it looks like when you run it

When I ran the script, the output lines typically show this story:

Before t=600s, cache freshness C stabilizes and miss probability stays low.
At t≈600s, C drops to 0, so miss_prob jumps.
Backend starts spike; due to saturation, in_flight keeps climbing for a while even after cache freshness begins improving.
Because cache fill is delayed, C doesn’t rebound instantly.
The in-flight drain delay means even after fill starts, the system still behaves “busy,” sustaining misses longer than expected.

That last part is the key: the system doesn’t recover at the same time scale as the cache. The two delays desynchronize cause and effect.

Turning the crank: delay mismatch vs. single delay

To see why two delays matter, I reran with matched delays (backend_delay_s == in_flight_delay_s). The “second hump” in in-flight was smaller, and the recovery looked more monotonic.

That’s the system dynamics lesson I didn’t fully appreciate at first:

When feedback loops include multiple time lags, the dynamics can overshoot and recover slowly even if each component individually is “fine.”

Practical insight: modeling stampedes as feedback, not a one-off

My biggest takeaway wasn’t just “stampedes happen.” It was the realization that a stampede is a feedback-driven system dynamics problem:

Cache expiry reduces freshness → increases misses.
Misses increase backend work → increases in-flight.
In-flight causes saturation → slows draining.
Fill happens after a delay → freshness improves late.
During the delay mismatch, the feedback keeps pushing.

Even a tiny model like this can make the shape of incidents predictable: spikes, then lingering recovery.

In the end, I learned how to represent a cache stampede as a system dynamics feedback loop using two explicit delays: one for when backend work becomes cache freshness, and another for when in-flight load drains. That mismatch in timing is what produced the “slow calm-down” behavior I observed, and the simulation made the causal chain feel concrete instead of mysterious.