Incident Postmortems As Feedback Control For Queue Backlog Oscillations

I used to think incident postmortems were mostly for “remembering what happened.” Then I watched a system almost learn from its failures—and fail anyway.

The pattern was weirdly consistent: after an incident, we’d apply a hotfix that reduced error rates, and everything looked calm for a few days. Then queue backlog would slowly climb, trigger alerts, and the system would flip into a stressful recovery mode. A week later, we’d see the same class of incident again, with slightly different symptoms.

What finally clicked for me was treating incident handling like feedback control: the postmortem isn’t just documentation; it’s a control loop that changes the system’s behavior in response to observed signals (like backlog, latency, and error rate). When that loop is poorly designed—especially when it responds too late or too aggressively—the system can develop oscillations: backlog builds up, over-correct happens, the system crashes into “catch-up mode,” and the cycle repeats.

In this post, I’ll show how I modeled “postmortem action selection” as a feedback controller, and how a tiny change in the “what we decide after an incident” rule made the oscillations stop.

The niche failure mode: “Human-tuned backlog recovery” that oscillates

Consider a common architecture:

Workers pull jobs from a queue.
A queue backlog is the number of jobs waiting.
Workers process jobs; processing time varies.
We have an operator (the on-call team) who applies mitigations during incidents.

In the real system, mitigations look like:

scale up workers,
increase consumer concurrency,
throttle producers,
restart services,
tweak timeouts.

The philosophy bug I found wasn’t that any individual hotfix was “wrong.” It was that we were making decisions based on a snapshot without considering how those decisions change the system’s future trajectory.

So I built a tiny simulation: one queue, one worker pool, and a control policy that chooses mitigation intensity based on backlog observed “during the incident.”

A small simulation of backlog oscillation (with step-by-step code)

I wanted something I could run in a terminal, so I wrote a discrete-time model in Python. It tracks:

backlog: jobs waiting
arrival_rate: jobs entering the queue per tick
service_capacity: jobs processed per tick, based on worker count
mitigation: how aggressively we temporarily throttle producers during “incident recovery”
a control rule that changes mitigation after an incident

Step 1: Write the model

import random
from dataclasses import dataclass

@dataclass
class State:
    backlog: float
    worker_count: int
    mitigation: float  # 0.0 means no throttle, 1.0 means full throttle

def simulate(
    steps=200,
    incident_threshold=120.0,
    base_arrival_rate=30.0,
    base_processing_per_worker=2.0,
    worker_capacity_jitter=0.2,
    production_volatility=0.25,
    recovery_steps=10,
    # "postmortem policy" parameters:
    # how strongly we adjust mitigation after observing an incident
    postmortem_gain=0.10,
    # how quickly we react after an incident is detected
    reaction_delay=0,
    seed=7,
):
    random.seed(seed)
    state = State(backlog=0.0, worker_count=10, mitigation=0.0)

    history = []
    incident_count = 0
    last_incident_step = -10**9

    # A simple "incident recovery schedule": for a fixed number of steps after each incident,
    # we keep the throttling level that the postmortem policy computed.
    recovery_timer = 0
    scheduled_mitigation = 0.0

    for t in range(steps):
        # 1) arrivals fluctuate (think upstream traffic changes)
        arrival = base_arrival_rate * (1.0 + random.uniform(-production_volatility, production_volatility))

        # 2) mitigation throttles producers: higher mitigation => fewer arrivals
        effective_arrival = arrival * (1.0 - state.mitigation)

        # 3) processing capacity fluctuates slightly (think GC pauses / noisy neighbors)
        capacity_jitter = 1.0 + random.uniform(-worker_capacity_jitter, worker_capacity_jitter)
        service_capacity = state.worker_count * base_processing_per_worker * capacity_jitter

        # 4) backlog evolves: backlog increases by arrivals, decreases by service capacity
        state.backlog = max(0.0, state.backlog + effective_arrival - service_capacity)

        # 5) detect incident based on current backlog snapshot
        if state.backlog >= incident_threshold and (t - last_incident_step) > recovery_steps:
            incident_count += 1
            last_incident_step = t
            # "Postmortem action": compute an intensity based on how far we overshot the threshold
            # Overshoot is (backlog - threshold). Larger overshoot => higher mitigation.
            overshoot = state.backlog - incident_threshold

            # Reaction delay models that real postmortem-driven changes are not instantaneous.
            react_at = t + reaction_delay

            # Clamp mitigation so it stays in [0,1]
            scheduled_mitigation = max(0.0, min(1.0, state.mitigation + postmortem_gain * overshoot))
            # Schedule recovery_timer to start at react_at; easiest is to set a timer now
            recovery_timer = max(recovery_timer, react_at)

        # 6) Apply recovery schedule: if we're past the scheduled incident reaction time,
        # keep mitigation for recovery_steps, then drop back to baseline.
        if recovery_timer and t >= recovery_timer:
            # Start-of-recovery: set mitigation to scheduled level once
            if t == recovery_timer:
                scheduled_mitigation = scheduled_mitigation
            # Keep mitigation while in recovery window
            if t < recovery_timer + recovery_steps:
                state.mitigation = scheduled_mitigation
            else:
                # Recovery window ended: mitigation resets (like reverting hotfixes after stability)
                state.mitigation = 0.0
                recovery_timer = 0

        history.append((t, state.backlog, state.worker_count, state.mitigation))

    return history, incident_count

def summarize(history, window=50):
    # crude oscillation metric: count how many times backlog crosses threshold and
    # how large the late-period variance is.
    late = history[-window:]
    backlogs = [b for (_, b, _, _) in late]
    var = sum((x - sum(backlogs)/len(backlogs))**2 for x in backlogs)/len(backlogs)
    return var, min(backlogs), max(backlogs)

Step 2: Run a “bad” postmortem policy and watch oscillation

This policy mimics a common failure:

high postmortem_gain: we overreact based on overshoot
reaction_delay: changes are delayed, so we keep suffering before the policy kicks in

history_bad, incidents_bad = simulate(
    steps=220,
    incident_threshold=120.0,
    postmortem_gain=0.12,   # too strong
    reaction_delay=6,       # delayed change
    recovery_steps=12,
    seed=3
)
var_bad, min_bad, max_bad = summarize(history_bad)

incidents_bad, var_bad, min_bad, max_bad

If I plot backlog, the signature looks like this conceptually:

backlog ramps up slowly
once it crosses 120, we’re already in trouble
mitigation kicks in late
capacity catches up too hard
backlog plunges, mitigation resets, backlog ramps again

That’s oscillation: the control loop is “chasing” a signal with too much gain and too much delay.

Turning postmortems into a control loop: reduce gain and react sooner

The systems-thinking insight here is simple: in feedback control, gain and delay matter. A postmortem-driven mitigation policy is effectively a controller that chooses an action intensity based on how bad the system looked at an incident boundary.

So I changed the policy to be more conservative and more immediate:

reduce postmortem_gain
reduce reaction_delay (modeling faster operational changes, like feature flags or safer runtime configuration)

history_good, incidents_good = simulate(
    steps=220,
    incident_threshold=120.0,
    postmortem_gain=0.05,   # less aggressive
    reaction_delay=2,       # faster reaction
    recovery_steps=12,
    seed=3
)
var_good, min_good, max_good = summarize(history_good)

incidents_good, var_good, min_good, max_good

In my run, this reduced:

the number of incidents
the late-period variance (backlog stayed more stable)
the “overshoot magnitude” (max backlog was less extreme)

The key is philosophical but practical: postmortems become part of the runtime system. If the changes they encode are too aggressive and arrive too late, they act like a controller with high gain and phase lag—classic recipe for oscillation.

The “aha” part: measuring what the postmortem changes will do next week

I started mapping postmortem actions into three categories, based on how they change the feedback loop:

Actuators: knobs that directly change system dynamics
(worker count, throttling, concurrency)
Sensors: signals that trigger incident handling
(backlog threshold, latency SLO, error rate)
Policy: the decision rule after an incident
(how much to change, how long to keep it, when to revert)

The oscillation in my simulation happened because the policy was implicitly doing:

“When backlog is high, set mitigation to a big value”
“Keep it only during recovery”
“Reset to zero afterward”

That last line (“reset afterward”) is another hidden control-loop design choice. In real systems, it shows up as reverting hotfixes too quickly or not persisting the improvement that would prevent recurrence.

So I updated my postmortem writing checklist to include a control-loop view:

Postmortem checklist I actually used

What signal triggered the response? (sensor)
What knob changed? (actuator)
How did we scale the magnitude? (gain)
How long until it took effect? (delay)
What was the revert condition? (how long the controller stays “on”)
What stable regime should the system converge to? (desired equilibrium)

That’s not extra bureaucracy. It’s a way to prevent “documented learning” from turning into “repeated oscillation.”

Connecting this back to real incident culture

When teams treat postmortems as only meaning-making (“here’s why it happened”), they miss that postmortems are also behavior changes. The system doesn’t care that we understood the root cause; it responds to the actions we encode into dashboards, feature flags, autoscaling rules, retry policies, and operational runbooks.

In other words: systems thinking turns root-cause analysis into trajectory analysis—what happens next after we deploy the fix, not just what happened during the incident.

Conclusion

I learned that incident postmortems function like a feedback controller: the “action selection” rule (gain), the time it takes to apply it (delay), and when mitigations revert all shape whether a system stabilizes or oscillates. By modeling backlog recovery as control logic and tuning the postmortem-driven policy to be less aggressive and faster to apply, I saw oscillations dampen and recurrence drop. That reframed tech philosophy for me: postmortems aren’t only about explaining failures—they’re part of how we steer the system’s future dynamics.