Weekend Notes On Designing A Deterministic “Incident Timeline” For Event-Driven Systems

The problem I couldn’t ignore

I once ran an incident where the system behaved “correctly” by every metric—but the story didn’t line up. Alerts told one narrative, logs told another, and our runbook assumed a third. Afterward, we spent hours arguing about what happened first, because we treated the timeline as a byproduct of debugging instead of a first-class artifact.

That experience pushed me toward a very specific design goal:

Build a deterministic incident timeline generator that can turn a stream of events into a single, reproducible “what happened” narrative—even when events arrive out of order.

This is a tech philosophy choice: when reality is messy (distributed systems are), I try to make the process of understanding deterministic.

The philosophy choice: “Reproducibility beats intuition”

Systems thinking says components interact through feedback loops, delays, and couplings. In incidents, the coupling is between:

Event ingestion timing (arrival order)
Causality (what really caused what)
Human interpretation (how we narrate the incident)

If you can’t reproduce the same incident timeline from the same underlying events, you can’t reliably compare “today’s fix” against “yesterday’s failure.” So I designed the timeline builder to be deterministic.

Two principles guided me:

Sort using explicit ordering rules, not arrival order.
Break ties consistently using stable identifiers, so two runs produce byte-for-byte identical output.

A concrete model I used

Each event record had:

event_id: unique stable ID (string)
emitted_at: when the system claims it emitted the event (integer timestamp)
received_at: when my collector received it (integer timestamp)
correlation_id: to group related events (string)
type: e.g. request_started, request_failed, service_scaled
causal_ref: optional pointer to another event by event_id

Key detail: causal references

If an event declares causal_ref, I can place it relative to the event it causally depends on. If it doesn’t, I fall back to time ordering.

Step-by-step: deterministic timeline builder in code

Below is a working Python implementation that:

Builds a dependency graph from causal_ref.
Produces a topological ordering (a linearization consistent with dependencies).
Uses deterministic tie-breaking so the ordering is stable.
Emits a timeline grouped by correlation_id.

from __future__ import annotations

from dataclasses import dataclass
from typing import Optional, List, Dict, Tuple
import heapq
import json


@dataclass(frozen=True)
class Event:
    event_id: str
    emitted_at: int
    received_at: int
    correlation_id: str
    type: str
    causal_ref: Optional[str] = None


def deterministic_timeline(events: List[Event]) -> Dict[str, List[dict]]:
    """
    Returns a deterministic timeline grouped by correlation_id.

    Determinism rules:
    - Primary ordering comes from causal dependencies (causal_ref graph).
    - When multiple events are available to schedule next, ties are resolved using:
      (emitted_at, received_at, event_id)
    """
    # Group events by correlation id first (keeps timelines readable)
    by_corr: Dict[str, List[Event]] = {}
    for e in events:
        by_corr.setdefault(e.correlation_id, []).append(e)

    timeline_by_corr: Dict[str, List[dict]] = {}

    for corr_id, corr_events in by_corr.items():
        # Index for quick lookup
        by_id: Dict[str, Event] = {e.event_id: e for e in corr_events}

        # Build adjacency list: ref_event -> list of dependent events
        dependents: Dict[str, List[str]] = {e.event_id: [] for e in corr_events}
        indegree: Dict[str, int] = {e.event_id: 0 for e in corr_events}

        for e in corr_events:
            if e.causal_ref is not None and e.causal_ref in by_id:
                # Edge: causal_ref -> e
                dependents[e.causal_ref].append(e.event_id)
                indegree[e.event_id] += 1

        # Priority queue for "available" nodes (indegree == 0)
        # Heap key provides deterministic tie-breaking.
        def heap_key(event_id: str) -> Tuple[int, int, str]:
            ev = by_id[event_id]
            return (ev.emitted_at, ev.received_at, ev.event_id)

        heap: List[Tuple[Tuple[int, int, str], str]] = []
        for event_id, deg in indegree.items():
            if deg == 0:
                heapq.heappush(heap, (heap_key(event_id), event_id))

        ordered: List[str] = []
        while heap:
            _, event_id = heapq.heappop(heap)
            ordered.append(event_id)
            for child in dependents[event_id]:
                indegree[child] -= 1
                if indegree[child] == 0:
                    heapq.heappush(heap, (heap_key(child), child))

        # If there are cycles (bad causal data), fall back deterministically by time.
        # Cycles should be rare; this is defensive programming.
        if len(ordered) != len(corr_events):
            ordered = sorted(
                [e.event_id for e in corr_events],
                key=lambda eid: heap_key(eid)
            )

        timeline_by_corr[corr_id] = [
            {
                "event_id": by_id[eid].event_id,
                "type": by_id[eid].type,
                "emitted_at": by_id[eid].emitted_at,
                "received_at": by_id[eid].received_at,
                "causal_ref": by_id[eid].causal_ref,
            }
            for eid in ordered
        ]

    return timeline_by_corr


if __name__ == "__main__":
    # Example: arrival order is intentionally scrambled.
    raw = [
        {"event_id": "e3", "emitted_at": 300, "received_at": 305, "correlation_id": "c1", "type": "request_failed", "causal_ref": "e2"},
        {"event_id": "e2", "emitted_at": 200, "received_at": 310, "correlation_id": "c1", "type": "timeout_detected", "causal_ref": "e1"},
        {"event_id": "e1", "emitted_at": 100, "received_at": 320, "correlation_id": "c1", "type": "request_started", "causal_ref": None},
        # Another correlation group
        {"event_id": "e4", "emitted_at": 150, "received_at": 400, "correlation_id": "c2", "type": "service_scaled", "causal_ref": None},
        {"event_id": "e5", "emitted_at": 160, "received_at": 410, "correlation_id": "c2", "type": "latency_recovered", "causal_ref": "e4"},
    ]

    events = [Event(**r) for r in raw]
    timeline = deterministic_timeline(events)

    # Print deterministically: sorting keys makes the output stable too.
    print(json.dumps(timeline, indent=2, sort_keys=True))

What each block is doing (and why)

Grouping by correlation_id: I want each timeline to represent one “thread” of causality (e.g., one request), not a blended soup of unrelated events.
Graph construction: If causal_ref is present and known, I create a directed edge: causal_ref -> event. This turns narrative into structure.
Topological ordering: A topological sort produces an ordering that respects dependencies. That means if event e2 claims it was caused by e1, the timeline won’t put e2 before e1.
Deterministic tie-breaking using a heap key: When multiple events are “available” (no unmet causal dependencies), I pick the next one using: (emitted_at, received_at, event_id)

This is the crucial philosophy part: I don’t trust whatever order events came in from the network or collector. I trust the explicit rules.
Cycle handling fallback: If the causal graph is inconsistent (cycles), topological sorting can’t produce an ordering. I fall back to deterministic time ordering so the output is still reproducible.

“What happens when I run this?”

In the example, the input list is shuffled so arrival order is misleading:

e3 arrives first in the raw list, but it depends on e2
e2 depends on e1

A deterministic timeline generator should still output:

request_started (e1)
timeout_detected (e2)
request_failed (e3)

When I run the script, the printed JSON reflects exactly that ordering, and the order won’t change across runs because the tie-breaker includes event_id.

The systems thinking connection

This small artifact helps with feedback loops in incident response:

Without determinism, teams create interpretation variance (“I think it happened first…”).
With determinism, the organization converges on a stable shared model: “Given these events, the timeline is X.”
That stabilizes learning—postmortems become comparisons against the same narrative, not re-litigated mysteries.

In other words, I treat debugging as a component in the system, not a side quest.

Practical takeaway I carried forward

I stopped thinking of logs as text we scan manually and started thinking of them as inputs to a deterministic transformation that produces an incident “story” we can trust. The philosophical shift is simple: make the understanding pipeline reproducible, so the system’s behavior can be improved rather than endlessly debated.

In the end, I learned that deterministic incident timelines aren’t about fancy algorithms—they’re a systems-thinking move that turns messy event streams into a stable narrative you can learn from.