Weekend Notes On Designing A Deterministic “Incident Timeline” For Event-Driven Systems
Written by
Elena Holos
The problem I couldn’t ignore
I once ran an incident where the system behaved “correctly” by every metric—but the story didn’t line up. Alerts told one narrative, logs told another, and our runbook assumed a third. Afterward, we spent hours arguing about what happened first, because we treated the timeline as a byproduct of debugging instead of a first-class artifact.
That experience pushed me toward a very specific design goal:
Build a deterministic incident timeline generator that can turn a stream of events into a single, reproducible “what happened” narrative—even when events arrive out of order.
This is a tech philosophy choice: when reality is messy (distributed systems are), I try to make the process of understanding deterministic.
The philosophy choice: “Reproducibility beats intuition”
Systems thinking says components interact through feedback loops, delays, and couplings. In incidents, the coupling is between:
- Event ingestion timing (arrival order)
- Causality (what really caused what)
- Human interpretation (how we narrate the incident)
If you can’t reproduce the same incident timeline from the same underlying events, you can’t reliably compare “today’s fix” against “yesterday’s failure.” So I designed the timeline builder to be deterministic.
Two principles guided me:
- Sort using explicit ordering rules, not arrival order.
- Break ties consistently using stable identifiers, so two runs produce byte-for-byte identical output.
A concrete model I used
Each event record had:
event_id: unique stable ID (string)emitted_at: when the system claims it emitted the event (integer timestamp)received_at: when my collector received it (integer timestamp)correlation_id: to group related events (string)type: e.g.request_started,request_failed,service_scaledcausal_ref: optional pointer to another event byevent_id
Key detail: causal references
If an event declares causal_ref, I can place it relative to the event it causally depends on. If it doesn’t, I fall back to time ordering.
Step-by-step: deterministic timeline builder in code
Below is a working Python implementation that:
- Builds a dependency graph from
causal_ref. - Produces a topological ordering (a linearization consistent with dependencies).
- Uses deterministic tie-breaking so the ordering is stable.
- Emits a timeline grouped by
correlation_id.
from __future__ import annotations from dataclasses import dataclass from typing import Optional, List, Dict, Tuple import heapq import json @dataclass(frozen=True) class Event: event_id: str emitted_at: int received_at: int correlation_id: str type: str causal_ref: Optional[str] = None def deterministic_timeline(events: List[Event]) -> Dict[str, List[dict]]: """ Returns a deterministic timeline grouped by correlation_id. Determinism rules: - Primary ordering comes from causal dependencies (causal_ref graph). - When multiple events are available to schedule next, ties are resolved using: (emitted_at, received_at, event_id) """ # Group events by correlation id first (keeps timelines readable) by_corr: Dict[str, List[Event]] = {} for e in events: by_corr.setdefault(e.correlation_id, []).append(e) timeline_by_corr: Dict[str, List[dict]] = {} for corr_id, corr_events in by_corr.items(): # Index for quick lookup by_id: Dict[str, Event] = {e.event_id: e for e in corr_events} # Build adjacency list: ref_event -> list of dependent events dependents: Dict[str, List[str]] = {e.event_id: [] for e in corr_events} indegree: Dict[str, int] = {e.event_id: 0 for e in corr_events} for e in corr_events: if e.causal_ref is not None and e.causal_ref in by_id: # Edge: causal_ref -> e dependents[e.causal_ref].append(e.event_id) indegree[e.event_id] += 1 # Priority queue for "available" nodes (indegree == 0) # Heap key provides deterministic tie-breaking. def heap_key(event_id: str) -> Tuple[int, int, str]: ev = by_id[event_id] return (ev.emitted_at, ev.received_at, ev.event_id) heap: List[Tuple[Tuple[int, int, str], str]] = [] for event_id, deg in indegree.items(): if deg == 0: heapq.heappush(heap, (heap_key(event_id), event_id)) ordered: List[str] = [] while heap: _, event_id = heapq.heappop(heap) ordered.append(event_id) for child in dependents[event_id]: indegree[child] -= 1 if indegree[child] == 0: heapq.heappush(heap, (heap_key(child), child)) # If there are cycles (bad causal data), fall back deterministically by time. # Cycles should be rare; this is defensive programming. if len(ordered) != len(corr_events): ordered = sorted( [e.event_id for e in corr_events], key=lambda eid: heap_key(eid) ) timeline_by_corr[corr_id] = [ { "event_id": by_id[eid].event_id, "type": by_id[eid].type, "emitted_at": by_id[eid].emitted_at, "received_at": by_id[eid].received_at, "causal_ref": by_id[eid].causal_ref, } for eid in ordered ] return timeline_by_corr if __name__ == "__main__": # Example: arrival order is intentionally scrambled. raw = [ {"event_id": "e3", "emitted_at": 300, "received_at": 305, "correlation_id": "c1", "type": "request_failed", "causal_ref": "e2"}, {"event_id": "e2", "emitted_at": 200, "received_at": 310, "correlation_id": "c1", "type": "timeout_detected", "causal_ref": "e1"}, {"event_id": "e1", "emitted_at": 100, "received_at": 320, "correlation_id": "c1", "type": "request_started", "causal_ref": None}, # Another correlation group {"event_id": "e4", "emitted_at": 150, "received_at": 400, "correlation_id": "c2", "type": "service_scaled", "causal_ref": None}, {"event_id": "e5", "emitted_at": 160, "received_at": 410, "correlation_id": "c2", "type": "latency_recovered", "causal_ref": "e4"}, ] events = [Event(**r) for r in raw] timeline = deterministic_timeline(events) # Print deterministically: sorting keys makes the output stable too. print(json.dumps(timeline, indent=2, sort_keys=True))
What each block is doing (and why)
-
Grouping by
correlation_id: I want each timeline to represent one “thread” of causality (e.g., one request), not a blended soup of unrelated events. -
Graph construction: If
causal_refis present and known, I create a directed edge:causal_ref -> event. This turns narrative into structure. -
Topological ordering: A topological sort produces an ordering that respects dependencies. That means if event
e2claims it was caused bye1, the timeline won’t pute2beforee1. -
Deterministic tie-breaking using a heap key: When multiple events are “available” (no unmet causal dependencies), I pick the next one using:
(emitted_at, received_at, event_id)This is the crucial philosophy part: I don’t trust whatever order events came in from the network or collector. I trust the explicit rules.
-
Cycle handling fallback: If the causal graph is inconsistent (cycles), topological sorting can’t produce an ordering. I fall back to deterministic time ordering so the output is still reproducible.
“What happens when I run this?”
In the example, the input list is shuffled so arrival order is misleading:
e3arrives first in the raw list, but it depends one2e2depends one1
A deterministic timeline generator should still output:
request_started(e1)timeout_detected(e2)request_failed(e3)
When I run the script, the printed JSON reflects exactly that ordering, and the order won’t change across runs because the tie-breaker includes event_id.
The systems thinking connection
This small artifact helps with feedback loops in incident response:
- Without determinism, teams create interpretation variance (“I think it happened first…”).
- With determinism, the organization converges on a stable shared model: “Given these events, the timeline is X.”
- That stabilizes learning—postmortems become comparisons against the same narrative, not re-litigated mysteries.
In other words, I treat debugging as a component in the system, not a side quest.
Practical takeaway I carried forward
I stopped thinking of logs as text we scan manually and started thinking of them as inputs to a deterministic transformation that produces an incident “story” we can trust. The philosophical shift is simple: make the understanding pipeline reproducible, so the system’s behavior can be improved rather than endlessly debated.
In the end, I learned that deterministic incident timelines aren’t about fancy algorithms—they’re a systems-thinking move that turns messy event streams into a stable narrative you can learn from.