Systems ThinkingJune 8, 2026

Pagerduty Triage Bots And The “One-Line Blame” Trap

E

Written by

Elena Holos

Last year I inherited an on-call rotation where every incident felt like the same small play: the pager went off, someone posted a terse message like “DB is down,” and the thread instantly turned into blame-ping-pong. The strangest part was that the facts usually were correct—services really did fail—but the collaboration was fragile enough that we wasted the first 15–30 minutes arguing about who “owned” the failing component.

What finally changed my mind wasn’t an org chart overhaul. It was a tiny automation idea: a “triage bot” that posts a structured first message in the incident channel. I expected it would help people coordinate faster. Instead, I learned that the shape of the first message can accidentally encode a blame culture.

This post is about a very specific failure mode I hit while building that bot: the “one-line blame” trap—when a bot’s single summarized line becomes a social claim, not a technical observation.

The incident bot idea I built

My goal was simple: when an alert fires, the bot should:

  1. Collect context (service name, environment, error rate snapshot, recent deploys).
  2. Post a first message with links to dashboards and logs.
  3. Encourage collaboration by framing what we’re looking at.

Here’s the first version of the bot message I generated (conceptually):

“Root cause suspected: DB outage.”

It seemed harmless, but the result was predictable: the incident channel turned into ownership drama. Even though “DB outage” was likely, the sentence sounded like an accusation.

Why “one line” matters (and how the culture got encoded)

In incident collaboration, people rapidly optimize for social clarity because high stress makes careful reasoning harder. A single line that sounds final (“root cause suspected”) does two cultural things:

  • It collapses the inquiry space: people stop exploring alternative hypotheses (e.g., cache stampede, circuit breaker misconfiguration, or downstream saturation).
  • It assigns responsibility: even if “suspected” is literally true, phrasing reads like accountability.

This is where systems thinking helped me. The technical system was failing, but the coordination system was failing too: the bot was a component in the human feedback loop. Its output wasn’t just information—it became an input into team dynamics.

A working example: an alert triage bot that posts the “wrong” kind of message

Below is a small, working Python script that demonstrates the “one-line blame” trap. It’s not a full PagerDuty integration; it’s a clear simulation: given an alert payload, it formats a message and prints it (in real life, it would POST to Slack).

Step 1: Parse the alert payload

# triage_bot_wrong.py import json from datetime import datetime def parse_alert(payload: dict) -> dict: """ Extracts the key fields the bot needs to build a triage message. """ return { "service": payload["service"], "environment": payload["environment"], "alert_name": payload["alert_name"], "firing_at": payload["firing_at"], "signal": payload["signal"], "evidence": payload["evidence"], # list of strings "recent_deploy": payload.get("recent_deploy"), # optional dict } if __name__ == "__main__": sample = { "service": "checkout-api", "environment": "prod", "alert_name": "SLO_BREACH_5m", "firing_at": "2026-06-08T12:34:56Z", "signal": "latency_p95", "evidence": [ "p95 latency jumped to 3.8s", "error rate increased to 6.2%", "DB connection pool saturation detected", ], "recent_deploy": { "version": "2026.06.08-rc2", "deployed_at": "2026-06-08T12:10:00Z", } } alert = parse_alert(sample) print(json.dumps(alert, indent=2))

What this does and why: I keep the extracted fields explicit so later formatting choices are visible. That matters for incident culture—small wording differences are easy to overlook when the code hides them.

Step 2: Format a “confident” first line (the trap)

# triage_bot_wrong.py (add below parse_alert) def format_wrong_message(alert: dict) -> str: """ Demonstrates the one-line blame trap by using a root-cause-sounding sentence based on the most salient evidence line. """ firing_at = datetime.fromisoformat(alert["firing_at"].replace("Z", "+00:00")) evidence = alert["evidence"] # Naively pick the first evidence line as "the cause". # In real alert payloads, this is often the most dramatic signal. headline = evidence[2] if len(evidence) >= 3 else evidence[-1] return "\n".join([ f"🚨 Incident started ({alert['alert_name']}) at {firing_at.isoformat()}", f"Service: {alert['service']} ({alert['environment']})", f"Root cause suspected: {headline.replace('detected', '').strip()}", "Evidence:", *[f"- {line}" for line in evidence], f"Recent deploy: {alert['recent_deploy']['version']}" if alert.get("recent_deploy") else "Recent deploy: none", ]) if __name__ == "__main__": sample = { "service": "checkout-api", "environment": "prod", "alert_name": "SLO_BREACH_5m", "firing_at": "2026-06-08T12:34:56Z", "signal": "latency_p95", "evidence": [ "p95 latency jumped to 3.8s", "error rate increased to 6.2%", "DB connection pool saturation detected", ], "recent_deploy": { "version": "2026.06.08-rc2", "deployed_at": "2026-06-08T12:10:00Z", } } alert = parse_alert(sample) msg = format_wrong_message(alert) print(msg)

What this does and why: it deliberately converts a piece of evidence (“DB connection pool saturation detected”) into a “Root cause suspected” headline. Technically, this is often a true statement—but socially, it behaves like a verdict.

That’s the trap: bots don’t just report facts; they trigger interpretations.

The cultural fix: switch from “root cause” to “hypotheses + next observations”

The repair wasn’t “be kinder.” It was to change the bot’s protocol:

  • Avoid “root cause” / “suspected” language in the first line.
  • Present hypotheses as “possible contributing factors.”
  • Add a short “next observation” list that invites technical exploration rather than ownership debate.

Step 3: Format a safer message

# triage_bot_right.py from datetime import datetime def format_right_message(alert: dict) -> str: """ Cultural fix: avoid root-cause-sounding headlines. Frame evidence as observations and list possible contributing factors. """ firing_at = datetime.fromisoformat(alert["firing_at"].replace("Z", "+00:00")) evidence = alert["evidence"] # Keep evidence as evidence; do not promote it to a verdict. # Convert specific evidence to a set of hypotheses that encourage testing. hypotheses = [ "DB pool saturation may be contributing to latency and errors.", "An interaction between recent deploy and downstream capacity could be amplifying load.", ] next_observations = [ "Check DB pool saturation timeline vs. deploy timestamp.", "Compare p95 latency by endpoint to see if impact is uniform.", "Inspect recent query patterns for spikes or regressions.", ] recent_deploy = ( f"{alert['recent_deploy']['version']} (deployed {alert['recent_deploy']['deployed_at']})" if alert.get("recent_deploy") else "none" ) return "\n".join([ f"🚨 Incident started ({alert['alert_name']}) at {firing_at.isoformat()}", f"Service: {alert['service']} ({alert['environment']})", "", "Observations (from alert signals):", *[f"- {line}" for line in evidence], "", "Possible contributing factors (not a verdict):", *[f"- {h}" for h in hypotheses], "", f"Recent deploy: {recent_deploy}", "", "Next observations (to unblock debugging):", *[f"- {n}" for n in next_observations], ]) if __name__ == "__main__": sample = { "service": "checkout-api", "environment": "prod", "alert_name": "SLO_BREACH_5m", "firing_at": "2026-06-08T12:34:56Z", "signal": "latency_p95", "evidence": [ "p95 latency jumped to 3.8s", "error rate increased to 6.2%", "DB connection pool saturation detected", ], "recent_deploy": { "version": "2026.06.08-rc2", "deployed_at": "2026-06-08T12:10:00Z", } } alert = { "service": sample["service"], "environment": sample["environment"], "alert_name": sample["alert_name"], "firing_at": sample["firing_at"], "signal": sample["signal"], "evidence": sample["evidence"], "recent_deploy": sample.get("recent_deploy"), } print(format_right_message(alert))

What changed and why: the bot now treats evidence as evidence. The message invites a shared debugging workflow:

  • It does not assign cause as a one-line claim.
  • It gives a testable sequence (timeline checks, endpoint breakdown, query inspection).

That’s incident culture engineering: changing the coordination inputs so the team’s mental models can converge on facts, not blame.

What happened after deploying the “right” message format

In practice, the biggest visible changes were:

  • The first 10–15 minutes stopped turning into “who owns the DB.”
  • People still discussed DB saturation, but as a hypothesis to validate, not a conclusion.
  • When we did identify a real root cause, it was the result of investigation—not the product of an early summary.

The incident system improved because I treated the bot as part of the socio-technical feedback loop. It wasn’t just alerting; it was shaping how humans interpret uncertainty under stress.

Conclusion

I learned that incident culture isn’t only made of policies and postmortems—it’s also encoded in the smallest technical artifacts, like a bot’s first line. Building a triage bot taught me the “one-line blame” trap: when automation sounds like a verdict, it collapses collaboration into ownership arguments. Switching to evidence-first wording with hypotheses and next observations helped the team debug faster because it kept our mental models aligned with uncertainty rather than accountability.