MDSwap » Workflow » Production Incident Response Runbook

Production Incident Response Runbook

First 15 minutes when a production alert fires — acknowledge, triage, decide, communicate

Production Incident Response Runbook

Phase 1: Acknowledge (0–60 seconds)

Phase 2: Triage (1–5 minutes)

Phase 3: Decide (5–10 minutes)

Path A: Roll back

Whatever your rollback command is — know this command BEFORE the incident

or

or via your platform UI

# Production Incident Response Runbook The first 15 minutes after a production alert fires decide whether the incident is "we shipped a fix in 30 minutes" or "we're writing a public post-mortem tomorrow." This is the runbook. ## Phase 1: Acknowledge (0–60 seconds) **Goal: stop the alert from paging more people.** 1. Acknowledge the page in your alerting tool (PagerDuty, Opsgenie, etc.). Don't snooze without action. 2. Open `#incidents` channel (or equivalent) and type: > `🚨 acking <alert name>. investigating now. will update in 5 min.` 3. If this is your first incident on this service, **also page the secondary**. Don't be a hero. ## Phase 2: Triage (1–5 minutes) **Goal: understand the blast radius. Is it 5 users or 50,000?** Check, in order: 1. **Status page / dashboard.** Is the alert one symptom of something larger? Look for spikes across services. 2. **Error rate vs. baseline.** "Errors are up 10x" tells you almost nothing without baseline. "Errors went from 0.1% to 8% over 4…

By @meliwat - License: -

Raw markdown

Other Workflow MDs

Workflow Contract for AI-Driven Repos — Non-negotiables for shipping safely when parallel agents share one repo and a live user base — pre-flight, branch hygiene, live-site rules.
Workspine — A repo-native delivery spine that keeps AI-assisted work consistent across sessions, agents, and cold starts.
Mom Test User Interview Script — 30-min user interview guide that gets past politeness to real signal