Production Incident Response Runbook
First 15 minutes when a production alert fires — acknowledge, triage, decide, communicate
Production Incident Response Runbook
Phase 1: Acknowledge (0–60 seconds)
Phase 2: Triage (1–5 minutes)
Phase 3: Decide (5–10 minutes)
Path A: Roll back
Whatever your rollback command is — know this command BEFORE the incident
or
or via your platform UI
# Production Incident Response Runbook The first 15 minutes after a production alert fires decide whether the incident is "we shipped a fix in 30 minutes" or "we're writing a public post-mortem tomorrow." This is the runbook. ## Phase 1: Acknowledge (0–60 seconds) **Goal: stop the alert from paging more people.** 1. Acknowledge the page in your alerting tool (PagerDuty, Opsgenie, etc.). Don't snooze without action. 2. Open `#incidents` channel (or equivalent) and type: > `🚨 acking <alert name>. investigating now. will update in 5 min.` 3. If this is your first incident on this service, **also page the secondary**. Don't be a hero. ## Phase 2: Triage (1–5 minutes) **Goal: understand the blast radius. Is it 5 users or 50,000?** Check, in order: 1. **Status page / dashboard.** Is the alert one symptom of something larger? Look for spikes across services. 2. **Error rate vs. baseline.** "Errors are up 10x" tells you almost nothing without baseline. "Errors went from 0.1% to 8% over 4…
By @meliwat - License: -
Raw markdown