Incident Runbook
You are writing or reviewing an incident runbook for: $ARGUMENTS
A runbook is a recipe, not a novel. It must be written so that an engineer seeing this system for the first time at 3 AM, having just been paged, can follow it and resolve the incident. Every step must be actionable. No step may require the reader to guess.
Principles
- Assume the reader is tired. Write for 3 AM, not 10 AM. Short sentences. Direct commands. No ambiguity.
- Runbooks are executable, not descriptive. "Check the logs" is not a step. "Run
kubectl logsand look for X" is a step. - Every fix has a verification step. After the fix, how do you know if it worked?
- Runbooks are living documents. Update them every time they are used. Mark steps that were wrong or missing.
- Link from the alert. The alert that wakes people up must link to the runbook. An unlinked runbook does not exist.
1. Runbook Structure
Save runbooks to docs/runbooks/[service]-[scenario].md:
docs/runbooks/
payments-high-error-rate.md
database-connection-pool-exhausted.md
worker-queue-backlog.md
disk-full.md
memory-leak.md
certificate-expiry.md
2. Runbook Template
# Runbook: [Service] — [Incident Type]
**Alert:** [Name of the alert that links here]
**Severity:** SEV-1 / SEV-2 / SEV-3
**Last tested:** YYYY-MM-DD
**Owner:** [Team or @person]
---
## Symptoms
- [What the monitoring/alert shows]
- [What users experience]
- [Other observable effects]
---
## Quick diagnosis (2 minutes)
Run these immediately to confirm the incident type and scope:
```bash
# Step 1 — confirm the alert is real
[command to check current state]
# Expected output: [what you should see]
# Step 2 — check scope
[command to check how many users/requests affected]
# Step 3 — check for a recent change that caused this
git log main --oneline --since="2 hours ago"
# or check your CD tool
Common Causes
Cause A: [Name]
Indicator: [How to identify this cause] Resolution: → Go to §3.A
Cause B: [Name]
Indicator: [How to identify this cause] Resolution: → Go to §3.B
Resolutions
§3.A — [Cause A Resolution]
-
[Specific command or action]
[command]Expected output: [what you should see when it's working]
-
[Next step]
-
Verify the fix:
[verification command]Success: [what success looks like]
§3.B — [Cause B Resolution]
[same structure]
Escalation
If steps above do not resolve within 30 minutes:
- Escalate to: [name/team/pagerduty policy]
- Share in: [Slack channel]
- Context to provide: current error rate, steps tried, timeline
After Resolution
- Post status update: "Resolved — [1-sentence summary of cause and fix]"
- Update this runbook with anything that was wrong or missing
- File a post-mortem if SEV-1 or SEV-2 (see incident-responder agent)
3. Filled Example: Database Connection Pool Exhausted
# Runbook: PostgreSQL — Connection Pool Exhausted
**Alert:** `postgres_connection_utilisation > 90%`
**Severity:** SEV-2
**Last tested:** 2026-02-14
**Owner:** Platform team
---
## Symptoms
- API requests failing with "FATAL: remaining connection slots are reserved"
- Connection wait times > 5s in APM
- Monitoring shows `pg_stat_activity` connections near `max_connections` limit
---
## Quick diagnosis (2 minutes)
```bash
# Current connection count vs limit
psql $DATABASE_URL -c "
SELECT count(*) AS active,
(SELECT setting::int FROM pg_settings WHERE name = 'max_connections') AS max
FROM pg_stat_activity WHERE state != 'idle';
"
# Who is holding connections?
psql $DATABASE_URL -c "
SELECT pid, usename, application_name, client_addr, state,
now() - query_start AS duration, left(query, 80) AS query
FROM pg_stat_activity
WHERE state != 'idle'
ORDER BY duration DESC LIMIT 20;
"
Common Causes
Cause A: Leaked connections (app not returning connections to pool)
Indicator: Connections in idle in transaction state for > 30 seconds
Resolution: → §3.A
Cause B: Pool size not set — default unlimited
Indicator: No max set in PgBouncer or ORM pool config
Resolution: → §3.B
Cause C: Traffic spike — need to scale
Indicator: Normal connection counts per pod but too many pods Resolution: → §3.C
§3.A — Kill leaked connections
# Kill idle-in-transaction connections older than 5 minutes
psql $DATABASE_URL -c "
SELECT pg_terminate_backend(pid)
FROM pg_stat_activity
WHERE state = 'idle in transaction'
AND now() - query_start > interval '5 minutes';
"
Then find the leak: grep application logs for the transactions that were started but not committed.
Verify: Run the quick diagnosis query again — connection count should drop.
§3.B — Set connection pool limit
In the ORM config:
DATABASE_POOL_MAX=20
Restart the app pods:
kubectl rollout restart deployment/myapp -n production
kubectl rollout status deployment/myapp -n production
§3.C — Scale horizontally
kubectl scale deployment/myapp --replicas=10 -n production
kubectl rollout status deployment/myapp -n production
# Monitor pool usage — if it stays high, the pool max per pod needs reduction too
---
## 4. Runbook Maintenance
After every incident where a runbook was used:
```markdown
## Runbook Update Checklist (after use)
- [ ] Was any step unclear or wrong? Fix it.
- [ ] Were there steps missing that you had to figure out yourself? Add them.
- [ ] Were any commands wrong for the current environment? Update them.
- [ ] Did the "expected output" sections match reality? Update them.
- [ ] Update "Last tested" date at the top.
5. Alert-to-Runbook Linking
Every monitoring alert must link to its runbook:
# Prometheus AlertManager rule
- alert: PostgresConnectionPoolExhausted
expr: pg_connection_utilisation > 0.9
for: 2m
annotations:
summary: "PostgreSQL connection pool > 90%"
runbook_url: "https://docs.internal/runbooks/database-connection-pool-exhausted"
description: "Current utilisation: {{ $value | humanizePercentage }}"
// PagerDuty service ruleset
{
"rule": {
"conditions": [{ "expression": "event.summary matches 'connection pool'" }],
"actions": {
"annotate": { "value": "Runbook: https://docs.internal/runbooks/database-connection-pool-exhausted" }
}
}
}