Incident Runbook

You are writing or reviewing an incident runbook for: $ARGUMENTS

A runbook is a recipe, not a novel. It must be written so that an engineer seeing this system for the first time at 3 AM, having just been paged, can follow it and resolve the incident. Every step must be actionable. No step may require the reader to guess.

Principles

Assume the reader is tired. Write for 3 AM, not 10 AM. Short sentences. Direct commands. No ambiguity.
Runbooks are executable, not descriptive. "Check the logs" is not a step. "Run kubectl logs and look for X" is a step.
Every fix has a verification step. After the fix, how do you know if it worked?
Runbooks are living documents. Update them every time they are used. Mark steps that were wrong or missing.
Link from the alert. The alert that wakes people up must link to the runbook. An unlinked runbook does not exist.

1. Runbook Structure

Save runbooks to docs/runbooks/[service]-[scenario].md:

docs/runbooks/
  payments-high-error-rate.md
  database-connection-pool-exhausted.md
  worker-queue-backlog.md
  disk-full.md
  memory-leak.md
  certificate-expiry.md

2. Runbook Template

# Runbook: [Service] — [Incident Type]

**Alert:** [Name of the alert that links here]
**Severity:** SEV-1 / SEV-2 / SEV-3
**Last tested:** YYYY-MM-DD
**Owner:** [Team or @person]

---

## Symptoms
- [What the monitoring/alert shows]
- [What users experience]
- [Other observable effects]

---

## Quick diagnosis (2 minutes)

Run these immediately to confirm the incident type and scope:

```bash
# Step 1 — confirm the alert is real
[command to check current state]
# Expected output: [what you should see]

# Step 2 — check scope
[command to check how many users/requests affected]

# Step 3 — check for a recent change that caused this
git log main --oneline --since="2 hours ago"
# or check your CD tool

Common Causes

Cause A: [Name]

Indicator: [How to identify this cause] Resolution: → Go to §3.A

Cause B: [Name]

Indicator: [How to identify this cause] Resolution: → Go to §3.B

Resolutions

§3.A — [Cause A Resolution]

[Specific command or action]
```
[command]
```
Expected output: [what you should see when it's working]
[Next step]
Verify the fix:
```
[verification command]
```
Success: [what success looks like]

§3.B — [Cause B Resolution]

[same structure]

Escalation

If steps above do not resolve within 30 minutes:

Escalate to: [name/team/pagerduty policy]
Share in: [Slack channel]
Context to provide: current error rate, steps tried, timeline

After Resolution

Post status update: "Resolved — [1-sentence summary of cause and fix]"
Update this runbook with anything that was wrong or missing
File a post-mortem if SEV-1 or SEV-2 (see incident-responder agent)

3. Filled Example: Database Connection Pool Exhausted

# Runbook: PostgreSQL — Connection Pool Exhausted

**Alert:** `postgres_connection_utilisation > 90%`
**Severity:** SEV-2
**Last tested:** 2026-02-14
**Owner:** Platform team

---

## Symptoms
- API requests failing with "FATAL: remaining connection slots are reserved"
- Connection wait times > 5s in APM
- Monitoring shows `pg_stat_activity` connections near `max_connections` limit

---

## Quick diagnosis (2 minutes)

```bash
# Current connection count vs limit
psql $DATABASE_URL -c "
  SELECT count(*) AS active,
    (SELECT setting::int FROM pg_settings WHERE name = 'max_connections') AS max
  FROM pg_stat_activity WHERE state != 'idle';
"

# Who is holding connections?
psql $DATABASE_URL -c "
  SELECT pid, usename, application_name, client_addr, state,
         now() - query_start AS duration, left(query, 80) AS query
  FROM pg_stat_activity
  WHERE state != 'idle'
  ORDER BY duration DESC LIMIT 20;
"

Common Causes

Cause A: Leaked connections (app not returning connections to pool)

Indicator: Connections in idle in transaction state for > 30 seconds Resolution: → §3.A

Cause B: Pool size not set — default unlimited

Indicator: No max set in PgBouncer or ORM pool config Resolution: → §3.B

Cause C: Traffic spike — need to scale

Indicator: Normal connection counts per pod but too many pods Resolution: → §3.C

§3.A — Kill leaked connections

# Kill idle-in-transaction connections older than 5 minutes
psql $DATABASE_URL -c "
  SELECT pg_terminate_backend(pid)
  FROM pg_stat_activity
  WHERE state = 'idle in transaction'
    AND now() - query_start > interval '5 minutes';
"

Then find the leak: grep application logs for the transactions that were started but not committed.

Verify: Run the quick diagnosis query again — connection count should drop.

§3.B — Set connection pool limit

In the ORM config:

DATABASE_POOL_MAX=20

Restart the app pods:

kubectl rollout restart deployment/myapp -n production
kubectl rollout status deployment/myapp -n production

§3.C — Scale horizontally

kubectl scale deployment/myapp --replicas=10 -n production
kubectl rollout status deployment/myapp -n production
# Monitor pool usage — if it stays high, the pool max per pod needs reduction too


---

## 4. Runbook Maintenance

After every incident where a runbook was used:

```markdown
## Runbook Update Checklist (after use)

- [ ] Was any step unclear or wrong? Fix it.
- [ ] Were there steps missing that you had to figure out yourself? Add them.
- [ ] Were any commands wrong for the current environment? Update them.
- [ ] Did the "expected output" sections match reality? Update them.
- [ ] Update "Last tested" date at the top.

5. Alert-to-Runbook Linking

Every monitoring alert must link to its runbook:

# Prometheus AlertManager rule
- alert: PostgresConnectionPoolExhausted
  expr: pg_connection_utilisation > 0.9
  for: 2m
  annotations:
    summary: "PostgreSQL connection pool > 90%"
    runbook_url: "https://docs.internal/runbooks/database-connection-pool-exhausted"
    description: "Current utilisation: {{ $value | humanizePercentage }}"

// PagerDuty service ruleset
{
  "rule": {
    "conditions": [{ "expression": "event.summary matches 'connection pool'" }],
    "actions": {
      "annotate": { "value": "Runbook: https://docs.internal/runbooks/database-connection-pool-exhausted" }
    }
  }
}