Step 3 · Post-Mortem

Goal

Understand exactly what happened, why it happened, and what systemic changes will prevent it from happening again. Leave the system more resilient than it was before the incident.

Instructions

You are in workflow step 3 of the incident-cycle. The incident is resolved. Your job is to analyse it deeply, document it clearly, and turn it into concrete improvements.

Principle: Post-mortems are blameless. We are examining systems and processes, not human beings. People do not fail — systems create conditions for failure. The question is always "what made this possible?", never "who did this?"

Tasks to Perform

1. Schedule the Post-Mortem Meeting

Within 2 business days for SEV-1
Within 5 business days for SEV-2
Invite everyone who was involved (engineers, on-call, anyone who was paged)
60 minutes maximum; 45 minutes for most incidents

2. Reconstruct the Timeline

Gather evidence from all sources before the meeting:

# Git commits around incident time
git log main --oneline --format="%h %ai %s" \
  --since="6 hours before incident" --until="1 hour after resolution"

# Deployment history
kubectl rollout history deployment/myapp -n production

# Alert history
# Export from PagerDuty/OpsGenie: all alerts from the incident window

# Log timestamps
grep "ERROR\|CRITICAL" /var/log/app/*.log | \
  awk '{print $1, $2}' | sort -u | head -50

Build a timeline with one-minute granularity for the critical window.

3. Write the Post-Mortem Document

Create docs/incidents/YYYY-MM-DD-[short-title].md:

# Post-Mortem: [Title]

**Date:** YYYY-MM-DD
**Severity:** SEV-X
**Duration:** X hours Y minutes
**Impact:** [X% of users could not Y; Z transactions failed; $N estimated lost revenue]
**Authors:** [@names]
**Status:** Draft / Final

---

## Executive Summary

[2–3 sentences: what happened, what was the root cause, what was the business impact, what is being done to prevent recurrence]

---

## Timeline

All times in UTC.

| Time | Event |
|---|---|
| HH:MM | [What happened] |
| HH:MM | [Alert fired — which alert, what threshold] |
| HH:MM | [Incident declared by @who] |
| HH:MM | [Key investigation finding] |
| HH:MM | [Mitigation applied — what specifically] |
| HH:MM | [Service restored — confirmed how] |
| HH:MM | [Incident resolved] |

---

## Root Cause

[A precise, technical, blameless explanation of why the incident occurred.
Name the specific code path, configuration, infrastructure change, or process gap.
Go deep enough that someone new to the codebase would understand it.]

---

## Contributing Factors

These are conditions that made the incident worse, longer, or more likely:

- [Factor 1 — e.g., "No alert existed for this specific error type"]
- [Factor 2 — e.g., "Runbook for this scenario was not documented"]
- [Factor 3 — e.g., "Deployment went out during peak traffic hours"]

---

## What Went Well

[Acknowledge what worked — fast detection, fast rollback, good communication.
This tells the team what practices to reinforce.]

- [e.g., "Rollback was completed in under 3 minutes"]
- [e.g., "On-call engineer responded within 2 minutes of page"]
- [e.g., "Status page was updated before users began contacting support"]

---

## What Went Poorly

[Be honest. This is how teams improve.]

- [e.g., "Alert fired 20 minutes after the error rate started climbing"]
- [e.g., "Root cause took 45 minutes to find; a better runbook would have cut this to 10"]

---

## Action Items

| Action | Owner | Priority | Due |
|---|---|---|---|
| [Specific improvement] | [@person] | P1/P2/P3 | YYYY-MM-DD |
| Add alert for [metric] | [@person] | P1 | YYYY-MM-DD |
| Write runbook for [scenario] | [@person] | P2 | YYYY-MM-DD |
| Fix [root cause] | [@person] | P1 | YYYY-MM-DD |

---

## Acknowledgement

This document was reviewed and approved by: [@names]

4. Facilitate the Meeting

Structure:

Timeline walk (15 min) — walk through the timeline together; let people correct or add to it
5 Whys (15 min) — ask "why did this happen?" five times to reach the systemic root cause
Action items (15 min) — assign specific, timeboxed, ownable actions
What went well (5 min) — end positively; reinforce good practices

Example 5 Whys:
Why did users see 500 errors?         → Database was refusing connections
Why was the database refusing?        → Connection pool was exhausted
Why was the pool exhausted?           → 5 app pods all opened max connections simultaneously
Why did they open max connections?    → New code had no connection pool limit set
Why was there no pool limit?          → Code review didn't check for database config
→ Action: Add PR checklist item for database configuration changes

5. Track Action Items to Completion

Add all action items to TODO.md:

- [ ] Add alert for DB connection pool > 80% _(ref: docs/incidents/YYYY-MM-DD-db-pool.md)_
- [ ] Write runbook: database-connection-pool-exhausted _(ref: docs/incidents/YYYY-MM-DD-db-pool.md)_
- [ ] Add DB pool limit to app configuration template _(ref: docs/incidents/YYYY-MM-DD-db-pool.md)_

Review outstanding post-mortem actions in every sprint retrospective.

Exit Criteria

[ ] Post-mortem document is complete and reviewed by all participants
[ ] Root cause is described precisely enough that a new engineer would understand it
[ ] All contributing factors are listed
[ ] Action items are specific, owned, and time-boxed
[ ] Action items tracked in TODO.md
[ ] Document shared with the broader team (not just the incident responders)
[ ] Any affected runbooks updated with learnings

← Return to Step 1 · Detect and Triage for the next incident.