Step 1 · Detect and Triage
Goal
Confirm the incident is real, assess its impact, assign a severity level, and mobilise the right people — all within the first five minutes.
Instructions
You are in workflow step 1 of the incident-cycle. Something has been flagged as a possible incident. Your job is to confirm it, size it, and declare it so the right actions follow.
Tasks to Perform
1. Confirm the Incident Is Real
Automated alerts can misfire. Before declaring an incident:
# Check current error rate vs baseline
# (use your APM: Datadog, New Relic, Grafana, etc.)
# Check if the monitoring system itself is healthy
# A spike in "unknown" or "timeout" could be a scrape failure
# Quick sanity check: can you reproduce the user-reported issue?
curl -s -o /dev/null -w "%{http_code}" https://yourapp.com/health
If the issue is confirmed: proceed to triage. If it is a false alarm: document it and silence the alert after fixing the root cause.
2. Assess Impact
Answer these questions as specifically as possible:
- Who is affected? All users? A subset? A specific region, account type, or browser?
- What can they not do? Name the exact feature or action that is broken.
- Since when? Cross-reference alert start time with deployment times and cron schedules.
- How many? Error rate, affected user count, transaction failure count — get a number.
# Error rate over last 30 minutes
# Replace with your logging/APM query:
grep "ERROR\|500\|Exception" /var/log/app/app.log | \
awk '{print $1, $2}' | cut -d: -f1,2 | sort | uniq -c | tail -30
# Recent deployments
git log main --oneline --since="3 hours ago"
# Recent cron / batch jobs
sudo journalctl -u cron --since="3 hours ago" | tail -30
3. Assign a Severity Level
| Level | Declare when | Response |
|---|---|---|
| SEV-1 | Complete outage or data loss in progress | Wake everyone now |
| SEV-2 | Major feature down, >5% of requests failing | Page primary on-call |
| SEV-3 | Degraded performance, minor feature impaired | Notify team; handle urgently |
| SEV-4 | Low impact, no urgent user effect | Ticket; handle next business day |
4. Declare and Communicate
Message to post in your incident channel:
🚨 SEV-[X] DECLARED — [One-sentence description]
Impact: [Who is affected and what they cannot do]
Started: [HH:MM UTC] (approximately)
Commander: [@you]
Status: Investigating
Next update: [HH:MM UTC, max 15 min from now]
Create the incident document:
- Open
docs/incidents/YYYY-MM-DD-[title].md - Post status to your external status page (even "Investigating — we are aware of an issue")
5. Assign Roles
| Role | Responsibility |
|---|---|
| Incident Commander | Makes decisions; keeps investigation moving |
| Technical Lead | Drives investigation; runs commands |
| Comms Lead | Writes status updates for users and stakeholders |
One person handles each role. In a small team, IC and Comms can be the same person; Technical Lead should be separate.
Exit Criteria
Before moving to Step 2:
- [ ] Incident confirmed as real (not a false alarm)
- [ ] Impact statement written: who, what, since when, how many
- [ ] Severity level assigned (SEV-1 through SEV-4)
- [ ] Incident declared publicly in team channel
- [ ] Roles assigned
- [ ] Status page updated
- [ ] Initial entry in incident timeline document
Next Step
→ Proceed to Step 2 · Investigate and Resolve