Step 1 · Detect and Triage

Goal

Confirm the incident is real, assess its impact, assign a severity level, and mobilise the right people — all within the first five minutes.

Instructions

You are in workflow step 1 of the incident-cycle. Something has been flagged as a possible incident. Your job is to confirm it, size it, and declare it so the right actions follow.


Tasks to Perform

1. Confirm the Incident Is Real

Automated alerts can misfire. Before declaring an incident:

# Check current error rate vs baseline
# (use your APM: Datadog, New Relic, Grafana, etc.)

# Check if the monitoring system itself is healthy
# A spike in "unknown" or "timeout" could be a scrape failure

# Quick sanity check: can you reproduce the user-reported issue?
curl -s -o /dev/null -w "%{http_code}" https://yourapp.com/health

If the issue is confirmed: proceed to triage. If it is a false alarm: document it and silence the alert after fixing the root cause.

2. Assess Impact

Answer these questions as specifically as possible:

  • Who is affected? All users? A subset? A specific region, account type, or browser?
  • What can they not do? Name the exact feature or action that is broken.
  • Since when? Cross-reference alert start time with deployment times and cron schedules.
  • How many? Error rate, affected user count, transaction failure count — get a number.
# Error rate over last 30 minutes
# Replace with your logging/APM query:
grep "ERROR\|500\|Exception" /var/log/app/app.log | \
  awk '{print $1, $2}' | cut -d: -f1,2 | sort | uniq -c | tail -30

# Recent deployments
git log main --oneline --since="3 hours ago"

# Recent cron / batch jobs
sudo journalctl -u cron --since="3 hours ago" | tail -30

3. Assign a Severity Level

Level Declare when Response
SEV-1 Complete outage or data loss in progress Wake everyone now
SEV-2 Major feature down, >5% of requests failing Page primary on-call
SEV-3 Degraded performance, minor feature impaired Notify team; handle urgently
SEV-4 Low impact, no urgent user effect Ticket; handle next business day

4. Declare and Communicate

Message to post in your incident channel:

🚨 SEV-[X] DECLARED — [One-sentence description]
Impact: [Who is affected and what they cannot do]
Started: [HH:MM UTC] (approximately)
Commander: [@you]
Status: Investigating
Next update: [HH:MM UTC, max 15 min from now]

Create the incident document:

  • Open docs/incidents/YYYY-MM-DD-[title].md
  • Post status to your external status page (even "Investigating — we are aware of an issue")

5. Assign Roles

Role Responsibility
Incident Commander Makes decisions; keeps investigation moving
Technical Lead Drives investigation; runs commands
Comms Lead Writes status updates for users and stakeholders

One person handles each role. In a small team, IC and Comms can be the same person; Technical Lead should be separate.


Exit Criteria

Before moving to Step 2:

  • [ ] Incident confirmed as real (not a false alarm)
  • [ ] Impact statement written: who, what, since when, how many
  • [ ] Severity level assigned (SEV-1 through SEV-4)
  • [ ] Incident declared publicly in team channel
  • [ ] Roles assigned
  • [ ] Status page updated
  • [ ] Initial entry in incident timeline document

Next Step

→ Proceed to Step 2 · Investigate and Resolve