Step 2 · Investigate and Resolve
Goal
Find the cause, apply the fastest effective mitigation, restore service, and close the incident.
Instructions
You are in workflow step 2 of the incident-cycle. The incident has been declared and severity is known. Your job is to stop the bleeding and restore normal service.
Principle: Mitigate first, explain later. If you can roll back and stop the pain in two minutes, do that now. Root cause analysis happens in the post-mortem — not during the incident.
Tasks to Perform
1. Ask "What Changed?"
Most incidents are caused by a recent change. Check these in order:
# Deployments in the last 4 hours
git log main --oneline --since="4 hours ago"
# Kubernetes rollout history
kubectl rollout history deployment/myapp -n production
# Infrastructure changes
# Check your Terraform apply history, ArgoCD sync history, or AWS CloudTrail
# Configuration changes
# Check your feature flag service, environment variable changes, secret rotations
# External dependency changes
# Status pages: status.github.com, status.aws.amazon.com, status.stripe.com
2. Collect Evidence
# Recent error logs
kubectl logs -l app=myapp -n production --since=30m | grep -E "ERROR|FATAL|panic|Exception" | tail -50
# Current pod state
kubectl get pods -n production -l app=myapp
kubectl describe pod [failing-pod-name] -n production | tail -30
# Database health
psql $DATABASE_URL -c "SELECT count(*), state FROM pg_stat_activity GROUP BY state;"
psql $DATABASE_URL -c "
SELECT pid, now() - query_start AS duration, state, left(query, 80)
FROM pg_stat_activity WHERE state != 'idle'
ORDER BY duration DESC LIMIT 10;"
# Memory and CPU
kubectl top pods -n production
kubectl top nodes
# External API latency (check your APM)
3. Form and Test Hypotheses
Before acting: state your hypothesis.
"I believe the cause is [X] because [evidence Y] and [evidence Z].
To confirm: I will run [command] and expect to see [result].
To mitigate: I will [action]."
Write this in the incident channel. It keeps the investigation focused and creates a timeline.
4. Apply Mitigation
Choose the fastest option that stops user impact:
# Option A: Rollback deployment (use when a code change caused it)
kubectl rollout undo deployment/myapp -n production
kubectl rollout status deployment/myapp -n production --timeout=5m
# Option B: Feature flag off (use when a specific feature is the problem)
# Disable flag in your flag management system — no deployment needed
# Option C: Scale out (use when it's a capacity issue)
kubectl scale deployment/myapp --replicas=15 -n production
# Option D: Redirect traffic (use when one region/instance is bad)
# Update load balancer weights or ingress annotations
# Option E: Kill bad database queries (use when queries are causing lock cascades)
psql $DATABASE_URL -c "SELECT pg_terminate_backend(pid)
FROM pg_stat_activity
WHERE duration > interval '5 minutes' AND state = 'active';"
5. Post Updates Every 15 Minutes
Even if nothing has changed:
[HH:MM UTC] Update
- Status: Still investigating / Mitigation applied, monitoring
- Findings so far: [1-2 sentences]
- Current action: [what you are doing right now]
- Next update: [HH:MM UTC]
6. Confirm Resolution
Do not declare resolved until you have evidence:
# Error rate back to baseline?
# [Check your APM / monitoring dashboard]
# All pods healthy?
kubectl get pods -n production -l app=myapp
# All should show Running and READY
# Smoke test the affected feature
curl -s -w "\nHTTP %{http_code}\n" https://yourapp.com/api/health
# Run the specific user journey that was failing
# No new errors in logs?
kubectl logs -l app=myapp -n production --since=5m | grep -c "ERROR"
7. Declare Resolved
✅ SEV-[X] RESOLVED — [HH:MM UTC]
Duration: [X hours Y minutes]
Impact: [brief summary]
Fix applied: [1-sentence description]
Post-mortem: [will be completed by DATE — for SEV-1 and SEV-2]
Status page updated to: All Systems Operational
Exit Criteria
- [ ] Error rate back to baseline and confirmed stable for ≥ 5 minutes
- [ ] All affected pods/services healthy
- [ ] User-facing smoke test passes
- [ ] Incident declared resolved in team channel
- [ ] External status page updated to resolved
- [ ] Incident timeline document updated with resolution and fix applied
Next Step
For SEV-1 and SEV-2: → Proceed to Step 3 · Post-Mortem
For SEV-3 and SEV-4: Post-mortem optional; document root cause and action items in the incident ticket and close.