Step 2 · Investigate and Resolve

Goal

Find the cause, apply the fastest effective mitigation, restore service, and close the incident.

Instructions

You are in workflow step 2 of the incident-cycle. The incident has been declared and severity is known. Your job is to stop the bleeding and restore normal service.

Principle: Mitigate first, explain later. If you can roll back and stop the pain in two minutes, do that now. Root cause analysis happens in the post-mortem — not during the incident.

Tasks to Perform

1. Ask "What Changed?"

Most incidents are caused by a recent change. Check these in order:

# Deployments in the last 4 hours
git log main --oneline --since="4 hours ago"

# Kubernetes rollout history
kubectl rollout history deployment/myapp -n production

# Infrastructure changes
# Check your Terraform apply history, ArgoCD sync history, or AWS CloudTrail

# Configuration changes
# Check your feature flag service, environment variable changes, secret rotations

# External dependency changes
# Status pages: status.github.com, status.aws.amazon.com, status.stripe.com

2. Collect Evidence

# Recent error logs
kubectl logs -l app=myapp -n production --since=30m | grep -E "ERROR|FATAL|panic|Exception" | tail -50

# Current pod state
kubectl get pods -n production -l app=myapp
kubectl describe pod [failing-pod-name] -n production | tail -30

# Database health
psql $DATABASE_URL -c "SELECT count(*), state FROM pg_stat_activity GROUP BY state;"
psql $DATABASE_URL -c "
  SELECT pid, now() - query_start AS duration, state, left(query, 80)
  FROM pg_stat_activity WHERE state != 'idle'
  ORDER BY duration DESC LIMIT 10;"

# Memory and CPU
kubectl top pods -n production
kubectl top nodes

# External API latency (check your APM)

3. Form and Test Hypotheses

Before acting: state your hypothesis.

"I believe the cause is [X] because [evidence Y] and [evidence Z].
To confirm: I will run [command] and expect to see [result].
To mitigate: I will [action]."

Write this in the incident channel. It keeps the investigation focused and creates a timeline.

4. Apply Mitigation

Choose the fastest option that stops user impact:

# Option A: Rollback deployment (use when a code change caused it)
kubectl rollout undo deployment/myapp -n production
kubectl rollout status deployment/myapp -n production --timeout=5m

# Option B: Feature flag off (use when a specific feature is the problem)
# Disable flag in your flag management system — no deployment needed

# Option C: Scale out (use when it's a capacity issue)
kubectl scale deployment/myapp --replicas=15 -n production

# Option D: Redirect traffic (use when one region/instance is bad)
# Update load balancer weights or ingress annotations

# Option E: Kill bad database queries (use when queries are causing lock cascades)
psql $DATABASE_URL -c "SELECT pg_terminate_backend(pid)
  FROM pg_stat_activity
  WHERE duration > interval '5 minutes' AND state = 'active';"

5. Post Updates Every 15 Minutes

Even if nothing has changed:

[HH:MM UTC] Update
- Status: Still investigating / Mitigation applied, monitoring
- Findings so far: [1-2 sentences]
- Current action: [what you are doing right now]
- Next update: [HH:MM UTC]

6. Confirm Resolution

Do not declare resolved until you have evidence:

# Error rate back to baseline?
# [Check your APM / monitoring dashboard]

# All pods healthy?
kubectl get pods -n production -l app=myapp
# All should show Running and READY

# Smoke test the affected feature
curl -s -w "\nHTTP %{http_code}\n" https://yourapp.com/api/health
# Run the specific user journey that was failing

# No new errors in logs?
kubectl logs -l app=myapp -n production --since=5m | grep -c "ERROR"

7. Declare Resolved

✅ SEV-[X] RESOLVED — [HH:MM UTC]
Duration: [X hours Y minutes]
Impact: [brief summary]
Fix applied: [1-sentence description]
Post-mortem: [will be completed by DATE — for SEV-1 and SEV-2]

Status page updated to: All Systems Operational

Exit Criteria

[ ] Error rate back to baseline and confirmed stable for ≥ 5 minutes
[ ] All affected pods/services healthy
[ ] User-facing smoke test passes
[ ] Incident declared resolved in team channel
[ ] External status page updated to resolved
[ ] Incident timeline document updated with resolution and fix applied

Next Step

For SEV-1 and SEV-2: → Proceed to Step 3 · Post-Mortem

For SEV-3 and SEV-4: Post-mortem optional; document root cause and action items in the incident ticket and close.