Observability
You are setting up observability for: $ARGUMENTS
Observability is the ability to understand the internal state of a system from its external outputs. A system is observable when you can answer "what is happening right now and why?" using only the system's own output — without needing to connect a debugger or push debug code to production.
The three pillars are: Logs (what happened), Metrics (how much / how fast), Traces (where time was spent).
1. Structured Logging
Why structured (JSON) logging
Plain text logs: [2026-03-28 12:00:00] ERROR: Payment failed for user 123
Structured logs:
{
"timestamp": "2026-03-28T12:00:00.000Z",
"level": "error",
"message": "Payment failed",
"request_id": "req_abc123",
"user_id": "usr_456",
"amount_cents": 4999,
"currency": "USD",
"gateway_error_code": "CARD_DECLINED",
"duration_ms": 213
}
Structured logs can be queried, aggregated, and alerted on by a log platform. Plain text requires regex and guesswork.
Log Levels — Use Them Correctly
| Level | Use for |
|---|---|
ERROR |
Something failed that requires human attention. Alert on this. |
WARN |
Unexpected but handled. Monitor for frequency. |
INFO |
Normal milestones (request completed, job processed, user logged in) |
DEBUG |
Detailed diagnostic info — only enabled in dev/staging |
Never use:
ERRORfor things that are handled and expected (a 404 is not an error)DEBUGin production (too noisy, possible sensitive data exposure)- Unstructured string concatenation:
log("User " + userId + " failed")→ use structured fields
What to Always Include in Every Log Entry
{
"timestamp": "ISO 8601 UTC",
"level": "error|warn|info|debug",
"message": "Short human-readable description",
"request_id": "propagated from request header or generated",
"service": "api-gateway",
"environment": "production"
}
Request ID Propagation
Every request should generate or accept a request ID and propagate it through every log entry:
Client → API (X-Request-ID: req_abc123)
API → Service A (X-Request-ID: req_abc123) → log(request_id: req_abc123)
API → Database — log the query with (request_id: req_abc123)
API → Response (X-Request-ID: req_abc123 in response header)
This makes it possible to find all log entries for a single user-visible request across any number of services.
Security: What NEVER to Log
- Passwords (even hashed)
- Full payment card numbers
- Authentication tokens / API keys / JWT values
- Session cookies
- Full request bodies for auth endpoints
- PII beyond what policy explicitly permits
2. Metrics
The Four Golden Signals (from Google SRE)
| Signal | What it measures | Target |
|---|---|---|
| Latency | How long requests take (p50, p95, p99) | Define per endpoint |
| Traffic | How many requests per second | Baseline for capacity |
| Errors | Error rate (%) | < 0.1% for critical paths |
| Saturation | How full is the system? (CPU %, queue depth, DB pool) | < 80% |
Key Application Metrics to Emit
# HTTP
http_request_duration_ms{method, route, status_code} — histogram
http_requests_total{method, route, status_code} — counter
# Database
db_query_duration_ms{query_name} — histogram
db_pool_active_connections — gauge
db_pool_waiting_requests — gauge
# Queue / Background Jobs
job_duration_ms{job_type} — histogram
job_queue_depth{queue_name} — gauge
job_failures_total{job_type} — counter
# Business metrics (critical path)
orders_created_total — counter
payment_success_total / payment_failure_total — counter
user_signup_total — counter
Metric Types
- Counter — monotonically increasing (total requests, total errors). Never resets.
- Gauge — current value that goes up and down (queue depth, connections, memory).
- Histogram — distribution of values (request durations, response sizes). Enables percentiles.
3. Health Check Endpoints
Every service must expose health check endpoints:
/health/live
Returns 200 if the process is running. Used by orchestrators to decide if the container should be restarted.
HTTP 200
{ "status": "ok" }
/health/ready
Returns 200 only if the service can handle traffic. Returns 503 during startup, after DB connection loss, etc.
HTTP 200
{
"status": "ready",
"checks": {
"database": "ok",
"cache": "ok",
"queue": "ok"
}
}
// When degraded:
HTTP 503
{
"status": "degraded",
"checks": {
"database": "ok",
"cache": "unavailable",
"queue": "ok"
}
}
4. Distributed Tracing
For systems with multiple services (or for diagnosing slow requests within a single service), implement tracing.
A trace shows the full call tree for a single request:
[GET /api/checkout] 450ms total
├── [auth middleware] 12ms
├── [validateCart()] 5ms
├── [db: SELECT products] 45ms
├── [calculateTax()] 2ms
├── [stripe.charge()] 380ms ← THIS is the bottleneck
└── [db: INSERT order] 6ms
Implementation:
- Use OpenTelemetry (vendor-neutral) for instrumentation
- Propagate trace context via
traceparentheader between services - Sample traces in production (don't send every trace — too expensive)
5. Alerting Strategy
Alerting philosophy: Page on symptoms, not causes
Bad: Alert when CPU > 80% (cause — but maybe not affecting users) Good: Alert when error rate > 1% for 5 minutes (symptom — users are affected)
Alert tiers
| Tier | Response time | Example condition |
|---|---|---|
| Page (immediate) | < 5 minutes | Error rate > 5%, service completely down |
| Ticket (business hours) | < 4 hours | Error rate > 0.5% sustained for 30min |
| Review (next sprint) | < 1 week | P99 latency creeping up week-over-week |
Essential Alerts
1. Error rate > 1% for 5 minutes → Page
2. P99 response time > 2s for 5 minutes → Page
3. Service health check failing → Page immediately
4. Job queue depth > 1000 for 10 minutes → Ticket
5. Disk usage > 85% → Ticket
6. Failed login rate > 10/minute (brute force signal) → Ticket
7. Zero traffic on critical endpoint for 5 minutes → Page (could be silent failure)
6. Dashboard Design
Every production service needs a single dashboard that answers:
- Is the service healthy right now?
- What is the current traffic level?
- What is the error rate?
- What is the response time?
- Are there any queued/stuck jobs?
Standard panels:
- Request rate (req/sec) — time series
- Error rate (%) — time series, with threshold line
- P50 / P95 / P99 latency — time series
- Active instances / containers — gauge
- Database query time — time series
- Background job queue depth — time series
- Recent errors — log panel (last 50 errors)
7. Observability Checklist
- [ ] All log entries are structured JSON
- [ ] All log entries include request_id, service name, environment
- [ ] Log levels are used correctly (no ERROR for expected conditions)
- [ ] Sensitive data is never logged
- [ ] HTTP metrics are emitted (request count, duration, status code)
- [ ]
/health/liveand/health/readyendpoints exist - [ ] At least one dashboard showing the golden signals
- [ ] Alerts defined for service down and high error rate
- [ ] Request IDs propagated across service boundaries