Observability

You are setting up observability for: $ARGUMENTS

Observability is the ability to understand the internal state of a system from its external outputs. A system is observable when you can answer "what is happening right now and why?" using only the system's own output — without needing to connect a debugger or push debug code to production.

The three pillars are: Logs (what happened), Metrics (how much / how fast), Traces (where time was spent).

1. Structured Logging

Why structured (JSON) logging

Plain text logs: [2026-03-28 12:00:00] ERROR: Payment failed for user 123

Structured logs:

{
  "timestamp": "2026-03-28T12:00:00.000Z",
  "level": "error",
  "message": "Payment failed",
  "request_id": "req_abc123",
  "user_id": "usr_456",
  "amount_cents": 4999,
  "currency": "USD",
  "gateway_error_code": "CARD_DECLINED",
  "duration_ms": 213
}

Structured logs can be queried, aggregated, and alerted on by a log platform. Plain text requires regex and guesswork.

Log Levels — Use Them Correctly

Level	Use for
`ERROR`	Something failed that requires human attention. Alert on this.
`WARN`	Unexpected but handled. Monitor for frequency.
`INFO`	Normal milestones (request completed, job processed, user logged in)
`DEBUG`	Detailed diagnostic info — only enabled in dev/staging

Never use:

ERROR for things that are handled and expected (a 404 is not an error)
DEBUG in production (too noisy, possible sensitive data exposure)
Unstructured string concatenation: log("User " + userId + " failed") → use structured fields

What to Always Include in Every Log Entry

{
  "timestamp": "ISO 8601 UTC",
  "level": "error|warn|info|debug",
  "message": "Short human-readable description",
  "request_id": "propagated from request header or generated",
  "service": "api-gateway",
  "environment": "production"
}

Request ID Propagation

Every request should generate or accept a request ID and propagate it through every log entry:

Client → API     (X-Request-ID: req_abc123)
API → Service A  (X-Request-ID: req_abc123) → log(request_id: req_abc123)
API → Database   — log the query with (request_id: req_abc123)
API → Response   (X-Request-ID: req_abc123 in response header)

This makes it possible to find all log entries for a single user-visible request across any number of services.

Security: What NEVER to Log

Passwords (even hashed)
Full payment card numbers
Authentication tokens / API keys / JWT values
Session cookies
Full request bodies for auth endpoints
PII beyond what policy explicitly permits

2. Metrics

The Four Golden Signals (from Google SRE)

Signal	What it measures	Target
Latency	How long requests take (p50, p95, p99)	Define per endpoint
Traffic	How many requests per second	Baseline for capacity
Errors	Error rate (%)	< 0.1% for critical paths
Saturation	How full is the system? (CPU %, queue depth, DB pool)	< 80%

Key Application Metrics to Emit

# HTTP
http_request_duration_ms{method, route, status_code} — histogram
http_requests_total{method, route, status_code} — counter

# Database
db_query_duration_ms{query_name} — histogram
db_pool_active_connections — gauge
db_pool_waiting_requests — gauge

# Queue / Background Jobs
job_duration_ms{job_type} — histogram
job_queue_depth{queue_name} — gauge
job_failures_total{job_type} — counter

# Business metrics (critical path)
orders_created_total — counter
payment_success_total / payment_failure_total — counter
user_signup_total — counter

Metric Types

Counter — monotonically increasing (total requests, total errors). Never resets.
Gauge — current value that goes up and down (queue depth, connections, memory).
Histogram — distribution of values (request durations, response sizes). Enables percentiles.

3. Health Check Endpoints

Every service must expose health check endpoints:

`/health/live`

Returns 200 if the process is running. Used by orchestrators to decide if the container should be restarted.

HTTP 200
{ "status": "ok" }

`/health/ready`

Returns 200 only if the service can handle traffic. Returns 503 during startup, after DB connection loss, etc.

HTTP 200
{
  "status": "ready",
  "checks": {
    "database": "ok",
    "cache": "ok",
    "queue": "ok"
  }
}

// When degraded:
HTTP 503
{
  "status": "degraded",
  "checks": {
    "database": "ok",
    "cache": "unavailable",
    "queue": "ok"
  }
}

4. Distributed Tracing

For systems with multiple services (or for diagnosing slow requests within a single service), implement tracing.

A trace shows the full call tree for a single request:

[GET /api/checkout] 450ms total
  ├── [auth middleware] 12ms
  ├── [validateCart()] 5ms
  ├── [db: SELECT products] 45ms
  ├── [calculateTax()] 2ms
  ├── [stripe.charge()] 380ms  ← THIS is the bottleneck
  └── [db: INSERT order] 6ms

Implementation:

Use OpenTelemetry (vendor-neutral) for instrumentation
Propagate trace context via traceparent header between services
Sample traces in production (don't send every trace — too expensive)

5. Alerting Strategy

Alerting philosophy: Page on symptoms, not causes

Bad: Alert when CPU > 80% (cause — but maybe not affecting users) Good: Alert when error rate > 1% for 5 minutes (symptom — users are affected)

Alert tiers

Tier	Response time	Example condition
Page (immediate)	< 5 minutes	Error rate > 5%, service completely down
Ticket (business hours)	< 4 hours	Error rate > 0.5% sustained for 30min
Review (next sprint)	< 1 week	P99 latency creeping up week-over-week

Essential Alerts

1. Error rate > 1% for 5 minutes → Page
2. P99 response time > 2s for 5 minutes → Page
3. Service health check failing → Page immediately
4. Job queue depth > 1000 for 10 minutes → Ticket
5. Disk usage > 85% → Ticket
6. Failed login rate > 10/minute (brute force signal) → Ticket
7. Zero traffic on critical endpoint for 5 minutes → Page (could be silent failure)

6. Dashboard Design

Every production service needs a single dashboard that answers:

Is the service healthy right now?
What is the current traffic level?
What is the error rate?
What is the response time?
Are there any queued/stuck jobs?

Standard panels:

Request rate (req/sec) — time series
Error rate (%) — time series, with threshold line
P50 / P95 / P99 latency — time series
Active instances / containers — gauge
Database query time — time series
Background job queue depth — time series
Recent errors — log panel (last 50 errors)

7. Observability Checklist

[ ] All log entries are structured JSON
[ ] All log entries include request_id, service name, environment
[ ] Log levels are used correctly (no ERROR for expected conditions)
[ ] Sensitive data is never logged
[ ] HTTP metrics are emitted (request count, duration, status code)
[ ] /health/live and /health/ready endpoints exist
[ ] At least one dashboard showing the golden signals
[ ] Alerts defined for service down and high error rate
[ ] Request IDs propagated across service boundaries