Database Migrations

You are planning or reviewing a migration for: $ARGUMENTS

Schema migrations are the most dangerous routine operation in backend engineering. An incorrectly written or timed migration can lock a production table for minutes, cause data loss, or break the running application. Apply every rule below.


Principles

  1. Backwards compatible or gated. Unless you are taking a maintenance window, every migration must be compatible with the version of the application currently running in production.
  2. Rollback is designed before apply. If you cannot describe how to undo this migration, you are not ready to run it.
  3. Large tables demand special treatment. An ALTER TABLE that acquires an exclusive lock on a 500 M-row table is an outage.
  4. Data migrations are separate from schema migrations. Never combine DDL and large DML in the same migration file.
  5. Zero-downtime is the default target. Explicitly document when a maintenance window is unavoidable.

1. Safe Migration Order (Zero-Downtime)

The expand–contract pattern allows schema changes without downtime:

Phase 1: EXPAND (migration only — no app code change)
  → Add new column / table (nullable, no default that requires a table rewrite)
  → Old app code continues to work

Phase 2: MIGRATE APP
  → Deploy new app code that writes to BOTH old and new columns
  → Backfill old rows (see §3)

Phase 3: CONTRACT (after backfill is complete)
  → Make column NOT NULL (if needed)
  → Drop old column in a later migration
  → Remove old code paths

2. DDL Safety by Operation

Adding a Column

-- SAFE: nullable with no default
ALTER TABLE orders ADD COLUMN notes TEXT;

-- SAFE in Postgres 11+: nullable with a constant default (no rewrite)
ALTER TABLE orders ADD COLUMN status VARCHAR(20) DEFAULT 'active';

-- DANGEROUS: NOT NULL without a default on a large table (full rewrite)
-- Do this instead:
ALTER TABLE orders ADD COLUMN priority INT;          -- Step 1: add nullable
-- backfill (see §3)
ALTER TABLE orders ALTER COLUMN priority SET NOT NULL -- Step 2: after backfill
  USING COALESCE(priority, 0);
ALTER TABLE orders ALTER COLUMN priority SET DEFAULT 0;

Renaming a Column

NEVER do:  ALTER TABLE orders RENAME COLUMN user_id TO customer_id;
           (breaks the running application immediately)

DO instead:
  1. Add new column: ALTER TABLE orders ADD COLUMN customer_id BIGINT;
  2. Deploy code that writes to both user_id and customer_id
  3. Backfill customer_id = user_id where customer_id IS NULL
  4. Deploy code that reads from customer_id only
  5. Later: DROP COLUMN user_id

Dropping a Column / Table

-- Step 1: Deploy code that no longer references the column
-- Step 2: Only then drop it (avoids "column not found" errors)
ALTER TABLE orders DROP COLUMN legacy_field;

-- For tables: same principle — remove all code references first

Adding an Index

-- DANGEROUS: locks the table for the duration
CREATE INDEX idx_orders_email ON orders(email);

-- SAFE: concurrent build, no table lock (Postgres)
CREATE INDEX CONCURRENTLY idx_orders_email ON orders(email);
-- Note: CONCURRENTLY cannot run inside a transaction block

Adding a NOT NULL Constraint

-- DANGEROUS: full table scan + lock
ALTER TABLE orders ALTER COLUMN email SET NOT NULL;

-- SAFE (Postgres 12+): validate separately
ALTER TABLE orders ADD CONSTRAINT orders_email_not_null
  CHECK (email IS NOT NULL) NOT VALID;          -- fast, no table scan
-- Run backfill to fix any NULL rows
ALTER TABLE orders VALIDATE CONSTRAINT orders_email_not_null;  -- validates without lock

3. Backfilling Large Tables

Never run a single UPDATE statement on millions of rows. It holds locks and fills the write-ahead log.

# Batch backfill — process in chunks with a pause between
BATCH_SIZE = 1000
MAX_ID = db.execute("SELECT MAX(id) FROM orders").scalar()

for start in range(0, MAX_ID, BATCH_SIZE):
    db.execute("""
        UPDATE orders
        SET customer_id = user_id
        WHERE id BETWEEN :start AND :end
          AND customer_id IS NULL
    """, {"start": start, "end": start + BATCH_SIZE - 1})
    db.commit()
    time.sleep(0.05)  # throttle — give other queries a chance
-- Or in pure SQL with a loop (PostgreSQL DO block)
DO $$
DECLARE v_last_id BIGINT := 0;
BEGIN
  LOOP
    UPDATE orders
    SET customer_id = user_id
    WHERE id IN (
      SELECT id FROM orders
      WHERE customer_id IS NULL AND id > v_last_id
      ORDER BY id LIMIT 1000
    );
    EXIT WHEN NOT FOUND;
    GET DIAGNOSTICS v_last_id = ROW_COUNT;
    v_last_id := (SELECT MAX(id) FROM orders WHERE customer_id IS NULL);
    PERFORM pg_sleep(0.05);
  END LOOP;
END$$;

4. Testing Migrations

# 1. Always test on a production-sized (anonymised) dataset in staging first
pg_dump production_db | psql staging_db

# 2. Time the migration on staging — if > 1 second, plan accordingly
\timing
-- run your migration here

# 3. Test the rollback
-- run the down migration immediately after the up migration on staging
-- verify the schema matches the original

# 4. Check for lock wait times
SELECT mode, granted, pid, query_start, query
FROM pg_locks JOIN pg_stat_activity USING (pid)
WHERE NOT granted;

5. Rollback Plan Template

Before every migration, document:

## Migration: add_customer_id_to_orders

### What it does
Adds `customer_id` column to `orders` table.

### Estimated time (on production dataset)
< 50ms (nullable column add on 10M rows, Postgres 14)

### Rollback
```sql
ALTER TABLE orders DROP COLUMN customer_id;

Risk

Low. Backwards-compatible. Running application ignores new column.

Dependencies

None. No code change required before running.



---

## 6. Pre-Apply Checklist

- [ ] Migration has been tested on a staging environment with production-scale data
- [ ] Rollback SQL is written and tested
- [ ] Migration does not hold a lock for more than 1 second on production-scale data
- [ ] Large `UPDATE`/`DELETE` uses batching, not a single statement
- [ ] `CREATE INDEX` uses `CONCURRENTLY` on large tables
- [ ] `NOT NULL` constraints use `NOT VALID` + `VALIDATE CONSTRAINT` on large tables
- [ ] Migration does not break the currently deployed application code
- [ ] Migration is scheduled outside of peak traffic (even if zero-downtime, reduces risk)
- [ ] DBA or senior engineer has reviewed if the table > 10 M rows