Database Migrations
You are planning or reviewing a migration for: $ARGUMENTS
Schema migrations are the most dangerous routine operation in backend engineering. An incorrectly written or timed migration can lock a production table for minutes, cause data loss, or break the running application. Apply every rule below.
Principles
- Backwards compatible or gated. Unless you are taking a maintenance window, every migration must be compatible with the version of the application currently running in production.
- Rollback is designed before apply. If you cannot describe how to undo this migration, you are not ready to run it.
- Large tables demand special treatment. An
ALTER TABLEthat acquires an exclusive lock on a 500 M-row table is an outage. - Data migrations are separate from schema migrations. Never combine DDL and large DML in the same migration file.
- Zero-downtime is the default target. Explicitly document when a maintenance window is unavoidable.
1. Safe Migration Order (Zero-Downtime)
The expand–contract pattern allows schema changes without downtime:
Phase 1: EXPAND (migration only — no app code change)
→ Add new column / table (nullable, no default that requires a table rewrite)
→ Old app code continues to work
Phase 2: MIGRATE APP
→ Deploy new app code that writes to BOTH old and new columns
→ Backfill old rows (see §3)
Phase 3: CONTRACT (after backfill is complete)
→ Make column NOT NULL (if needed)
→ Drop old column in a later migration
→ Remove old code paths
2. DDL Safety by Operation
Adding a Column
-- SAFE: nullable with no default
ALTER TABLE orders ADD COLUMN notes TEXT;
-- SAFE in Postgres 11+: nullable with a constant default (no rewrite)
ALTER TABLE orders ADD COLUMN status VARCHAR(20) DEFAULT 'active';
-- DANGEROUS: NOT NULL without a default on a large table (full rewrite)
-- Do this instead:
ALTER TABLE orders ADD COLUMN priority INT; -- Step 1: add nullable
-- backfill (see §3)
ALTER TABLE orders ALTER COLUMN priority SET NOT NULL -- Step 2: after backfill
USING COALESCE(priority, 0);
ALTER TABLE orders ALTER COLUMN priority SET DEFAULT 0;
Renaming a Column
NEVER do: ALTER TABLE orders RENAME COLUMN user_id TO customer_id;
(breaks the running application immediately)
DO instead:
1. Add new column: ALTER TABLE orders ADD COLUMN customer_id BIGINT;
2. Deploy code that writes to both user_id and customer_id
3. Backfill customer_id = user_id where customer_id IS NULL
4. Deploy code that reads from customer_id only
5. Later: DROP COLUMN user_id
Dropping a Column / Table
-- Step 1: Deploy code that no longer references the column
-- Step 2: Only then drop it (avoids "column not found" errors)
ALTER TABLE orders DROP COLUMN legacy_field;
-- For tables: same principle — remove all code references first
Adding an Index
-- DANGEROUS: locks the table for the duration
CREATE INDEX idx_orders_email ON orders(email);
-- SAFE: concurrent build, no table lock (Postgres)
CREATE INDEX CONCURRENTLY idx_orders_email ON orders(email);
-- Note: CONCURRENTLY cannot run inside a transaction block
Adding a NOT NULL Constraint
-- DANGEROUS: full table scan + lock
ALTER TABLE orders ALTER COLUMN email SET NOT NULL;
-- SAFE (Postgres 12+): validate separately
ALTER TABLE orders ADD CONSTRAINT orders_email_not_null
CHECK (email IS NOT NULL) NOT VALID; -- fast, no table scan
-- Run backfill to fix any NULL rows
ALTER TABLE orders VALIDATE CONSTRAINT orders_email_not_null; -- validates without lock
3. Backfilling Large Tables
Never run a single UPDATE statement on millions of rows. It holds locks and fills the write-ahead log.
# Batch backfill — process in chunks with a pause between
BATCH_SIZE = 1000
MAX_ID = db.execute("SELECT MAX(id) FROM orders").scalar()
for start in range(0, MAX_ID, BATCH_SIZE):
db.execute("""
UPDATE orders
SET customer_id = user_id
WHERE id BETWEEN :start AND :end
AND customer_id IS NULL
""", {"start": start, "end": start + BATCH_SIZE - 1})
db.commit()
time.sleep(0.05) # throttle — give other queries a chance
-- Or in pure SQL with a loop (PostgreSQL DO block)
DO $$
DECLARE v_last_id BIGINT := 0;
BEGIN
LOOP
UPDATE orders
SET customer_id = user_id
WHERE id IN (
SELECT id FROM orders
WHERE customer_id IS NULL AND id > v_last_id
ORDER BY id LIMIT 1000
);
EXIT WHEN NOT FOUND;
GET DIAGNOSTICS v_last_id = ROW_COUNT;
v_last_id := (SELECT MAX(id) FROM orders WHERE customer_id IS NULL);
PERFORM pg_sleep(0.05);
END LOOP;
END$$;
4. Testing Migrations
# 1. Always test on a production-sized (anonymised) dataset in staging first
pg_dump production_db | psql staging_db
# 2. Time the migration on staging — if > 1 second, plan accordingly
\timing
-- run your migration here
# 3. Test the rollback
-- run the down migration immediately after the up migration on staging
-- verify the schema matches the original
# 4. Check for lock wait times
SELECT mode, granted, pid, query_start, query
FROM pg_locks JOIN pg_stat_activity USING (pid)
WHERE NOT granted;
5. Rollback Plan Template
Before every migration, document:
## Migration: add_customer_id_to_orders
### What it does
Adds `customer_id` column to `orders` table.
### Estimated time (on production dataset)
< 50ms (nullable column add on 10M rows, Postgres 14)
### Rollback
```sql
ALTER TABLE orders DROP COLUMN customer_id;
Risk
Low. Backwards-compatible. Running application ignores new column.
Dependencies
None. No code change required before running.
---
## 6. Pre-Apply Checklist
- [ ] Migration has been tested on a staging environment with production-scale data
- [ ] Rollback SQL is written and tested
- [ ] Migration does not hold a lock for more than 1 second on production-scale data
- [ ] Large `UPDATE`/`DELETE` uses batching, not a single statement
- [ ] `CREATE INDEX` uses `CONCURRENTLY` on large tables
- [ ] `NOT NULL` constraints use `NOT VALID` + `VALIDATE CONSTRAINT` on large tables
- [ ] Migration does not break the currently deployed application code
- [ ] Migration is scheduled outside of peak traffic (even if zero-downtime, reduces risk)
- [ ] DBA or senior engineer has reviewed if the table > 10 M rows