Data Modeling
You are designing or reviewing the data model for: $ARGUMENTS
A poor data model is permanent suffering. Fixing a normalization mistake in a 500 M-row production table costs weeks. The time spent designing a schema correctly at the start pays back every day the system runs.
Principles
- Model the domain, not the UI. The database schema reflects business entities and their relationships, not what a web form happens to show.
- Integrity is enforced in the database, not just the application. Foreign keys, constraints, and check constraints are not optional.
- Name things precisely. Ambiguous names create endless confusion.
orders.status— of what type?varchar?enum? What values are valid? - Normalise until it hurts, then denormalise with intention. Start normalised; denormalise only when you have a profiled query to justify it.
- Plan for change. Columns get added. Write a schema that supports forward migration without rewriting the world.
1. Entity–Relationship Design
Identify Entities and Relationships
Before writing a single CREATE TABLE, sketch:
- Entities — the nouns:
User,Order,Product,Invoice - Relationships — the verbs: User places Order, Order contains Products
- Cardinality — one-to-one, one-to-many, many-to-many
User ——< Order >—— OrderItem >—— Product
|
└— Address (billing)
└— Address (shipping)
Primary Key Choice
| Approach | Pros | Cons |
|---|---|---|
BIGINT GENERATED BY DEFAULT AS IDENTITY |
Fast insert, small index | Exposes row count; sequential can be guessed |
UUID v4 |
Globally unique; safe to expose in URLs | 16 bytes; random = index fragmentation |
UUID v7 |
Globally unique + time-sortable | Requires Postgres 17+ or app-generated |
| Natural key | Self-describing | Changes in the real world; joins are verbose |
Use surrogate keys (integer or UUID) as primary keys. Natural keys as unique constraints.
2. Normalisation
First Normal Form (1NF)
- No repeating groups or arrays stored in a single column
- Each column holds atomic values
-- BAD: comma-separated tags violate 1NF
CREATE TABLE posts (id BIGINT, tags TEXT); -- 'php,mysql,performance'
-- GOOD: separate table
CREATE TABLE post_tags (
post_id BIGINT REFERENCES posts(id) ON DELETE CASCADE,
tag VARCHAR(64) NOT NULL,
PRIMARY KEY (post_id, tag)
);
Second Normal Form (2NF)
- Every non-key column depends on the entire primary key (no partial dependency)
Third Normal Form (3NF)
- No transitive dependencies (no column depends on a non-key column)
-- BAD: city depends on zip_code, not on user_id
CREATE TABLE users (
id BIGINT PRIMARY KEY,
zip_code VARCHAR(10),
city VARCHAR(100) -- transitive dependency
);
-- GOOD: extract to separate table
CREATE TABLE zip_codes (
code VARCHAR(10) PRIMARY KEY,
city VARCHAR(100) NOT NULL
);
CREATE TABLE users (
id BIGINT PRIMARY KEY,
zip_code VARCHAR(10) REFERENCES zip_codes(code)
);
3. Column Design
-- Timestamp: always with timezone; always UTC input
created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
updated_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
-- Money: never FLOAT — use DECIMAL or integer cents
amount_cents INTEGER NOT NULL, -- store cents, display dollars
-- or
amount DECIMAL(19, 4) NOT NULL,
-- Status/type fields: use ENUM or CHECK constraint
status VARCHAR(20) NOT NULL DEFAULT 'pending'
CHECK (status IN ('pending', 'active', 'cancelled', 'completed')),
-- Boolean: explicit NOT NULL with default
is_active BOOLEAN NOT NULL DEFAULT TRUE,
-- Soft deletes: use a timestamp, not a boolean
deleted_at TIMESTAMPTZ NULL DEFAULT NULL,
-- Audit trail: who changed it?
created_by BIGINT REFERENCES users(id),
updated_by BIGINT REFERENCES users(id),
4. Indexes
-- 1. Index all foreign keys (Postgres does NOT do this automatically)
CREATE INDEX idx_orders_user_id ON orders(user_id);
CREATE INDEX idx_order_items_order_id ON order_items(order_id);
-- 2. Index columns in frequent WHERE conditions
CREATE INDEX idx_orders_status ON orders(status) WHERE deleted_at IS NULL;
-- 3. Composite index — column order matters: most selective first, or match WHERE clause
CREATE INDEX idx_orders_user_status ON orders(user_id, status, created_at DESC);
-- 4. Partial index — enormously more efficient for filtered queries
CREATE INDEX idx_orders_pending ON orders(created_at) WHERE status = 'pending';
-- 5. Unique constraint (also creates an index)
ALTER TABLE users ADD CONSTRAINT users_email_unique UNIQUE (email);
Index audit:
-- Find unused indexes (Postgres)
SELECT schemaname, tablename, indexname, idx_scan
FROM pg_stat_user_indexes
WHERE idx_scan = 0
ORDER BY tablename;
-- Find tables with sequential scans (missing indexes)
SELECT relname, seq_scan, idx_scan
FROM pg_stat_user_tables
WHERE seq_scan > idx_scan
ORDER BY seq_scan DESC;
5. Common Patterns
Audit Log Table
CREATE TABLE audit_log (
id BIGINT GENERATED BY DEFAULT AS IDENTITY PRIMARY KEY,
table_name VARCHAR(64) NOT NULL,
record_id BIGINT NOT NULL,
action VARCHAR(16) NOT NULL CHECK (action IN ('INSERT', 'UPDATE', 'DELETE')),
changed_by BIGINT REFERENCES users(id),
changed_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
old_data JSONB,
new_data JSONB
);
CREATE INDEX idx_audit_log_record ON audit_log(table_name, record_id);
Tree / Hierarchy (Closure Table — most query-friendly)
CREATE TABLE categories (
id BIGINT GENERATED BY DEFAULT AS IDENTITY PRIMARY KEY,
name VARCHAR(255) NOT NULL
);
CREATE TABLE category_paths (
ancestor BIGINT NOT NULL REFERENCES categories(id),
descendant BIGINT NOT NULL REFERENCES categories(id),
depth SMALLINT NOT NULL,
PRIMARY KEY (ancestor, descendant)
);
-- All descendants of category 5:
SELECT c.* FROM categories c
JOIN category_paths cp ON cp.descendant = c.id
WHERE cp.ancestor = 5 AND cp.depth > 0;
Polymorphic Associations (avoid if possible — use exclusive arcs)
-- BAD: polymorphic
CREATE TABLE comments (
id BIGINT PRIMARY KEY,
body TEXT,
subject_type VARCHAR(64), -- 'post', 'video', 'product'
subject_id BIGINT -- can never have a FK
);
-- BETTER: exclusive arcs (nullable FKs, only one set at a time)
CREATE TABLE comments (
id BIGINT PRIMARY KEY,
body TEXT,
post_id BIGINT REFERENCES posts(id),
video_id BIGINT REFERENCES videos(id),
product_id BIGINT REFERENCES products(id),
CHECK (
(post_id IS NOT NULL)::INT +
(video_id IS NOT NULL)::INT +
(product_id IS NOT NULL)::INT = 1
)
);
6. Schema Review Checklist
- [ ] All tables have a surrogate primary key
- [ ] All foreign keys have an index
- [ ] All foreign keys have an explicit
ON DELETEbehaviour defined - [ ] No money stored as FLOAT
- [ ] No comma-separated values in a single column (1NF)
- [ ] Timestamps are
TIMESTAMPTZ(notTIMESTAMP WITHOUT TIME ZONE) - [ ] Status/type columns have a CHECK constraint or are an ENUM
- [ ] Soft-delete uses
deleted_at TIMESTAMPTZ NULL(notis_deleted BOOLEAN) - [ ] Audit columns (
created_at,updated_at,created_by) are present on mutable tables - [ ] Schema diagram or ERD is up to date in
docs/