Data Modeling

You are designing or reviewing the data model for: $ARGUMENTS

A poor data model is permanent suffering. Fixing a normalization mistake in a 500 M-row production table costs weeks. The time spent designing a schema correctly at the start pays back every day the system runs.


Principles

  1. Model the domain, not the UI. The database schema reflects business entities and their relationships, not what a web form happens to show.
  2. Integrity is enforced in the database, not just the application. Foreign keys, constraints, and check constraints are not optional.
  3. Name things precisely. Ambiguous names create endless confusion. orders.status — of what type? varchar? enum? What values are valid?
  4. Normalise until it hurts, then denormalise with intention. Start normalised; denormalise only when you have a profiled query to justify it.
  5. Plan for change. Columns get added. Write a schema that supports forward migration without rewriting the world.

1. Entity–Relationship Design

Identify Entities and Relationships

Before writing a single CREATE TABLE, sketch:

  1. Entities — the nouns: User, Order, Product, Invoice
  2. Relationships — the verbs: User places Order, Order contains Products
  3. Cardinality — one-to-one, one-to-many, many-to-many
User ——< Order >—— OrderItem >—— Product
               |
               └— Address (billing)
               └— Address (shipping)

Primary Key Choice

Approach Pros Cons
BIGINT GENERATED BY DEFAULT AS IDENTITY Fast insert, small index Exposes row count; sequential can be guessed
UUID v4 Globally unique; safe to expose in URLs 16 bytes; random = index fragmentation
UUID v7 Globally unique + time-sortable Requires Postgres 17+ or app-generated
Natural key Self-describing Changes in the real world; joins are verbose

Use surrogate keys (integer or UUID) as primary keys. Natural keys as unique constraints.


2. Normalisation

First Normal Form (1NF)

  • No repeating groups or arrays stored in a single column
  • Each column holds atomic values
-- BAD: comma-separated tags violate 1NF
CREATE TABLE posts (id BIGINT, tags TEXT);  -- 'php,mysql,performance'

-- GOOD: separate table
CREATE TABLE post_tags (
  post_id BIGINT REFERENCES posts(id) ON DELETE CASCADE,
  tag     VARCHAR(64) NOT NULL,
  PRIMARY KEY (post_id, tag)
);

Second Normal Form (2NF)

  • Every non-key column depends on the entire primary key (no partial dependency)

Third Normal Form (3NF)

  • No transitive dependencies (no column depends on a non-key column)
-- BAD: city depends on zip_code, not on user_id
CREATE TABLE users (
  id       BIGINT PRIMARY KEY,
  zip_code VARCHAR(10),
  city     VARCHAR(100)    -- transitive dependency
);

-- GOOD: extract to separate table
CREATE TABLE zip_codes (
  code VARCHAR(10) PRIMARY KEY,
  city VARCHAR(100) NOT NULL
);
CREATE TABLE users (
  id       BIGINT PRIMARY KEY,
  zip_code VARCHAR(10) REFERENCES zip_codes(code)
);

3. Column Design

-- Timestamp: always with timezone; always UTC input
created_at  TIMESTAMPTZ NOT NULL DEFAULT NOW(),
updated_at  TIMESTAMPTZ NOT NULL DEFAULT NOW(),

-- Money: never FLOAT — use DECIMAL or integer cents
amount_cents  INTEGER NOT NULL,          -- store cents, display dollars
-- or
amount  DECIMAL(19, 4) NOT NULL,

-- Status/type fields: use ENUM or CHECK constraint
status  VARCHAR(20) NOT NULL DEFAULT 'pending'
        CHECK (status IN ('pending', 'active', 'cancelled', 'completed')),

-- Boolean: explicit NOT NULL with default
is_active  BOOLEAN NOT NULL DEFAULT TRUE,

-- Soft deletes: use a timestamp, not a boolean
deleted_at  TIMESTAMPTZ NULL DEFAULT NULL,

-- Audit trail: who changed it?
created_by  BIGINT REFERENCES users(id),
updated_by  BIGINT REFERENCES users(id),

4. Indexes

-- 1. Index all foreign keys (Postgres does NOT do this automatically)
CREATE INDEX idx_orders_user_id       ON orders(user_id);
CREATE INDEX idx_order_items_order_id ON order_items(order_id);

-- 2. Index columns in frequent WHERE conditions
CREATE INDEX idx_orders_status ON orders(status) WHERE deleted_at IS NULL;

-- 3. Composite index — column order matters: most selective first, or match WHERE clause
CREATE INDEX idx_orders_user_status ON orders(user_id, status, created_at DESC);

-- 4. Partial index — enormously more efficient for filtered queries
CREATE INDEX idx_orders_pending ON orders(created_at) WHERE status = 'pending';

-- 5. Unique constraint (also creates an index)
ALTER TABLE users ADD CONSTRAINT users_email_unique UNIQUE (email);

Index audit:

-- Find unused indexes (Postgres)
SELECT schemaname, tablename, indexname, idx_scan
FROM pg_stat_user_indexes
WHERE idx_scan = 0
ORDER BY tablename;

-- Find tables with sequential scans (missing indexes)
SELECT relname, seq_scan, idx_scan
FROM pg_stat_user_tables
WHERE seq_scan > idx_scan
ORDER BY seq_scan DESC;

5. Common Patterns

Audit Log Table

CREATE TABLE audit_log (
  id         BIGINT GENERATED BY DEFAULT AS IDENTITY PRIMARY KEY,
  table_name VARCHAR(64)  NOT NULL,
  record_id  BIGINT       NOT NULL,
  action     VARCHAR(16)  NOT NULL CHECK (action IN ('INSERT', 'UPDATE', 'DELETE')),
  changed_by BIGINT       REFERENCES users(id),
  changed_at TIMESTAMPTZ  NOT NULL DEFAULT NOW(),
  old_data   JSONB,
  new_data   JSONB
);
CREATE INDEX idx_audit_log_record ON audit_log(table_name, record_id);

Tree / Hierarchy (Closure Table — most query-friendly)

CREATE TABLE categories (
  id   BIGINT GENERATED BY DEFAULT AS IDENTITY PRIMARY KEY,
  name VARCHAR(255) NOT NULL
);

CREATE TABLE category_paths (
  ancestor   BIGINT NOT NULL REFERENCES categories(id),
  descendant BIGINT NOT NULL REFERENCES categories(id),
  depth      SMALLINT NOT NULL,
  PRIMARY KEY (ancestor, descendant)
);

-- All descendants of category 5:
SELECT c.* FROM categories c
JOIN category_paths cp ON cp.descendant = c.id
WHERE cp.ancestor = 5 AND cp.depth > 0;

Polymorphic Associations (avoid if possible — use exclusive arcs)

-- BAD: polymorphic
CREATE TABLE comments (
  id          BIGINT PRIMARY KEY,
  body        TEXT,
  subject_type VARCHAR(64),  -- 'post', 'video', 'product'
  subject_id  BIGINT         -- can never have a FK
);

-- BETTER: exclusive arcs (nullable FKs, only one set at a time)
CREATE TABLE comments (
  id         BIGINT PRIMARY KEY,
  body       TEXT,
  post_id    BIGINT REFERENCES posts(id),
  video_id   BIGINT REFERENCES videos(id),
  product_id BIGINT REFERENCES products(id),
  CHECK (
    (post_id IS NOT NULL)::INT +
    (video_id IS NOT NULL)::INT +
    (product_id IS NOT NULL)::INT = 1
  )
);

6. Schema Review Checklist

  • [ ] All tables have a surrogate primary key
  • [ ] All foreign keys have an index
  • [ ] All foreign keys have an explicit ON DELETE behaviour defined
  • [ ] No money stored as FLOAT
  • [ ] No comma-separated values in a single column (1NF)
  • [ ] Timestamps are TIMESTAMPTZ (not TIMESTAMP WITHOUT TIME ZONE)
  • [ ] Status/type columns have a CHECK constraint or are an ENUM
  • [ ] Soft-delete uses deleted_at TIMESTAMPTZ NULL (not is_deleted BOOLEAN)
  • [ ] Audit columns (created_at, updated_at, created_by) are present on mutable tables
  • [ ] Schema diagram or ERD is up to date in docs/