Beyond basic CRUD: Query optimization, execution plans, window functions, transactions, and interview patterns.
Understanding the query lifecycle helps you write better queries and debug performance issues.
SELECT * FORM users fails here (FORM vs FROM).
SELECT * FROM nonexistent_table fails here.
The most complex and important stage. The optimizer:
ANALYZE (updating statistics) can fix slow queries.
Executes the physical plan using a pull-based iterator model:
The workhorse of database indexing. Keeps data sorted for efficient range queries.
-- Equality
SELECT * FROM users WHERE id = 42;
-- Range queries
SELECT * FROM orders WHERE created_at BETWEEN '2024-01-01' AND '2024-12-31';
-- Prefix matching
SELECT * FROM users WHERE name LIKE 'John%';
-- Sorting
SELECT * FROM products ORDER BY price LIMIT 10;
-- Leading wildcard (must scan all)
SELECT * FROM users WHERE name LIKE '%smith';
-- Functions on indexed column
SELECT * FROM users WHERE LOWER(email) = 'john@example.com';
-- OR conditions on different columns
SELECT * FROM users WHERE status = 'active' OR role = 'admin';
-- NOT conditions
SELECT * FROM users WHERE status != 'deleted';
(country, city) is useless for WHERE city = 'Paris'.
-- Index on (country, city, zipcode)
-- Uses full index (all 3 columns)
WHERE country = 'USA' AND city = 'NYC' AND zipcode = '10001'
-- Uses first 2 columns
WHERE country = 'USA' AND city = 'NYC'
-- Uses first column only
WHERE country = 'USA'
-- CANNOT use index (missing leftmost column)
WHERE city = 'NYC'
(status, created_at) works well for WHERE status = 'active' AND created_at > '2024-01-01'.
When an index contains ALL columns needed by a query, it's a "covering index" - no table access needed.
-- Query
SELECT email, name FROM users WHERE status = 'active';
-- Covering index - includes all columns in query
CREATE INDEX idx_users_status_covering
ON users(status) INCLUDE (email, name);
-- PostgreSQL will show "Index Only Scan" in EXPLAIN
Index only a subset of rows. Smaller index = faster updates + fits in memory.
-- Only 5% of orders are pending, but we query them constantly
CREATE INDEX idx_orders_pending
ON orders(created_at)
WHERE status = 'pending';
-- Query must include the filter condition
SELECT * FROM orders
WHERE status = 'pending' AND created_at > NOW() - INTERVAL '1 day';
-- Only for equality comparisons, O(1) lookup
CREATE INDEX idx_users_email_hash ON users USING hash(email);
-- Good for
WHERE email = 'john@example.com'
-- Cannot use for ranges, ordering, or prefix matching
| Index Type | Best For | Limitations |
|---|---|---|
| B-Tree | Equality, ranges, sorting, prefix | Not great for full-text or spatial |
| Hash | Exact equality lookups | No ranges, no ordering |
| GiST | Geometric, full-text, ranges | More complex, slower updates |
| GIN | Arrays, JSONB, full-text | Expensive to update |
| BRIN | Very large tables with natural ordering | Only for correlated data |
EXPLAIN ANALYZE
SELECT u.name, COUNT(o.id)
FROM users u
JOIN orders o ON u.id = o.user_id
WHERE u.created_at > '2024-01-01'
GROUP BY u.id;
| Metric | Meaning | Red Flag |
|---|---|---|
cost=X..Y |
Startup cost..Total cost (arbitrary units) | Very high numbers |
rows=N |
Estimated rows to process | Estimate ≠ Actual (bad stats) |
actual time |
Real execution time (ms) | High times on simple ops |
loops=N |
Times this node executed | High loops = nested loop issue |
Reads entire table. Bad for large tables with selective conditions.
-- Usually bad
Seq Scan on users
Filter: (status = 'active')
Rows Removed by Filter: 999000
Uses index, then fetches rows from table (random I/O).
-- Good for selective queries
Index Scan using idx_users_status
Index Cond: (status = 'active')
All data from index - no table access. Best possible.
-- Best - covering index
Index Only Scan using idx_covering
Index Cond: (status = 'active')
Heap Fetches: 0
Collects row locations, sorts, then fetches. Good for medium selectivity.
-- Good for OR or multiple indexes
Bitmap Heap Scan on users
-> BitmapOr
-> Bitmap Index Scan
-> Bitmap Index Scan
rows=100 actual rows=100000ANALYZE tablename; to update statistics.
loops=10000 on a Seq ScanSort Method: external merge Diskwork_mem or add an index to avoid sorting.
The optimizer chooses join algorithms based on table sizes, indexes, and available memory.
-- Good for nested loop: small outer, indexed inner
SELECT * FROM orders o -- 100 rows (filtered)
JOIN products p ON o.product_id = p.id -- Index on p.id
WHERE o.user_id = 42;
a.date > b.date.
| Algorithm | Time | Memory | Best When |
|---|---|---|---|
| Nested Loop | O(N × M) | O(1) | Small outer + indexed inner |
| Hash Join | O(N + M) | O(N) | No indexes, equality only |
| Merge Join | O(N + M)* | O(1) | Pre-sorted inputs, range joins |
*Plus sort cost if not pre-sorted
Window functions compute values across rows related to the current row without collapsing rows like GROUP BY.
function_name(...) OVER (
PARTITION BY column -- Groups rows (like GROUP BY but keeps all rows)
ORDER BY column -- Defines row order within partition
ROWS BETWEEN ... AND ... -- Frame: which rows to include
)
SELECT
name,
department,
salary,
ROW_NUMBER() OVER (PARTITION BY department ORDER BY salary DESC) AS row_num,
RANK() OVER (PARTITION BY department ORDER BY salary DESC) AS rank,
DENSE_RANK() OVER (PARTITION BY department ORDER BY salary DESC) AS dense_rank
FROM employees;
| name | department | salary | row_num | rank | dense_rank |
|---|---|---|---|---|---|
| Alice | Engineering | 150k | 1 | 1 | 1 |
| Bob | Engineering | 150k | 2 | 1 | 1 |
| Carol | Engineering | 120k | 3 | 3 | 2 |
| Dave | Sales | 100k | 1 | 1 | 1 |
-- Compare each sale to the previous day's sale
SELECT
date,
revenue,
LAG(revenue, 1) OVER (ORDER BY date) AS prev_day_revenue,
revenue - LAG(revenue, 1) OVER (ORDER BY date) AS day_over_day_change,
LEAD(revenue, 1) OVER (ORDER BY date) AS next_day_revenue
FROM daily_sales;
| date | revenue | prev_day | change | next_day |
|---|---|---|---|---|
| Jan 1 | 1000 | NULL | NULL | 1200 |
| Jan 2 | 1200 | 1000 | +200 | 1100 |
| Jan 3 | 1100 | 1200 | -100 | NULL |
SELECT
date,
revenue,
-- Running total
SUM(revenue) OVER (
ORDER BY date
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
) AS running_total,
-- 7-day moving average
AVG(revenue) OVER (
ORDER BY date
ROWS BETWEEN 6 PRECEDING AND CURRENT ROW
) AS moving_avg_7d
FROM daily_sales;
SELECT
department,
name,
salary,
FIRST_VALUE(name) OVER w AS highest_paid,
LAST_VALUE(name) OVER w AS lowest_paid,
salary - FIRST_VALUE(salary) OVER w AS diff_from_top
FROM employees
WINDOW w AS (
PARTITION BY department
ORDER BY salary DESC
ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
);
LAST_VALUE has a gotcha: Default frame is ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW,
so LAST_VALUE returns current row, not the actual last.
Always specify UNBOUNDED FOLLOWING explicitly.
WITH
monthly_totals AS (
SELECT
DATE_TRUNC('month', order_date) AS month,
SUM(total) AS revenue
FROM orders
GROUP BY 1
),
with_growth AS (
SELECT
month,
revenue,
LAG(revenue) OVER (ORDER BY month) AS prev_revenue
FROM monthly_totals
)
SELECT
month,
revenue,
ROUND((revenue - prev_revenue) / prev_revenue * 100, 2) AS growth_pct
FROM with_growth;
-- Employee org chart: find all reports under a manager
WITH RECURSIVE org_chart AS (
-- Base case: the starting manager
SELECT id, name, manager_id, 0 AS depth
FROM employees
WHERE id = 1 -- CEO
UNION ALL
-- Recursive case: find direct reports
SELECT e.id, e.name, e.manager_id, oc.depth + 1
FROM employees e
JOIN org_chart oc ON e.manager_id = oc.id
)
SELECT * FROM org_chart;
-- Find all paths from node A to node F
WITH RECURSIVE paths AS (
-- Start at node A
SELECT
source,
target,
ARRAY[source, target] AS path,
weight
FROM edges
WHERE source = 'A'
UNION ALL
-- Extend paths
SELECT
p.source,
e.target,
p.path || e.target,
p.weight + e.weight
FROM paths p
JOIN edges e ON p.target = e.source
WHERE NOT e.target = ANY(p.path) -- Prevent cycles
)
SELECT path, weight
FROM paths
WHERE target = 'F'
ORDER BY weight;
NOT x = ANY(path)) or a depth limit (WHERE depth < 10).
| Property | Meaning | Example |
|---|---|---|
| Atomicity | All or nothing | Transfer fails midway → both accounts unchanged |
| Consistency | Valid state to valid state | Balance never goes negative (if constrained) |
| Isolation | Concurrent txns don't interfere | Two transfers don't overwrite each other |
| Durability | Committed = permanent | Power loss after COMMIT → data survives |
Reading uncommitted changes from another transaction.
T1: UPDATE balance = 0
T2: SELECT balance → 0 # Dirty!
T1: ROLLBACK
# T2 saw data that never existed
Same query returns different results within one transaction.
T1: SELECT balance → 100
T2: UPDATE balance = 50
T2: COMMIT
T1: SELECT balance → 50 # Changed!
New rows appear that match a previous query's conditions.
T1: SELECT COUNT(*) WHERE age>21 → 5
T2: INSERT (age=25)
T2: COMMIT
T1: SELECT COUNT(*) WHERE age>21 → 6
Two transactions overwrite each other's changes.
T1: SELECT balance → 100
T2: SELECT balance → 100
T1: UPDATE balance = 100 + 50
T2: UPDATE balance = 100 + 30
# Expected: 180, Actual: 130
| Level | Dirty Read | Non-Repeatable | Phantom | Use Case |
|---|---|---|---|---|
| Read Uncommitted | Possible | Possible | Possible | Almost never used |
| Read Committed | Prevented | Possible | Possible | PostgreSQL default |
| Repeatable Read | Prevented | Prevented | Varies* | MySQL default (InnoDB) |
| Serializable | Prevented | Prevented | Prevented | Financial systems |
*PostgreSQL's Repeatable Read prevents phantoms; MySQL's does not.
| Lock | Acquired By | Blocks |
|---|---|---|
| ACCESS SHARE | SELECT | ACCESS EXCLUSIVE only |
| ROW SHARE | SELECT FOR UPDATE | EXCLUSIVE, ACCESS EXCLUSIVE |
| ROW EXCLUSIVE | INSERT, UPDATE, DELETE | SHARE, EXCLUSIVE, ACCESS EXCLUSIVE |
| SHARE | CREATE INDEX (non-concurrent) | ROW EXCLUSIVE, EXCLUSIVE, ACCESS EXCLUSIVE |
| ACCESS EXCLUSIVE | ALTER TABLE, DROP, VACUUM FULL | Everything |
BEGIN;
-- Lock the row, preventing other transactions from modifying it
SELECT balance FROM accounts
WHERE id = 1
FOR UPDATE;
-- Safe to update - no one else can change it
UPDATE accounts SET balance = balance - 100 WHERE id = 1;
COMMIT;
-- Wait for lock (default)
SELECT * FROM orders WHERE id = 1 FOR UPDATE;
-- Error immediately if locked
SELECT * FROM orders WHERE id = 1 FOR UPDATE NOWAIT;
-- Skip locked rows (great for job queues)
SELECT * FROM jobs WHERE status = 'pending'
LIMIT 1 FOR UPDATE SKIP LOCKED;
-- Add version column to table
ALTER TABLE products ADD COLUMN version INT DEFAULT 1;
-- Read current state
SELECT id, name, price, version FROM products WHERE id = 42;
-- Returns: version = 5
-- Update only if version unchanged
UPDATE products
SET price = 99.99, version = version + 1
WHERE id = 42 AND version = 5;
-- If 0 rows affected → someone else modified → retry
-- Lock multiple rows in consistent order
SELECT * FROM accounts
WHERE id IN (3, 5, 7)
ORDER BY id -- Consistent ordering!
FOR UPDATE;
Problem: Get top 3 highest paid employees per department.
-- Using window function (most common approach)
WITH ranked AS (
SELECT
*,
ROW_NUMBER() OVER (
PARTITION BY department_id
ORDER BY salary DESC
) AS rn
FROM employees
)
SELECT * FROM ranked WHERE rn <= 3;
-- Alternative: LATERAL join (PostgreSQL)
SELECT d.name, e.*
FROM departments d
CROSS JOIN LATERAL (
SELECT * FROM employees
WHERE department_id = d.id
ORDER BY salary DESC
LIMIT 3
) e;
SELECT
date,
amount,
SUM(amount) OVER (
ORDER BY date
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
) AS running_total
FROM transactions;
Problem: Find consecutive date ranges from sparse data.
-- Find islands of consecutive dates
WITH grouped AS (
SELECT
date,
date - (ROW_NUMBER() OVER (ORDER BY date))::int AS grp
FROM events
)
SELECT
MIN(date) AS island_start,
MAX(date) AS island_end,
COUNT(*) AS consecutive_days
FROM grouped
GROUP BY grp
ORDER BY island_start;
-- Transform rows to columns
SELECT
user_id,
SUM(CASE WHEN month = 'Jan' THEN revenue ELSE 0 END) AS jan_revenue,
SUM(CASE WHEN month = 'Feb' THEN revenue ELSE 0 END) AS feb_revenue,
SUM(CASE WHEN month = 'Mar' THEN revenue ELSE 0 END) AS mar_revenue
FROM monthly_sales
GROUP BY user_id;
-- PostgreSQL crosstab (requires tablefunc extension)
SELECT * FROM crosstab(
'SELECT user_id, month, revenue FROM monthly_sales ORDER BY 1,2'
) AS ct(user_id INT, jan NUMERIC, feb NUMERIC, mar NUMERIC);
-- Find duplicate emails
SELECT email, COUNT(*) AS cnt
FROM users
GROUP BY email
HAVING COUNT(*) > 1;
-- Delete duplicates, keeping lowest id
DELETE FROM users
WHERE id NOT IN (
SELECT MIN(id) FROM users GROUP BY email
);
-- Better: Using CTE with window function
WITH ranked AS (
SELECT id, ROW_NUMBER() OVER (PARTITION BY email ORDER BY id) AS rn
FROM users
)
DELETE FROM users WHERE id IN (SELECT id FROM ranked WHERE rn > 1);
-- PostgreSQL has built-in percentile functions
SELECT PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY salary)
FROM employees;
-- Manual approach (works in most databases)
WITH ordered AS (
SELECT
salary,
ROW_NUMBER() OVER (ORDER BY salary) AS rn,
COUNT(*) OVER () AS total
FROM employees
)
SELECT AVG(salary) AS median
FROM ordered
WHERE rn IN (total/2, total/2 + 1)
OR (total % 2 = 1 AND rn = (total + 1) / 2);
Problem: Find users who logged in 3+ consecutive days.
WITH login_groups AS (
SELECT
user_id,
login_date,
login_date - ROW_NUMBER() OVER (
PARTITION BY user_id
ORDER BY login_date
)::int AS grp
FROM logins
GROUP BY user_id, login_date -- Dedupe same-day logins
)
SELECT user_id, COUNT(*) AS consecutive_days
FROM login_groups
GROUP BY user_id, grp
HAVING COUNT(*) >= 3;
-- Find employees who earn more than their manager
SELECT e.name AS employee, e.salary, m.name AS manager, m.salary AS mgr_salary
FROM employees e
JOIN employees m ON e.manager_id = m.id
WHERE e.salary > m.salary;
-- Find customers who never ordered
-- NOT IN: BEWARE of NULLs! Returns empty if any NULL in subquery
SELECT * FROM customers
WHERE id NOT IN (SELECT customer_id FROM orders);
-- NOT EXISTS: NULL-safe, often faster
SELECT * FROM customers c
WHERE NOT EXISTS (
SELECT 1 FROM orders o WHERE o.customer_id = c.id
);
-- LEFT JOIN / IS NULL: Also NULL-safe
SELECT c.* FROM customers c
LEFT JOIN orders o ON c.id = o.customer_id
WHERE o.id IS NULL;
NOT IN with a subquery that can return NULL will always return 0 rows.
Always prefer NOT EXISTS for anti-joins.
-- Truncate to start of period
SELECT DATE_TRUNC('month', created_at) AS month_start FROM orders;
-- Generate date series (for filling gaps)
SELECT generate_series(
'2024-01-01'::date,
'2024-12-31'::date,
'1 day'::interval
) AS date;
-- Join with date series to include zero-count days
WITH dates AS (
SELECT generate_series('2024-01-01', '2024-01-31', '1 day'::interval)::date AS d
)
SELECT d.d, COALESCE(COUNT(o.id), 0) AS order_count
FROM dates d
LEFT JOIN orders o ON d.d = o.order_date::date
GROUP BY d.d
ORDER BY d.d;
EXPLAIN ANALYZE - check for Seq Scans on large tablesANALYZENOT EXISTS over NOT INROW_NUMBER() OVER (PARTITION BY...)FOR UPDATE SKIP LOCKED for job queues