Query Optimization, Indexing, and Best Practices
An index is a data structure that improves query performance by reducing the number of rows scanned. Think of it like a book index - instead of reading every page to find "database", you check the index.
Rule of thumb: Index columns used in WHERE clauses, JOIN conditions, and ORDER BY
Default index type in PostgreSQL, MySQL. Balanced tree structure.
WHERE user_id = 123WHERE created_at > '2024-01-01'ORDER BY last_nameWHERE email LIKE 'john%' (but NOT '%@gmail.com')LIKE '%search%'gender CHAR(1)-- Create B-tree index (default)
CREATE INDEX idx_users_email ON users(email);
-- Composite index (multiple columns)
CREATE INDEX idx_users_name ON users(last_name, first_name);
-- Unique index
CREATE UNIQUE INDEX idx_users_email_unique ON users(email);
Fast equality checks, no range queries.
WHERE id = 123)
-- PostgreSQL hash index
CREATE INDEX idx_users_id_hash ON users USING HASH (id);
-- Use case: primary key lookups (though B-tree is usually fine)
For columns with multiple values: arrays, JSONB, full-text search.
WHERE tags @> ARRAY['postgresql']WHERE data @> '{"status": "active"}'WHERE to_tsvector(content) @@ to_tsquery('database')-- Index on JSONB column
CREATE INDEX idx_users_metadata ON users USING GIN (metadata);
-- Query:
SELECT * FROM users WHERE metadata @> '{"premium": true}';
-- Full-text search index
CREATE INDEX idx_posts_search ON posts USING GIN (to_tsvector('english', content));
For geometric data, full-text search, custom data types.
WHERE location && 'BOX(...)'WHERE date_range && '[2024-01-01,2024-12-31]'-- Spatial index (PostGIS)
CREATE INDEX idx_locations_geom ON locations USING GIST (geom);
-- Query:
SELECT * FROM locations
WHERE ST_DWithin(geom, ST_MakePoint(-122.4194, 37.7749), 1000);
Index only subset of rows matching a condition.
-- Only index active users
CREATE INDEX idx_active_users ON users(email) WHERE status = 'active';
-- Smaller index, faster for:
SELECT * FROM users WHERE status = 'active' AND email = 'john@example.com';
-- Use case: 95% of queries filter by status = 'active'
-- Partial index is much smaller and faster than full index
Index contains all columns needed for query - no table lookup required.
-- Include extra columns in index
CREATE INDEX idx_users_email_name ON users(email) INCLUDE (first_name, last_name);
-- Query can be satisfied entirely from index:
SELECT first_name, last_name FROM users WHERE email = 'john@example.com';
-- No need to access table, just scan index
Index on (last_name, first_name) is different from (first_name, last_name)
Composite index can be used if query includes leftmost columns.
-- Index on (last_name, first_name, age)
CREATE INDEX idx_users_composite ON users(last_name, first_name, age);
-- Can use index (leftmost prefix):
SELECT * FROM users WHERE last_name = 'Smith'; -- ✓ Uses index
SELECT * FROM users WHERE last_name = 'Smith' AND first_name = 'John'; -- ✓ Uses index
SELECT * FROM users WHERE last_name = 'Smith' AND age = 30; -- ✓ Uses index (skips middle)
-- Cannot use index efficiently:
SELECT * FROM users WHERE first_name = 'John'; -- ✗ Skips leftmost
SELECT * FROM users WHERE age = 30; -- ✗ Skips leftmost
= before >, <, BETWEEN-- Good order: status (low cardinality) after email (high cardinality)
CREATE INDEX idx_users_email_status ON users(email, status);
-- Bad order: status first limits index usefulness
CREATE INDEX idx_users_status_email ON users(status, email);
-- PostgreSQL
EXPLAIN ANALYZE
SELECT * FROM users WHERE email = 'john@example.com';
Seq Scan on users (cost=0.00..1693.00 rows=1 width=54) (actual time=0.234..15.678 rows=1 loops=1) Filter: (email = 'john@example.com'::text) Rows Removed by Filter: 99999 Planning Time: 0.123 ms Execution Time: 15.701 ms
| Metric | Meaning |
|---|---|
| Seq Scan | Sequential scan (reads entire table) - usually BAD for large tables |
| Index Scan | Uses index - GOOD |
| Index Only Scan | Uses only index, no table lookup - BEST |
| Bitmap Heap Scan | Uses index to find rows, then fetches from table - OK |
| cost=0.00..1693.00 | Estimated cost (startup..total). Lower is better. Units are arbitrary. |
| rows=1 | Estimated rows returned |
| actual time=0.234..15.678 | Actual time (startup..total) in milliseconds (with ANALYZE) |
| rows=1 (actual) | Actual rows returned |
CREATE INDEX idx_users_email ON users(email);
EXPLAIN ANALYZE
SELECT * FROM users WHERE email = 'john@example.com';
Index Scan using idx_users_email on users (cost=0.29..8.31 rows=1 width=54) (actual time=0.045..0.047 rows=1 loops=1) Index Cond: (email = 'john@example.com'::text) Planning Time: 0.234 ms Execution Time: 0.078 ms
Improvement: 15.7ms → 0.078ms (200x faster!)
SELECT * FROM users WHERE age > 25;
Seq Scan on users (cost=0.00..1693.00 rows=50000 width=54) Filter: (age > 25)
Reads entire table. OK for small tables or when returning most rows.
SELECT * FROM users WHERE email = 'john@example.com';
Index Scan using idx_users_email on users (cost=0.29..8.31 rows=1 width=54) Index Cond: (email = 'john@example.com'::text)
Uses index, then fetches rows from table. Good for selective queries.
SELECT email FROM users WHERE email = 'john@example.com';
Index Only Scan using idx_users_email on users (cost=0.29..4.31 rows=1 width=32) Index Cond: (email = 'john@example.com'::text) Heap Fetches: 0
All data in index, no table access. Fastest!
SELECT * FROM users WHERE age BETWEEN 25 AND 35;
Bitmap Heap Scan on users (cost=100.29..1500.31 rows=5000 width=54)
Recheck Cond: ((age >= 25) AND (age <= 35))
-> Bitmap Index Scan on idx_users_age (cost=0.00..99.04 rows=5000 width=0)
Index Cond: ((age >= 25) AND (age <= 35))
Builds bitmap of matching rows from index, then fetches in physical order. Good for large result sets.
SELECT * FROM users u JOIN orders o ON u.id = o.user_id;
Nested Loop (cost=0.29..1234.56 rows=100 width=...)
-> Seq Scan on users u (cost=0.00..123.00 rows=100 width=...)
-> Index Scan using idx_orders_user_id on orders o (cost=0.29..10.50 rows=5 width=...)
Index Cond: (user_id = u.id)
For each row in outer table, scan inner table. Good when outer table is small.
SELECT * FROM users u JOIN orders o ON u.id = o.user_id;
Hash Join (cost=234.56..5678.90 rows=10000 width=...)
Hash Cond: (o.user_id = u.id)
-> Seq Scan on orders o (cost=0.00..4000.00 rows=100000 width=...)
-> Hash (cost=123.00..123.00 rows=5000 width=...)
-> Seq Scan on users u (cost=0.00..123.00 rows=5000 width=...)
Build hash table from smaller table, probe with larger table. Good for large, unindexed joins.
SELECT * FROM users u JOIN orders o ON u.id = o.user_id
ORDER BY u.id;
Merge Join (cost=123.45..2345.67 rows=10000 width=...) Merge Cond: (u.id = o.user_id) -> Index Scan using users_pkey on users u (cost=0.29..500.00 rows=5000 width=...) -> Index Scan using idx_orders_user_id on orders o (cost=0.29..1500.00 rows=50000 width=...)
Both tables sorted, merge in order. Good when both tables indexed on join key.
-- BAD: Fetches all columns, including large BLOBs
SELECT * FROM users;
-- GOOD: Fetch only what you need
SELECT id, email, first_name FROM users;
Benefits: Less data transferred, can use covering indexes, less memory
-- BAD: Fetches millions of rows
SELECT * FROM logs ORDER BY created_at DESC;
-- GOOD: Pagination
SELECT * FROM logs ORDER BY created_at DESC LIMIT 100 OFFSET 0;
-- BETTER: Cursor-based pagination (no OFFSET)
SELECT * FROM logs
WHERE created_at < '2024-01-01 12:00:00'
ORDER BY created_at DESC
LIMIT 100;
OFFSET 10000 still scans first 10,000 rows then discards them.
Better: Use cursor-based pagination with WHERE clause.
-- BAD: Function prevents index usage
SELECT * FROM users WHERE LOWER(email) = 'john@example.com';
SELECT * FROM orders WHERE YEAR(created_at) = 2024;
-- GOOD: Store lowercase in column or use functional index
CREATE INDEX idx_users_email_lower ON users(LOWER(email));
SELECT * FROM users WHERE LOWER(email) = 'john@example.com'; -- Now uses index
-- GOOD: Use range query instead of function
SELECT * FROM orders
WHERE created_at >= '2024-01-01' AND created_at < '2025-01-01';
-- BAD: Leading wildcard prevents index usage
SELECT * FROM users WHERE email LIKE '%@gmail.com';
-- GOOD: Prefix search can use index
SELECT * FROM users WHERE email LIKE 'john%';
-- GOOD: For suffix search, use full-text search or reverse index
CREATE INDEX idx_users_email_reverse ON users(REVERSE(email));
SELECT * FROM users WHERE REVERSE(email) LIKE REVERSE('%@gmail.com');
-- BAD: IN with subquery can be slow
SELECT * FROM users
WHERE id IN (SELECT user_id FROM orders WHERE total > 100);
-- GOOD: EXISTS can stop early
SELECT * FROM users u
WHERE EXISTS (
SELECT 1 FROM orders o
WHERE o.user_id = u.id AND o.total > 100
);
-- BETTER: JOIN if you need columns from orders
SELECT DISTINCT u.* FROM users u
JOIN orders o ON u.id = o.user_id
WHERE o.total > 100;
EXISTS stops at first match. IN builds entire list first.
-- BAD: 1000 individual INSERT statements
for user in users:
execute("INSERT INTO users (name, email) VALUES (?, ?)", user.name, user.email)
-- GOOD: Single INSERT with multiple rows
INSERT INTO users (name, email) VALUES
('Alice', 'alice@example.com'),
('Bob', 'bob@example.com'),
('Charlie', 'charlie@example.com'),
... (1000 rows)
-- Or use COPY (PostgreSQL) for bulk loading
COPY users(name, email) FROM '/tmp/users.csv' WITH CSV HEADER;
Result: 1000 round trips → 1 round trip. ~100-1000x faster.
Common ORM anti-pattern: 1 query to fetch N records, then N queries to fetch related data.
# Fetch all users (1 query)
users = User.objects.all() # SELECT * FROM users
# For each user, fetch their orders (N queries!)
for user in users:
orders = user.orders.all() # SELECT * FROM orders WHERE user_id = ?
print(f"{user.name}: {orders.count()} orders")
# Result: 1 + N queries (if 100 users, 101 queries!)
# Fetch users and their orders in 2 queries total
users = User.objects.prefetch_related('orders')
# Query 1: SELECT * FROM users
# Query 2: SELECT * FROM orders WHERE user_id IN (1,2,3,...)
for user in users:
orders = user.orders.all() # No query! Already loaded
print(f"{user.name}: {orders.count()} orders")
# Result: 2 queries regardless of N
# Single query with JOIN
users = User.objects.select_related('profile')
# SELECT users.*, profiles.*
# FROM users
# LEFT JOIN profiles ON users.id = profiles.user_id
for user in users:
print(f"{user.name}: {user.profile.bio}") # No query!
# Result: 1 query total
| Method | Relationship | Use Case |
|---|---|---|
| select_related() | ForeignKey, OneToOne | 1:1 or N:1 (uses JOIN) |
| prefetch_related() | ManyToMany, Reverse ForeignKey | 1:N or M:N (uses separate query + IN) |
# Django: Enable query logging
DEBUG = True # or use django-debug-toolbar
# Count queries
from django.db import connection
print(len(connection.queries)) # Number of queries executed
-- PostgreSQL: Enable slow query log
ALTER DATABASE mydb SET log_min_duration_statement = 100; -- Log queries > 100ms
-- Analyze logs for repeated similar queries
-- Example: Same query with different parameters 100 times = N+1
# Each request creates new connection
1. Open TCP connection to database (50-100ms)
2. Authenticate (10-50ms)
3. Execute query (5ms)
4. Close connection
Total: 65-155ms per request (overhead >> query time!)
# Reuse existing connections
1. Get connection from pool (< 1ms)
2. Execute query (5ms)
3. Return connection to pool
Total: 6ms (10-25x faster!)
Application Pool Database
| | |
|---Request Conn-------->| |
| [Idle Conn 1]--------------->| (reuse)
|<--Return Conn 1--------| |
| | |
|---Execute Query------->|---------------------->|
| | |
|<--Return Result--------|<----------------------|
| | |
|---Release Conn-------->| |
| [Idle Conn 1] (back to pool) |
| Parameter | Meaning | Typical Value |
|---|---|---|
| min_connections | Minimum pooled connections (always open) | 2-10 |
| max_connections | Maximum pooled connections | 10-100 (depends on load) |
| max_idle_time | Close idle connections after this time | 10-30 minutes |
| connection_timeout | Wait time for available connection | 5-30 seconds |
| max_lifetime | Maximum connection age (prevent stale connections) | 30-60 minutes |
from sqlalchemy import create_engine
engine = create_engine(
'postgresql://user:pass@localhost/db',
pool_size=10, # Normal pool size
max_overflow=20, # Can create 20 extra connections if needed
pool_timeout=30, # Wait 30s for connection
pool_recycle=3600, # Recycle connections after 1 hour
pool_pre_ping=True # Test connection before using
)
Java (HikariCP - fastest):
HikariConfig config = new HikariConfig();
config.setJdbcUrl("jdbc:postgresql://localhost/db");
config.setUsername("user");
config.setPassword("pass");
config.setMaximumPoolSize(20);
config.setMinimumIdle(5);
config.setConnectionTimeout(30000);
config.setIdleTimeout(600000);
config.setMaxLifetime(1800000);
HikariDataSource dataSource = new HikariDataSource(config);
Formula: connections = ((core_count * 2) + effective_spindle_count)
Example: 4 CPU cores, SSD (treat as 1 spindle) = (4 × 2) + 1 = 9 connections
-- BAD: Fetches all columns including BLOBs, JSON, etc.
SELECT * FROM products; -- 50 columns, 10MB per row
-- GOOD: Only what you need
SELECT id, name, price FROM products;
-- BAD: JOIN without index on foreign key
CREATE TABLE orders (
id SERIAL PRIMARY KEY,
user_id INT REFERENCES users(id) -- No index!
);
SELECT * FROM users u JOIN orders o ON u.id = o.user_id; -- Slow!
-- GOOD: Always index foreign keys
CREATE INDEX idx_orders_user_id ON orders(user_id);
-- BAD: user_id is INT, but query uses VARCHAR
SELECT * FROM users WHERE user_id = '123'; -- Index not used!
-- GOOD: Match column type
SELECT * FROM users WHERE user_id = 123;
-- BAD: Can't use indexes efficiently
SELECT * FROM products
WHERE category_id = 5 OR price < 10; -- Two different columns!
-- BETTER: UNION (can use indexes)
SELECT * FROM products WHERE category_id = 5
UNION
SELECT * FROM products WHERE price < 10;
-- BAD: Huge IN list
SELECT * FROM users WHERE id IN (1,2,3,...,10000); -- 10,000 IDs!
-- BETTER: Temp table + JOIN
CREATE TEMP TABLE user_ids (id INT);
INSERT INTO user_ids VALUES (1),(2),(3),...,(10000);
SELECT u.* FROM users u JOIN user_ids t ON u.id = t.id;