System Design Heuristics - Quick Reference

Write Spike

Write + Spike -> Queue

When you have sudden bursts of write traffic that could overwhelm your system, use a message queue to buffer and smooth out the load.

When to use: Flash sales, user registration spikes, batch imports, webhook delivery

Trade-offs: Adds latency (asynchronous), eventual consistency, requires queue infrastructure

Technologies:

RabbitMQ, Apache Kafka, Amazon SQS, Redis Pub/Sub
Google Cloud Pub/Sub, Azure Service Bus

Use Cases:

E-commerce: Order processing during Black Friday
Social media: Post creation during viral events
Analytics: Event tracking from millions of users

Global Latency

Latency + Global -> CDN

For globally distributed users, reduce latency by caching static content at edge locations close to users.

When to use: Static assets (images, CSS, JS), video streaming, downloadable content

Trade-offs: Cache invalidation complexity, increased cost, stale content risk

Technologies:

Cloudflare, Amazon CloudFront, Fastly, Akamai
Google Cloud CDN, Azure CDN

Use Cases:

Netflix: Video delivery to global audience
E-commerce: Product images and assets
News sites: Articles and media content

Load Growth

Load + Growth -> Scale Out (Horizontal Scaling)

When traffic grows beyond single machine capacity, add more machines rather than upgrading to larger machines.

When to use: Unpredictable growth, need for redundancy, cost efficiency at scale

Trade-offs: Increased complexity, distributed system challenges, load balancing required

Technologies:

Kubernetes, Docker Swarm, AWS Auto Scaling
Horizontal Pod Autoscaler (HPA)

Use Cases:

Web servers: Add more application instances
Microservices: Scale individual services independently
API gateways: Handle increased request volume

Read Bottleneck

Read + Bottleneck -> Cache

When database reads become a bottleneck, introduce caching layer to serve frequently accessed data from memory.

When to use: Read-heavy workloads, expensive queries, frequently accessed data

Trade-offs: Cache invalidation, stale data, increased memory costs, cache warming

Technologies:

Redis, Memcached, Amazon ElastiCache
Varnish, Nginx caching, Application-level caches

Use Cases:

Social media: User profile data, friend lists
E-commerce: Product catalog, pricing info
News sites: Article content, homepage data

Request Spike

Requests + Spike -> Throttle (Rate Limiting)

Protect your system from being overwhelmed by limiting the rate of requests from clients.

When to use: API protection, abuse prevention, fair resource allocation, DDoS mitigation

Trade-offs: Legitimate users may be blocked, complex quota management, distributed state

Technologies:

Token Bucket, Leaky Bucket algorithms
API Gateway rate limiting (AWS, Kong, Nginx)
Redis for distributed rate limiting

Use Cases:

APIs: 1000 requests per hour per user
Login: Prevent brute force attacks
Email: Limit sending to prevent spam

Retry Safety

Retry + Safety -> Idempotent Operations

Design operations so they can be safely retried multiple times with the same result, enabling resilience to failures.

When to use: Network unreliability, distributed systems, payment processing

Trade-offs: Implementation complexity, idempotency key management, storage overhead

Techniques:

Idempotency keys (UUIDs)
Natural idempotency (PUT, DELETE in REST)
Database constraints (unique indexes)

Use Cases:

Payments: Prevent double charging
Email: Don't send duplicate notifications
Database updates: Safe to retry

Core Failure

Core + Failure -> Redundancy

For critical components, maintain multiple copies to ensure availability even when individual components fail.

When to use: Mission-critical systems, high availability requirements, disaster recovery

Trade-offs: Increased cost, synchronization complexity, potential inconsistency

Strategies:

Active-Active: All replicas serve traffic
Active-Passive: Standby for failover
Multi-region deployment

Use Cases:

Databases: Master-slave replication
Web servers: Load balanced instances
DNS: Multiple nameservers

Dataset Growth

Dataset + Growth -> Sharding (Partitioning)

When a single database becomes too large, split data across multiple databases based on a partition key.

When to use: Data exceeds single DB capacity, write throughput bottleneck, horizontal scaling

Trade-offs: Cross-shard queries complex, rebalancing difficulty, hotspot risks

Strategies:

Hash-based: shard = hash(key) % num_shards
Range-based: user_id 1-1M -> shard1, 1M-2M -> shard2
Geographic: US users -> US DB, EU users -> EU DB

Use Cases:

Social media: User data sharded by user_id
E-commerce: Orders sharded by region
Gaming: Players sharded by game server

Text Search

Text + Search -> Inverted Index

For full-text search, build an index mapping words to documents containing them, enabling fast search queries.

When to use: Full-text search, document retrieval, autocomplete, fuzzy matching

Trade-offs: Index storage overhead, index update latency, complex relevance ranking

Technologies:

Elasticsearch, Apache Solr, Amazon CloudSearch
PostgreSQL full-text search, MongoDB text indexes

Use Cases:

E-commerce: Product search
Documentation: Search help articles
Social media: Search posts and users

Large File Upload

Upload + Large-file -> Chunking (Multipart Upload)

Split large files into smaller chunks for parallel upload, retry on failure, and resumable uploads.

When to use: Video uploads, large file transfers, unreliable networks

Trade-offs: Implementation complexity, chunk reassembly, orphaned chunks cleanup

Technologies:

S3 Multipart Upload, Azure Blob Storage
Resumable.js, tus protocol

Use Cases:

YouTube: Video uploads
Dropbox: Large file sync
Cloud storage: Backup uploads

Durability Failure

Durability + Failure -> Replication

Ensure data survives failures by maintaining multiple copies across different nodes or data centers.

When to use: Critical data, disaster recovery, data loss prevention

Trade-offs: Storage cost multiplier, replication lag, consistency challenges

Strategies:

Synchronous: Wait for all replicas (strong consistency)
Asynchronous: Don't wait (eventual consistency)
Quorum: Wait for majority

Use Cases:

Databases: MySQL replication, Cassandra RF=3
File storage: HDFS replication, S3 durability
Message queues: Kafka topic replication

Real-time Broadcast

Broadcast + Realtime -> Pub/Sub

Enable real-time communication where publishers send messages to topics and subscribers receive them.

When to use: Live notifications, chat systems, event-driven architecture

Trade-offs: Message ordering challenges, delivery guarantees, scaling subscribers

Technologies:

Redis Pub/Sub, Apache Kafka, RabbitMQ
Google Pub/Sub, AWS SNS, Azure Event Grid

Use Cases:

Chat apps: Message broadcast to room
Stock ticker: Price updates to subscribers
Notifications: Push to all connected clients

Location Search

Location + Search -> Geohashing

Convert latitude/longitude into a string that allows efficient proximity searches and location grouping.

When to use: Nearby search, location-based services, geofencing

Trade-offs: Boundary issues, precision vs length, not perfect for distances

Technologies:

Geohash algorithm, S2 Geometry (Google)
PostGIS, MongoDB geospatial indexes
Redis GEO commands

Use Cases:

Uber: Find nearby drivers
Yelp: Restaurants near me
Pokemon Go: Nearby Pokemon

Write Conflict

Write + Conflict -> Optimistic Lock

Allow concurrent reads but detect conflicts at write time using version numbers, avoiding expensive locks.

When to use: Low conflict rate, read-heavy workloads, distributed systems

Trade-offs: Retry logic required, wasted work on conflict, user experience on retry

Techniques:

Version numbers: UPDATE ... WHERE version = old_version
Timestamps: Check if last_modified changed
ETags: HTTP conditional requests

Use Cases:

E-commerce: Inventory updates
Collaborative editing: Document versions
Bank account: Balance updates

Untrusted Execution

Untrusted + Execution -> Container/Sandbox

Isolate untrusted code execution in containers or sandboxes to prevent security breaches and resource abuse.

When to use: User-submitted code, multi-tenant systems, plugin architectures

Trade-offs: Performance overhead, complexity, resource limits enforcement

Technologies:

Docker, Kubernetes pods, gVisor
AWS Lambda, Google Cloud Functions
V8 isolates, WebAssembly

Use Cases:

Code playgrounds: CodePen, Replit
CI/CD: Build pipeline execution
Serverless: Function execution

Real-time Updates

Realtime + Updates -> WebSockets

Establish persistent bidirectional connection for real-time data push from server to client.

When to use: Chat, live dashboards, real-time collaboration, live sports scores

Trade-offs: Connection management complexity, scaling challenges, fallback needed

Technologies:

Socket.IO, WebSocket API, SignalR
Server-Sent Events (SSE) for one-way
Long polling as fallback

Use Cases:

Slack: Real-time messaging
Stock trading: Live price updates
Google Docs: Collaborative editing

Traffic Reliability

Traffic + Reliability -> Load Balancer

Distribute incoming traffic across multiple servers, improving reliability and handling capacity.

When to use: Multiple backend servers, high availability needs, horizontal scaling

Trade-offs: Single point of failure (need redundancy), session affinity complexity, cost

Algorithms:

Round Robin: Distribute evenly
Least Connections: Send to least busy
IP Hash: Sticky sessions
Weighted: Based on server capacity

Technologies:

Nginx, HAProxy, AWS ELB/ALB
Google Cloud Load Balancing, Azure Load Balancer

Distributed Transaction

Distributed + Transaction -> Saga Pattern

Manage distributed transactions as a sequence of local transactions with compensating actions for rollback.

When to use: Microservices, long-running transactions, cross-service workflows

Trade-offs: Eventual consistency, complex rollback logic, partial failure handling

Types:

Choreography: Services emit events
Orchestration: Central coordinator

Use Cases:

E-commerce: Order -> Payment -> Inventory -> Shipping
Travel booking: Flight + Hotel + Car (all or none)
Banking: Transfer between accounts

Concurrency Consistency

Concurrency + Consistency -> Row Locking

Prevent concurrent modifications to same data by locking specific rows during transactions.

When to use: High conflict scenarios, critical updates, financial transactions

Trade-offs: Reduced concurrency, deadlock risk, performance impact

Types:

Pessimistic: Lock before read (SELECT FOR UPDATE)
Optimistic: Check version before write
Shared locks: Multiple readers
Exclusive locks: Single writer

Use Cases:

Bank transfers: Lock both accounts
Ticket booking: Lock seat during purchase
Inventory: Prevent overselling

Problem	Solution Pattern	Key Benefit
Too many writes at once	Queue	Smooth out spikes
Slow for global users	CDN	Reduce latency
Single server overwhelmed	Horizontal scaling	Add capacity
Database reads slow	Cache	Fast in-memory access
Too many API calls	Rate limiting	Protect resources
Network failures cause issues	Idempotency	Safe retries
System goes down	Redundancy	High availability
Database too large	Sharding	Distribute data
Need to search text	Inverted index	Fast full-text search
Real-time updates needed	WebSockets	Bidirectional push

System Design Heuristics - Quick Reference Guide

Core Heuristics

Technologies:

Use Cases:

Technologies:

Use Cases:

Technologies:

Use Cases:

Technologies:

Use Cases:

Technologies:

Use Cases:

Techniques:

Use Cases:

Strategies:

Use Cases:

Strategies:

Use Cases:

Technologies:

Use Cases:

Technologies:

Use Cases:

Strategies:

Use Cases:

Technologies:

Use Cases:

Technologies:

Use Cases:

Techniques:

Use Cases:

Technologies:

Use Cases:

Technologies:

Use Cases:

Algorithms:

Technologies:

Types:

Use Cases:

Types:

Use Cases:

Quick Decision Matrix