System Design Heuristics - Quick Reference Guide

Heuristics are rules of thumb that help you quickly identify appropriate solutions for common system design problems. These patterns have been battle-tested in production systems and provide a starting point for design decisions.

During interviews, recognizing these patterns quickly demonstrates experience and helps structure your approach.

Core Heuristics

Write Spike
Write + Spike -> Queue
When you have sudden bursts of write traffic that could overwhelm your system, use a message queue to buffer and smooth out the load.
When to use: Flash sales, user registration spikes, batch imports, webhook delivery
Trade-offs: Adds latency (asynchronous), eventual consistency, requires queue infrastructure

Technologies:

  • RabbitMQ, Apache Kafka, Amazon SQS, Redis Pub/Sub
  • Google Cloud Pub/Sub, Azure Service Bus

Use Cases:

  • E-commerce: Order processing during Black Friday
  • Social media: Post creation during viral events
  • Analytics: Event tracking from millions of users
Global Latency
Latency + Global -> CDN
For globally distributed users, reduce latency by caching static content at edge locations close to users.
When to use: Static assets (images, CSS, JS), video streaming, downloadable content
Trade-offs: Cache invalidation complexity, increased cost, stale content risk

Technologies:

  • Cloudflare, Amazon CloudFront, Fastly, Akamai
  • Google Cloud CDN, Azure CDN

Use Cases:

  • Netflix: Video delivery to global audience
  • E-commerce: Product images and assets
  • News sites: Articles and media content
Load Growth
Load + Growth -> Scale Out (Horizontal Scaling)
When traffic grows beyond single machine capacity, add more machines rather than upgrading to larger machines.
When to use: Unpredictable growth, need for redundancy, cost efficiency at scale
Trade-offs: Increased complexity, distributed system challenges, load balancing required

Technologies:

  • Kubernetes, Docker Swarm, AWS Auto Scaling
  • Horizontal Pod Autoscaler (HPA)

Use Cases:

  • Web servers: Add more application instances
  • Microservices: Scale individual services independently
  • API gateways: Handle increased request volume
Read Bottleneck
Read + Bottleneck -> Cache
When database reads become a bottleneck, introduce caching layer to serve frequently accessed data from memory.
When to use: Read-heavy workloads, expensive queries, frequently accessed data
Trade-offs: Cache invalidation, stale data, increased memory costs, cache warming

Technologies:

  • Redis, Memcached, Amazon ElastiCache
  • Varnish, Nginx caching, Application-level caches

Use Cases:

  • Social media: User profile data, friend lists
  • E-commerce: Product catalog, pricing info
  • News sites: Article content, homepage data
Request Spike
Requests + Spike -> Throttle (Rate Limiting)
Protect your system from being overwhelmed by limiting the rate of requests from clients.
When to use: API protection, abuse prevention, fair resource allocation, DDoS mitigation
Trade-offs: Legitimate users may be blocked, complex quota management, distributed state

Technologies:

  • Token Bucket, Leaky Bucket algorithms
  • API Gateway rate limiting (AWS, Kong, Nginx)
  • Redis for distributed rate limiting

Use Cases:

  • APIs: 1000 requests per hour per user
  • Login: Prevent brute force attacks
  • Email: Limit sending to prevent spam
Retry Safety
Retry + Safety -> Idempotent Operations
Design operations so they can be safely retried multiple times with the same result, enabling resilience to failures.
When to use: Network unreliability, distributed systems, payment processing
Trade-offs: Implementation complexity, idempotency key management, storage overhead

Techniques:

  • Idempotency keys (UUIDs)
  • Natural idempotency (PUT, DELETE in REST)
  • Database constraints (unique indexes)

Use Cases:

  • Payments: Prevent double charging
  • Email: Don't send duplicate notifications
  • Database updates: Safe to retry
Core Failure
Core + Failure -> Redundancy
For critical components, maintain multiple copies to ensure availability even when individual components fail.
When to use: Mission-critical systems, high availability requirements, disaster recovery
Trade-offs: Increased cost, synchronization complexity, potential inconsistency

Strategies:

  • Active-Active: All replicas serve traffic
  • Active-Passive: Standby for failover
  • Multi-region deployment

Use Cases:

  • Databases: Master-slave replication
  • Web servers: Load balanced instances
  • DNS: Multiple nameservers
Dataset Growth
Dataset + Growth -> Sharding (Partitioning)
When a single database becomes too large, split data across multiple databases based on a partition key.
When to use: Data exceeds single DB capacity, write throughput bottleneck, horizontal scaling
Trade-offs: Cross-shard queries complex, rebalancing difficulty, hotspot risks

Strategies:

  • Hash-based: shard = hash(key) % num_shards
  • Range-based: user_id 1-1M -> shard1, 1M-2M -> shard2
  • Geographic: US users -> US DB, EU users -> EU DB

Use Cases:

  • Social media: User data sharded by user_id
  • E-commerce: Orders sharded by region
  • Gaming: Players sharded by game server
Text Search
Text + Search -> Inverted Index
For full-text search, build an index mapping words to documents containing them, enabling fast search queries.
When to use: Full-text search, document retrieval, autocomplete, fuzzy matching
Trade-offs: Index storage overhead, index update latency, complex relevance ranking

Technologies:

  • Elasticsearch, Apache Solr, Amazon CloudSearch
  • PostgreSQL full-text search, MongoDB text indexes

Use Cases:

  • E-commerce: Product search
  • Documentation: Search help articles
  • Social media: Search posts and users
Large File Upload
Upload + Large-file -> Chunking (Multipart Upload)
Split large files into smaller chunks for parallel upload, retry on failure, and resumable uploads.
When to use: Video uploads, large file transfers, unreliable networks
Trade-offs: Implementation complexity, chunk reassembly, orphaned chunks cleanup

Technologies:

  • S3 Multipart Upload, Azure Blob Storage
  • Resumable.js, tus protocol

Use Cases:

  • YouTube: Video uploads
  • Dropbox: Large file sync
  • Cloud storage: Backup uploads
Durability Failure
Durability + Failure -> Replication
Ensure data survives failures by maintaining multiple copies across different nodes or data centers.
When to use: Critical data, disaster recovery, data loss prevention
Trade-offs: Storage cost multiplier, replication lag, consistency challenges

Strategies:

  • Synchronous: Wait for all replicas (strong consistency)
  • Asynchronous: Don't wait (eventual consistency)
  • Quorum: Wait for majority

Use Cases:

  • Databases: MySQL replication, Cassandra RF=3
  • File storage: HDFS replication, S3 durability
  • Message queues: Kafka topic replication
Real-time Broadcast
Broadcast + Realtime -> Pub/Sub
Enable real-time communication where publishers send messages to topics and subscribers receive them.
When to use: Live notifications, chat systems, event-driven architecture
Trade-offs: Message ordering challenges, delivery guarantees, scaling subscribers

Technologies:

  • Redis Pub/Sub, Apache Kafka, RabbitMQ
  • Google Pub/Sub, AWS SNS, Azure Event Grid

Use Cases:

  • Chat apps: Message broadcast to room
  • Stock ticker: Price updates to subscribers
  • Notifications: Push to all connected clients
Location Search
Location + Search -> Geohashing
Convert latitude/longitude into a string that allows efficient proximity searches and location grouping.
When to use: Nearby search, location-based services, geofencing
Trade-offs: Boundary issues, precision vs length, not perfect for distances

Technologies:

  • Geohash algorithm, S2 Geometry (Google)
  • PostGIS, MongoDB geospatial indexes
  • Redis GEO commands

Use Cases:

  • Uber: Find nearby drivers
  • Yelp: Restaurants near me
  • Pokemon Go: Nearby Pokemon
Write Conflict
Write + Conflict -> Optimistic Lock
Allow concurrent reads but detect conflicts at write time using version numbers, avoiding expensive locks.
When to use: Low conflict rate, read-heavy workloads, distributed systems
Trade-offs: Retry logic required, wasted work on conflict, user experience on retry

Techniques:

  • Version numbers: UPDATE ... WHERE version = old_version
  • Timestamps: Check if last_modified changed
  • ETags: HTTP conditional requests

Use Cases:

  • E-commerce: Inventory updates
  • Collaborative editing: Document versions
  • Bank account: Balance updates
Untrusted Execution
Untrusted + Execution -> Container/Sandbox
Isolate untrusted code execution in containers or sandboxes to prevent security breaches and resource abuse.
When to use: User-submitted code, multi-tenant systems, plugin architectures
Trade-offs: Performance overhead, complexity, resource limits enforcement

Technologies:

  • Docker, Kubernetes pods, gVisor
  • AWS Lambda, Google Cloud Functions
  • V8 isolates, WebAssembly

Use Cases:

  • Code playgrounds: CodePen, Replit
  • CI/CD: Build pipeline execution
  • Serverless: Function execution
Real-time Updates
Realtime + Updates -> WebSockets
Establish persistent bidirectional connection for real-time data push from server to client.
When to use: Chat, live dashboards, real-time collaboration, live sports scores
Trade-offs: Connection management complexity, scaling challenges, fallback needed

Technologies:

  • Socket.IO, WebSocket API, SignalR
  • Server-Sent Events (SSE) for one-way
  • Long polling as fallback

Use Cases:

  • Slack: Real-time messaging
  • Stock trading: Live price updates
  • Google Docs: Collaborative editing
Traffic Reliability
Traffic + Reliability -> Load Balancer
Distribute incoming traffic across multiple servers, improving reliability and handling capacity.
When to use: Multiple backend servers, high availability needs, horizontal scaling
Trade-offs: Single point of failure (need redundancy), session affinity complexity, cost

Algorithms:

  • Round Robin: Distribute evenly
  • Least Connections: Send to least busy
  • IP Hash: Sticky sessions
  • Weighted: Based on server capacity

Technologies:

  • Nginx, HAProxy, AWS ELB/ALB
  • Google Cloud Load Balancing, Azure Load Balancer
Distributed Transaction
Distributed + Transaction -> Saga Pattern
Manage distributed transactions as a sequence of local transactions with compensating actions for rollback.
When to use: Microservices, long-running transactions, cross-service workflows
Trade-offs: Eventual consistency, complex rollback logic, partial failure handling

Types:

  • Choreography: Services emit events
  • Orchestration: Central coordinator

Use Cases:

  • E-commerce: Order -> Payment -> Inventory -> Shipping
  • Travel booking: Flight + Hotel + Car (all or none)
  • Banking: Transfer between accounts
Concurrency Consistency
Concurrency + Consistency -> Row Locking
Prevent concurrent modifications to same data by locking specific rows during transactions.
When to use: High conflict scenarios, critical updates, financial transactions
Trade-offs: Reduced concurrency, deadlock risk, performance impact

Types:

  • Pessimistic: Lock before read (SELECT FOR UPDATE)
  • Optimistic: Check version before write
  • Shared locks: Multiple readers
  • Exclusive locks: Single writer

Use Cases:

  • Bank transfers: Lock both accounts
  • Ticket booking: Lock seat during purchase
  • Inventory: Prevent overselling

Quick Decision Matrix

Problem Solution Pattern Key Benefit
Too many writes at once Queue Smooth out spikes
Slow for global users CDN Reduce latency
Single server overwhelmed Horizontal scaling Add capacity
Database reads slow Cache Fast in-memory access
Too many API calls Rate limiting Protect resources
Network failures cause issues Idempotency Safe retries
System goes down Redundancy High availability
Database too large Sharding Distribute data
Need to search text Inverted index Fast full-text search
Real-time updates needed WebSockets Bidirectional push