Distributed Systems Concepts

Core distributed systems knowledge at the right level for a Distinguished Architect interview. Focus on trade-offs, not theory.

CAP Theorem (and PACELC)

One-liner
In a distributed system, you can only guarantee two of three: Consistency, Availability, Partition tolerance. In practice, partitions happen, so you're really choosing between Consistency and Availability during partitions (PACELC adds: Else choose between Latency and Consistency).
When to discuss
  • Designing multi-region systems
  • Explaining database choices
  • Trade-offs between consistency and availability
  • Failure mode discussions
Trade-offs
Choice Means Example
CP (Consistency + Partition Tolerance) System becomes unavailable during partitions to maintain consistency Traditional RDBMS with strong consistency, Zookeeper
AP (Availability + Partition Tolerance) System stays available but may return stale data DynamoDB, Cassandra, DNS
CA (Consistency + Availability) Only possible in a single node (not distributed) Single-server databases
Twilio Application
Messaging Pipeline: AP system - messages should still be accepted even if you can't immediately confirm they're replicated everywhere. Better to accept the message (availability) and deal with eventual consistency than reject messages during a partition.

Account/Billing Data: More CP - better to briefly be unavailable than charge a customer twice or allow unauthorized access.
2-Minute Interview Answer

"When would you choose consistency over availability for Twilio's messaging platform?"

"The messaging pipeline itself should be AP - we want to accept messages even during network partitions. If we can't replicate immediately, that's okay - we queue it and eventually deliver. However, for the control plane - account authentication, API key validation, rate limit state - I'd lean more CP.

Here's why: If a customer sends a message and we temporarily can't persist it due to a partition, we can return a retry-able error and they'll resend. But if we accept an API request with an invalid key because we couldn't reach the auth service, that's a security issue.

In practice, I'd use PACELC thinking: Partitions are rare, so optimize for the else case - low latency and eventual consistency for data plane, stronger consistency with slightly higher latency for control plane. We'd use DynamoDB with eventual consistent reads for message metadata, but strongly consistent reads for account authentication."

Consistency Models

One-liner
Consistency models define what guarantees you get when reading data after writes in a distributed system. The spectrum runs from "always see the latest write" (strong) to "eventually you'll see it" (eventual).
Consistency Models Spectrum
Model Guarantee Latency Cost When to Use
Strong Consistency Read always returns latest write High - must coordinate across nodes Financial transactions, inventory
Eventual Consistency Eventually all replicas converge Low - no coordination needed Social media feeds, message status
Causal Consistency Related writes seen in order Medium - track causality Message threads, comment replies
Read-Your-Writes You see your own writes immediately Low-Medium - session affinity User settings, profile updates
Twilio Application
Message Delivery Status: Eventual consistency is fine. If a message was delivered 2 seconds ago but the status check shows "sent" for another second, that's acceptable.

Rate Limiting: Causal consistency matters. If a customer hits their rate limit, subsequent requests must see that state or you'll over-deliver.

Account Settings: Read-your-writes consistency. When a developer updates a webhook URL, they should immediately see that change in the console.
2-Minute Interview Answer

"How would you design message status tracking for billions of messages per day?"

"I'd use eventual consistency with DynamoDB. Here's the reasoning: message status is fundamentally asynchronous - the message goes through multiple states (queued, sent, delivered, read) over seconds or minutes. There's no user expectation of instant status updates.

The write path writes status to DynamoDB with eventual consistent replication across AZs. The read path (status check API) does eventually consistent reads - we're optimizing for low latency and high throughput.

However, I'd add read-your-writes consistency for the customer's own writes. If they send a message and immediately query its status, we route that to the same AZ using session affinity so they see their own write. But if someone else queries that message status, eventual consistency is fine.

This gives us the throughput to handle billions of writes/day while keeping p99 latency under 10ms for status checks."

Distributed Consensus (Paxos/Raft)

One-liner
Distributed consensus algorithms (Paxos, Raft) let multiple nodes agree on a value even in the presence of failures, but they're expensive - requiring multiple round trips and a quorum of nodes.
When Consensus is Needed
  • Leader election: Deciding which node is the primary
  • Configuration management: Cluster membership, feature flags
  • Distributed locks: Ensuring only one process does something
  • Atomic broadcast: Ensuring all nodes see events in same order
Trade-offs
Pros:
  • Strong consistency guarantees
  • Fault tolerant (can survive minority failures)
  • Well-understood algorithms
Cons:
  • High latency (multiple round trips)
  • Reduced availability (needs quorum)
  • Doesn't scale to many nodes (typically 3-7 nodes)
  • Complex to implement correctly
Twilio Application
Where to use consensus:
  • Leader election for message routing cells
  • Managing which nodes are in the cluster
  • Coordinating schema migrations
Where NOT to use consensus:
  • Every message send (way too slow)
  • Message status updates (don't need strong consistency)
  • Analytics/metrics collection (eventual consistency is fine)
2-Minute Interview Answer

"Would you use consensus for Twilio's messaging pipeline?"

"No, not for the data plane. Consensus requires multiple round trips and a quorum, which would add 10-50ms latency per message and limit throughput. For a system handling millions of messages per second, that's a non-starter.

Instead, I'd use consensus sparingly for control plane operations: leader election in each cell, cluster membership management, and distributed configuration. Let etcd or ZooKeeper handle that - don't build it yourself.

For the message pipeline itself, I'd design for eventual consistency with idempotent operations. Messages get written to Kafka or Kinesis (which uses quorum internally but hides that complexity), processed by workers, and delivery status is eventually consistent.

The key insight: use consensus to set up the infrastructure (which cells are active, who's the leader), but not for every transaction. That's how you get both reliability and scale."

Message Delivery Semantics

One-liner
At-most-once (might lose messages), at-least-once (might duplicate), and exactly-once (expensive/complex). For most systems, design for at-least-once with idempotent consumers.
Delivery Semantics Comparison
Semantic Guarantee How to Achieve Use Case
At-most-once Message delivered 0 or 1 times Send and forget, no retries Metrics, logs, non-critical events
At-least-once Message delivered 1+ times Retry until acknowledged Most messaging, event processing
Exactly-once Message delivered exactly 1 time Distributed transactions or deduplication Financial transactions, billing
The Pragmatic Approach

Design for at-least-once delivery with idempotent consumers

  • Producer retries until acknowledged
  • Consumer processes each message but makes operations idempotent
  • Use idempotency keys, deduplication windows, or natural idempotency
  • This gives you 99.9% of exactly-once semantics with 10% of the complexity
Twilio Application
SMS Delivery:
  • Accept message with idempotency key
  • Retry sending to carrier up to N times (at-least-once to carrier)
  • Use message SID for deduplication (idempotent)
  • Webhook callbacks include unique message ID for customer deduplication

Result: End-to-end reliability without expensive distributed transactions.

2-Minute Interview Answer

"How would you ensure messages aren't delivered twice?"

"I'd design for at-least-once delivery with idempotency, not exactly-once delivery. Here's why: true exactly-once delivery requires distributed transactions across the message queue, database, and external carriers - that's too expensive and complex.

Instead, the architecture would be:

1. Ingestion: Accept message with client-provided idempotency key. Store in DynamoDB with key as primary key - duplicate POSTs get deduplicated automatically.

2. Processing: Write to Kafka with message SID. Workers consume with at-least-once semantics. If a worker crashes mid-processing, Kafka redelivers.

3. Carrier delivery: Workers send to carrier. If they get a timeout, they retry. Carrier might receive duplicate requests, but we include the same message ID so they can deduplicate.

4. Customer webhooks: We might send delivery status webhook twice if our system restarts. We include message SID and status update ID so customers can deduplicate.

This gives end-to-end idempotency without distributed transactions. Each layer handles duplicates locally."

Partitioning and Sharding

One-liner
Partitioning (sharding) splits data across multiple nodes to scale beyond a single machine. The hard part is choosing the partition key and handling resharding.
Partitioning Strategies
Strategy How It Works Pros Cons
Hash-based hash(key) % num_partitions Even distribution Hard to reshard, can't do range queries
Range-based Key ranges (A-M, N-Z) Range queries efficient Hotspots if keys aren't distributed
Consistent Hashing Keys and nodes on a ring Adding nodes only affects neighbors More complex, potential imbalance
Geography-based Partition by region/country Locality, data residency Uneven load across regions
Twilio Application
Account Data: Partition by account_id (hash-based). Each account's data is on one shard, no cross-shard queries needed.

Message History: Partition by account_id + time_bucket. Keeps recent messages for each account together, enables efficient time-range queries.

Phone Number Inventory: Partition by country code + region. Supports locality (US customers query US numbers) and data residency requirements.
Key Considerations
  • Partition key choice: Must distribute load evenly and support common query patterns
  • Cross-partition queries: Expensive - design to avoid them or make them rare
  • Hotspots: If one partition gets more traffic, it becomes a bottleneck
  • Resharding: Adding/removing partitions requires data movement - plan for it upfront
2-Minute Interview Answer

"How would you partition Twilio's message data?"

"I'd use a hybrid approach: partition by account_id and time_bucket.

Primary partition key: account_id - This gives us tenant isolation, which is critical for Twilio. Each customer's data lives on specific shards. This also aligns with access patterns - customers query their own messages, not across accounts.

Secondary partition: time_bucket - Within each account, partition by month or week. This keeps recent data hot and allows us to archive old messages to cheaper storage.

Number of shards: Start with 256 logical shards mapped to fewer physical nodes. This gives us room to scale - we can add physical nodes and rebalance logical shards without changing the partition scheme.

Handling hotspots: Large customers (enterprises sending millions of messages/day) get dedicated shards. We detect this in the API layer and route their traffic separately. Small customers share shards.

This approach scales to billions of messages, maintains tenant isolation, supports time-range queries, and handles heterogeneous customer sizes."

Clock Synchronization and Ordering

One-liner
You can't rely on wall-clock time in distributed systems. Use logical clocks (Lamport timestamps, vector clocks) or hybrid logical clocks when ordering matters.
Time and Ordering Mechanisms
Approach How It Works When to Use
Wall-clock time (NTP) Synchronized physical clocks Approximate ordering, timestamps for humans
Lamport timestamps Logical counter, incremented per event Total ordering of events in a system
Vector clocks Per-node counters, tracks causality Detecting concurrent updates, conflict resolution
Hybrid logical clocks Combines wall time + logical counter Modern databases (CockroachDB, MongoDB)
Why Wall-Clock Time is Problematic
  • Clocks drift (NTP can only sync within milliseconds)
  • Clocks can jump backwards (NTP corrections, leap seconds)
  • Can't determine "happens-before" relationships reliably
  • Cross-region clock skew can be 100ms+
Twilio Application
Message ordering: Don't guarantee strict ordering unless explicitly required. Most customers don't care if two messages sent 10ms apart arrive slightly out of order.

When ordering matters (e.g., conversation threads): Use sequence numbers per conversation, not timestamps.

Audit logs: Use hybrid logical clocks - wall time for human readability, logical counter for precise ordering.
2-Minute Interview Answer

"How would you order message delivery when messages come from multiple regions?"

"First, I'd ask: do we actually need strict ordering? For most messaging use cases, best-effort ordering is sufficient. Messages sent seconds apart don't need microsecond precision.

If the customer doesn't require ordering, we accept messages in any region with a timestamp for approximate ordering, but we don't guarantee it. This is the cheapest and fastest approach.

If ordering IS required (like a conversation thread), I wouldn't rely on timestamps. Instead:

1. Each conversation gets a logical sequence number
2. Messages include a client-generated sequence number or explicitly reference the previous message
3. We detect gaps and can hold later messages until earlier ones arrive

For internal systems (audit logs, event sourcing), I'd use hybrid logical clocks: each event gets a timestamp that's mostly wall-clock time but includes a logical counter for events that happen on the same node in the same millisecond. This gives us human-readable timestamps that also support precise ordering.

The key is: don't rely on synchronized clocks across regions for correctness, only for approximate ordering."

Replication Patterns

One-liner
Replication provides redundancy and read scalability. Leader-follower is most common; multi-leader adds complexity but enables multi-region writes; leaderless (Dynamo-style) maximizes availability.
Replication Strategies
Pattern Writes Reads Complexity Use Case
Leader-Follower Leader only Any replica Low Most databases, read-heavy workloads
Multi-Leader Any leader Any replica High (conflicts!) Multi-region apps, collaborative editing
Leaderless (Quorum) Quorum of nodes Quorum of nodes Medium High availability, DynamoDB/Cassandra
Replication Lag Challenges
Synchronous replication:
  • Pro: Followers are guaranteed up-to-date
  • Con: Writes are slow (wait for replicas)
Asynchronous replication:
  • Pro: Writes are fast (don't wait for replicas)
  • Con: Followers might lag, can lose data if leader fails
Semi-synchronous (common):
  • Wait for 1 follower, others async
  • Balances durability and performance
Twilio Application
Account data (RDS/Aurora): Leader-follower with async replication to read replicas in same region. Control plane tolerates slight lag.

Message metadata (DynamoDB): Leaderless with quorum reads/writes. Cross-region replication for disaster recovery.

Global configuration: Multi-leader replication (via DynamoDB Global Tables). Each region can update, conflicts resolved via last-write-wins.
2-Minute Interview Answer

"How would you design multi-region replication for Twilio's account database?"

"I'd use leader-follower with regional read replicas and cross-region disaster recovery, not multi-leader.

Architecture:
- Primary region (e.g., us-east-1) has the leader database (Aurora)
- Same region has 2+ read replicas for read scaling
- Secondary region (e.g., us-west-2) has cross-region read replica for disaster recovery

Why not multi-leader? Account data has strong consistency requirements - you can't have two regions accepting conflicting updates to the same account. Multi-leader would require conflict resolution, which is complex and error-prone for structured data like accounts and billing.

Write path: All writes go to the leader in the primary region. Control plane writes are infrequent enough (account creation, setting updates) that cross-region latency is acceptable.

Read path: - In-region reads go to read replicas (fast)
- Cross-region reads tolerate replication lag (eventual consistency)
- For read-your-writes consistency, session affinity to the region where you wrote

Failover: If primary region fails, promote the cross-region replica to leader. This is rare but tested regularly.

This gives us read scalability, disaster recovery, and avoids the complexity of multi-leader conflict resolution."