Architectural Patterns for Reliability

Resilience patterns critical for distributed systems like Twilio. Focus on preventing cascading failures and maintaining availability.

Circuit Breaker

One-liner
Prevent cascading failures by stopping calls to a failing dependency. Like an electrical circuit breaker, it "trips" when failures exceed a threshold.
How It Works
Three states:
  1. Closed (normal): Requests pass through. Track failures.
  2. Open (tripped): Reject requests immediately, return fallback. Don't call dependency.
  3. Half-Open (testing): Allow limited requests through to test if dependency recovered.

State transitions:
  • Closed → Open: When failure rate exceeds threshold (e.g., 50% failures in 10 seconds)
  • Open → Half-Open: After timeout (e.g., 30 seconds)
  • Half-Open → Closed: If test requests succeed
  • Half-Open → Open: If test requests fail
┌──────────────────────────────────────────────────┐ │ CIRCUIT BREAKER │ │ │ │ CLOSED ──[failures > threshold]──> OPEN │ │ │ │ │ │ │ │ │ │ │ [timeout elapsed] │ │ │ │ │ │ │ ▼ │ │ └──[test succeeds]── HALF-OPEN ◄──┘ │ │ │ │ │ [test fails] │ │ │ │ │ ▼ │ │ OPEN │ └──────────────────────────────────────────────────┘
When to Use
  • Calling external APIs that might fail or be slow
  • Dependencies that could timeout
  • Services under your control that might overload
  • Preventing retry storms
Trade-offs
Pros:
  • Prevents cascading failures and retry storms
  • Fails fast - don't wait for timeouts
  • Gives failing service time to recover
  • Provides fallback experience
Cons:
  • Adds complexity to client code
  • Need to tune thresholds (false positives vs late detection)
  • Requires fallback strategy
Twilio Application
SMS carrier integration: Each carrier gets a circuit breaker. If Carrier A is timing out, open the circuit and fail fast. Route traffic to Carrier B or return error to customer faster.

Webhook delivery: Circuit breaker per customer webhook URL. If their endpoint is down, open circuit and stop hammering it. Queue webhooks for later retry.

Internal service calls: Identity service, billing service, analytics - all get circuit breakers. If billing is slow, don't wait for timeout on every request.
2-Minute Interview Answer

"How would you prevent a slow carrier API from impacting Twilio's message delivery?"

"I'd implement circuit breakers per carrier with aggressive timeouts.

Setup: Each carrier integration has a circuit breaker tracking success/failure rate over a sliding window - say, 100 requests or 10 seconds.

Failure threshold: If more than 50% of requests fail or timeout in that window, trip the circuit to Open. Don't send any more requests to that carrier for 30 seconds.

Fast failure: While Open, immediately reject new messages for that carrier. Either return an error to the customer or route to a backup carrier if available.

Recovery testing: After 30 seconds, enter Half-Open. Send a few test messages. If they succeed, close the circuit and resume normal traffic. If they fail, stay Open for another interval.

Why this matters: Without circuit breakers, you'd have thousands of threads blocked waiting for timeouts to that slow carrier. With circuit breakers, you fail fast and preserve capacity to deliver messages through healthy carriers.

I'd also add metrics and alerting on circuit breaker state transitions so we know when carriers are having issues."

Bulkhead / Fault Isolation

One-liner
Isolate failures by partitioning resources. Like bulkheads in a ship prevent one leak from sinking the whole vessel.
How It Works
Resource isolation strategies:
  • Thread pools: Separate thread pool per dependency (don't let one slow service starve all threads)
  • Connection pools: Separate DB connection pool per tenant or use case
  • Infrastructure: Separate compute clusters, VPCs, regions (cells)
  • Rate limits: Per-tenant quotas prevent noisy neighbors
┌─────────────────────────────────────────────────┐ │ BULKHEAD PATTERN │ │ │ │ ┌──────────┐ ┌──────────┐ ┌──────────┐ │ │ │ Tenant A │ │ Tenant B │ │ Tenant C │ │ │ │ Traffic │ │ Traffic │ │ Traffic │ │ │ └────┬─────┘ └────┬─────┘ └────┬─────┘ │ │ │ │ │ │ │ ┌────▼─────┐ ┌───▼──────┐ ┌───▼──────┐ │ │ │ Cell 1 │ │ Cell 2 │ │ Cell 3 │ │ │ │ (isolated│ │(isolated)│ │(isolated)│ │ │ │ infra) │ │ │ │ │ │ │ └──────────┘ └──────────┘ └──────────┘ │ │ │ │ If Cell 1 fails, Cells 2 & 3 unaffected │ └─────────────────────────────────────────────────┘
When to Use
  • Multi-tenant systems (prevent one tenant from affecting others)
  • Multiple external dependencies (isolate their failures)
  • Mixed workload priorities (batch vs real-time)
  • Scaling beyond single datacenter/region
Twilio Application (This is YOUR expertise!)
Cell-based architecture: Enterprise customers get dedicated cells. Startups share cells. One cell failure doesn't cascade.

Thread pool isolation: Worker threads for carrier A, B, C are separate. If carrier A is slow, it only blocks threads in its pool.

Database connection pools: Control plane queries get separate pool from data plane. Analytics queries don't starve production traffic.

Rate limiting per tenant: Each customer has quota. One customer's burst doesn't consume all capacity.
Trade-offs
Pros:
  • Limits blast radius of failures
  • Prevents noisy neighbor problems
  • Enables independent scaling
  • Better capacity management
Cons:
  • Resource inefficiency (reserved capacity may be idle)
  • Operational complexity (more infrastructure to manage)
  • Requires careful capacity planning
2-Minute Interview Answer

"How would you prevent one large customer from impacting Twilio's other customers?"

"This is exactly what cell-based architecture solves. I'd design multiple fault-isolated cells where enterprise customers get dedicated cells and smaller customers share cells.

Cell structure: Each cell is a full stack - API gateways, workers, databases, message queues - deployed in its own VPC with its own capacity.

Tenant → cell mapping: Routing layer maps account_id to cell_id. Enterprise customer 'ACME Corp' routes to Cell-ACME. Thousands of small customers route to Cell-Shared-1, Cell-Shared-2, etc.

Blast radius: If ACME sends a burst of 1 million messages and overwhelms their cell, it only affects them. Cell-Shared-1 is completely isolated.

Within shared cells, further isolation: Rate limiting per account. Thread pool bulkheads for different operations. Connection pool limits.

Resource efficiency: Shared cells run at higher utilization. Dedicated cells have reserved capacity but may be underutilized - that's the price of isolation.

This is the same philosophy I've implemented at PayPal - fault isolation is more important than resource efficiency for reliability."

Retry with Exponential Backoff

One-liner
Retry failed requests with increasing delays between attempts. Add jitter to prevent thundering herd.
How It Works
Basic exponential backoff:
  • Attempt 1: Immediate
  • Attempt 2: Wait 1 second
  • Attempt 3: Wait 2 seconds
  • Attempt 4: Wait 4 seconds
  • Attempt 5: Wait 8 seconds (or give up)

With jitter (randomization):
  • Instead of exactly 4 seconds, wait random(2, 4) seconds
  • Prevents all clients retrying at the same time (thundering herd)

Formula: wait = min(max_delay, base_delay * 2^attempt) * random(0.5, 1.5)
When to Use
  • Network errors (transient failures)
  • Rate limiting (429 responses)
  • Service temporarily overloaded (503 responses)
  • Distributed systems where eventual success is expected
When NOT to Retry
  • 4xx errors (except 429): Bad request won't succeed on retry
  • Non-idempotent operations: Unless you have idempotency keys
  • User-facing synchronous requests: Don't make users wait through multiple retries
  • When circuit breaker is open: Respect the circuit breaker
Twilio Application
Carrier API calls: Retry with exponential backoff up to 3 attempts. If carrier returns 503, they're overloaded - back off.

Webhook delivery: Customer webhook fails? Retry with backoff up to 24 hours. Common pattern: 1min, 5min, 30min, 1hr, 6hr, 24hr.

Internal queue processing: Worker fails to process message? Requeue with exponential backoff, max 5 attempts, then dead letter queue.
Trade-offs
Pros:
  • Handles transient failures automatically
  • Reduces load on struggling service (backs off)
  • Jitter prevents thundering herd
Cons:
  • Increases latency (waiting for retries)
  • Can hide underlying problems (masking failures)
  • May exhaust retries on permanent failures
2-Minute Interview Answer

"How would you handle webhook delivery failures?"

"I'd implement an asynchronous retry system with exponential backoff and eventual dead-lettering.

Immediate delivery attempt: When event occurs (message delivered), synchronously attempt webhook POST. If it succeeds in < 5 seconds, done.

First failure: If it fails (timeout, 5xx, network error), enqueue for retry. Don't block the main message pipeline waiting for retries.

Retry schedule with exponential backoff:
- Retry 1: 1 minute later
- Retry 2: 5 minutes after retry 1
- Retry 3: 30 minutes after retry 2
- Retry 4: 1 hour after retry 3
- Retry 5: 6 hours after retry 4
- Final retry: 24 hours after retry 5

Jitter: Add random 10-20% jitter to each delay to prevent thundering herd if many webhooks fail simultaneously.

Circuit breaker: If a customer's webhook URL fails repeatedly, open circuit breaker for that URL. Stop trying for a longer period.

Dead letter queue: After final retry fails, move to DLQ. Customer can see failed webhooks in console and manually replay.

Don't retry on 4xx (except 429): If customer returns 400, their endpoint doesn't want this webhook - don't retry."

Rate Limiting / Throttling

One-liner
Control the rate of requests to prevent overload. Essential for multi-tenant systems and protecting downstream dependencies.
Common Rate Limiting Algorithms
Algorithm How It Works Pros Cons
Token Bucket Bucket fills with tokens at fixed rate. Each request consumes token. Allows bursts, smooth rate limiting Can burst above average rate
Leaky Bucket Requests enter bucket, processed at fixed rate Enforces strict rate, smooths traffic No bursting allowed
Fixed Window Count requests per time window (e.g., per minute) Simple, low memory Burst at window edges (2x rate possible)
Sliding Window Weighted count based on current position in window More accurate than fixed window More complex, more memory
Where to Apply Rate Limiting
  • API gateway: Per account, per API key
  • Protecting dependencies: Limit calls to external APIs
  • Resource protection: DB connections, CPU, memory
  • Abuse prevention: Prevent DoS, bot traffic
  • Fair sharing: Multi-tenant resource allocation
Twilio Application
Per-account API rate limits: Each account has message send quota (e.g., 1000 msgs/sec). Enforced at API gateway.

Carrier protection: Rate limit outbound calls to each carrier to respect their limits.

Webhook delivery: Rate limit webhook POSTs per customer URL (don't overwhelm their servers).

Tiered limits: Free tier: 100 msgs/day. Pro tier: 10k msgs/day. Enterprise: custom limits.
Trade-offs
Where to implement:
  • Client-side (SDK): Cheap, but can't trust it (users can bypass)
  • API gateway: Good for per-account limits, but single point of failure
  • Distributed (Redis): Scales well, but adds dependency and latency
  • Local (in-process): Fast, but not accurate in distributed system
2-Minute Interview Answer

"How would you implement rate limiting for Twilio's API?"

"I'd implement tiered rate limiting at the API gateway level with distributed state in Redis.

Architecture:
- API gateway receives request, extracts account_id from API key
- Check rate limit in Redis: increment counter for this account in current time window
- If under limit, allow request. If over limit, return 429 Too Many Requests.

Algorithm: Sliding window counter in Redis
Use sorted set with timestamp as score. For each request:
1. Remove entries older than window (1 minute)
2. Count remaining entries
3. If count < limit, add new entry
4. Else reject with 429

Multiple tiers:
- Free: 100 msgs/minute
- Pro: 10,000 msgs/minute
- Enterprise: 100,000 msgs/minute or custom
Look up tier from account metadata.

Graceful degradation: If Redis is down, fail open with local in-memory rate limiting (less accurate but keeps system available).

Response headers: Include X-RateLimit-Limit, X-RateLimit-Remaining, X-RateLimit-Reset in every response so clients can self-regulate.

This balances accuracy, performance, and operational simplicity."

Saga Pattern (Distributed Transactions)

One-liner
Coordinate multi-step transactions across services without distributed locks. Use compensating transactions to undo on failure.
How It Works
Two approaches:

1. Choreography (event-driven):
  • Each service listens for events and triggers next step
  • No central coordinator
  • Services publish events, others subscribe

2. Orchestration (coordinator):
  • Central orchestrator service tells each service what to do
  • Orchestrator tracks state and handles failures
  • Easier to understand and debug

Compensating transactions:
If step 4 fails, execute compensating transactions for steps 3, 2, 1 in reverse order to undo the work.
SAGA PATTERN - Orchestration ┌────────────┐ │Orchestrator│ └──────┬─────┘ │ ├─ 1. Reserve Inventory ──> ✓ │ ├─ 2. Charge Payment ──────> ✓ │ ├─ 3. Ship Order ──────────> ✗ FAILED! │ ├─ Compensate: Refund ─────> ✓ │ └─ Compensate: Unreserve ──> ✓ Result: Transaction rolled back
When to Use
  • Multi-service transactions (e.g., order placement across inventory, payment, shipping)
  • Long-running workflows
  • When 2-phase commit is too expensive
  • When services are independently deployed and scaled
Twilio Application
Message send workflow (orchestration saga):
  1. Validate account & API key (auth service)
  2. Check rate limit & decrement quota (billing service)
  3. Deduct account balance (billing service)
  4. Queue message for delivery (messaging service)
  5. Return SID to customer
If step 4 fails: Compensate by refunding balance, restoring quota.

Account provisioning (choreography saga):
Account created → Billing service provisions account → Identity service creates credentials → Welcome email sent
Trade-offs
Approach Pros Cons
Choreography Decentralized, no single point of failure Hard to track workflow, circular dependencies possible
Orchestration Clear workflow, easy to debug, centralized state Orchestrator is SPOF, can become complex
2-Minute Interview Answer

"How would you handle a message send that requires checking quota, charging balance, and queuing delivery?"

"I'd use an orchestrator-based saga pattern to coordinate the multi-service transaction.

Orchestrator flow:
1. Customer POSTs to /Messages API
2. API gateway calls orchestrator service
3. Orchestrator executes saga:
a. Call auth service: validate API key → get account_id
b. Call billing service: check if account has quota and balance
c. Call billing service: reserve balance (soft charge)
d. Call messaging service: queue message with unique SID
e. Call billing service: confirm charge (hard commit)
4. Return SID to customer

If step 3d fails (message queue is down):
Orchestrator executes compensating transaction:
- Call billing service: release reserved balance
- Return 503 to customer: 'Please retry'

Why orchestration over choreography: This is a synchronous user-facing API. We need to respond within 500ms. Orchestration gives us:
- Clear timeout boundaries
- Easy retry logic
- Centralized monitoring of success rate

Idempotency: Accept idempotency key from customer. If they retry the same POST, return same SID without executing saga again.

This approach gives ACID-like semantics across microservices without distributed transactions."

Event Sourcing & CQRS

One-liner
Event Sourcing: Store all changes as events, derive current state by replaying. CQRS: Separate read and write models for different optimization.
Event Sourcing
Instead of storing current state:
  • Store every event that changed the state
  • Current state = replay all events from beginning
  • Events are immutable, append-only

Example - Bank Account:
Don't store: balance = $500
Do store: [AccountCreated($0), Deposited($1000), Withdrew($500)]
Replay events → balance = $500

CQRS (Command Query Responsibility Segregation)
Separate write model from read model:
  • Write model (commands): Optimized for validation and writes
  • Read model (queries): Optimized for query performance, often denormalized
  • Async synchronization between models
When to Use
Event Sourcing:
  • Need full audit history (compliance, debugging)
  • Time-travel queries ("what was balance on March 1?")
  • Event-driven architectures
  • Complex domain logic that's naturally event-based
CQRS:
  • Very different read vs write patterns
  • Read scalability needs (many readers, few writers)
  • Complex queries on denormalized data
Twilio Application
Message lifecycle (Event Sourcing):
Events: MessageReceived → MessageValidated → MessageQueued → MessageSentToCarrier → MessageDelivered
Store all events in Kafka. Current message status = latest event.

Account dashboard (CQRS):
Write model: Validate & store messages in normalized tables
Read model: Denormalized view for dashboard: "messages sent today by account" - updated by consuming Kafka events
Trade-offs
Pros:
  • Complete audit trail
  • Can rebuild state at any point in time
  • Easy to add new read models retroactively
  • Natural fit for event-driven systems
Cons:
  • Complexity - not familiar to most developers
  • Eventual consistency in read models
  • Event schema evolution is hard
  • Storage cost (every event is kept)
2-Minute Interview Answer

"How would you design Twilio's message status tracking system?"

"I'd use event sourcing with CQRS to optimize for both write throughput and query performance.

Write path (Event Sourcing):
Every message state change is an event written to Kafka:
- MessageCreated(sid, to, from, body)
- MessageQueued(sid, timestamp)
- MessageSentToCarrier(sid, carrier_id, timestamp)
- MessageDelivered(sid, timestamp)

Why Kafka? We're writing millions of events per second. Kafka is built for this - append-only, partitioned by message SID for ordering.

Read path (CQRS):
Multiple read models optimized for different queries:

1. Status Check API: DynamoDB with message SID as key, stores latest status. Updated by Kafka consumer.
2. Analytics Dashboard: Aggregate counts in DynamoDB: messages sent today per account. Updated in real-time from Kafka.
3. Message History: ElasticSearch for full-text search & filtering ('show me all failed messages to +1-555-...'). Bulk indexed from Kafka.

Benefits:
- Append-only writes = max throughput
- Read models optimized for specific queries
- Can add new read models without touching write path
- Full audit trail for compliance

Trade-off: Read models are eventually consistent (lag 100ms - 1s), but that's acceptable for status checks."

Strangler Fig Pattern (Migration)

One-liner
Incrementally migrate from old system to new by routing traffic to new system piece by piece, eventually "strangling" the old system.
How It Works
  1. Add routing layer: Proxy/gateway that can route to old or new system
  2. Implement one feature in new system: While old system still handles everything else
  3. Route portion of traffic to new system: Small percentage, specific accounts, specific features
  4. Gradually increase traffic to new system: Monitor, validate, increase percentage
  5. Repeat for each feature: Until old system handles 0% traffic
  6. Decommission old system: It's fully replaced
STRANGLER FIG MIGRATION Phase 1: All traffic to old system ┌─────┐ ┌──────────┐ │Users│────>│ Old API │ └─────┘ └──────────┘ Phase 2: Route reads to new, writes to old ┌─────┐ ┌─────────┐ ┌──────────┐ │Users│────>│ Router │───>│ Old API │ (writes) └─────┘ └────┬────┘ └──────────┘ │ ┌──────────┐ └────────>│ New API │ (reads) └──────────┘ Phase 3: All traffic to new ┌─────┐ ┌──────────┐ │Users│────>│ New API │ └─────┘ └──────────┘ ┌──────────┐ │ Old API │ (decommission) └──────────┘
When to Use
  • Migrating monolith to microservices
  • Replatforming (e.g., datacenter to cloud)
  • Technology stack replacement
  • Any migration too large for big-bang cutover
Twilio Application
Migrating messaging pipeline to cell-based architecture:
  1. Add routing layer at API gateway that can route to legacy or new cells
  2. Build Cell-1 with full messaging stack
  3. Route 1% of traffic (new accounts) to Cell-1
  4. Monitor error rates, latency, delivery success
  5. Increase to 5%, 10%, 25%, 50%, 100%
  6. Build Cell-2, Cell-3 as needed
  7. Migrate existing accounts gradually
  8. Eventually decommission legacy system
Trade-offs
Pros:
  • Incremental, low-risk migration
  • Can validate each step before proceeding
  • Easy rollback (route back to old system)
  • No "big bang" cutover
Cons:
  • Maintain both systems during migration (cost, complexity)
  • Need routing/proxy layer (added latency)
  • Data synchronization challenges
  • Migration can take months/years
2-Minute Interview Answer

"How would you migrate Twilio's monolithic messaging system to cell-based architecture?"

"I'd use strangler fig pattern to incrementally migrate over 6-12 months.

Phase 1: Routing infrastructure (Month 1)
Build routing layer at API gateway. Route based on account_id to either 'legacy' or cell_id. Initially, all routes to legacy.

Phase 2: Build first cell (Month 2-3)
Build Cell-1 with full stack: API tier, workers, queues, DB. Deploy in isolated VPC.

Phase 3: Route new accounts (Month 3)
All new account sign-ups route to Cell-1. Existing accounts stay on legacy. This limits blast radius - new accounts don't have production expectations yet.

Phase 4: Gradual rollout (Month 4-6)
Migrate existing accounts in batches:
- Start with internal test accounts
- Then free tier accounts (lower risk)
- Then pro tier
- Finally enterprise (with explicit communication)

Phase 5: Horizontal scaling (Month 6-9)
Build Cell-2, Cell-3 as Cell-1 reaches capacity. Distribute existing accounts across cells.

Phase 6: Decommission legacy (Month 10-12)
Once legacy handles 0% traffic for 30 days, decommission.

Critical success factors:
- Feature parity before migration
- Automated testing in both systems
- Metrics to compare legacy vs cells
- Fast rollback mechanism
- Clear communication to customers

This is exactly how I'd approach it at PayPal."