Architectural Patterns - Twilio Interview Prep

One-liner

Prevent cascading failures by stopping calls to a failing dependency. Like an electrical circuit breaker, it "trips" when failures exceed a threshold.

How It Works

Three states:

Closed (normal): Requests pass through. Track failures.
Open (tripped): Reject requests immediately, return fallback. Don't call dependency.
Half-Open (testing): Allow limited requests through to test if dependency recovered.

State transitions:

Closed → Open: When failure rate exceeds threshold (e.g., 50% failures in 10 seconds)
Open → Half-Open: After timeout (e.g., 30 seconds)
Half-Open → Closed: If test requests succeed
Half-Open → Open: If test requests fail

┌──────────────────────────────────────────────────┐ │ CIRCUIT BREAKER │ │ │ │ CLOSED ──[failures > threshold]──> OPEN │ │ │ │ │ │ │ │ │ │ │ [timeout elapsed] │ │ │ │ │ │ │ ▼ │ │ └──[test succeeds]── HALF-OPEN ◄──┘ │ │ │ │ │ [test fails] │ │ │ │ │ ▼ │ │ OPEN │ └──────────────────────────────────────────────────┘

When to Use

Calling external APIs that might fail or be slow
Dependencies that could timeout
Services under your control that might overload
Preventing retry storms

Trade-offs

Pros:

Prevents cascading failures and retry storms
Fails fast - don't wait for timeouts
Gives failing service time to recover
Provides fallback experience

Cons:

Adds complexity to client code
Need to tune thresholds (false positives vs late detection)
Requires fallback strategy

Twilio Application

SMS carrier integration: Each carrier gets a circuit breaker. If Carrier A is timing out, open the circuit and fail fast. Route traffic to Carrier B or return error to customer faster.

Webhook delivery: Circuit breaker per customer webhook URL. If their endpoint is down, open circuit and stop hammering it. Queue webhooks for later retry.

Internal service calls: Identity service, billing service, analytics - all get circuit breakers. If billing is slow, don't wait for timeout on every request.

2-Minute Interview Answer

"How would you prevent a slow carrier API from impacting Twilio's message delivery?"

"I'd implement circuit breakers per carrier with aggressive timeouts.

Setup: Each carrier integration has a circuit breaker tracking success/failure rate over a sliding window - say, 100 requests or 10 seconds.

Failure threshold: If more than 50% of requests fail or timeout in that window, trip the circuit to Open. Don't send any more requests to that carrier for 30 seconds.

Fast failure: While Open, immediately reject new messages for that carrier. Either return an error to the customer or route to a backup carrier if available.

Recovery testing: After 30 seconds, enter Half-Open. Send a few test messages. If they succeed, close the circuit and resume normal traffic. If they fail, stay Open for another interval.

Why this matters: Without circuit breakers, you'd have thousands of threads blocked waiting for timeouts to that slow carrier. With circuit breakers, you fail fast and preserve capacity to deliver messages through healthy carriers.

I'd also add metrics and alerting on circuit breaker state transitions so we know when carriers are having issues."

One-liner

Isolate failures by partitioning resources. Like bulkheads in a ship prevent one leak from sinking the whole vessel.

How It Works

Resource isolation strategies:

Thread pools: Separate thread pool per dependency (don't let one slow service starve all threads)
Connection pools: Separate DB connection pool per tenant or use case
Infrastructure: Separate compute clusters, VPCs, regions (cells)
Rate limits: Per-tenant quotas prevent noisy neighbors

┌─────────────────────────────────────────────────┐ │ BULKHEAD PATTERN │ │ │ │ ┌──────────┐ ┌──────────┐ ┌──────────┐ │ │ │ Tenant A │ │ Tenant B │ │ Tenant C │ │ │ │ Traffic │ │ Traffic │ │ Traffic │ │ │ └────┬─────┘ └────┬─────┘ └────┬─────┘ │ │ │ │ │ │ │ ┌────▼─────┐ ┌───▼──────┐ ┌───▼──────┐ │ │ │ Cell 1 │ │ Cell 2 │ │ Cell 3 │ │ │ │ (isolated│ │(isolated)│ │(isolated)│ │ │ │ infra) │ │ │ │ │ │ │ └──────────┘ └──────────┘ └──────────┘ │ │ │ │ If Cell 1 fails, Cells 2 & 3 unaffected │ └─────────────────────────────────────────────────┘

When to Use

Multi-tenant systems (prevent one tenant from affecting others)
Multiple external dependencies (isolate their failures)
Mixed workload priorities (batch vs real-time)
Scaling beyond single datacenter/region

Twilio Application (This is YOUR expertise!)

Cell-based architecture: Enterprise customers get dedicated cells. Startups share cells. One cell failure doesn't cascade.

Thread pool isolation: Worker threads for carrier A, B, C are separate. If carrier A is slow, it only blocks threads in its pool.

Database connection pools: Control plane queries get separate pool from data plane. Analytics queries don't starve production traffic.

Rate limiting per tenant: Each customer has quota. One customer's burst doesn't consume all capacity.

Trade-offs

Pros:

Limits blast radius of failures
Prevents noisy neighbor problems
Enables independent scaling
Better capacity management

Cons:

Resource inefficiency (reserved capacity may be idle)
Operational complexity (more infrastructure to manage)
Requires careful capacity planning

2-Minute Interview Answer

"How would you prevent one large customer from impacting Twilio's other customers?"

"This is exactly what cell-based architecture solves. I'd design multiple fault-isolated cells where enterprise customers get dedicated cells and smaller customers share cells.

Cell structure: Each cell is a full stack - API gateways, workers, databases, message queues - deployed in its own VPC with its own capacity.

Tenant → cell mapping: Routing layer maps account_id to cell_id. Enterprise customer 'ACME Corp' routes to Cell-ACME. Thousands of small customers route to Cell-Shared-1, Cell-Shared-2, etc.

Blast radius: If ACME sends a burst of 1 million messages and overwhelms their cell, it only affects them. Cell-Shared-1 is completely isolated.

Within shared cells, further isolation: Rate limiting per account. Thread pool bulkheads for different operations. Connection pool limits.

Resource efficiency: Shared cells run at higher utilization. Dedicated cells have reserved capacity but may be underutilized - that's the price of isolation.

This is the same philosophy I've implemented at PayPal - fault isolation is more important than resource efficiency for reliability."

One-liner

Retry failed requests with increasing delays between attempts. Add jitter to prevent thundering herd.

How It Works

Basic exponential backoff:

Attempt 1: Immediate
Attempt 2: Wait 1 second
Attempt 3: Wait 2 seconds
Attempt 4: Wait 4 seconds
Attempt 5: Wait 8 seconds (or give up)

With jitter (randomization):

Instead of exactly 4 seconds, wait random(2, 4) seconds
Prevents all clients retrying at the same time (thundering herd)

Formula: wait = min(max_delay, base_delay * 2^attempt) * random(0.5, 1.5)

When to Use

Network errors (transient failures)
Rate limiting (429 responses)
Service temporarily overloaded (503 responses)
Distributed systems where eventual success is expected

When NOT to Retry

4xx errors (except 429): Bad request won't succeed on retry
Non-idempotent operations: Unless you have idempotency keys
User-facing synchronous requests: Don't make users wait through multiple retries
When circuit breaker is open: Respect the circuit breaker

Twilio Application

Carrier API calls: Retry with exponential backoff up to 3 attempts. If carrier returns 503, they're overloaded - back off.

Webhook delivery: Customer webhook fails? Retry with backoff up to 24 hours. Common pattern: 1min, 5min, 30min, 1hr, 6hr, 24hr.

Internal queue processing: Worker fails to process message? Requeue with exponential backoff, max 5 attempts, then dead letter queue.

Trade-offs

Pros:

Handles transient failures automatically
Reduces load on struggling service (backs off)
Jitter prevents thundering herd

Cons:

Increases latency (waiting for retries)
Can hide underlying problems (masking failures)
May exhaust retries on permanent failures

2-Minute Interview Answer

"How would you handle webhook delivery failures?"

"I'd implement an asynchronous retry system with exponential backoff and eventual dead-lettering.

Immediate delivery attempt: When event occurs (message delivered), synchronously attempt webhook POST. If it succeeds in < 5 seconds, done.

First failure: If it fails (timeout, 5xx, network error), enqueue for retry. Don't block the main message pipeline waiting for retries.

Retry schedule with exponential backoff:
- Retry 1: 1 minute later
- Retry 2: 5 minutes after retry 1
- Retry 3: 30 minutes after retry 2
- Retry 4: 1 hour after retry 3
- Retry 5: 6 hours after retry 4
- Final retry: 24 hours after retry 5

Jitter: Add random 10-20% jitter to each delay to prevent thundering herd if many webhooks fail simultaneously.

Circuit breaker: If a customer's webhook URL fails repeatedly, open circuit breaker for that URL. Stop trying for a longer period.

Dead letter queue: After final retry fails, move to DLQ. Customer can see failed webhooks in console and manually replay.

Don't retry on 4xx (except 429): If customer returns 400, their endpoint doesn't want this webhook - don't retry."

One-liner

Control the rate of requests to prevent overload. Essential for multi-tenant systems and protecting downstream dependencies.

Common Rate Limiting Algorithms

Algorithm	How It Works	Pros	Cons
Token Bucket	Bucket fills with tokens at fixed rate. Each request consumes token.	Allows bursts, smooth rate limiting	Can burst above average rate
Leaky Bucket	Requests enter bucket, processed at fixed rate	Enforces strict rate, smooths traffic	No bursting allowed
Fixed Window	Count requests per time window (e.g., per minute)	Simple, low memory	Burst at window edges (2x rate possible)
Sliding Window	Weighted count based on current position in window	More accurate than fixed window	More complex, more memory

Where to Apply Rate Limiting

API gateway: Per account, per API key
Protecting dependencies: Limit calls to external APIs
Resource protection: DB connections, CPU, memory
Abuse prevention: Prevent DoS, bot traffic
Fair sharing: Multi-tenant resource allocation

Twilio Application

Per-account API rate limits: Each account has message send quota (e.g., 1000 msgs/sec). Enforced at API gateway.

Carrier protection: Rate limit outbound calls to each carrier to respect their limits.

Webhook delivery: Rate limit webhook POSTs per customer URL (don't overwhelm their servers).

Tiered limits: Free tier: 100 msgs/day. Pro tier: 10k msgs/day. Enterprise: custom limits.

Trade-offs

Where to implement:

Client-side (SDK): Cheap, but can't trust it (users can bypass)
API gateway: Good for per-account limits, but single point of failure
Distributed (Redis): Scales well, but adds dependency and latency
Local (in-process): Fast, but not accurate in distributed system

2-Minute Interview Answer

"How would you implement rate limiting for Twilio's API?"

"I'd implement tiered rate limiting at the API gateway level with distributed state in Redis.

Architecture:
- API gateway receives request, extracts account_id from API key
- Check rate limit in Redis: increment counter for this account in current time window
- If under limit, allow request. If over limit, return 429 Too Many Requests.

Algorithm: Sliding window counter in Redis
Use sorted set with timestamp as score. For each request:
1. Remove entries older than window (1 minute)
2. Count remaining entries
3. If count < limit, add new entry
4. Else reject with 429

Multiple tiers:
- Free: 100 msgs/minute
- Pro: 10,000 msgs/minute
- Enterprise: 100,000 msgs/minute or custom
Look up tier from account metadata.

Graceful degradation: If Redis is down, fail open with local in-memory rate limiting (less accurate but keeps system available).

Response headers: Include X-RateLimit-Limit, X-RateLimit-Remaining, X-RateLimit-Reset in every response so clients can self-regulate.

This balances accuracy, performance, and operational simplicity."

One-liner

Coordinate multi-step transactions across services without distributed locks. Use compensating transactions to undo on failure.

How It Works

Two approaches:

1. Choreography (event-driven):

Each service listens for events and triggers next step
No central coordinator
Services publish events, others subscribe

2. Orchestration (coordinator):

Central orchestrator service tells each service what to do
Orchestrator tracks state and handles failures
Easier to understand and debug

Compensating transactions:
If step 4 fails, execute compensating transactions for steps 3, 2, 1 in reverse order to undo the work.

SAGA PATTERN - Orchestration ┌────────────┐ │Orchestrator│ └──────┬─────┘ │ ├─ 1. Reserve Inventory ──> ✓ │ ├─ 2. Charge Payment ──────> ✓ │ ├─ 3. Ship Order ──────────> ✗ FAILED! │ ├─ Compensate: Refund ─────> ✓ │ └─ Compensate: Unreserve ──> ✓ Result: Transaction rolled back

When to Use

Multi-service transactions (e.g., order placement across inventory, payment, shipping)
Long-running workflows
When 2-phase commit is too expensive
When services are independently deployed and scaled

Twilio Application

Message send workflow (orchestration saga):

Validate account & API key (auth service)
Check rate limit & decrement quota (billing service)
Deduct account balance (billing service)
Queue message for delivery (messaging service)
Return SID to customer

If step 4 fails: Compensate by refunding balance, restoring quota.

Account provisioning (choreography saga):
Account created → Billing service provisions account → Identity service creates credentials → Welcome email sent

Trade-offs

Approach	Pros	Cons
Choreography	Decentralized, no single point of failure	Hard to track workflow, circular dependencies possible
Orchestration	Clear workflow, easy to debug, centralized state	Orchestrator is SPOF, can become complex

2-Minute Interview Answer

"How would you handle a message send that requires checking quota, charging balance, and queuing delivery?"

"I'd use an orchestrator-based saga pattern to coordinate the multi-service transaction.

Orchestrator flow:
1. Customer POSTs to /Messages API
2. API gateway calls orchestrator service
3. Orchestrator executes saga:
a. Call auth service: validate API key → get account_id
b. Call billing service: check if account has quota and balance
c. Call billing service: reserve balance (soft charge)
d. Call messaging service: queue message with unique SID
e. Call billing service: confirm charge (hard commit)
4. Return SID to customer

If step 3d fails (message queue is down):
Orchestrator executes compensating transaction:
- Call billing service: release reserved balance
- Return 503 to customer: 'Please retry'

Why orchestration over choreography: This is a synchronous user-facing API. We need to respond within 500ms. Orchestration gives us:
- Clear timeout boundaries
- Easy retry logic
- Centralized monitoring of success rate

Idempotency: Accept idempotency key from customer. If they retry the same POST, return same SID without executing saga again.

This approach gives ACID-like semantics across microservices without distributed transactions."

One-liner

Event Sourcing: Store all changes as events, derive current state by replaying. CQRS: Separate read and write models for different optimization.

Event Sourcing

Instead of storing current state:

Store every event that changed the state
Current state = replay all events from beginning
Events are immutable, append-only

Example - Bank Account:
Don't store: balance = $500
Do store: [AccountCreated($0), Deposited($1000), Withdrew($500)]
Replay events → balance = $500

CQRS (Command Query Responsibility Segregation)

Separate write model from read model:

Write model (commands): Optimized for validation and writes
Read model (queries): Optimized for query performance, often denormalized
Async synchronization between models

When to Use

Event Sourcing:

Need full audit history (compliance, debugging)
Time-travel queries ("what was balance on March 1?")
Event-driven architectures
Complex domain logic that's naturally event-based

CQRS:

Very different read vs write patterns
Read scalability needs (many readers, few writers)
Complex queries on denormalized data

Twilio Application

Message lifecycle (Event Sourcing):
Events: MessageReceived → MessageValidated → MessageQueued → MessageSentToCarrier → MessageDelivered
Store all events in Kafka. Current message status = latest event.

Account dashboard (CQRS):
Write model: Validate & store messages in normalized tables
Read model: Denormalized view for dashboard: "messages sent today by account" - updated by consuming Kafka events

Trade-offs

Pros:

Complete audit trail
Can rebuild state at any point in time
Easy to add new read models retroactively
Natural fit for event-driven systems

Cons:

Complexity - not familiar to most developers
Eventual consistency in read models
Event schema evolution is hard
Storage cost (every event is kept)

2-Minute Interview Answer

"How would you design Twilio's message status tracking system?"

"I'd use event sourcing with CQRS to optimize for both write throughput and query performance.

Write path (Event Sourcing):
Every message state change is an event written to Kafka:
- MessageCreated(sid, to, from, body)
- MessageQueued(sid, timestamp)
- MessageSentToCarrier(sid, carrier_id, timestamp)
- MessageDelivered(sid, timestamp)

Why Kafka? We're writing millions of events per second. Kafka is built for this - append-only, partitioned by message SID for ordering.

Read path (CQRS):
Multiple read models optimized for different queries:

1. Status Check API: DynamoDB with message SID as key, stores latest status. Updated by Kafka consumer.
2. Analytics Dashboard: Aggregate counts in DynamoDB: messages sent today per account. Updated in real-time from Kafka.
3. Message History: ElasticSearch for full-text search & filtering ('show me all failed messages to +1-555-...'). Bulk indexed from Kafka.

Benefits:
- Append-only writes = max throughput
- Read models optimized for specific queries
- Can add new read models without touching write path
- Full audit trail for compliance

Trade-off: Read models are eventually consistent (lag 100ms - 1s), but that's acceptable for status checks."

One-liner

Incrementally migrate from old system to new by routing traffic to new system piece by piece, eventually "strangling" the old system.

How It Works

Add routing layer: Proxy/gateway that can route to old or new system
Implement one feature in new system: While old system still handles everything else
Route portion of traffic to new system: Small percentage, specific accounts, specific features
Gradually increase traffic to new system: Monitor, validate, increase percentage
Repeat for each feature: Until old system handles 0% traffic
Decommission old system: It's fully replaced

STRANGLER FIG MIGRATION Phase 1: All traffic to old system ┌─────┐ ┌──────────┐ │Users│────>│ Old API │ └─────┘ └──────────┘ Phase 2: Route reads to new, writes to old ┌─────┐ ┌─────────┐ ┌──────────┐ │Users│────>│ Router │───>│ Old API │ (writes) └─────┘ └────┬────┘ └──────────┘ │ ┌──────────┐ └────────>│ New API │ (reads) └──────────┘ Phase 3: All traffic to new ┌─────┐ ┌──────────┐ │Users│────>│ New API │ └─────┘ └──────────┘ ┌──────────┐ │ Old API │ (decommission) └──────────┘

When to Use

Migrating monolith to microservices
Replatforming (e.g., datacenter to cloud)
Technology stack replacement
Any migration too large for big-bang cutover

Twilio Application

Migrating messaging pipeline to cell-based architecture:

Add routing layer at API gateway that can route to legacy or new cells
Build Cell-1 with full messaging stack
Route 1% of traffic (new accounts) to Cell-1
Monitor error rates, latency, delivery success
Increase to 5%, 10%, 25%, 50%, 100%
Build Cell-2, Cell-3 as needed
Migrate existing accounts gradually
Eventually decommission legacy system

Trade-offs

Pros:

Incremental, low-risk migration
Can validate each step before proceeding
Easy rollback (route back to old system)
No "big bang" cutover

Cons:

Maintain both systems during migration (cost, complexity)
Need routing/proxy layer (added latency)
Data synchronization challenges
Migration can take months/years

2-Minute Interview Answer

"How would you migrate Twilio's monolithic messaging system to cell-based architecture?"

"I'd use strangler fig pattern to incrementally migrate over 6-12 months.

Phase 1: Routing infrastructure (Month 1)
Build routing layer at API gateway. Route based on account_id to either 'legacy' or cell_id. Initially, all routes to legacy.

Phase 2: Build first cell (Month 2-3)
Build Cell-1 with full stack: API tier, workers, queues, DB. Deploy in isolated VPC.

Phase 3: Route new accounts (Month 3)
All new account sign-ups route to Cell-1. Existing accounts stay on legacy. This limits blast radius - new accounts don't have production expectations yet.

Phase 4: Gradual rollout (Month 4-6)
Migrate existing accounts in batches:
- Start with internal test accounts
- Then free tier accounts (lower risk)
- Then pro tier
- Finally enterprise (with explicit communication)

Phase 5: Horizontal scaling (Month 6-9)
Build Cell-2, Cell-3 as Cell-1 reaches capacity. Distribute existing accounts across cells.

Phase 6: Decommission legacy (Month 10-12)
Once legacy handles 0% traffic for 30 days, decommission.

Critical success factors:
- Feature parity before migration
- Automated testing in both systems
- Metrics to compare legacy vs cells
- Fast rollback mechanism
- Clear communication to customers

This is exactly how I'd approach it at PayPal."

Architectural Patterns for Reliability

Circuit Breaker

Bulkhead / Fault Isolation

Retry with Exponential Backoff

Rate Limiting / Throttling

Saga Pattern (Distributed Transactions)

Event Sourcing & CQRS

Strangler Fig Pattern (Migration)