Resilience patterns critical for distributed systems like Twilio. Focus on preventing cascading failures and maintaining availability.
"How would you prevent a slow carrier API from impacting Twilio's message delivery?"
"I'd implement circuit breakers per carrier with aggressive timeouts.
Setup: Each carrier integration has a circuit breaker tracking success/failure rate over a sliding window - say, 100 requests or 10 seconds.
Failure threshold: If more than 50% of requests fail or timeout in that window, trip the circuit to Open. Don't send any more requests to that carrier for 30 seconds.
Fast failure: While Open, immediately reject new messages for that carrier. Either return an error to the customer or route to a backup carrier if available.
Recovery testing: After 30 seconds, enter Half-Open. Send a few test messages. If they succeed, close the circuit and resume normal traffic. If they fail, stay Open for another interval.
Why this matters: Without circuit breakers, you'd have thousands of threads blocked waiting for timeouts to that slow carrier. With circuit breakers, you fail fast and preserve capacity to deliver messages through healthy carriers.
I'd also add metrics and alerting on circuit breaker state transitions so we know when carriers are having issues."
"How would you prevent one large customer from impacting Twilio's other customers?"
"This is exactly what cell-based architecture solves. I'd design multiple fault-isolated cells where enterprise customers get dedicated cells and smaller customers share cells.
Cell structure: Each cell is a full stack - API gateways, workers, databases, message queues - deployed in its own VPC with its own capacity.
Tenant → cell mapping: Routing layer maps account_id to cell_id. Enterprise customer 'ACME Corp' routes to Cell-ACME. Thousands of small customers route to Cell-Shared-1, Cell-Shared-2, etc.
Blast radius: If ACME sends a burst of 1 million messages and overwhelms their cell, it only affects them. Cell-Shared-1 is completely isolated.
Within shared cells, further isolation: Rate limiting per account. Thread pool bulkheads for different operations. Connection pool limits.
Resource efficiency: Shared cells run at higher utilization. Dedicated cells have reserved capacity but may be underutilized - that's the price of isolation.
This is the same philosophy I've implemented at PayPal - fault isolation is more important than resource efficiency for reliability."
wait = min(max_delay, base_delay * 2^attempt) * random(0.5, 1.5)
"How would you handle webhook delivery failures?"
"I'd implement an asynchronous retry system with exponential backoff and eventual dead-lettering.
Immediate delivery attempt: When event occurs (message delivered), synchronously attempt webhook POST. If it succeeds in < 5 seconds, done.
First failure: If it fails (timeout, 5xx, network error), enqueue for retry. Don't block the main message pipeline waiting for retries.
Retry schedule with exponential backoff:
- Retry 1: 1 minute later
- Retry 2: 5 minutes after retry 1
- Retry 3: 30 minutes after retry 2
- Retry 4: 1 hour after retry 3
- Retry 5: 6 hours after retry 4
- Final retry: 24 hours after retry 5
Jitter: Add random 10-20% jitter to each delay to prevent thundering herd if many webhooks fail simultaneously.
Circuit breaker: If a customer's webhook URL fails repeatedly, open circuit breaker for that URL. Stop trying for a longer period.
Dead letter queue: After final retry fails, move to DLQ. Customer can see failed webhooks in console and manually replay.
Don't retry on 4xx (except 429): If customer returns 400, their endpoint doesn't want this webhook - don't retry."
| Algorithm | How It Works | Pros | Cons |
|---|---|---|---|
| Token Bucket | Bucket fills with tokens at fixed rate. Each request consumes token. | Allows bursts, smooth rate limiting | Can burst above average rate |
| Leaky Bucket | Requests enter bucket, processed at fixed rate | Enforces strict rate, smooths traffic | No bursting allowed |
| Fixed Window | Count requests per time window (e.g., per minute) | Simple, low memory | Burst at window edges (2x rate possible) |
| Sliding Window | Weighted count based on current position in window | More accurate than fixed window | More complex, more memory |
"How would you implement rate limiting for Twilio's API?"
"I'd implement tiered rate limiting at the API gateway level with distributed state in Redis.
Architecture:
- API gateway receives request, extracts account_id from API key
- Check rate limit in Redis: increment counter for this account in current time window
- If under limit, allow request. If over limit, return 429 Too Many Requests.
Algorithm: Sliding window counter in Redis
Use sorted set with timestamp as score. For each request:
1. Remove entries older than window (1 minute)
2. Count remaining entries
3. If count < limit, add new entry
4. Else reject with 429
Multiple tiers:
- Free: 100 msgs/minute
- Pro: 10,000 msgs/minute
- Enterprise: 100,000 msgs/minute or custom
Look up tier from account metadata.
Graceful degradation:
If Redis is down, fail open with local in-memory rate limiting (less accurate but keeps system available).
Response headers:
Include X-RateLimit-Limit, X-RateLimit-Remaining, X-RateLimit-Reset in every response so clients can self-regulate.
This balances accuracy, performance, and operational simplicity."
| Approach | Pros | Cons |
|---|---|---|
| Choreography | Decentralized, no single point of failure | Hard to track workflow, circular dependencies possible |
| Orchestration | Clear workflow, easy to debug, centralized state | Orchestrator is SPOF, can become complex |
"How would you handle a message send that requires checking quota, charging balance, and queuing delivery?"
"I'd use an orchestrator-based saga pattern to coordinate the multi-service transaction.
Orchestrator flow:
1. Customer POSTs to /Messages API
2. API gateway calls orchestrator service
3. Orchestrator executes saga:
a. Call auth service: validate API key → get account_id
b. Call billing service: check if account has quota and balance
c. Call billing service: reserve balance (soft charge)
d. Call messaging service: queue message with unique SID
e. Call billing service: confirm charge (hard commit)
4. Return SID to customer
If step 3d fails (message queue is down):
Orchestrator executes compensating transaction:
- Call billing service: release reserved balance
- Return 503 to customer: 'Please retry'
Why orchestration over choreography:
This is a synchronous user-facing API. We need to respond within 500ms. Orchestration gives us:
- Clear timeout boundaries
- Easy retry logic
- Centralized monitoring of success rate
Idempotency:
Accept idempotency key from customer. If they retry the same POST, return same SID without executing saga again.
This approach gives ACID-like semantics across microservices without distributed transactions."
balance = $500
[AccountCreated($0), Deposited($1000), Withdrew($500)]
"How would you design Twilio's message status tracking system?"
"I'd use event sourcing with CQRS to optimize for both write throughput and query performance.
Write path (Event Sourcing):
Every message state change is an event written to Kafka:
- MessageCreated(sid, to, from, body)
- MessageQueued(sid, timestamp)
- MessageSentToCarrier(sid, carrier_id, timestamp)
- MessageDelivered(sid, timestamp)
Why Kafka? We're writing millions of events per second. Kafka is built for this - append-only, partitioned by message SID for ordering.
Read path (CQRS):
Multiple read models optimized for different queries:
1. Status Check API: DynamoDB with message SID as key, stores latest status. Updated by Kafka consumer.
2. Analytics Dashboard: Aggregate counts in DynamoDB: messages sent today per account. Updated in real-time from Kafka.
3. Message History: ElasticSearch for full-text search & filtering ('show me all failed messages to +1-555-...'). Bulk indexed from Kafka.
Benefits:
- Append-only writes = max throughput
- Read models optimized for specific queries
- Can add new read models without touching write path
- Full audit trail for compliance
Trade-off:
Read models are eventually consistent (lag 100ms - 1s), but that's acceptable for status checks."
"How would you migrate Twilio's monolithic messaging system to cell-based architecture?"
"I'd use strangler fig pattern to incrementally migrate over 6-12 months.
Phase 1: Routing infrastructure (Month 1)
Build routing layer at API gateway. Route based on account_id to either 'legacy' or cell_id. Initially, all routes to legacy.
Phase 2: Build first cell (Month 2-3)
Build Cell-1 with full stack: API tier, workers, queues, DB. Deploy in isolated VPC.
Phase 3: Route new accounts (Month 3)
All new account sign-ups route to Cell-1. Existing accounts stay on legacy. This limits blast radius - new accounts don't have production expectations yet.
Phase 4: Gradual rollout (Month 4-6)
Migrate existing accounts in batches:
- Start with internal test accounts
- Then free tier accounts (lower risk)
- Then pro tier
- Finally enterprise (with explicit communication)
Phase 5: Horizontal scaling (Month 6-9)
Build Cell-2, Cell-3 as Cell-1 reaches capacity. Distribute existing accounts across cells.
Phase 6: Decommission legacy (Month 10-12)
Once legacy handles 0% traffic for 30 days, decommission.
Critical success factors:
- Feature parity before migration
- Automated testing in both systems
- Metrics to compare legacy vs cells
- Fast rollback mechanism
- Clear communication to customers
This is exactly how I'd approach it at PayPal."