Algorithm Overview
๐ฏ Goals of Cell Routing
- Stable Assignment: Customers stay in the same cell (DynamoDB is source of truth)
- Optimal Load Distribution: Assign new customers to least-loaded cells using real metrics
- Fast Lookups: Sub-10ms lookup time via Redis cache (95% hit rate)
- Segment Awareness: Route Enterprise to Enterprise cells, SMB to SMB cells
- Geographic Affinity: Route customers to geographically closest cells
- Controlled Migration: Explicit rebalancing via Control Plane when needed
โ Why This Approach is Better Than Consistent Hashing
Key Insight: If you're using DynamoDB as the source of truth and already doing lookups, consistent hashing adds complexity without benefit.
- Simpler: No hash ring to maintain, version, or synchronize across Cell Routers
- Better Load Distribution: Assign to least-loaded cell based on actual CPU/customer count, not hash distribution
- Same Performance: DynamoDB lookup is required either way (for existing customers)
- Controlled Migration: Control Plane explicitly chooses which customers to migrate, when, and where
- Easier to Debug: Direct database queries show exactly where each customer is assigned
Consistent hashing is valuable for stateless distributed caches or database sharding where the hash IS the source of truth. But when you have a database, use it!
Architecture Components
Data Model: DynamoDB (Workflow-Affinity Routing)
Cell Routing Algorithm
๐ฏ Core Principle: Workflow-Affinity Routing
Every customer ร service_category โ cell assignment is stored in DynamoDB Global Tables. Same customer can have different cells for different workflow categories (messaging, realtime, verify, async).
Why? Services with different scaling characteristics (bursty SMS vs sustained Voice) can scale independently while services that call each other (SMS + WhatsApp in fallback chains) stay co-located.
Complete Routing Flow
Step-by-Step Routing Decision
- Parse
customer_idfrom JWT token or API key - Determine
regionfrom request origin (Route 53 geolocation) - Extract
service_categoryfrom API path:/v1/sms, /v1/mms, /v1/whatsappโ messaging/v1/voice, /v1/videoโ realtime/v1/verify, /v1/lookupโ verify/v1/email, /v1/faxโ async
- Redis key:
cell:{customer_id}:{region}:{category} - If HIT: Return cached cell_id (~5ms latency)
- If MISS: Proceed to Step 3
- Query:
SELECT cell_id FROM customer_cell_mapping WHERE customer_id = ? AND region#category = ? - If found: Cache in Redis (TTL: 1 hour), return cell_id (~15ms latency)
- If not found (new customer or new category): Proceed to Step 4
- Determine customer segment (from signup plan or API tier)
- Call
getLeastLoadedCell(region, segment, category)- see Cell Assignment section - Write to DynamoDB with conditional expression:
attribute_not_exists(customer_id) AND attribute_not_exists(region#category) - Cache in Redis (TTL: 1 hour)
- Return cell_id (~20ms latency)
- Set header:
X-Twilio-Cell-ID: enterprise-us-east-1-a - VPC Lattice reads header and routes to correct cell's service endpoint
- Request arrives at cell's ALB โ EKS cluster โ Kubernetes pod
Routing Service Implementation (Lambda)
Performance Metrics
Cache Hit Rate
Redis cache hit rate
Cache Hit Latency
Redis lookup time
Cache Miss Latency
DynamoDB lookup
New Customer
Assignment + DDB write
Cell Assignment Algorithm
๐ฏ Goal: Assign to Least-Loaded Workflow-Affinity Cell
When a customer first uses a service category, assign them to the cell with the most available capacity for that workflow type in their segment and region.
Same customer can have different cells for different service categories (messaging, realtime, verify, async).
Least-Loaded Assignment Algorithm
Load Balancing Comparison
Hash-Based Assignment
Method: hash(customer_id) % num_cells
Distribution:
- Pseudo-random based on hash function
- No awareness of actual cell load
- Can assign to overloaded cells
Adding cells: Requires remapping existing customers
Least-Loaded Assignment RECOMMENDED
Method: Query cell metrics, pick lowest load
Distribution:
- โ Based on real metrics (CPU, customer count)
- โ Avoids overloaded cells
- โ Adapts to heterogeneous cell sizes
Adding cells: New customers automatically go to new cell
Rebalancing: Explicit migration by Control Plane when needed
Race Condition Handling
What if two Cell Routers assign the same customer simultaneously?
Scenario: Customer makes first API call โ Load balancer sends to 2 Cell Router instances simultaneously
Solution: DynamoDB Conditional Write
ConditionExpression: 'attribute_not_exists(customer_id)'
First write succeeds, second write gets ConditionalCheckFailedException
Second Cell Router catches exception, re-queries DynamoDB, uses the cell_id from first write
Result: Atomic assignment, no duplicate customers in different cells!
Alternative: Consistent Hashing (Background Knowledge)
โ ๏ธ When to Use Consistent Hashing
Consistent hashing is valuable in specific scenarios:
- Stateless distributed caches: Memcached/Redis clusters where hash function IS the source of truth
- Database sharding: Cassandra-style partitioning where no central lookup table exists
- Minimal state scenarios: When storing every mapping in a database is impractical
NOT recommended when: You already have a database (like DynamoDB) storing mappings!
Consistent Hashing Algorithm (Reference)
Why Consistent Hashing?
Problem: Simple hash modulo (e.g., hash(customer_id) % num_cells) causes massive redistribution when cells are added/removed.
Example: With 10 cells, adding 1 cell (11 total) reassigns ~91% of customers!
Solution: Consistent hashing minimizes redistribution to K/N where K = keys, N = nodes. Adding a cell only redistributes ~1/N customers.
How Consistent Hashing Works
Algorithm Pseudocode
Virtual Nodes: Why 150?
Few Virtual Nodes (e.g., 3)
Pros:
- Lower memory usage
- Faster ring operations
Cons:
- โ Uneven distribution
- โ Large gaps between nodes
- โ Some cells get 2x load
150 Virtual Nodes RECOMMENDED
Pros:
- โ Excellent load distribution (ยฑ1-2%)
- โ Smooth customer redistribution
- โ Industry standard (Cassandra uses 256)
Cons:
- Moderate memory (150 entries ร cells)
Math: 10 cells ร 150 vnodes = 1,500 ring entries (negligible memory)
Many Virtual Nodes (e.g., 1000)
Pros:
- Perfect distribution
Cons:
- โ Overkill for most systems
- โ Slower lookups
- โ Higher memory usage
- โ Diminishing returns
Routing Algorithm Implementation (Workflow-Affinity)
Complete Routing Flow
Step-by-Step Routing Decision
- Parse
customer_idfrom JWT token or API key - Determine
regionfrom request origin (Route 53 geolocation) - Extract
service_categoryfrom API path:/v1/sms, /v1/mms, /v1/whatsappโ messaging/v1/voice, /v1/videoโ realtime/v1/verify, /v1/lookupโ verify/v1/email, /v1/faxโ async
- Check ElastiCache for cached segment:
GET customer:{customer_id}:segment - If miss, query DynamoDB:
customer_cell_mappingtable - If new customer, classify based on signup plan (free โ SMB, paid โ Mid-Market, enterprise contract โ Enterprise)
- Cache result with 1-hour TTL
- Redis key:
cell:{customer_id}:{region}:{category} - If HIT: Return cached cell_id (95% of requests)
- If MISS: Proceed to Step 4
- Query DynamoDB:
customer_cell_mappingwith PK=customer_id, SK=region#category - If exists: Return cell_id, cache in Redis (TTL: 1 hour)
- If not exists (new customer or first use of this category): Assign to least-loaded cell
- Query cells for segment + region + category
- Select cell with lowest load score (customer count + category-specific metric)
- Write to DynamoDB:
customer_cell_mappingtable - Attributes: customer_id, region#category, cell_id, service_category, segment, assigned_at
- Cache in Redis with 1-hour TTL
- Emit CloudWatch metric:
NewCustomerCategoryAssignment
- Update request header:
X-Twilio-Cell-ID: enterprise-us-east-1-b - VPC Lattice routes to correct EKS cluster
- Service Discovery resolves cell DNS:
enterprise-us-east-1-b.internal - ALB forwards to target group (EKS worker nodes)
- Kubernetes ingress routes to namespace:
customer-12345
Routing Service Implementation (AWS Lambda)
Performance Metrics
Cache Hit Rate
Redis cache hit rate
Cache Hit Latency
Redis lookup time
Cache Miss Latency
DynamoDB lookup + hash
New Customer
Hash + DDB write + cache
Cell Rebalancing Strategy
When to Rebalance?
- New Cell Added: Capacity expansion, new region launch
- Cell Removed: Decommissioning, failure, cost optimization
- Cell Overloaded: CPU > 80%, request rate > threshold
- Customer Reclassification: SMB โ Mid-Market โ Enterprise promotion
- Scheduled Optimization: Quarterly rebalancing to optimize cost/performance
Rebalancing Strategies
| Strategy | Trigger | Scope | Downtime | Duration |
|---|---|---|---|---|
| Automatic (Consistent Hash) | New cell added/removed | ~1/N customers | Zero (dual-write) | Minutes to hours |
| Manual Migration | Customer upgrade (SMBโEnterprise) | Single customer | Zero (blue/green) | Seconds |
| Gradual Drain | Cell decommissioning | All customers in cell | Zero (redirect) | Days (batched) |
| Load-Based | Cell CPU > 80% | Hottest customers | Zero | Minutes |
| Scheduled Optimization | Quarterly maintenance | Imbalanced customers | Zero | Hours (off-peak) |
Zero-Downtime Migration Process
Rebalancing Automation (Step Function)
Load-Based Rebalancing
Automatic Hotspot Detection
Scenario: Enterprise Cell A has CPU at 85%, while Cell B is at 40%.
Solution: Identify "hottest" customers in Cell A and migrate to Cell B.
Algorithm:
- CloudWatch alarm triggers when Cell CPU > 80% for 5 minutes
- Query DynamoDB for customers in overloaded cell, sorted by
api_calls_30d DESC - Select top N customers representing ~20% of traffic
- Find underutilized cell in same segment/region (CPU < 50%)
- Migrate hot customers to underutilized cell using zero-downtime process
- Monitor for 1 hour, verify load distribution improved
Expected Result: Cell A CPU drops to ~65%, Cell B increases to ~60% (balanced)
Production Implementation Considerations
DynamoDB Table Design
Monitoring & Alerting
| Metric | Threshold | Alert | Action |
|---|---|---|---|
| Cell CPU Utilization | > 80% for 5 min | PagerDuty P2 | Auto-scale or trigger load-based rebalancing |
| Cell Imbalance | Stddev > 20% across cells | Slack warning | Schedule rebalancing during off-peak |
| Cache Hit Rate | < 90% | Slack warning | Investigate cache eviction, increase Redis capacity |
| Cell Routing Latency | p99 > 50ms | PagerDuty P3 | Check DynamoDB throttling, Redis latency |
| Migration Failures | > 5 failures in 1 hour | PagerDuty P1 | Pause migrations, investigate root cause |
| Customer Distribution | Cell has < 10 customers | CloudWatch alarm | Consider cell decommissioning |
Cost Optimization
DynamoDB Costs
Assumptions:
- 10M customers
- 3 regions (Global Tables)
- 10 KB avg item size
Storage: 10M ร 10KB ร 3 regions = 300 GB = $75/month
Reads (cached): 1B req/month ร 5% miss rate = 50M reads = $6/month
Writes: 10M new customers/month = $13/month
Replication: Included in Global Tables
Total: ~$100/month
ElastiCache Costs
Configuration:
- cache.r7g.large (2 vCPU, 13 GB RAM)
- 3 AZs, 3 replicas
- Per region
Per Region: $0.188/hr ร 3 nodes ร 730 hrs = $411/month
3 Regions: $1,233/month
Alternative: cache.r7g.xlarge (26 GB) if > 10M customers
Cell Router Lambda
Configuration:
- 1024 MB memory
- ~20ms avg execution
- 1B requests/month
Compute: 1B ร 20ms = 20M GB-sec = $333/month
Requests: 1B requests = $200/month
Total: ~$533/month
Note: 95% cache hit rate saves significant Lambda cost!
๐ฐ Total Cell Routing Infrastructure Cost
Monthly: ~$1,866/month for 10M customers, 1B API calls
Per Customer: $0.0001866/month
Per API Call: $0.000001866
ROI: Enables reliable cell isolation, fast routing, zero-downtime migrations - well worth the investment!
Control Plane: Cell Lifecycle Management
๐ฎ What is the Control Plane?
The Control Plane is the operational automation layer that manages the entire cell lifecycle. Think of it as the "cockpit of the platform" that orchestrates cell provisioning, customer assignment, rebalancing, and decommissioning.
It works in conjunction with the Cloud Native Landing Zone (AWS Control Tower), which provides the foundational multi-account governance structure.
Control Plane vs Cell Router
Cell Router (Data Plane)
Purpose: Route customer requests to the correct cell
Latency: ~5-15ms per request
Operations:
- Extract customer_id from request
- Query cache/DynamoDB for cell assignment
- Apply consistent hashing if new customer
- Set X-Twilio-Cell-ID header
- Forward request via VPC Lattice
Scale: Millions of requests per second
Control Plane (Management Plane) NEW
Purpose: Manage cell lifecycle and infrastructure
Latency: Minutes to hours (batch operations)
Operations:
- Provision new AWS accounts for cells
- Deploy cell infrastructure (VPC, EKS, DB)
- Update consistent hash ring
- Orchestrate customer migrations
- Decommission old cells
Scale: Hundreds of operations per day
Control Plane Architecture
Control Plane Operations
Operation 1: Provision New Cell
- Call AWS Control Tower Account Factory API
- Account name:
enterprise-cell-{region}-{letter} - Organizational Unit:
/Cells/Enterprise - Wait for account creation (2-5 minutes)
- VPC with CIDR 10.0.0.0/16 (overlapping IPs OK)
- 3 private subnets across AZs (10.0.0.0/20, 10.0.16.0/20, 10.0.32.0/20)
- 3 public subnets for NAT gateways
- EKS cluster with managed node groups (20-50 nodes)
- DynamoDB tables (replicated via Global Tables)
- ElastiCache Redis cluster
- Create VPC Lattice service for cell
- Service name:
enterprise-{region}-{letter}-api - Target: ALB in cell's VPC
- Associate service with regional VPC Lattice network
- Write to DynamoDB
cell_metadatatable - Add 150 virtual nodes for new cell to consistent hash ring
- Increment ring version number
- Publish SNS notification โ All Cell Router instances refresh ring
- Calculate which customers should move to new cell (consistent hashing)
- Start Step Function:
MigrateCellBatch - Migrate ~1/N customers from existing cells (N = total cells)
๐ Key Insight: Control Plane is Asynchronous
Unlike the Cell Router which operates in the critical path (milliseconds), the Control Plane operates asynchronously (minutes to hours). This separation allows:
- Cell Router to be ultra-fast and simple
- Control Plane to be complex and robust (retries, validation, rollback)
- Zero downtime during infrastructure changes
- Safe experimentation with new cell configurations
Interview Talking Points: Control Plane
"How would you automate cell provisioning and lifecycle management?"
"I'd build a Control Plane as a set of Step Functions workflows that orchestrate cell lifecycle operations.
Architecture: The Control Plane sits on top of AWS Control Tower, which provides the multi-account Landing Zone. When we need a new cell, the Control Plane calls the Account Factory API to create an AWS account, then uses Terraform to deploy the standard cell stackโVPC, EKS, DynamoDB, etc.
Separation of concerns: The Cell Router handles request-time routing in milliseconds, reading from a DynamoDB table and cache. The Control Plane handles infrastructure changes asynchronously in minutes, updating that table and notifying Cell Routers via SNS when the hash ring changes.
Capacity management: EventBridge rules monitor cell capacity metrics. When a cell hits 70% CPU or 100 customers, it triggers the provisioning workflow automatically. This ensures we're always ahead of demand.
Hash ring updates: The Control Plane is the single writer to the hash ring metadata in DynamoDB. It increments a version number with each change and publishes to SNS. Cell Routers poll every 30 seconds or receive push notifications to refresh their in-memory ring.
Why Step Functions: Cell provisioning involves multiple AWS API calls, Terraform applies, health checks, and rollback logic. Step Functions provides the orchestration, retries, and error handling out of the box. A single workflow can provision a cell end-to-end in about 15 minutes.
This architecture keeps the data plane simple and fast while allowing complex operational automation in the control plane."
Edge Cases & Failure Scenarios
Scenario 1: DynamoDB Unavailable
Problem
DynamoDB Global Table in us-east-1 is experiencing throttling or regional outage.
Impact
- Cache misses cannot be resolved (5% of traffic)
- New customer assignments fail
Mitigation
- Cross-Region Failover: Route 53 health check detects DynamoDB unavailability, routes requests to EU region
- Degraded Mode: Cell Router falls back to pure consistent hashing (no persistence check)
- Extended Cache TTL: Temporarily extend Redis TTL from 1 hour to 24 hours
- Rate Limiting: Reduce new customer signup rate to avoid overwhelming when DynamoDB recovers
Recovery
When DynamoDB recovers, background job reconciles any assignments made during outage.
Scenario 2: Redis Cache Failure
Problem
ElastiCache cluster in us-east-1 fails (e.g., all nodes restart simultaneously).
Impact
- 100% cache miss rate (vs 5% normal)
- 20x increase in DynamoDB read load
- Routing latency increases from 5ms โ 15ms
Mitigation
- Multi-AZ: ElastiCache cluster mode with 3 replicas across AZs (automatic failover)
- DynamoDB Auto-Scaling: On-demand mode handles read spike
- Circuit Breaker: If cache is down, skip cache entirely (direct to DynamoDB)
- Graceful Degradation: Inform customers of slight latency increase via status page
Scenario 3: Customer Segment Changes Mid-Request
Problem
Customer upgrades from SMB โ Enterprise while their request is in flight.
Impact
- Request routed to SMB cell, but customer now belongs in Enterprise cell
- Data may be written to wrong cell
Mitigation
- Optimistic Locking: Use DynamoDB
versionfield with conditional writes - Dual-Write Period: During upgrade, write to both SMB and Enterprise cells for 5 minutes
- Cell Router Check: Before routing, verify segment hasn't changed (compare cached vs DynamoDB)
- Automatic Redirect: If segment mismatch detected, invalidate cache and re-route
Scenario 4: Hash Ring Desynchronization
Problem
Cell Router instances have inconsistent views of the hash ring (e.g., new cell added but not all routers refreshed).
Impact
- Same customer routed to different cells by different routers
- Data inconsistency
Mitigation
- Authoritative Source: DynamoDB
cell_metadatatable is source of truth - Version Number: Hash ring has version number, routers check version before routing
- Refresh Frequency: Cell Routers poll DynamoDB every 30 seconds for ring updates
- SNS Notification: When cell added/removed, publish to SNS โ All routers immediately refresh
- Consistency Check: If routing result doesn't match DynamoDB assignment, trust DynamoDB and update local ring
Scenario 5: Entire Cell Failure
Problem
Enterprise Cell A (us-east-1) experiences complete failure (EKS cluster down, AZ outage, etc.).
Impact
- All customers in Cell A cannot access services
- ~1,000 enterprise customers affected
Mitigation
- Health Checks: ALB target group health checks detect cell failure within 30 seconds
- Automatic Failover: Update DynamoDB: Set
cell_status=inactive - Emergency Rebalancing: Step Function immediately migrates all Cell A customers to Cell B (same region)
- Dual-Cell Strategy: Enterprise customers replicated to 2 cells (primary + hot standby)
- RTO: < 5 minutes for full recovery
- RPO: < 1 second (DynamoDB Global Tables replication lag)
Interview Talking Points
๐ค 2-Minute Summary: Cell Routing & Assignment
"For Twilio's cell routing, I use DynamoDB Global Tables as the source of truth for customer โ cell mappings. The Cell Router is simple: check Redis cache (95% hit rate, ~5ms), query DynamoDB on miss (~15ms), or assign new customers to the least-loaded cell (~20ms)."
"For new customer assignment, I query cell metadata from DynamoDB to find the least-loaded cell in the customer's segment and region. The assignment uses a load score combining customer count and CPU utilization, so we always assign to the cell with the most capacity. A conditional write prevents race conditions if two Cell Routers try to assign the same customer simultaneously."
"This approach is simpler than consistent hashing because DynamoDB is already the source of truthโwhy calculate hashes when you can just look it up? It also gives better load distribution since we use real metrics instead of pseudo-random hash distribution."
"When adding new cells, the Control Plane provisions the AWS account and infrastructure. New customers automatically go to the new cell since it has the lowest load. For rebalancing existing customers, the Control Plane explicitly migrates them using a zero-downtime process: dual-write, background sync, atomic cutover, cleanup."
"The entire routing infrastructure costs about $1,866/month for 10M customers and 1B API callsโDynamoDB global tables, ElastiCache, and Lambda. That's roughly $0.000002 per API call for reliable cell isolation and sub-20ms routing."
๐ก Key Interview Points
- Simplicity: "DynamoDB is the source of truth, so we just look it up. No hash ring versioning or synchronization."
- Performance: "95% cache hit rate means ~5ms latency for most requests. Cache miss is ~15ms from DynamoDB."
- Load Distribution: "Least-loaded assignment based on real metrics (CPU, customer count) beats hash-based distribution."
- Atomic Assignment: "DynamoDB conditional writes prevent race conditions when multiple Cell Routers assign same customer."
- Controlled Migration: "Control Plane explicitly decides which customers to migrate, when, and whereโnot dictated by hash ring changes."
- When Hashing Makes Sense: "Consistent hashing is great for stateless caches (Memcached) or database sharding (Cassandra), but not when you already have a database!"