Cell Routing Algorithm

Algorithm Overview

🎯 Goals of Cell Routing

Stable Assignment: Customers stay in the same cell (DynamoDB is source of truth)
Optimal Load Distribution: Assign new customers to least-loaded cells using real metrics
Fast Lookups: Sub-10ms lookup time via Redis cache (95% hit rate)
Segment Awareness: Route Enterprise to Enterprise cells, SMB to SMB cells
Geographic Affinity: Route customers to geographically closest cells
Controlled Migration: Explicit rebalancing via Control Plane when needed

✅ Why This Approach is Better Than Consistent Hashing

Key Insight: If you're using DynamoDB as the source of truth and already doing lookups, consistent hashing adds complexity without benefit.

Simpler: No hash ring to maintain, version, or synchronize across Cell Routers
Better Load Distribution: Assign to least-loaded cell based on actual CPU/customer count, not hash distribution
Same Performance: DynamoDB lookup is required either way (for existing customers)
Controlled Migration: Control Plane explicitly chooses which customers to migrate, when, and where
Easier to Debug: Direct database queries show exactly where each customer is assigned

Consistent hashing is valuable for stateless distributed caches or database sharding where the hash IS the source of truth. But when you have a database, use it!

Architecture Components

Data Model: DynamoDB (Workflow-Affinity Routing)

// Table: customer_cell_mapping
// Global table replicated across US, EU, APAC
// Key includes service_category for workflow-affinity routing

{
  "customer_id": "12345",                         // Partition Key
  "region#category": "us-east-1#messaging",     // Composite Sort Key
  "cell_id": "messaging-enterprise-us-001",    // Assigned cell (workflow-specific)
  "service_category": "messaging",              // messaging | realtime | async | verify
  "region": "us-east-1",                        // Denormalized for queries
  "segment": "enterprise",                      // Customer segment
  "assigned_at": "2025-01-15T10:30:00Z",       // When assigned
  "last_accessed": "2025-01-20T14:22:11Z",     // For cold customer detection
  "api_calls_30d": 1500000,                    // Usage metrics for rebalancing
  "monthly_spend": 75000.00,                   // For segment reclassification
  "is_migrating": false,                      // Flag during cell migration
  "version": 5                                  // Optimistic locking
}

// Same customer, multiple cells (one per workflow category):
// { customer_id: "12345", region#category: "us-east-1#messaging",  cell_id: "messaging-enterprise-us-001" }
// { customer_id: "12345", region#category: "us-east-1#realtime",   cell_id: "realtime-enterprise-us-001" }
// { customer_id: "12345", region#category: "us-east-1#verify",     cell_id: "verify-enterprise-us-001" }

// GSI-1: cell_id (to list all customers in a cell)
// GSI-2: segment + region + category (to find cells by segment/category)
                

🎯 Core Principle: Workflow-Affinity Routing

Every customer × service_category → cell assignment is stored in DynamoDB Global Tables. Same customer can have different cells for different workflow categories (messaging, realtime, verify, async).

Why? Services with different scaling characteristics (bursty SMS vs sustained Voice) can scale independently while services that call each other (SMS + WhatsApp in fallback chains) stay co-located.

Complete Routing Flow

Step-by-Step Routing Decision

Step 1: Extract Customer Context

Parse customer_id from JWT token or API key
Determine region from request origin (Route 53 geolocation)
Extract service_category from API path:
- /v1/sms, /v1/mms, /v1/whatsapp → messaging
- /v1/voice, /v1/video → realtime
- /v1/verify, /v1/lookup → verify
- /v1/email, /v1/fax → async

Step 2: Check Cache (95% of requests end here!)

Redis key: cell:{customer_id}:{region}:{category}
If HIT: Return cached cell_id (~5ms latency)
If MISS: Proceed to Step 3

Step 3: Lookup in DynamoDB

Query: SELECT cell_id FROM customer_cell_mapping WHERE customer_id = ? AND region#category = ?
If found: Cache in Redis (TTL: 1 hour), return cell_id (~15ms latency)
If not found (new customer or new category): Proceed to Step 4

Step 4: Assign New Customer/Category (only for first request per category)

Determine customer segment (from signup plan or API tier)
Call getLeastLoadedCell(region, segment, category) - see Cell Assignment section
Write to DynamoDB with conditional expression: attribute_not_exists(customer_id) AND attribute_not_exists(region#category)
Cache in Redis (TTL: 1 hour)
Return cell_id (~20ms latency)

Step 5: Set Routing Header & Forward

Set header: X-Twilio-Cell-ID: enterprise-us-east-1-a
VPC Lattice reads header and routes to correct cell's service endpoint
Request arrives at cell's ALB → EKS cluster → Kubernetes pod

Routing Service Implementation (Lambda)

import { DynamoDBClient, GetItemCommand, PutItemCommand } from "@aws-sdk/client-dynamodb"
import { createClient } from "redis"

const dynamodb = new DynamoDBClient({ region: process.env.AWS_REGION })
const redis = createClient({ url: process.env.REDIS_URL })

export async function routeToCell(customerId, region, segment, serviceCategory) {
    // Step 1: Check cache (95% of requests end here)
    // Key includes service_category for workflow-affinity routing
    const cacheKey = `cell:${customerId}:${region}:${serviceCategory}`
    let cellId = await redis.get(cacheKey)

    if (cellId) {
        console.log(`Cache HIT: ${cacheKey} → ${cellId}`)
        return { cellId, source: 'cache', latency: '~5ms' }
    }

    // Step 2: Check DynamoDB (composite sort key: region#category)
    const sortKey = `${region}#${serviceCategory}`
    const dbResult = await dynamodb.send(new GetItemCommand({
        TableName: 'customer_cell_mapping',
        Key: {
            customer_id: { S: customerId },
            'region#category': { S: sortKey }
        }
    }))

    if (dbResult.Item) {
        cellId = dbResult.Item.cell_id.S

        // Cache for 1 hour
        await redis.setEx(cacheKey, 3600, cellId)

        console.log(`DB HIT: ${customerId}:${serviceCategory} → ${cellId}`)
        return { cellId, source: 'database', latency: '~15ms' }
    }

    // Step 3: New customer or new category - assign to least-loaded cell
    // Same customer can have different cells for different workflow categories
    cellId = await assignToLeastLoadedCell(customerId, region, segment, serviceCategory)

    // Cache
    await redis.setEx(cacheKey, 3600, cellId)

    console.log(`NEW ASSIGNMENT: ${customerId}:${serviceCategory} → ${cellId}`)
    return { cellId, source: 'new_assignment', latency: '~20ms' }
}
                

Performance Metrics

Cache Hit Rate

95%

Redis cache hit rate

Cache Hit Latency

~5ms

Redis lookup time

Cache Miss Latency

~15ms

DynamoDB lookup

New Customer

~20ms

Assignment + DDB write

Cell Assignment Algorithm

🎯 Goal: Assign to Least-Loaded Workflow-Affinity Cell

When a customer first uses a service category, assign them to the cell with the most available capacity for that workflow type in their segment and region.

Same customer can have different cells for different service categories (messaging, realtime, verify, async).

Least-Loaded Assignment Algorithm

async function assignToLeastLoadedCell(customerId, region, segment, serviceCategory) {
    // Step 1: Get all active cells for this segment + region + category
    // Cells are workflow-specific: messaging cells, realtime cells, etc.
    const cells = await dynamodb.send(new QueryCommand({
        TableName: 'cell_metadata',
        IndexName: 'category-segment-region-index',
        KeyConditionExpression: 'category = :cat AND segment_region = :segReg',
        FilterExpression: '#status = :active',
        ExpressionAttributeNames: {
            '#status': 'status'
        },
        ExpressionAttributeValues: {
            ':cat': { S: serviceCategory },
            ':segReg': { S: `${segment}#${region}` },
            ':active': { S: 'active' }
        }
    }))

    if (cells.Items.length === 0) {
        throw new Error(`No active ${serviceCategory} cells for segment=${segment}, region=${region}`)
    }

    // Step 2: Find least-loaded cell
    // Sort by: 1) customer count, 2) category-specific metric (queue depth, connections, etc.)
    const leastLoaded = cells.Items
        .map(item => ({
            id: item.cell_id.S,
            customers: item.current_customers?.N || 0,
            load: item.load_metric?.N || 0,  // Category-specific: queue_depth, concurrent_connections, etc.
            maxCustomers: item.max_customers?.N || 100,
            // Calculate load score (0-200)
            loadScore: (item.current_customers.N / item.max_customers.N * 100) +
                      (item.load_metric.N)
        }))
        .sort((a, b) => a.loadScore - b.loadScore)[0]

    console.log(`Selected ${serviceCategory} cell: ${leastLoaded.id} (${leastLoaded.customers}/${leastLoaded.maxCustomers} customers)`)

    // Step 3: Persist assignment with conditional write
    // Composite sort key: region#category (same customer can have multiple cells)
    const sortKey = `${region}#${serviceCategory}`
    await dynamodb.send(new PutItemCommand({
        TableName: 'customer_cell_mapping',
        Item: {
            customer_id: { S: customerId },
            'region#category': { S: sortKey },
            cell_id: { S: leastLoaded.id },
            service_category: { S: serviceCategory },
            region: { S: region },
            segment: { S: segment },
            assigned_at: { S: new Date().toISOString() },
            version: { N: '1' }
        },
        ConditionExpression: 'attribute_not_exists(customer_id) AND attribute_not_exists(#sk)',
        ExpressionAttributeNames: { '#sk': 'region#category' }
    }))

    // Step 4: Update cell metadata (increment customer count)
    await dynamodb.send(new UpdateItemCommand({
        TableName: 'cell_metadata',
        Key: { cell_id: { S: leastLoaded.id } },
        UpdateExpression: 'ADD current_customers :one',
        ExpressionAttributeValues: {
            ':one': { N: '1' }
        }
    }))

    return leastLoaded.id
}
                

Load Balancing Comparison

Hash-Based Assignment

Method: hash(customer_id) % num_cells

Distribution:

Pseudo-random based on hash function
No awareness of actual cell load
Can assign to overloaded cells

Adding cells: Requires remapping existing customers

Least-Loaded Assignment RECOMMENDED

Method: Query cell metrics, pick lowest load

Distribution:

✅ Based on real metrics (CPU, customer count)
✅ Avoids overloaded cells
✅ Adapts to heterogeneous cell sizes

Adding cells: New customers automatically go to new cell

Rebalancing: Explicit migration by Control Plane when needed

Race Condition Handling

What if two Cell Routers assign the same customer simultaneously?

Scenario: Customer makes first API call → Load balancer sends to 2 Cell Router instances simultaneously

Solution: DynamoDB Conditional Write

ConditionExpression: 'attribute_not_exists(customer_id)'

First write succeeds, second write gets ConditionalCheckFailedException

Second Cell Router catches exception, re-queries DynamoDB, uses the cell_id from first write

Result: Atomic assignment, no duplicate customers in different cells!

Alternative: Consistent Hashing (Background Knowledge)

⚠️ When to Use Consistent Hashing

Consistent hashing is valuable in specific scenarios:

Stateless distributed caches: Memcached/Redis clusters where hash function IS the source of truth
Database sharding: Cassandra-style partitioning where no central lookup table exists
Minimal state scenarios: When storing every mapping in a database is impractical

NOT recommended when: You already have a database (like DynamoDB) storing mappings!

Consistent Hashing Algorithm (Reference)

Why Consistent Hashing?

Problem: Simple hash modulo (e.g., hash(customer_id) % num_cells) causes massive redistribution when cells are added/removed.

Example: With 10 cells, adding 1 cell (11 total) reassigns ~91% of customers!

Solution: Consistent hashing minimizes redistribution to K/N where K = keys, N = nodes. Adding a cell only redistributes ~1/N customers.

How Consistent Hashing Works

Algorithm Pseudocode

class ConsistentHashRing {
    constructor(cells, virtualNodesPerCell = 150) {
        this.ring = new SortedMap()  // Hash → Cell mapping
        this.cells = cells
        this.virtualNodesPerCell = virtualNodesPerCell

        // Add virtual nodes for each cell
        for (cell of cells) {
            for (i = 0; i < virtualNodesPerCell; i++) {
                // Create virtual node identifier
                vnodeKey = `${cell.id}:vnode:${i}`
                hash = sha256(vnodeKey)  // 32-byte hash
                hashInt = hash.toUInt64()  // Convert to 64-bit integer

                this.ring.put(hashInt, cell)
            }
        }
    }

    getCell(customerId, segment) {
        // Hash customer ID
        customerHash = sha256(`customer:${customerId}:${segment}`)
        hashInt = customerHash.toUInt64()

        // Find next cell clockwise on ring (ceiling lookup)
        entry = this.ring.ceilingEntry(hashInt)

        // Wrap around if at end of ring
        if (!entry) {
            entry = this.ring.firstEntry()
        }

        return entry.value  // Returns the cell
    }

    addCell(newCell) {
        // Add virtual nodes for new cell
        affectedCustomers = []

        for (i = 0; i < this.virtualNodesPerCell; i++) {
            vnodeKey = `${newCell.id}:vnode:${i}`
            hash = sha256(vnodeKey).toUInt64()

            // Find what was previously at this position
            prevEntry = this.ring.floorEntry(hash)

            // Insert new virtual node
            this.ring.put(hash, newCell)

            // Customers between prevEntry and this hash need migration
            if (prevEntry) {
                affectedCustomers.push({
                    rangeStart: prevEntry.key,
                    rangeEnd: hash,
                    fromCell: prevEntry.value,
                    toCell: newCell
                })
            }
        }

        return affectedCustomers  // ~1/N customers affected
    }

    removeCell(cellToRemove) {
        // Remove all virtual nodes for this cell
        affectedCustomers = []

        for (i = 0; i < this.virtualNodesPerCell; i++) {
            vnodeKey = `${cellToRemove.id}:vnode:${i}`
            hash = sha256(vnodeKey).toUInt64()

            // Find next cell (where customers will move)
            nextEntry = this.ring.higherEntry(hash)
            if (!nextEntry) nextEntry = this.ring.firstEntry()

            // Remove this virtual node
            this.ring.remove(hash)

            affectedCustomers.push({
                fromCell: cellToRemove,
                toCell: nextEntry.value
            })
        }

        return affectedCustomers
    }
}
                

Virtual Nodes: Why 150?

Few Virtual Nodes (e.g., 3)

Pros:

Lower memory usage
Faster ring operations

Cons:

❌ Uneven distribution
❌ Large gaps between nodes
❌ Some cells get 2x load

150 Virtual Nodes RECOMMENDED

Pros:

✅ Excellent load distribution (±1-2%)
✅ Smooth customer redistribution
✅ Industry standard (Cassandra uses 256)

Cons:

Moderate memory (150 entries × cells)

Math: 10 cells × 150 vnodes = 1,500 ring entries (negligible memory)

Many Virtual Nodes (e.g., 1000)

Pros:

Perfect distribution

Cons:

❌ Overkill for most systems
❌ Slower lookups
❌ Higher memory usage
❌ Diminishing returns

Routing Algorithm Implementation (Workflow-Affinity)

Complete Routing Flow

Step-by-Step Routing Decision

Step 1: Extract Customer Context

Parse customer_id from JWT token or API key
Determine region from request origin (Route 53 geolocation)
Extract service_category from API path:
- /v1/sms, /v1/mms, /v1/whatsapp → messaging
- /v1/voice, /v1/video → realtime
- /v1/verify, /v1/lookup → verify
- /v1/email, /v1/fax → async

Step 2: Determine Customer Segment

Check ElastiCache for cached segment: GET customer:{customer_id}:segment
If miss, query DynamoDB: customer_cell_mapping table
If new customer, classify based on signup plan (free → SMB, paid → Mid-Market, enterprise contract → Enterprise)
Cache result with 1-hour TTL

Step 3: Check Cell Assignment Cache

Redis key: cell:{customer_id}:{region}:{category}
If HIT: Return cached cell_id (95% of requests)
If MISS: Proceed to Step 4

Step 4: Lookup or Assign Cell

Query DynamoDB: customer_cell_mapping with PK=customer_id, SK=region#category
If exists: Return cell_id, cache in Redis (TTL: 1 hour)
If not exists (new customer or first use of this category): Assign to least-loaded cell
Query cells for segment + region + category
Select cell with lowest load score (customer count + category-specific metric)

Step 5: Persist Assignment

Write to DynamoDB: customer_cell_mapping table
Attributes: customer_id, region#category, cell_id, service_category, segment, assigned_at
Cache in Redis with 1-hour TTL
Emit CloudWatch metric: NewCustomerCategoryAssignment

Step 6: Route Request to Cell

Update request header: X-Twilio-Cell-ID: enterprise-us-east-1-b
VPC Lattice routes to correct EKS cluster
Service Discovery resolves cell DNS: enterprise-us-east-1-b.internal
ALB forwards to target group (EKS worker nodes)
Kubernetes ingress routes to namespace: customer-12345

Routing Service Implementation (AWS Lambda)

import { DynamoDBClient, GetItemCommand, PutItemCommand, QueryCommand } from "@aws-sdk/client-dynamodb"
import { createClient } from "redis"

const dynamodb = new DynamoDBClient({ region: process.env.AWS_REGION })
const redis = createClient({ url: process.env.REDIS_URL })

// Service category mapping from API path
const CATEGORY_MAP = {
    '/v1/sms': 'messaging',
    '/v1/mms': 'messaging',
    '/v1/whatsapp': 'messaging',
    '/v1/voice': 'realtime',
    '/v1/video': 'realtime',
    '/v1/verify': 'verify',
    '/v1/lookup': 'verify',
    '/v1/email': 'async',
    '/v1/fax': 'async'
}

export async function routeToCell(customerId, region, segment, serviceCategory) {
    // Step 1: Check cache (includes service category)
    const cacheKey = `cell:${customerId}:${region}:${serviceCategory}`
    let cellId = await redis.get(cacheKey)

    if (cellId) {
        console.log(`Cache HIT: ${cacheKey} → ${cellId}`)
        return { cellId, source: 'cache' }
    }

    // Step 2: Check DynamoDB (composite sort key: region#category)
    const sortKey = `${region}#${serviceCategory}`
    const dbResult = await dynamodb.send(new GetItemCommand({
        TableName: 'customer_cell_mapping',
        Key: {
            customer_id: { S: customerId },
            'region#category': { S: sortKey }
        }
    }))

    if (dbResult.Item) {
        cellId = dbResult.Item.cell_id.S
        await redis.setEx(cacheKey, 3600, cellId)
        console.log(`DB HIT: ${customerId}:${serviceCategory} → ${cellId}`)
        return { cellId, source: 'database' }
    }

    // Step 3: New assignment - find least-loaded cell for this category
    cellId = await assignToLeastLoadedCell(customerId, region, segment, serviceCategory)
    await redis.setEx(cacheKey, 3600, cellId)

    console.log(`NEW ASSIGNMENT: ${customerId}:${serviceCategory} → ${cellId}`)
    return { cellId, source: 'least_loaded' }
}

async function assignToLeastLoadedCell(customerId, region, segment, category) {
    // Query cells for this category + segment + region
    const cells = await dynamodb.send(new QueryCommand({
        TableName: 'cell_metadata',
        IndexName: 'category-segment-region-index',
        KeyConditionExpression: 'category = :cat AND segment_region = :segReg',
        ExpressionAttributeValues: {
            ':cat': { S: category },
            ':segReg': { S: `${segment}#${region}` }
        }
    }))

    // Find least loaded cell
    const leastLoaded = cells.Items
        .map(item => ({ id: item.cell_id.S, load: item.load_score.N }))
        .sort((a, b) => a.load - b.load)[0]

    // Persist assignment
    await dynamodb.send(new PutItemCommand({
        TableName: 'customer_cell_mapping',
        Item: {
            customer_id: { S: customerId },
            'region#category': { S: `${region}#${category}` },
            cell_id: { S: leastLoaded.id },
            service_category: { S: category }
        }
    }))

    return leastLoaded.id
}
                

Performance Metrics

Cache Hit Rate

95%

Redis cache hit rate

Cache Hit Latency

~5ms

Redis lookup time

Cache Miss Latency

~15ms

DynamoDB lookup + hash

New Customer

~20ms

Hash + DDB write + cache

Cell Rebalancing Strategy

                    When to Rebalance?
                    New Cell Added: Capacity expansion, new region launch
Cell Removed: Decommissioning, failure, cost optimization
Cell Overloaded: CPU > 80%, request rate > threshold
Customer Reclassification: SMB → Mid-Market → Enterprise promotion
Scheduled Optimization: Quarterly rebalancing to optimize cost/performance

                

Rebalancing Strategies

Strategy	Trigger	Scope	Downtime	Duration
Automatic (Consistent Hash)	New cell added/removed	~1/N customers	Zero (dual-write)	Minutes to hours
Manual Migration	Customer upgrade (SMB→Enterprise)	Single customer	Zero (blue/green)	Seconds
Gradual Drain	Cell decommissioning	All customers in cell	Zero (redirect)	Days (batched)
Load-Based	Cell CPU > 80%	Hottest customers	Zero	Minutes
Scheduled Optimization	Quarterly maintenance	Imbalanced customers	Zero	Hours (off-peak)

Zero-Downtime Migration Process

is_migrating=true, target_cell=new_cell 2. Cell Router writes to BOTH old and new cells 3. Reads still served from old cell Phase 2: Background Data Sync (Variable) 1. Launch background job to copy customer data: • DynamoDB streams → Lambda → Copy to new cell's namespace • S3 data (recordings, attachments) → Copy to new region (if applicable) • Aurora data (if customer-specific) → Snapshot + restore 2. Verify data integrity (checksums, row counts) Phase 3: Cutover (Instant) 1. Update DynamoDB: cell_id=new_cell, increment version 2. Invalidate Redis cache: DEL cell:{customer_id}:* 3. Next request routes to new cell (reads from new cell) Phase 4: Cleanup (Hours later) 1. Monitor new cell for errors (rollback if needed) 2. After 24 hours: Delete customer data from old cell Timeline T=0 Start T+5min Data sync T+30min Cutover T+24hrs Cleanup 🔑 Key Insight: Dual-write ensures zero data loss, atomic cutover ensures zero downtime Total customer-visible impact: 0ms (transparent migration)

Rebalancing Automation (Step Function)

// AWS Step Functions state machine for cell rebalancing

{
  "Comment": "Cell Rebalancing Workflow",
  "StartAt": "IdentifyCustomersToMigrate",
  "States": {
    "IdentifyCustomersToMigrate": {
      "Type": "Task",
      "Resource": "arn:aws:lambda:us-east-1:xxx:function:identify-migration-candidates",
      // Returns list of customer_ids that need migration based on consistent hash
      "Next": "BatchCustomers"
    },
    "BatchCustomers": {
      "Type": "Task",
      "Resource": "arn:aws:states:::lambda:invoke",
      // Split customers into batches of 100 to avoid overwhelming
      "Parameters": {
        "FunctionName": "batch-customers",
        "BatchSize": 100
      },
      "Next": "MigrateBatch"
    },
    "MigrateBatch": {
      "Type": "Map",
      // Parallel execution (up to 100 concurrent migrations)
      "MaxConcurrency": 100,
      "ItemsPath": "$.batches",
      "Iterator": {
        "StartAt": "MigrateSingleCustomer",
        "States": {
          "MigrateSingleCustomer": {
            "Type": "Task",
            "Resource": "arn:aws:lambda:us-east-1:xxx:function:migrate-customer",
            "Retry": [{
              "ErrorEquals": ["States.ALL"],
              "MaxAttempts": 3,
              "BackoffRate": 2.0
            }],
            "Catch": [{
              "ErrorEquals": ["States.ALL"],
              "ResultPath": "$.error",
              "Next": "LogFailure"
            }],
            "End": true
          },
          "LogFailure": {
            "Type": "Task",
            "Resource": "arn:aws:lambda:us-east-1:xxx:function:log-migration-failure",
            // Alert on-call engineer, add to retry queue
            "End": true
          }
        }
      },
      "Next": "VerifyMigration"
    },
    "VerifyMigration": {
      "Type": "Task",
      "Resource": "arn:aws:lambda:us-east-1:xxx:function:verify-migration",
      // Check error rates, latency for migrated customers
      "Next": "CheckHealth"
    },
    "CheckHealth": {
      "Type": "Choice",
      "Choices": [{
        "Variable": "$.health_check_passed",
        "BooleanEquals": true,
        "Next": "ScheduleCleanup"
      }],
      "Default": "Rollback"
    },
    "Rollback": {
      "Type": "Task",
      "Resource": "arn:aws:lambda:us-east-1:xxx:function:rollback-migration",
      // Revert customer to old cell, alert team
      "End": true
    },
    "ScheduleCleanup": {
      "Type": "Wait",
      "Seconds": 86400,  // 24 hours
      "Next": "CleanupOldCell"
    },
    "CleanupOldCell": {
      "Type": "Task",
      "Resource": "arn:aws:lambda:us-east-1:xxx:function:cleanup-old-cell-data",
      "End": true
    }
  }
}
                

Load-Based Rebalancing

Automatic Hotspot Detection

Scenario: Enterprise Cell A has CPU at 85%, while Cell B is at 40%.

Solution: Identify "hottest" customers in Cell A and migrate to Cell B.

Algorithm:

CloudWatch alarm triggers when Cell CPU > 80% for 5 minutes
Query DynamoDB for customers in overloaded cell, sorted by api_calls_30d DESC
Select top N customers representing ~20% of traffic
Find underutilized cell in same segment/region (CPU < 50%)
Migrate hot customers to underutilized cell using zero-downtime process
Monitor for 1 hour, verify load distribution improved

Expected Result: Cell A CPU drops to ~65%, Cell B increases to ~60% (balanced)

Production Implementation Considerations

DynamoDB Table Design

// Table: customer_cell_mapping
{
  "TableName": "customer_cell_mapping",
  "BillingMode": "PAY_PER_REQUEST",  // On-demand for unpredictable load
  "StreamEnabled": true,
  "StreamViewType": "NEW_AND_OLD_IMAGES",  // For audit trail
  "KeySchema": [
    { "AttributeName": "customer_id", "KeyType": "HASH" },
    { "AttributeName": "region", "KeyType": "RANGE" }
  ],
  "GlobalSecondaryIndexes": [
    {
      "IndexName": "cell_id-index",
      "KeySchema": [
        { "AttributeName": "cell_id", "KeyType": "HASH" }
      ],
      // Query: "List all customers in cell X" for rebalancing
    },
    {
      "IndexName": "segment-region-index",
      "KeySchema": [
        { "AttributeName": "segment", "KeyType": "HASH" },
        { "AttributeName": "region", "KeyType": "RANGE" }
      ],
      // Query: "List all enterprise customers in us-east-1"
    }
  ],
  "GlobalTableSettings": {
    "Regions": ["us-east-1", "eu-west-1", "ap-southeast-1"],
    "ReplicationSettings": {
      "RegionalReplicaSettings": {
        "us-east-1": { "ReadCapacityUnits": "AUTO" },
        "eu-west-1": { "ReadCapacityUnits": "AUTO" },
        "ap-southeast-1": { "ReadCapacityUnits": "AUTO" }
      }
    }
  }
}

// Table: cell_metadata (for hash ring construction)
{
  "TableName": "cell_metadata",
  "BillingMode": "PAY_PER_REQUEST",
  "KeySchema": [
    { "AttributeName": "cell_id", "KeyType": "HASH" }
  ],
  "Attributes": {
    "cell_id": "enterprise-us-east-1-a",
    "segment": "enterprise",
    "region": "us-east-1",
    "status": "active",  // active | draining | inactive
    "capacity": {
      "max_customers": 1000,
      "current_customers": 875,
      "cpu_percent": 72,
      "memory_percent": 65
    },
    "endpoints": {
      "internal_dns": "enterprise-us-east-1-a.internal",
      "alb_arn": "arn:aws:elasticloadbalancing:..."
    },
    "created_at": "2024-01-01T00:00:00Z",
    "virtual_nodes": 150
  }
}
                

Monitoring & Alerting

Metric	Threshold	Alert	Action
Cell CPU Utilization	> 80% for 5 min	PagerDuty P2	Auto-scale or trigger load-based rebalancing
Cell Imbalance	Stddev > 20% across cells	Slack warning	Schedule rebalancing during off-peak
Cache Hit Rate	< 90%	Slack warning	Investigate cache eviction, increase Redis capacity
Cell Routing Latency	p99 > 50ms	PagerDuty P3	Check DynamoDB throttling, Redis latency
Migration Failures	> 5 failures in 1 hour	PagerDuty P1	Pause migrations, investigate root cause
Customer Distribution	Cell has < 10 customers	CloudWatch alarm	Consider cell decommissioning

Cost Optimization

DynamoDB Costs

Assumptions:

10M customers
3 regions (Global Tables)
10 KB avg item size

Storage: 10M × 10KB × 3 regions = 300 GB = $75/month

Reads (cached): 1B req/month × 5% miss rate = 50M reads = $6/month

Writes: 10M new customers/month = $13/month

Replication: Included in Global Tables

Total: ~$100/month

ElastiCache Costs

Configuration:

cache.r7g.large (2 vCPU, 13 GB RAM)
3 AZs, 3 replicas
Per region

Per Region: $0.188/hr × 3 nodes × 730 hrs = $411/month

3 Regions: $1,233/month

Alternative: cache.r7g.xlarge (26 GB) if > 10M customers

Cell Router Lambda

Configuration:

1024 MB memory
~20ms avg execution
1B requests/month

Compute: 1B × 20ms = 20M GB-sec = $333/month

Requests: 1B requests = $200/month

Total: ~$533/month

Note: 95% cache hit rate saves significant Lambda cost!

💰 Total Cell Routing Infrastructure Cost

Monthly: ~$1,866/month for 10M customers, 1B API calls

Per Customer: $0.0001866/month

Per API Call: $0.000001866

ROI: Enables reliable cell isolation, fast routing, zero-downtime migrations - well worth the investment!

Control Plane: Cell Lifecycle Management

🎮 What is the Control Plane?

The Control Plane is the operational automation layer that manages the entire cell lifecycle. Think of it as the "cockpit of the platform" that orchestrates cell provisioning, customer assignment, rebalancing, and decommissioning.

It works in conjunction with the Cloud Native Landing Zone (AWS Control Tower), which provides the foundational multi-account governance structure.

Control Plane vs Cell Router

Cell Router (Data Plane)

Purpose: Route customer requests to the correct cell

Latency: ~5-15ms per request

Operations:

Extract customer_id from request
Query cache/DynamoDB for cell assignment
Apply consistent hashing if new customer
Set X-Twilio-Cell-ID header
Forward request via VPC Lattice

Scale: Millions of requests per second

Control Plane (Management Plane) NEW

Purpose: Manage cell lifecycle and infrastructure

Latency: Minutes to hours (batch operations)

Operations:

Provision new AWS accounts for cells
Deploy cell infrastructure (VPC, EKS, DB)
Update consistent hash ring
Orchestrate customer migrations
Decommission old cells

Scale: Hundreds of operations per day

Control Plane Architecture

Control Plane Operations

Operation 1: Provision New Cell

Trigger: Existing cell reaches 70% capacity OR manual request

Step 1: Create AWS Account

Call AWS Control Tower Account Factory API
Account name: enterprise-cell-{region}-{letter}
Organizational Unit: /Cells/Enterprise
Wait for account creation (2-5 minutes)

Step 2: Deploy Infrastructure (Terraform)

VPC with CIDR 10.0.0.0/16 (overlapping IPs OK)
3 private subnets across AZs (10.0.0.0/20, 10.0.16.0/20, 10.0.32.0/20)
3 public subnets for NAT gateways
EKS cluster with managed node groups (20-50 nodes)
DynamoDB tables (replicated via Global Tables)
ElastiCache Redis cluster

Step 3: Register with VPC Lattice

Create VPC Lattice service for cell
Service name: enterprise-{region}-{letter}-api
Target: ALB in cell's VPC
Associate service with regional VPC Lattice network

Step 4: Update Hash Ring

Write to DynamoDB cell_metadata table
Add 150 virtual nodes for new cell to consistent hash ring
Increment ring version number
Publish SNS notification → All Cell Router instances refresh ring

Step 5: Trigger Rebalancing

Calculate which customers should move to new cell (consistent hashing)
Start Step Function: MigrateCellBatch
Migrate ~1/N customers from existing cells (N = total cells)

🔑 Key Insight: Control Plane is Asynchronous

Unlike the Cell Router which operates in the critical path (milliseconds), the Control Plane operates asynchronously (minutes to hours). This separation allows:

Cell Router to be ultra-fast and simple
Control Plane to be complex and robust (retries, validation, rollback)
Zero downtime during infrastructure changes
Safe experimentation with new cell configurations

Interview Talking Points: Control Plane

2-Minute Interview Answer

"How would you automate cell provisioning and lifecycle management?"

"I'd build a Control Plane as a set of Step Functions workflows that orchestrate cell lifecycle operations.

Architecture: The Control Plane sits on top of AWS Control Tower, which provides the multi-account Landing Zone. When we need a new cell, the Control Plane calls the Account Factory API to create an AWS account, then uses Terraform to deploy the standard cell stack—VPC, EKS, DynamoDB, etc.

Separation of concerns: The Cell Router handles request-time routing in milliseconds, reading from a DynamoDB table and cache. The Control Plane handles infrastructure changes asynchronously in minutes, updating that table and notifying Cell Routers via SNS when the hash ring changes.

Capacity management: EventBridge rules monitor cell capacity metrics. When a cell hits 70% CPU or 100 customers, it triggers the provisioning workflow automatically. This ensures we're always ahead of demand.

Hash ring updates: The Control Plane is the single writer to the hash ring metadata in DynamoDB. It increments a version number with each change and publishes to SNS. Cell Routers poll every 30 seconds or receive push notifications to refresh their in-memory ring.

Why Step Functions: Cell provisioning involves multiple AWS API calls, Terraform applies, health checks, and rollback logic. Step Functions provides the orchestration, retries, and error handling out of the box. A single workflow can provision a cell end-to-end in about 15 minutes.

This architecture keeps the data plane simple and fast while allowing complex operational automation in the control plane."

Edge Cases & Failure Scenarios

Scenario 1: DynamoDB Unavailable

Problem

DynamoDB Global Table in us-east-1 is experiencing throttling or regional outage.

Impact

Cache misses cannot be resolved (5% of traffic)
New customer assignments fail

Mitigation

Cross-Region Failover: Route 53 health check detects DynamoDB unavailability, routes requests to EU region
Degraded Mode: Cell Router falls back to pure consistent hashing (no persistence check)
Extended Cache TTL: Temporarily extend Redis TTL from 1 hour to 24 hours
Rate Limiting: Reduce new customer signup rate to avoid overwhelming when DynamoDB recovers

Recovery

When DynamoDB recovers, background job reconciles any assignments made during outage.

Scenario 2: Redis Cache Failure

Problem

ElastiCache cluster in us-east-1 fails (e.g., all nodes restart simultaneously).

Impact

100% cache miss rate (vs 5% normal)
20x increase in DynamoDB read load
Routing latency increases from 5ms → 15ms

Mitigation

Multi-AZ: ElastiCache cluster mode with 3 replicas across AZs (automatic failover)
DynamoDB Auto-Scaling: On-demand mode handles read spike
Circuit Breaker: If cache is down, skip cache entirely (direct to DynamoDB)
Graceful Degradation: Inform customers of slight latency increase via status page

Scenario 3: Customer Segment Changes Mid-Request

Problem

Customer upgrades from SMB → Enterprise while their request is in flight.

Impact

Request routed to SMB cell, but customer now belongs in Enterprise cell
Data may be written to wrong cell

Mitigation

Optimistic Locking: Use DynamoDB version field with conditional writes
Dual-Write Period: During upgrade, write to both SMB and Enterprise cells for 5 minutes
Cell Router Check: Before routing, verify segment hasn't changed (compare cached vs DynamoDB)
Automatic Redirect: If segment mismatch detected, invalidate cache and re-route

Scenario 4: Hash Ring Desynchronization

Problem

Cell Router instances have inconsistent views of the hash ring (e.g., new cell added but not all routers refreshed).

Impact

Same customer routed to different cells by different routers
Data inconsistency

Mitigation

Authoritative Source: DynamoDB cell_metadata table is source of truth
Version Number: Hash ring has version number, routers check version before routing
Refresh Frequency: Cell Routers poll DynamoDB every 30 seconds for ring updates
SNS Notification: When cell added/removed, publish to SNS → All routers immediately refresh
Consistency Check: If routing result doesn't match DynamoDB assignment, trust DynamoDB and update local ring

Scenario 5: Entire Cell Failure

Problem

Enterprise Cell A (us-east-1) experiences complete failure (EKS cluster down, AZ outage, etc.).

Impact

All customers in Cell A cannot access services
~1,000 enterprise customers affected

Mitigation

Health Checks: ALB target group health checks detect cell failure within 30 seconds
Automatic Failover: Update DynamoDB: Set cell_status=inactive
Emergency Rebalancing: Step Function immediately migrates all Cell A customers to Cell B (same region)
Dual-Cell Strategy: Enterprise customers replicated to 2 cells (primary + hot standby)
RTO: < 5 minutes for full recovery
RPO: < 1 second (DynamoDB Global Tables replication lag)

Interview Talking Points

🎤 2-Minute Summary: Cell Routing & Assignment

"For Twilio's cell routing, I use DynamoDB Global Tables as the source of truth for customer → cell mappings. The Cell Router is simple: check Redis cache (95% hit rate, ~5ms), query DynamoDB on miss (~15ms), or assign new customers to the least-loaded cell (~20ms)."

"For new customer assignment, I query cell metadata from DynamoDB to find the least-loaded cell in the customer's segment and region. The assignment uses a load score combining customer count and CPU utilization, so we always assign to the cell with the most capacity. A conditional write prevents race conditions if two Cell Routers try to assign the same customer simultaneously."

"This approach is simpler than consistent hashing because DynamoDB is already the source of truth—why calculate hashes when you can just look it up? It also gives better load distribution since we use real metrics instead of pseudo-random hash distribution."

"When adding new cells, the Control Plane provisions the AWS account and infrastructure. New customers automatically go to the new cell since it has the lowest load. For rebalancing existing customers, the Control Plane explicitly migrates them using a zero-downtime process: dual-write, background sync, atomic cutover, cleanup."

"The entire routing infrastructure costs about $1,866/month for 10M customers and 1B API calls—DynamoDB global tables, ElastiCache, and Lambda. That's roughly $0.000002 per API call for reliable cell isolation and sub-20ms routing."

💡 Key Interview Points

Simplicity: "DynamoDB is the source of truth, so we just look it up. No hash ring versioning or synchronization."
Performance: "95% cache hit rate means ~5ms latency for most requests. Cache miss is ~15ms from DynamoDB."
Load Distribution: "Least-loaded assignment based on real metrics (CPU, customer count) beats hash-based distribution."
Atomic Assignment: "DynamoDB conditional writes prevent race conditions when multiple Cell Routers assign same customer."
Controlled Migration: "Control Plane explicitly decides which customers to migrate, when, and where—not dictated by hash ring changes."
When Hashing Makes Sense: "Consistent hashing is great for stateless caches (Memcached) or database sharding (Cassandra), but not when you already have a database!"