๐Ÿ—๏ธ Twilio Cell-Based Architecture on AWS

Multi-Region, Multi-Master Database Design for Enterprise Scale

Architecture Overview

๐ŸŽฏ Core Design Principles

  • Full Cell Autonomy: Each enterprise cell is a completely isolated VPC with independent IP space (overlapping CIDRs allowed)
  • Blast Radius Containment: Cell failures affect only customers in that cell (~100 enterprise customers max per cell)
  • Multi-Region Active/Active: Customers write locally to their region, DynamoDB Global Tables replicate globally
  • VPC Lattice for Inter-Cell Routing: Service-based routing (not IP-based) enables overlapping IP spaces
  • Consistency by Use Case: Multi-master (DynamoDB) for logs/events, leader-follower (Aurora) for phone inventory/payments

๐Ÿ“ Key Architectural Decisions

Decision Approach Rationale
Cell Isolation VPC-per-cell for Enterprise, shared VPC for Mid-Market/SMB Balance isolation (enterprise) vs operational efficiency (SMB)
IP Addressing Overlapping RFC1918 (all cells use 10.0.0.0/16) Full autonomy, simplified addressing, VPC Lattice handles routing
Cell Edge NLB (Enterprise) or shared ALB (Mid-Market/SMB) NLB for ultra-low latency, ALB for WAF/path routing
Inter-Cell Routing VPC Lattice (service mesh) Supports overlapping IPs, IAM auth, cross-VPC without peering
Database Strategy DynamoDB Global Tables (multi-master) + Aurora Global (leader-follower) Match consistency model to use case requirements

How Overlapping IP Spaces Work with VPC Lattice

๐Ÿ”‘ The Problem: Cell Autonomy with RFC1918 IP Overlaps

Challenge: For true cell autonomy, each Enterprise cell should be a fully independent VPC. But if we have 100+ enterprise cells, we'd run out of unique RFC1918 IP space. The solution: allow all cells to use the same CIDR blocks (e.g., 10.0.0.0/16).

โœ… The Solution: VPC Lattice Service-Based Routing

VPC Lattice routes based on service names, not IP addresses. This means overlapping IP spaces across VPCs are perfectly fine.

How It Works:

  1. Each cell is a dedicated VPC with the same CIDR (10.0.0.0/16)
  2. VPC Lattice creates a Service Network that spans all VPCs in the region
  3. Each cell registers its services with unique service names (e.g., enterprise-cell-a-api, enterprise-cell-b-api)
  4. Cell Router sets X-Twilio-Cell-ID header (e.g., "enterprise-us-east-1-a")
  5. VPC Lattice routes to the service by name, not by IP address
  6. Service discovery happens via AWS Cloud Map - VPC Lattice resolves service name to the correct VPC's load balancer

Example: Three cells with overlapping IPs

  • enterprise-cell-a-vpc: 10.0.0.0/16 โ†’ Service: enterprise-cell-a-api
  • enterprise-cell-b-vpc: 10.0.0.0/16 โ†’ Service: enterprise-cell-b-api
  • enterprise-cell-c-vpc: 10.0.0.0/16 โ†’ Service: enterprise-cell-c-api

All three cells have pod IPs like 10.0.50.10, but VPC Lattice routes correctly because it uses service names, not IPs.

๐Ÿ’ก Key Benefit: No VPC peering needed! VPC Lattice handles cross-VPC communication with built-in IAM authentication, observability, and no risk of IP conflicts.

Architecture Diagrams

๐Ÿ“ Creating Professional AWS Architecture Diagrams

For interview presentations, create professional diagrams using these tools with official AWS Architecture Icons:

Complete Request Flow Architecture

๐Ÿ”„ End-to-End Request Flow

GLOBAL LAYER Client (HTTPS) Route 53 Geolocation Routing CloudFront CDN TLS termination WAF, Caching us-east-1 eu-west-1 ap-southeast-1 REGIONAL LAYER (us-east-1) Global Accelerator Anycast IPs Regional ALB Internet-facing Cell Router (Lambda) 1. Extract customer_id from JWT 2. Query ElastiCache/DynamoDB 3. Assign to least-loaded cell (new) VPC Lattice Service Network Routes by service name (NOT IP) โœ“ Overlapping 10.0.0.0/16 CIDRs CELL LAYER (Overlapping IPs) Cell A VPC 10.0.0.0/16 ALB HTTPS:443 EKS Cluster Voice: 100 pods SMS: 80 pods DynamoDB Aurora Cell B VPC 10.0.0.0/16 ALB + EKS Cell C VPC 10.0.0.0/16 ALB + EKS Cell D VPC 10.0.0.0/16 ALB + EKS ALL CELLS: 10.0.0.0/16 OVERLAPPING IPs! DATA LAYER (Multi-Region Replication) DynamoDB Global Tables โœ“ Multi-master replication โœ“ Eventual consistency โœ“ Active-Active writes Logs, events, sessions customer_id โ†’ cell_id mapping Aurora Global Database โœ“ Leader-follower โœ“ Strong consistency โœ“ Read replicas in each region Phone inventory Payment transactions Total Latency: CloudFront (5ms) + ALB (5ms) + Cell Router (10ms) + VPC Lattice (2ms) + App (50ms) + DB (5ms) โ‰ˆ 77ms

๐Ÿ“‹ Diagram Specifications for Interview

Create these diagrams using draw.io or CloudCraft with AWS icons:

1. Multi-Region Overview Diagram

Components to include:

  • Route 53 (global) โ†’ CloudFront (global) โ†’ Regional ALBs (us-east-1, eu-west-1, ap-southeast-1)
  • Cell Router Lambda in each region
  • VPC Lattice service network (spans multiple VPCs in region)
  • 3-4 cell VPCs per region, all labeled "10.0.0.0/16" to show overlapping IPs
  • Arrows showing data replication (DynamoDB bidirectional, Aurora primaryโ†’replica)
2. Cell Detail Diagram

Components to include:

  • VPC (10.0.0.0/16) with public/private subnets across 3 AZs
  • ALB in public subnets
  • EKS cluster nodes in private subnets
  • Pods (Voice, SMS, Auth services)
  • DynamoDB and Aurora databases
  • ElastiCache Redis cluster
  • NAT Gateways, Internet Gateway
  • Security group arrows showing traffic flow
3. VPC Lattice Overlapping IP Diagram

Show how overlapping IPs work:

  • VPC Lattice service network at the top
  • 3 VPCs below, ALL labeled "10.0.0.0/16"
  • Each VPC has different service name (enterprise-a-api, enterprise-b-api, enterprise-c-api)
  • Annotation: "VPC Lattice routes by service name, not IP"
  • Show pods in each VPC with same IP (e.g., 10.0.50.10) to emphasize overlapping
Client HTTPS Request Route 53 (DNS) Hosted Zone: twilio.com Geolocation Routing Policy Health Checks: /health endpoint every 30s CloudFront CDN Distribution: api.twilio.com Custom SSL Certificate (ACM) Cache Policy: TTL 60s for GET, 0s for POST Origin: Regional ALBs (failover group) + WAF (rate limiting 10k req/5min per IP) US Traffic EU Traffic APAC Traffic US-EAST-1 Global Accelerator Anycast IPs: 2x static IPs Endpoint: Regional ALB (weight: 100) API Gateway Regional REST API Custom domain Throttle: 10k/sec ALB TLS 1.3 termination Path-based routing Sticky sessions Cell Router (Lambda) 1. Lookup customer โ†’ cell (ElastiCache/DynamoDB) 2. Assign to least-loaded cell (new customers) 3. Set X-Twilio-Cell-ID header Runtime: Node.js 20, Memory: 1024MB, Timeout: 10s VPC Lattice (Service Mesh) Routes to target cell based on X-Twilio-Cell-ID header โœ“ Enables Overlapping IPs via Service-Based Routing Enterprise Cell VPC: 10.0.0.0/16 EKS Cluster 1000 nodes m6i.4xlarge 3 AZs Pod autoscaling ~100 customers Mid-Market Shared VPC EKS Cluster 300 nodes m6i.2xlarge 3 AZs Spot instances ~5,000 customers SMB Cell A Shared VPC ECS Fargate Serverless Auto-scaling 1 vCPU, 2GB Task count: 50-500 ~50,000 customers Data Layer - US-EAST-1 DynamoDB Global Tables Multi-master On-demand Aurora PostgreSQL 15 Primary region Multi-AZ ElastiCache Redis 7 Cluster mode 3 shards, 2 replicas Replication VPC: 10.0.0.0/16 | 3 AZs | Private subnets for compute EU-WEST-1 Global Accelerator Same Anycast IPs as US Endpoint: Regional ALB (weight: 100) API Gateway Regional REST API GDPR compliant Data residency ALB TLS 1.3 termination Path-based routing Sticky sessions Cell Router (Lambda) Same logic as US (regional deployment) Reads from local DynamoDB replica Writes to DynamoDB Global Table VPC Lattice (Service Mesh) Routes to target cell based on X-Twilio-Cell-ID header Enterprise Cell EKS Cluster 800 nodes m6i.4xlarge 3 AZs Pod autoscaling ~80 customers Mid-Market EKS Cluster 250 nodes m6i.2xlarge 3 AZs Spot instances ~4,000 customers SMB Cell A ECS Fargate Serverless Auto-scaling 1 vCPU, 2GB Task count: 40-400 ~40,000 customers Data Layer - EU-WEST-1 DynamoDB Global Tables Replica region Local writes OK Aurora Read Replica Read-only < 1s lag ElastiCache Redis 7 Cluster mode 3 shards, 2 replicas VPC: 10.1.0.0/16 | 3 AZs | GDPR compliant data residency AP-SOUTHEAST-1 Global Accelerator Same Anycast IPs Endpoint: Regional ALB (weight: 100) ALB (APAC optimized) TLS 1.3 termination Singapore edge location Low latency routing Cell Router (Lambda) Same logic as US/EU Local DynamoDB replica reads Multi-master writes VPC Lattice (Service Mesh) Routes to target cell Enterprise Cell EKS Cluster 500 nodes m6i.4xlarge 3 AZs Pod autoscaling ~50 customers Mid-Market EKS Cluster 200 nodes m6i.2xlarge 3 AZs Spot instances ~3,000 customers SMB Cell A ECS Fargate Serverless Auto-scaling 1 vCPU, 2GB Task count: 30-300 ~30,000 customers Data Layer - AP-SOUTHEAST-1 DynamoDB Global Tables Replica region Local writes OK Aurora Read Replica Read-only < 1s lag ElastiCache Redis 7 Cluster mode 2 shards, 2 replicas VPC: 10.2.0.0/16 | 3 AZs | Asia-Pacific optimized routing DynamoDB Global Replication DynamoDB Global Replication

๐Ÿ”ง Key Configuration Details

Route 53 Configuration

  • Routing Policy: Geolocation (US โ†’ us-east-1, EU โ†’ eu-west-1, APAC โ†’ ap-southeast-1)
  • Health Checks: /health endpoint, 30s interval, 3 failures = unhealthy
  • Failover: If primary region unhealthy, route to secondary region
  • TTL: 60 seconds (balance between failover speed and DNS query volume)

CloudFront Configuration

  • Origin: Origin group with regional ALBs (automatic failover)
  • Cache Behavior: GET requests cached 60s, POST/PUT/DELETE bypass cache
  • SSL/TLS: Custom certificate from ACM, TLS 1.2+ only
  • WAF: Rate limiting (10,000 requests per 5 minutes per IP)

Global Accelerator Configuration

  • Anycast IPs: 2 static IPv4 addresses (global)
  • Endpoints: Regional ALBs in each region (weight: 100)
  • Health Checks: Port 443, HTTPS protocol, 30s interval
  • Traffic Dial: 100% to healthy endpoints, automatic failover

VPC Lattice Configuration

  • Service Network: Cross-VPC service discovery and routing
  • Routing: Header-based routing on X-Twilio-Cell-ID
  • Auth: IAM-based service-to-service authentication
  • Observability: Access logs to S3, metrics to CloudWatch

ALB Configuration

  • Scheme: Internet-facing, cross-zone load balancing enabled
  • TLS: TLS 1.3, certificate from ACM, ALPN support
  • Routing: Path-based routing (/v1/voice, /v1/sms, etc.)
  • Sticky Sessions: Application cookie, 1-hour duration

Cell Router Lambda Configuration

  • Runtime: Node.js 20, arm64 (Graviton2)
  • Memory: 1024MB, Timeout: 10 seconds
  • Concurrency: Reserved concurrency per region (1000)
  • VPC: Deployed in VPC to access ElastiCache privately

๐Ÿ“Š Traffic Flow Example

  1. Client makes HTTPS request to api.twilio.com
  2. Route 53 resolves DNS based on geolocation โ†’ Returns CloudFront distribution
  3. CloudFront terminates TLS, applies WAF rules, routes to nearest regional ALB
  4. Global Accelerator (optional) provides anycast routing to nearest region
  5. ALB terminates TLS (if not via CloudFront), routes based on path to API Gateway or directly to Cell Router
  6. Cell Router Lambda extracts customer ID from JWT, queries ElastiCache/DynamoDB for cell assignment, assigns new customers to least-loaded cell, sets X-Twilio-Cell-ID header
  7. VPC Lattice routes request to appropriate EKS cluster based on X-Twilio-Cell-ID header
  8. EKS/ECS processes request in customer's assigned cell
  9. Data Layer reads from local DynamoDB replica (eventual consistency) or Aurora read replica, writes to DynamoDB Global Table (multi-master) or Aurora primary (strong consistency)
  10. Response flows back through VPC Lattice โ†’ Cell Router โ†’ ALB โ†’ CloudFront โ†’ Client

โšก Total Latency Budget: CloudFront (5ms) + ALB (5ms) + Cell Router (5-15ms) + VPC Lattice (2ms) + Application (20-100ms) + Database (5-10ms) = 42-137ms end-to-end

๐Ÿ”€ Routing Responsibilities: VPC Lattice vs EKS

Inter-Cell Routing (VPC Lattice)

Responsibility: Route requests between cells based on customer assignment

How it works:

  1. Cell Router Lambda sets X-Twilio-Cell-ID header (e.g., "enterprise-us-east-1-a")
  2. VPC Lattice inspects header and routes to correct cell's service network
  3. Works across VPCs, availability zones, and accounts
  4. Provides IAM-based authentication between services

Example: Customer A (assigned to Enterprise Cell A) โ†’ VPC Lattice โ†’ Enterprise Cell A EKS cluster

Intra-Cell Routing (EKS/Kubernetes)

Responsibility: Route requests within a cell between microservices

How it works:

  1. Request arrives at cell's EKS cluster via VPC Lattice
  2. Kubernetes Ingress Controller (NGINX/ALB Ingress) receives request
  3. Routes to appropriate service based on path (/v1/voice โ†’ voice-service)
  4. Service mesh (Istio/Linkerd) handles service-to-service communication

Example: Within Enterprise Cell A: voice-service โ†’ auth-service โ†’ database-service

๐Ÿ”‘ Key Distinction: VPC Lattice routes to the correct cell (inter-cell), while EKS/Kubernetes routes within the cell to the correct microservice (intra-cell). This separation allows cells to be completely isolated while still enabling centralized routing logic.

๐Ÿ’ก Alternative: For larger enterprises, you could use Istio multi-cluster to span service mesh across cells, but this increases complexity. VPC Lattice + per-cell Kubernetes Ingress is simpler and maintains better fault isolation.

Network Boundaries & Cell Edge Architecture

๐Ÿ—๏ธ Cell Network Boundary Design

VPC Structure Options

Approach VPC Topology Isolation Level Use Case Cell Edge Device
Option 1: VPC-per-Cell โญ Each cell = dedicated VPC ๐ŸŸข Strongest (network isolation) Enterprise cells VPC Lattice โ†’ ALB โ†’ EKS Pods (IP mode)
Option 2: Shared VPC, Subnet Isolation Cells in different subnets ๐ŸŸก Moderate (security groups) Mid-Market, SMB cells VPC Lattice โ†’ ALB (target groups) โ†’ EKS
Option 3: Namespace Isolation Cells = Kubernetes namespaces ๐ŸŸ  Weakest (logical only) Dev/test environments Shared ALB โ†’ Ingress Controller โ†’ Namespaces

Recommended Architecture: Hybrid Approach

Enterprise Cells (Top 100 customers):

  • VPC Boundary: Dedicated VPC per cell (e.g., enterprise-cell-a-vpc, enterprise-cell-b-vpc) - ALL use 10.0.0.0/16 (overlapping IPs)
  • Subnets: Private subnets across 3 AZs for EKS, public subnets for NAT gateways
  • Edge Device: Application Load Balancer (ALB) registered with VPC Lattice as target
  • Why ALB: Layer 7 routing (path/host), WAF integration, TLS termination, managed by AWS Load Balancer Controller
  • Traffic Flow: VPC Lattice โ†’ ALB โ†’ Target Group (EKS pods in IP mode) โ†’ Pods
  • โœ“ VPC Lattice enables overlapping IPs - routes by service name, not IP address
  • Note: Use NLB instead if you need source IP preservation or non-HTTP protocols

Mid-Market Cells:

  • VPC Boundary: Shared VPC with isolated subnets per cell (e.g., midmarket-vpc = 10.1.0.0/16)
  • Subnets: Cell A (10.1.0.0/18), Cell B (10.1.64.0/18), Cell C (10.1.128.0/18)
  • Edge Device: Application Load Balancer (ALB) with separate target groups per cell
  • Why ALB: Layer 7, path-based routing, WAF integration, TLS termination
  • Traffic Flow: VPC Lattice โ†’ ALB โ†’ Target Group (Cell-specific) โ†’ EKS Pods

SMB Cells (ECS Fargate):

  • VPC Boundary: Shared VPC with Fargate tasks in isolated subnets
  • Edge Device: ALB with target groups pointing to Fargate service
  • Why ALB: Native ECS integration, automatic task registration/deregistration
  • Traffic Flow: VPC Lattice โ†’ ALB โ†’ Fargate Tasks (via AWS Cloud Map service discovery)

What Sits at the Cell Edge?

๐ŸŽฏ Answer: It depends on the cell type, but generally:

For Enterprise Cells (VPC-per-cell):

  1. VPC Lattice Service Network routes traffic to the cell's VPC service
  2. Application Load Balancer (ALB) sits at the VPC edge, registered as VPC Lattice target
  3. ALB routes directly to Target Group containing EKS pods (IP mode)
  4. AWS Load Balancer Controller manages ALB lifecycle and target registration

For Mid-Market/SMB Cells (Shared VPC):

  1. VPC Lattice routes to a shared ALB within the VPC
  2. ALB sits at the cell's logical edge, with listener rules routing to cell-specific target groups
  3. Target groups contain either EKS worker nodes (NodePort) or Fargate tasks
  4. Kubernetes/ECS service discovery handles pod-level routing

๐Ÿ”‘ Key Point: The ALB/NLB is NOT shared across all cells. Enterprise cells get dedicated NLBs in their own VPCs. Mid-Market/SMB cells may share an ALB but use separate target groups for isolation.

Regional vs Cell-Level ALB

โš ๏ธ Clarification on the Architecture Diagram:

In the diagram above, the regional ALB shown is actually the ingress ALB that sits in front of the Cell Router Lambda, NOT the cell edge devices. Here's the corrected flow:

  1. CloudFront/Global Accelerator โ†’ Regional Ingress ALB (internet-facing, routes to Cell Router Lambda)
  2. Cell Router Lambda determines cell assignment, sets X-Twilio-Cell-ID header
  3. VPC Lattice reads header, routes to cell-specific NLB/ALB
  4. Cell Edge NLB/ALB โ†’ Kubernetes Ingress or ECS Service
  5. Ingress/Service โ†’ Application Pods/Tasks

So there are TWO load balancer layers:

  • Layer 1 (Regional): Ingress ALB for Cell Router Lambda (1 per region)
  • Layer 2 (Cell): ALB at each cell's edge, registered with VPC Lattice (1 per enterprise cell, or shared for SMB)

๐Ÿ’ก Why ALB at cell edge, not NLB: VPC Lattice supports ALB as a target type. ALB provides Layer 7 features (path routing, host headers, WAF integration) while VPC Lattice handles the service discovery. Only use NLB if you need source IP preservation or non-HTTP protocols.

Example: Enterprise Cell Network Boundary

VPC: enterprise-cell-a-vpc (10.0.0.0/16) โ† Same CIDR as all other enterprise cells

Subnets:

  • Public: 10.0.0.0/20 (AZ-a), 10.0.16.0/20 (AZ-b), 10.0.32.0/20 (AZ-c) - ALB nodes
  • Private: 10.0.64.0/18 (AZ-a), 10.0.128.0/18 (AZ-b), 10.0.192.0/18 (AZ-c) - EKS nodes

Edge Device: ALB (enterprise-a-alb) created by AWS Load Balancer Controller, registered with VPC Lattice

VPC Lattice Target: ALB ARN registered as service target

ALB Target Group: EKS pods in IP target mode (port 8080)

Kubernetes: AWS Load Balancer Controller watches Ingress resources, auto-creates ALB + target groups

Security Groups:

  • ALB SG: Allow 443 from VPC Lattice service network CIDR (169.254.171.0/24)
  • EKS Pod SG: Allow 8080 from ALB SG, allow pod-to-pod within VPC

Kubernetes Ingress Controller: AWS Load Balancer Controller vs NGINX

For the cell edge ingress layer, there are two primary options. Here's the trade-off analysis:

โœ… AWS Load Balancer Controller

Recommended for this AWS-native architecture

How it works:

  • Kubernetes controller watches Ingress resources
  • Automatically provisions ALB/NLB per Ingress
  • Integrates directly with VPC Lattice targets
  • Uses AWS SDK to manage load balancers

Advantages:

  • โœ… AWS-native integration (VPC Lattice, WAF, Shield)
  • โœ… Automatic ALB/NLB lifecycle management
  • โœ… Native support for target groups, health checks
  • โœ… Managed by AWS (security patches, updates)
  • โœ… Better observability (CloudWatch, X-Ray)
  • โœ… IP or instance target modes

Disadvantages:

  • โŒ AWS vendor lock-in (can't move to GCP/Azure easily)
  • โŒ Less feature-rich than NGINX (no Lua, plugins)
  • โŒ ALB costs (~$20/month per Ingress)
  • โŒ Limited advanced routing (no regex, rewrite limits)

Best for: AWS-committed architectures, teams preferring managed services, production workloads needing WAF/Shield integration

NGINX Ingress Controller

Alternative for multi-cloud flexibility

How it works:

  • NGINX pods run as DaemonSet on nodes
  • Exposed via single NLB (cost-efficient)
  • NGINX handles all routing internally
  • Config via Kubernetes Ingress annotations

Advantages:

  • โœ… Multi-cloud portable (works on EKS, GKE, AKS)
  • โœ… Feature-rich (rate limiting, auth, rewrites, Lua)
  • โœ… Battle-tested, huge community
  • โœ… Single NLB cost (vs ALB per Ingress)
  • โœ… Advanced routing (regex, complex rewrites)
  • โœ… Extensive customization via ConfigMaps

Disadvantages:

  • โŒ Self-managed (you patch, upgrade, scale)
  • โŒ Extra network hop (NLB โ†’ NGINX pod โ†’ app pod)
  • โŒ Less AWS-native integration
  • โŒ Resource overhead (NGINX pods consume CPU/memory)
  • โŒ Need to manage NGINX pod scaling

Best for: Multi-cloud strategy, need advanced routing features, team has NGINX expertise, cost-sensitive (fewer ALBs)

๐Ÿ“Š Recommendation for This Architecture

Use AWS Load Balancer Controller for the following reasons:

  1. Already committed to AWS: You're using VPC Lattice, DynamoDB Global Tables, Aurora - already AWS-native
  2. VPC Lattice integration: ALB/NLB created by controller can be registered as VPC Lattice targets seamlessly
  3. WAF requirement: Twilio needs DDoS protection - ALB integrates directly with WAF
  4. Operational simplicity: Managed service reduces ops burden vs self-managing NGINX at scale
  5. Cell isolation: Separate ALBs per cell provide better blast radius containment than shared NGINX

When to use NGINX instead: If you need advanced routing (complex regex, Lua scripts) or if maintaining multi-cloud optionality is critical

Cell Network Boundary Detail

VPC Lattice Service Network Routes based on X-Twilio-Cell-ID header Service Network CIDR: 169.254.171.0/24 (AWS-managed) Enterprise Cell A - Dedicated VPC VPC CIDR: 10.10.0.0/16 Network Load Balancer (Cell Edge) Private subnets across 3 AZs Listener: 443/TCP โ†’ Target: NGINX Ingress Registered with VPC Lattice as target Kubernetes Ingress Controller (NGINX) Routes based on path/host to Kubernetes Services Voice Service Pods: 100 CPU: 2 cores Memory: 4GB HPA: 50-200 Namespace: customer-12345 SMS Service Pods: 80 CPU: 1 core Memory: 2GB HPA: 40-150 Namespace: customer-12345 Auth Service Pods: 20 CPU: 0.5 core Memory: 1GB HPA: 10-50 Namespace: customer-12345 Mid-Market - Shared VPC VPC CIDR: 10.20.0.0/16 Shared ALB (Cell Edge) Target Group per cell Routes by X-Twilio-Cell-ID Security Groups enforce isolation EKS NodePort Services ALB โ†’ NodePort 30000-32767 โ†’ Pods Shared EKS Cluster Multiple customers in namespaces Namespace: customer-001 (100 pods) Namespace: customer-002 (80 pods) Namespace: customer-003 (120 pods) Isolation: Network Policies + Resource Quotas SMB - Shared VPC VPC CIDR: 10.30.0.0/16 ALB (Cell Edge) Target: ECS Service Auto-registers tasks Via AWS Cloud Map ECS Service (Fargate) Serverless, no node management Fargate Tasks Task count: 50-500 (auto-scaling) vCPU: 1, Memory: 2GB per task Each task = 1 customer ~100 customers/task Network: awsvpc mode (ENI per task)

Cell Architecture Terminology & AWS Implementation

๐Ÿ—๏ธ Mapping Cell-Based Architecture to AWS Implementation

Your cell-based architecture implementation using separate AWS accounts and VPCs to define cell boundaries is recognized as a best practice for managing scale and isolation. This physical realization directly addresses account quotas and VPC limits like Network Address Usage (NAU).

Your Term / Component Industry Term AWS Implementation Function
Cell Unit of Scale / Deployment Cell Dedicated AWS Account + VPC (10.0.0.0/16) Logical grouping representing a unit of scale and deployment; change-set boundary at deployment time to limit blast radius. Contains size-capped workload replica.
Cloud Native Landing Zone Management Plane AWS Control Tower Multi-Account Environment Foundational substrate that sets up and governs a secure multi-account AWS environment. Provides design-time governance, security guardrails, and account-level orchestration.
Control Plane Control Plane / Orchestrator Infrastructure-as-Code + Lambda/Step Functions The "cockpit of the platform" that interacts with your Landing Zone to manage cell lifecycle: provisioning new cell accounts, de-provisioning old ones, and migrating tenants between cells.
Platform Services / Global Cell Shared Singleton Services / Tier 0 Multi-Region Deployment (Identity, IAM, etc.) Critical services that cannot be easily cellularized (Identity, Access Management, Scheduler). Often reside in a "Global Cell" as Tier 0 dependencies. Must be multi-region to avoid single point of failure.
Cell Router Traffic Partitioning Layer / Router Route 53 / API Gateway / Lambda The "thinnest possible layer" that uses a partition key (Customer ID) to route incoming requests to the specific endpoint of the AWS account/VPC where that customer's cell resides.

๐Ÿšข Shipping Port Analogy

Cloud Native Landing Zone (Management Plane) = The port authority, providing the docks and security rules
Cell (AWS Account + VPC) = A standardized shipping container; if one container is damaged (deployment failure), the contents of others are safe
Cell Router = The harbor pilot who knows exactly which container belongs to which customer
Control Plane = The crane operator who brings new containers onto the dock as the port grows
Platform Services/Global Cell = The port's central administration building that all ships must check in with

Why Separate AWS Accounts for Cells?

โœ… Account-Level Benefits

  • Natural Isolation Boundary: Actions in one account cannot impact others, effectively bypassing regional account-level limits
  • Quota Independence: Each cell gets its own set of AWS service quotas (EC2 instances, VPCs, NAT Gateways, etc.)
  • Blast Radius Containment: IAM misconfigurations, security incidents, or service disruptions are contained within the account
  • Cost Allocation: Perfect cost attribution per cell/customer segment using AWS Cost Explorer and tagging
  • Compliance & Auditing: Clear security boundaries for regulatory requirements (SOC2, HIPAA, PCI-DSS)
  • VPC Limit Bypass: Avoid VPC-level Network Address Usage (NAU) limits by having dedicated VPCs per account

๐Ÿ“Š Sizing Example: Enterprise Cells

Cell Capacity Planning:

  • Max Customers per Cell: 100 enterprise customers
  • EKS Cluster per Cell: 1 cluster with 20-50 nodes
  • Fixed Maximum Size: When Cell A reaches 100 customers, Control Plane automatically provisions Cell B
  • Saturation Limit: CPU > 70%, Memory > 75%, or customer count = max triggers new cell creation

Control Plane Automation

Control Plane Responsibilities

  1. Cell Provisioning: Use AWS Control Tower Account Factory API to create new AWS account โ†’ Deploy VPC + EKS + DynamoDB via Terraform/CloudFormation โ†’ Register cell with VPC Lattice Service Network
  2. Tenant Onboarding: When new customer signs up โ†’ Determine segment (SMB/Mid-Market/Enterprise) โ†’ Assign to least-loaded cell in segment โ†’ Update DynamoDB customer_cell_mapping table โ†’ Cache assignment in Redis
  3. Cell Scaling: Monitor cell capacity metrics โ†’ Trigger Step Function when saturation threshold reached โ†’ Provision new cell account โ†’ New customers automatically route to new cell (lowest load)
  4. Cell Migration: Dual-write phase โ†’ Background data sync โ†’ Atomic cutover (update DynamoDB + invalidate cache) โ†’ Cleanup old cell data after 24 hours
  5. Cell Decommissioning: Drain traffic โ†’ Migrate all customers to other cells โ†’ Delete AWS account via Control Tower

Multi-Region Architecture Considerations

๐ŸŒ Regional Cell Deployment

Each region (us-east-1, eu-west-1, ap-southeast-1) has its own set of cells:

Example Cell Distribution:

  • us-east-1: enterprise-cell-a, enterprise-cell-b, midmarket-cell-shared, smb-cell-shared
  • eu-west-1: enterprise-cell-eu-a, enterprise-cell-eu-b, midmarket-cell-eu-shared
  • ap-southeast-1: enterprise-cell-apac-a, midmarket-cell-apac-shared

Global Cell (Platform Services):

  • Identity Service: Deployed in all 3 regions with DynamoDB Global Tables for user authentication
  • API Key Management: Replicated globally for low-latency validation
  • Billing Service: Multi-region for quota checks and balance deductions

โš ๏ธ Critical: Platform Services are "Tier 0" dependencies - if Identity service fails globally, all cells are affected. Therefore, these MUST be deployed multi-region with active-active replication.

Interview Talking Points

๐ŸŽค 2-Minute Summary: Cell-Based Architecture on AWS

"For Twilio's cell-based architecture, I'd implement cells as dedicated AWS accounts with their own VPCs, all using the same CIDR block (10.0.0.0/16) for maximum autonomy. This bypasses VPC-level quotas and enables true fault isolation."

"The Cloud Native Landing Zone is built on AWS Control Tower, which sets up the multi-account governance structure. The Control Plane automates cell lifecycleโ€”provisioning new accounts when capacity is reached, using infrastructure-as-code to deploy identical cell stacks, and managing customer migrations."

"VPC Lattice solves the overlapping IP problem by routing based on service names, not IP addresses. The Cell Router looks up which cell a customer belongs to in DynamoDB (cached in Redis), then VPC Lattice routes the request to that cell's service endpointโ€”no VPC peering needed."

"Platform Services like Identity and IAM run in a separate Global Cell, deployed multi-region as Tier 0 dependencies. All regional cells depend on these for authentication and authorization, so they must be highly available across multiple regions."

"This architecture balances operational efficiency with isolation: enterprise customers get dedicated cells (accounts), mid-market shares cells in a VPC, and SMB uses namespace isolation within shared clusters. The Control Plane automates it all."

Cell Partitioning Strategy

Why Cell-Based Architecture?

Cell-based architecture is a pattern where you partition your infrastructure into isolated, independent units (cells). Each cell is a complete deployment of your application stack, limiting the blast radius of failures.

Benefits: Fault isolation, independent scaling, easier testing, phased rollouts, regional compliance.

Cell Partitioning Dimensions

Partition Type Strategy Rationale AWS Implementation
Customer Size Enterprise, Mid-Market, SMB Different SLAs, resource needs, isolation requirements Separate EKS clusters, dedicated capacity reservations
Geography US, EU, APAC regions Data residency, latency, compliance (GDPR, etc.) Multi-region deployment, Route 53 geolocation routing
Product Vertical Voice, SMS, Video, Verify Different latency/consistency requirements Product-specific microservices in each cell
Availability Zone Multi-AZ within region High availability, AZ failure isolation EKS node groups across 3 AZs, Aurora Multi-AZ

Cell Design Per Customer Segment

๐Ÿข Enterprise Cell

Customers: Uber, Lyft, Airbnb (top 100 customers)

Characteristics:

  • Dedicated EKS cluster (1000+ nodes)
  • Reserved capacity, dedicated NAT gateways
  • Aurora Multi-Master for writes
  • DynamoDB Global Tables
  • 99.99% SLA
  • Priority support, custom metrics

๐Ÿช Mid-Market Cell

Customers: Growing startups (1000-10,000 customers)

Characteristics:

  • Shared EKS cluster (200-500 nodes)
  • Burstable capacity
  • Aurora read replicas (writes to primary)
  • DynamoDB Global Tables
  • 99.95% SLA
  • Standard support

๐Ÿ  SMB Cells (A/B/C/D)

Customers: Small businesses (100,000+ customers)

Characteristics:

  • ECS Fargate (serverless, cost-optimized)
  • Spot instances for batch workloads
  • Aurora read replicas (shared primary)
  • DynamoDB on-demand pricing
  • 99.9% SLA
  • Community support

โšก Cell Routing Strategy

How requests route to cells:

  1. Route 53: Geolocation routing directs to nearest region
  2. API Gateway / ALB: Custom domain with cell identifier in header/path
  3. Cell Router Service: Looks up customer โ†’ cell mapping in DynamoDB
  4. VPC Lattice: Service mesh routes to correct EKS cluster/namespace

Customer ID โ†’ Hash โ†’ Cell Assignment โ†’ Persistent mapping in DynamoDB

Cell Taxonomy: Workflow-Affinity, Operationally-Differentiated

๐ŸŽฏ Core Principle: Group by Workflow Affinity, Not Product Catalog

Services that communicate at runtime should share a cell. Services with uncorrelated demand and no runtime dependencies can be separated for better capacity efficiency.

Why? Co-location solves cross-cell API calls and cascading failures. But forcing unrelated services together creates capacity modeling challengesโ€”you're sizing cells for combined peak load even when services never peak together.

The Core Trade-off

Approach Cross-Cell Calls Capacity Efficiency Routing Complexity
All services, one cell None Poor (uncorrelated demand) Simple
Workflow-affinity cells โœ… Rare (across workflow boundaries) Good Moderate
Service-per-cell Frequent Optimal Complex

When to Co-locate Services (Same Cell)

โœ… Co-locate when:

  • Services call each other synchronously โ€” SMS delivery โ†’ Voice callback โ†’ WhatsApp fallback
  • Customer workflows span multiple services in a single request path โ€” Verify API uses SMS + Voice
  • Failure in one impacts the other anyway โ€” Tightly coupled dependencies
  • Similar scaling characteristics โ€” Both bursty, both sustained, etc.

When to Separate Services (Different Cells)

๐Ÿ”€ Separate when:

  • Services rarely or never communicate at runtime โ€” No cross-cell call risk
  • Vastly different scaling characteristics โ€” SMS is bursty (millions/minute), Voice is sustained (concurrent calls), Video is bandwidth-bound
  • Different instance type optimizations โ€” CPU-bound vs memory-bound vs I/O-bound workloads
  • Uncorrelated demand patterns โ€” Don't want to size for combined peak when services never peak together

The Auto-Scaling Reality Check

Auto-scaling helps but has limits:

Factor Challenge
Scale-up latency 2-5 minutes for EC2, need headroom buffer
Minimum instances Cost floor regardless of actual demand
Different triggers SMS scales on queue depth, Voice on concurrent connections
Compounding headroom If SMS needs 30% headroom and Voice needs 30%, combined cell needs bothโ€”even if they never peak together

Recommended: Workflow-Affinity Cell Types

๐Ÿ“ฑ Messaging Cell

  • Services: SMS, MMS, WhatsApp, RCS
  • Why together: Similar patterns (message queues), often used in fallback chains
  • Scaling: Queue depth, messages/second

๐ŸŽฅ Real-time Media Cell

  • Services: Voice, Video, WebRTC
  • Why together: Connection-based, jitter-sensitive, often combined in apps
  • Scaling: Concurrent connections, bandwidth

โœ‰๏ธ Async Communication Cell

  • Services: Email (SendGrid), Fax
  • Why together: Batch-oriented, store-and-forward, different SLA expectations
  • Scaling: Throughput, delivery rate

๐Ÿ” Verify Cell

  • Services: Verify API (2FA), Lookup
  • Why together: Cross-channel verification workflows (SMS + Voice + Email)
  • Scaling: Requests/second, verification attempts
# Workflow-affinity routing:
customer_id + region + service_category โ†’ cell_id

# Example mappings:
customer: acme-corp, region: us-east-1, category: messaging  โ†’ messaging-enterprise-us-001
customer: acme-corp, region: us-east-1, category: realtime   โ†’ realtime-enterprise-us-001
customer: acme-corp, region: us-east-1, category: verify     โ†’ verify-enterprise-us-001

# Same customer, different cells by workflow affinity
# Cross-category calls are rare (by design)

Cell Taxonomy Dimensions

Primary Dimensions (Tier 1 - Must Have)

  1. Geographic/Regional
    • Why: Legal requirement for data residency (GDPR, Chinese data laws)
    • Why: Latency matters for real-time voice/video
    • Impact: Minimum deployment in US, EU, APAC regions
    • Interview value: Shows understanding of regulatory constraints
  2. Customer Segment
    • Why: Enterprise vs SMB have fundamentally different operational needs
    • Examples:
      • Enterprise: Dedicated cells, custom SLAs, contractual guarantees, 24/7 support
      • SMB/Developer: Shared multi-tenant cells, best-effort, self-service
    • Impact: Different blast radius tolerance, isolation requirements, cost models
    • Interview value: Demonstrates business/technical balance
  3. Compliance/Regulatory
    • Why: You CANNOT mix HIPAA healthcare traffic with regular SMS
    • Why: PCI-DSS for payment flows requires isolation
    • Impact: Separate cell infrastructure with auditable controls
    • Interview value: Shows enterprise architecture experience

Secondary Dimensions (Conditional)

  • Workload Characteristics (May be implied by product type):
    • Transactional: 2FA codes, alerts (low-latency, predictable)
    • Marketing/Bulk: Campaign SMS (high-throughput, bursty)
    • Real-time: Voice/Video (sustained connections, jitter-sensitive)

Note: This might be handled within cells via different service tiers rather than requiring separate cell types.

What to AVOID: Per-Service Cells Without Workflow Analysis

Don't blindly create one cell type per product. Instead, analyze actual workflow dependencies:

  • Ask: "If a customer uses Service A, how often do they call Service B in the same request?"
  • If frequently: Co-locate in the same workflow-affinity cell
  • If rarely/never: Separate cells are fineโ€”no cross-cell call risk
Bad: SMS-Cell, Voice-Cell, Video-Cell (arbitrary per-service split, ignores workflows)
Better: Messaging-Cell (SMS + MMS + WhatsApp), Realtime-Cell (Voice + Video + WebRTC)
Best: Workflow analysis determines grouping based on actual runtime dependencies

Key insight: The goal isn't "fewer cells" or "more cells"โ€”it's minimizing cross-cell calls while avoiding capacity modeling nightmares from forcing unrelated services together.

Operational Differentiation: Same Code, Different Posture

Every cell runs the same microservices (SMS Service, Voice Service, Video Service, etc.). What changes is the operational configuration:

Aspect Enterprise Cell SMB Cell
Availability Multi-AZ, N+2 redundancy Multi-AZ, N+1 redundancy
Compute r7g.2xlarge (dedicated) r7g.large (shared, burstable)
Database Multi-AZ RDS, 6 read replicas Multi-AZ RDS, 2 read replicas
Capacity Reserved instances, pre-scaled Auto-scaling, spot instances
Network Dedicated NAT Gateways, Transit Gateway Shared NAT, standard VPC routing
Storage Provisioned IOPS SSD (io2) General Purpose SSD (gp3)
Monitoring Sub-minute CloudWatch, custom metrics 5-minute CloudWatch, standard metrics
Blast Radius 10-50 customers per cell 1,000+ customers per cell
SLA 99.99% (52 min downtime/year) 99.9% (8.7 hrs downtime/year)
Cost per Customer ~$500-2000/month ~$10-50/month

Key Benefits of This Architecture

1๏ธโƒฃ No Cross-Cell Cascading Failures

  • Customer uses SMS + Voice + Video โ†’ all in same cell
  • No distributed transactions across cells
  • Failures stay contained within cell boundary

2๏ธโƒฃ Noisy Neighbor Isolation

  • SMB customer traffic spike โ†’ only affects other SMB customers
  • Enterprise customers in separate cells โ†’ unaffected
  • No SLA violations for high-paying customers

3๏ธโƒฃ Simple Customer Migration

  • Upgrade SMB โ†’ Enterprise: Update DynamoDB routing table
  • Zero code changes, just point to new cell
  • Gradual traffic shifting (10% โ†’ 50% โ†’ 100%)

4๏ธโƒฃ Economic Optimization

  • Enterprise: Infrastructure dollars proportional to revenue
  • SMB: Economies of scale through higher density
  • Right-sized spending per customer tier

5๏ธโƒฃ Operational Simplicity

  • Every cell has identical structure (same services)
  • One deployment pipeline, one monitoring setup
  • Only Terraform variables change (instance sizes, etc.)

Infrastructure-as-Code Example

# Terraform module - workflow-affinity cells with operational tiers
module "messaging_cell" {
  source = "./modules/twilio-cell"

  # Cell identity
  cell_id  = "messaging-enterprise-us-001"
  category = "messaging"
  region   = "us-east-1"
  segment  = "enterprise"

  # Services in this workflow affinity group
  services = [
    "sms-service",
    "mms-service",
    "whatsapp-service",
    "rcs-service"
  ]

  # Operational config (enterprise tier)
  instance_type = "r7g.2xlarge"
  min_instances = 6
  scaling_metric = "queue_depth"  # Messaging scales on queue depth
}

module "realtime_cell" {
  source = "./modules/twilio-cell"

  cell_id  = "realtime-enterprise-us-001"
  category = "realtime"
  region   = "us-east-1"
  segment  = "enterprise"

  # Services in this workflow affinity group
  services = [
    "voice-service",
    "video-service",
    "webrtc-service"
  ]

  # Different scaling characteristics
  instance_type = "c7g.2xlarge"  # CPU-optimized for media processing
  min_instances = 4
  scaling_metric = "concurrent_connections"  # Realtime scales on connections
}

# Same customer can be in both cells (different workflow categories)
# Cross-category calls are rare by design

๐Ÿ’ก Interview Talking Point: Cell Taxonomy

"I'd group services by workflow affinity rather than putting everything in one cell or splitting by individual product.

The key question is: 'Do these services call each other at runtime?' If yes, co-locate them to avoid cross-cell latency and cascading failures. If they rarely or never communicate, separate cells let them scale independently with better capacity efficiency.

For Twilio, this might mean: A Messaging Cell (SMS, MMS, WhatsAppโ€”similar patterns, fallback chains), a Real-time Media Cell (Voice, Video, WebRTCโ€”connection-based, jitter-sensitive), and an Async Cell (Email, Faxโ€”batch-oriented, different SLAs).

Why not just one cell with everything? Capacity modeling becomes a nightmare. SMS is bursty, Voice is sustained, Video is bandwidth-heavy. If you force them together, you're sizing for combined peak even when they never peak simultaneously. Auto-scaling helps, but you still pay the minimum instance cost floor and headroom buffer for each service.

Within each workflow category, I'd still differentiate by operational tierโ€”Enterprise cells over-provisioned for SLA guarantees, SMB cells cost-optimized with higher density. Same deployment pipeline, different operational posture.

The goal is minimizing cross-cell calls while avoiding the capacity modeling tax of forcing unrelated services together."

The "So What?" Test for Cell Taxonomy

For each dimension, ask: "Does this require materially different infrastructure, SLAs, or operational procedures?"

  • โœ… Geography โ†’ Yes (different AWS regions, data residency laws)
  • โœ… Segment โ†’ Yes (dedicated vs shared, different SLAs)
  • โœ… Compliance โ†’ Yes (audit logs, encryption, access controls)
  • โœ… Workflow affinity โ†’ Yes (different scaling characteristics, instance types, capacity models)
  • โš ๏ธ Individual product โ†’ Only if services rarely communicate AND have very different scaling needs

Principle: Segment on operational characteristics AND runtime dependencies. Co-locate services that call each other; separate services with uncorrelated demand that don't communicate.

Example Cell Naming Convention

# Workflow-affinity cell naming: {category}-{segment}-{region}-{compliance}
Messaging-Enterprise-US-HIPAA    # Healthcare customer, messaging services, US
Messaging-SMB-EU-Standard        # Small business, messaging services, EU
Realtime-Enterprise-US-Standard  # Enterprise customer, voice/video, US
Realtime-SMB-APAC-Standard       # SMB customer, voice/video, Asia-Pacific
Async-Enterprise-US-PCI          # Payment processor, email/fax, US
Verify-Enterprise-US-HIPAA       # Healthcare, 2FA workflows, US

# Cell Router lookup (now includes service category):
Customer ID + Service Category โ†’ DynamoDB โ†’ Cell Assignment โ†’ VPC Lattice โ†’ Correct Cell

# Same customer, multiple cells:
acme-corp โ†’ messaging  โ†’ Messaging-Enterprise-US-Standard
acme-corp โ†’ realtime   โ†’ Realtime-Enterprise-US-Standard
acme-corp โ†’ verify     โ†’ Verify-Enterprise-US-Standard

Detailed Architecture Layers

Request Flow Through Architecture

Client Layer 1: Global Edge Route 53 (DNS) CloudFront (CDN) WAF/Shield Geolocation routing, DDoS protection Layer 2: Regional Gateway Global Accelerator API Gateway ALB Regional failover, TLS termination Layer 3: Cell Router Customer โ†’ Cell Mapping VPC Lattice DynamoDB lookup, routing logic Enterprise Cell EKS Cluster Microservices (Voice, SMS, Video) VPC Lattice Horizontal Pod Autoscaling Mid-Market Cell EKS Cluster (Shared) Microservices VPC Lattice Cluster Autoscaling SMB Cell ECS Fargate Microservices Service Discovery Auto Scaling Layer 5: Data Layer DynamoDB Global Tables Multi-Master Account data, SMS logs Aurora Multi-Master PostgreSQL Enterprise config Aurora Replicas Leader-Follower Phone inventory ElastiCache Redis Cluster Session cache Kinesis Streams Event streaming SQS FIFO Message ordering S3 + Glacier Call recordings, message archives, compliance data

AWS Service Mapping by Layer

Layer 1: Global Edge & Traffic Management

  • Route 53: DNS with health checks, geolocation routing, latency-based routing
  • CloudFront: CDN for static assets, API caching, geo-restrictions
  • AWS Global Accelerator: Anycast IPs, automatic failover between regions
  • WAF & Shield: DDoS protection, rate limiting, bot mitigation

Layer 2: Regional Gateway

  • API Gateway: Regional REST/WebSocket APIs, request validation, throttling
  • Application Load Balancer (ALB): Layer 7 load balancing, TLS termination, path-based routing
  • Network Load Balancer (NLB): Layer 4 for Voice/Video (UDP/TCP), ultra-low latency
  • VPC Peering / Transit Gateway: Cross-region connectivity

Layer 3: Cell Router & Service Mesh

  • Cell Router Service: Custom Lambda/ECS service that maps customer โ†’ cell
  • DynamoDB: Stores customer-to-cell assignments, low-latency lookups
  • VPC Lattice: Service mesh for inter-service communication, service-to-service connectivity
  • AWS Cloud Map: Service discovery for microservices

Layer 4: Compute Cells

  • EKS (Kubernetes): Enterprise and mid-market cells, full control, custom scheduling
  • ECS Fargate: SMB cells, serverless, lower operational overhead
  • EC2 Auto Scaling: EKS worker nodes, reserved + spot instances
  • Lambda: Event-driven workloads (webhooks, async processing)

Layer 5: Data & Storage

  • DynamoDB Global Tables: Multi-master, active-active replication across regions
  • Aurora Multi-Master: PostgreSQL with multi-master for enterprise cells
  • Aurora with Read Replicas: Cross-region replicas, leader-follower pattern
  • ElastiCache (Redis): Session management, rate limiting, caching
  • S3: Call recordings, message attachments, static assets
  • Kinesis: Real-time event streaming, analytics pipeline
  • SQS/SNS: Asynchronous messaging, fan-out patterns

Database Layer Strategy

Multi-Master vs Leader-Follower Decision Matrix

The choice between multi-master and leader-follower depends on:

  • Write Pattern: Multi-region writes? Geographic distribution of writes?
  • Consistency Requirements: Strong consistency needed? Conflict tolerance?
  • Latency Sensitivity: Can tolerate cross-region write latency?
  • Data Characteristics: High contention? Append-only? Partitionable?

Database Selection by Use Case

Aurora PostgreSQL Multi-Master

Pattern: Multi-Master (Active/Active)

Use Cases:

  • Enterprise customer configurations
  • Complex queries with ACID
  • Multi-table transactions

Why:

  • โœ… ACID transactions across masters
  • โœ… SQL interface, complex queries
  • โœ… Up to 2 masters per region
  • โš ๏ธ Limited to single region multi-master
  • โš ๏ธ Higher latency than DynamoDB

Configuration: 2 masters + 3 read replicas per region

Aurora Global Database (Leader-Follower)

Pattern: Leader-Follower (Active/Passive)

Use Cases:

  • Phone number inventory
  • Payment transactions
  • Strong consistency requirements
  • Regulatory compliance data

Why:

  • โœ… Strong consistency guarantees
  • โœ… < 1 second cross-region replication
  • โœ… Failover in < 1 minute (RPO < 1s)
  • โœ… Read replicas in secondary regions
  • โš ๏ธ Writes only to primary region

Configuration: Primary in US-East-1, replicas in EU + APAC

Cassandra on EC2/Kubernetes

Pattern: Masterless (Peer-to-Peer)

Use Cases:

  • Time-series data (call metrics)
  • Extreme scale (billions of records)
  • High write throughput
  • When DynamoDB limits hit

Why:

  • โœ… Linear scalability
  • โœ… No single point of failure
  • โœ… Tunable consistency (CL=QUORUM)
  • โœ… Multi-datacenter replication
  • โš ๏ธ Higher operational complexity
  • โš ๏ธ Need dedicated ops team

When to use: Only if DynamoDB can't meet scale/cost requirements

Database Replication Topology

DynamoDB Global Tables (Multi-Master) US-EAST-1 (Primary) DynamoDB Multi-Master Accepts writes Replicates to EU + APAC Last-write-wins EU-WEST-1 DynamoDB Multi-Master Accepts writes Replicates to US + APAC < 1 second replication AP-SOUTHEAST-1 DynamoDB Multi-Master Accepts writes Replicates to US + EU Eventual consistency Aurora Global Database (Leader-Follower) US-EAST-1 (Primary) Aurora LEADER Accepts ALL writes Replicates to secondaries EU-WEST-1 (Secondary) Aurora FOLLOWER Read-only Can be promoted to leader Replication

Conflict Resolution Strategies

Database Conflict Resolution Use Case Fit
DynamoDB Global Tables Last-write-wins (LWW) based on timestamp โœ… Account updates, configs (rare conflicts)
โœ… Append-only logs (no conflicts)
Aurora Multi-Master Automatic deadlock detection, transaction rollback โœ… Enterprise configs with transactions
โš ๏ธ Limited to 2 masters in same region
Cassandra Tunable: LWW, custom application logic, CRDTs โœ… Time-series (no conflicts)
โœ… Counters with CRDT increments
Application-Level Version vectors, CRDTs, custom merge logic โœ… Complex business logic
โœ… Shopping carts, collaborative editing

Twilio Product to Database Mapping

Principle: Match Consistency Model to Product Requirements

Different Twilio products have different consistency, latency, and scale requirements. Choose the right database pattern for each.

Twilio Product Data Type Database Choice Replication Pattern Consistency Model Rationale
Programmable Voice Call records, CDRs DynamoDB Global Tables Multi-Master Eventual Append-only, high volume, sub-second replication lag acceptable, query by CallSID
Programmable Voice Active call state ElastiCache Redis N/A (Cache) Strong* Low latency, session state, ephemeral (TTL). *Single-node writes within region
Programmable SMS Message logs DynamoDB Global Tables Multi-Master Eventual Append-only, billions of messages, partitioned by phone number, ~1s replication lag
Programmable SMS Message archives S3 โ†’ Glacier Cross-region replication Eventual Compliance, long-term retention, cost-optimized, S3 read-after-write consistency
Phone Numbers Inventory management Aurora Global Database Leader-Follower Strong CRITICAL: Can't double-assign numbers. ACID transactions, serializable isolation
Conversations API Messages, participants DynamoDB Global Tables Multi-Master Eventual Append-only messages, out-of-order delivery acceptable, partition by conversation
Verify API Verification tokens ElastiCache Redis N/A (Cache) Strong* Short-lived (5-10 min TTL), low latency. *Regional strong consistency
Account Management Customer accounts DynamoDB Global Tables Multi-Master Eventual Low write contention, global access, last-write-wins acceptable, ~1s replication
Billing Usage metrics Kinesis โ†’ S3 โ†’ Redshift Event streaming Eventual High volume, analytics, batch processing, eventual aggregation acceptable
Billing Invoices, payments Aurora Global Database Leader-Follower Strong CRITICAL: Financial transactions require ACID guarantees, no double-charging
Video Room state ElastiCache Redis N/A (Cache) Strong* Real-time, low latency, ephemeral. *Regional consistency, conflicts rare
Video Recordings S3 Multi-Region Cross-region replication Eventual Large files, CDN integration, durability, S3 eventual cross-region consistency

๐Ÿ“– Consistency Models Explained

Eventual Consistency

Definition: Writes to one replica will eventually propagate to all replicas, but reads may see stale data during replication lag (~1 second for DynamoDB Global Tables).

When to use:

  • Append-only data (logs, messages, events)
  • Data where conflicts are rare or easily resolved (last-write-wins)
  • High availability and low latency are more important than consistency
  • Multi-region writes are required

Trade-off: Clients might read stale data briefly, but system remains available during network partitions.

Strong Consistency

Definition: All reads see the most recent write. Once a write is acknowledged, all subsequent reads from any replica will see that write.

When to use:

  • Financial transactions (billing, payments)
  • Inventory management (phone numbers, seats)
  • Any data where stale reads cause business problems
  • ACID transaction requirements

Trade-off: Higher latency for cross-region reads, limited write scalability (single leader), reduced availability during partitions.

๐Ÿ”‘ Interview Tip: Always justify your consistency choice. For Twilio, voice/SMS logs can tolerate eventual consistency (append-only, no conflicts), but phone number inventory requires strong consistency (can't double-assign). This is a key architectural trade-off!

Data Flow Example: SMS Message

API Request
(EU customer)
โ†’
Cell Router
(DynamoDB lookup)
โ†’
EU Cell
(EKS pod)
โ†’
DynamoDB EU
(local write)
โ†’
Replicate to
US + APAC
(<1 sec)
Kinesis
(event stream)
โ†’
Firehose
(batch)
โ†’
S3
(compliance archive)
โ†’
Glacier
(7 year retention)

AWS Well-Architected Framework Alignment

๐ŸŽฏ Operational Excellence

  • IaC: Terraform/CloudFormation for all cells
  • CI/CD: Blue/green deployments per cell
  • Observability: CloudWatch, X-Ray, cell-level dashboards
  • Runbooks: Automated remediation with Systems Manager
  • Chaos Engineering: Fault injection per cell (GameDay exercises)

๐Ÿ”’ Security

  • IAM: Service roles per cell, least privilege
  • Encryption: At-rest (KMS) and in-transit (TLS 1.3)
  • Secrets: Secrets Manager, automatic rotation
  • Network: VPC isolation per cell, PrivateLink
  • Compliance: GDPR data residency, SOC2, HIPAA

๐Ÿ’ช Reliability

  • Multi-AZ: All cells across 3 AZs
  • Multi-Region: Active/active in 3 regions
  • Cell Isolation: Blast radius limited to single cell
  • Backups: Automated DynamoDB PITR, Aurora snapshots
  • Disaster Recovery: RTO < 5min, RPO < 1sec

โšก Performance Efficiency

  • Global Accelerator: Anycast routing, TCP optimization
  • DynamoDB: Single-digit ms latency, auto-scaling
  • ElastiCache: Sub-millisecond caching
  • Compute: Graviton instances (40% better perf/cost)
  • Observability: X-Ray distributed tracing

๐Ÿ’ฐ Cost Optimization

  • Right-Sizing: Enterprise (reserved), SMB (spot/fargate)
  • DynamoDB: On-demand for SMB, provisioned for enterprise
  • S3 Lifecycle: Glacier for compliance archives
  • Compute Savings Plans: 30-50% savings
  • Monitoring: Cost anomaly detection per cell

๐ŸŒ Sustainability

  • Regions: Choose AWS regions with renewable energy
  • Graviton: 60% less energy than x86
  • Serverless: Fargate, Lambda for variable workloads
  • S3 Intelligent-Tiering: Automatic archival
  • Auto-Scaling: Scale down during off-peak

Key Architecture Decisions & Trade-offs

โœ… Decision: DynamoDB Global Tables over Cassandra

Rationale: Managed service, lower ops overhead, sufficient scale for most Twilio use cases

Trade-off: Less control over replication topology, higher cost at extreme scale

When to reconsider: If DynamoDB costs > $1M/month or need more than 5 regions

โœ… Decision: Cell-based architecture by customer size

Rationale: Different SLAs, resource isolation, blast radius containment

Trade-off: More complex routing, cell rebalancing overhead

Mitigation: Automated cell assignment, gradual customer migration tools

โœ… Decision: EKS for Enterprise, ECS Fargate for SMB

Rationale: Enterprise needs custom scheduling/control, SMB needs cost efficiency

Trade-off: Managing two orchestration platforms

Mitigation: Shared CI/CD pipelines, standardized observability

โœ… Decision: Aurora Global Database for phone inventory

Rationale: Strong consistency required, can't double-assign phone numbers

Trade-off: Single write region, cross-region writes have higher latency

Mitigation: Regional inventory pools, write to nearest region's pool

Scalability Estimates

Component Current Scale Target Scale Bottleneck Mitigation
DynamoDB Global Tables 10K RCU/WCU per table 100K RCU/WCU Partition hotspots Partition key design (customer_id + timestamp), adaptive capacity
Aurora Global 100K transactions/sec 500K transactions/sec Write throughput Horizontal sharding by region, read replicas
EKS Cluster (Enterprise) 1000 nodes 5000 nodes Control plane limits Multiple clusters per region, federated control plane
API Gateway 10K req/sec 100K req/sec Regional quotas Multi-region, ALB for high throughput paths
Kinesis Streams 1000 shards 10K shards Shard management Automatic resharding, Kinesis Data Firehose for batch

Monitoring & Observability

Cell-Level Dashboards

  • CloudWatch: Custom metrics per cell (latency, error rate, throughput)
  • X-Ray: Distributed tracing across cells, service map
  • Prometheus + Grafana: EKS cluster metrics, pod-level observability
  • DynamoDB Contributor Insights: Identify hot partitions
  • Aurora Performance Insights: Database query analysis
  • CloudWatch Alarms: Cell health checks, automatic remediation via Lambda

SLI/SLO Tracking:

  • API latency p99 < 200ms (per cell, per product)
  • Error rate < 0.1% (per cell)
  • Availability > 99.99% (enterprise cells), > 99.95% (mid-market), > 99.9% (SMB)

Summary & Interview Talking Points

๐ŸŽค 2-Minute Elevator Pitch

"For Twilio's global platform, I'd design a cell-based architecture partitioned by customer sizeโ€”enterprise, mid-market, and SMBโ€”deployed across three AWS regions for multi-region active/active redundancy."

"At the data layer, we'd use DynamoDB Global Tables for multi-master replication where eventual consistency is acceptableโ€”like customer accounts and message logs. For strong consistency needs like phone number inventory, we'd use Aurora Global Database with leader-follower replication."

"Enterprise customers get dedicated EKS clusters with Aurora Multi-Master and reserved capacity. Mid-market shares EKS with Aurora replicas. SMB runs on cost-optimized ECS Fargate."

"This design aligns with AWS Well-Architected: operational excellence through IaC and cell-level observability, security via VPC isolation and encryption, reliability through multi-AZ/multi-region with blast radius containment, performance via low-latency databases and global accelerator, and cost optimization by right-sizing per customer segment."

"The cell architecture limits blast radiusโ€”if one cell fails, it only affects that customer segment. We can independently scale, deploy, and test each cell."

๐Ÿ“‹ Key Takeaways

  1. Cell-based architecture limits blast radius and enables independent scaling
  2. Multi-master (DynamoDB Global Tables) for low-latency global writes with eventual consistency
  3. Leader-follower (Aurora Global Database) for strong consistency requirements
  4. Customer segmentation (Enterprise/Mid-Market/SMB) drives compute and database choices
  5. Product-specific patterns match data consistency to business requirements
  6. AWS-native services reduce operational overhead vs self-managed (DynamoDB > Cassandra)
  7. Well-Architected Framework ensures holistic design across all pillars

๐Ÿš€ Next Steps for Deep Dive

  • Design detailed cell routing algorithm (DynamoDB lookups, least-loaded assignment, cell rebalancing)
  • Plan disaster recovery runbooks (regional failover, data loss scenarios)
  • Cost modeling per customer segment (reserved vs on-demand, DynamoDB pricing)
  • Security deep-dive (GDPR compliance, data residency, encryption key management)
  • Migration strategy (monolith โ†’ cells, zero-downtime migration)