Architecture Overview
๐ฏ Core Design Principles
- Full Cell Autonomy: Each enterprise cell is a completely isolated VPC with independent IP space (overlapping CIDRs allowed)
- Blast Radius Containment: Cell failures affect only customers in that cell (~100 enterprise customers max per cell)
- Multi-Region Active/Active: Customers write locally to their region, DynamoDB Global Tables replicate globally
- VPC Lattice for Inter-Cell Routing: Service-based routing (not IP-based) enables overlapping IP spaces
- Consistency by Use Case: Multi-master (DynamoDB) for logs/events, leader-follower (Aurora) for phone inventory/payments
๐ Key Architectural Decisions
| Decision | Approach | Rationale |
|---|---|---|
| Cell Isolation | VPC-per-cell for Enterprise, shared VPC for Mid-Market/SMB | Balance isolation (enterprise) vs operational efficiency (SMB) |
| IP Addressing | Overlapping RFC1918 (all cells use 10.0.0.0/16) | Full autonomy, simplified addressing, VPC Lattice handles routing |
| Cell Edge | NLB (Enterprise) or shared ALB (Mid-Market/SMB) | NLB for ultra-low latency, ALB for WAF/path routing |
| Inter-Cell Routing | VPC Lattice (service mesh) | Supports overlapping IPs, IAM auth, cross-VPC without peering |
| Database Strategy | DynamoDB Global Tables (multi-master) + Aurora Global (leader-follower) | Match consistency model to use case requirements |
How Overlapping IP Spaces Work with VPC Lattice
๐ The Problem: Cell Autonomy with RFC1918 IP Overlaps
Challenge: For true cell autonomy, each Enterprise cell should be a fully independent VPC. But if we have 100+ enterprise cells, we'd run out of unique RFC1918 IP space. The solution: allow all cells to use the same CIDR blocks (e.g., 10.0.0.0/16).
โ The Solution: VPC Lattice Service-Based Routing
VPC Lattice routes based on service names, not IP addresses. This means overlapping IP spaces across VPCs are perfectly fine.
How It Works:
- Each cell is a dedicated VPC with the same CIDR (10.0.0.0/16)
- VPC Lattice creates a Service Network that spans all VPCs in the region
- Each cell registers its services with unique service names (e.g.,
enterprise-cell-a-api,enterprise-cell-b-api) - Cell Router sets X-Twilio-Cell-ID header (e.g., "enterprise-us-east-1-a")
- VPC Lattice routes to the service by name, not by IP address
- Service discovery happens via AWS Cloud Map - VPC Lattice resolves service name to the correct VPC's load balancer
Example: Three cells with overlapping IPs
enterprise-cell-a-vpc: 10.0.0.0/16 โ Service:enterprise-cell-a-apienterprise-cell-b-vpc: 10.0.0.0/16 โ Service:enterprise-cell-b-apienterprise-cell-c-vpc: 10.0.0.0/16 โ Service:enterprise-cell-c-api
All three cells have pod IPs like 10.0.50.10, but VPC Lattice routes correctly because it uses service names, not IPs.
๐ก Key Benefit: No VPC peering needed! VPC Lattice handles cross-VPC communication with built-in IAM authentication, observability, and no risk of IP conflicts.
Architecture Diagrams
๐ Creating Professional AWS Architecture Diagrams
For interview presentations, create professional diagrams using these tools with official AWS Architecture Icons:
- draw.io (diagrams.net) - Free, has built-in AWS icon library (app.diagrams.net)
- CloudCraft - AWS-specific, 3D diagrams (cloudcraft.co)
- Lucidchart - Professional diagramming tool with AWS shapes
- AWS Architecture Icons - Download official icons: aws.amazon.com/architecture/icons/
Complete Request Flow Architecture
๐ End-to-End Request Flow
๐ Diagram Specifications for Interview
Create these diagrams using draw.io or CloudCraft with AWS icons:
1. Multi-Region Overview Diagram
Components to include:
- Route 53 (global) โ CloudFront (global) โ Regional ALBs (us-east-1, eu-west-1, ap-southeast-1)
- Cell Router Lambda in each region
- VPC Lattice service network (spans multiple VPCs in region)
- 3-4 cell VPCs per region, all labeled "10.0.0.0/16" to show overlapping IPs
- Arrows showing data replication (DynamoDB bidirectional, Aurora primaryโreplica)
2. Cell Detail Diagram
Components to include:
- VPC (10.0.0.0/16) with public/private subnets across 3 AZs
- ALB in public subnets
- EKS cluster nodes in private subnets
- Pods (Voice, SMS, Auth services)
- DynamoDB and Aurora databases
- ElastiCache Redis cluster
- NAT Gateways, Internet Gateway
- Security group arrows showing traffic flow
3. VPC Lattice Overlapping IP Diagram
Show how overlapping IPs work:
- VPC Lattice service network at the top
- 3 VPCs below, ALL labeled "10.0.0.0/16"
- Each VPC has different service name (enterprise-a-api, enterprise-b-api, enterprise-c-api)
- Annotation: "VPC Lattice routes by service name, not IP"
- Show pods in each VPC with same IP (e.g., 10.0.50.10) to emphasize overlapping
๐ง Key Configuration Details
Route 53 Configuration
- Routing Policy: Geolocation (US โ us-east-1, EU โ eu-west-1, APAC โ ap-southeast-1)
- Health Checks: /health endpoint, 30s interval, 3 failures = unhealthy
- Failover: If primary region unhealthy, route to secondary region
- TTL: 60 seconds (balance between failover speed and DNS query volume)
CloudFront Configuration
- Origin: Origin group with regional ALBs (automatic failover)
- Cache Behavior: GET requests cached 60s, POST/PUT/DELETE bypass cache
- SSL/TLS: Custom certificate from ACM, TLS 1.2+ only
- WAF: Rate limiting (10,000 requests per 5 minutes per IP)
Global Accelerator Configuration
- Anycast IPs: 2 static IPv4 addresses (global)
- Endpoints: Regional ALBs in each region (weight: 100)
- Health Checks: Port 443, HTTPS protocol, 30s interval
- Traffic Dial: 100% to healthy endpoints, automatic failover
VPC Lattice Configuration
- Service Network: Cross-VPC service discovery and routing
- Routing: Header-based routing on X-Twilio-Cell-ID
- Auth: IAM-based service-to-service authentication
- Observability: Access logs to S3, metrics to CloudWatch
ALB Configuration
- Scheme: Internet-facing, cross-zone load balancing enabled
- TLS: TLS 1.3, certificate from ACM, ALPN support
- Routing: Path-based routing (/v1/voice, /v1/sms, etc.)
- Sticky Sessions: Application cookie, 1-hour duration
Cell Router Lambda Configuration
- Runtime: Node.js 20, arm64 (Graviton2)
- Memory: 1024MB, Timeout: 10 seconds
- Concurrency: Reserved concurrency per region (1000)
- VPC: Deployed in VPC to access ElastiCache privately
๐ Traffic Flow Example
- Client makes HTTPS request to
api.twilio.com - Route 53 resolves DNS based on geolocation โ Returns CloudFront distribution
- CloudFront terminates TLS, applies WAF rules, routes to nearest regional ALB
- Global Accelerator (optional) provides anycast routing to nearest region
- ALB terminates TLS (if not via CloudFront), routes based on path to API Gateway or directly to Cell Router
- Cell Router Lambda extracts customer ID from JWT, queries ElastiCache/DynamoDB for cell assignment, assigns new customers to least-loaded cell, sets
X-Twilio-Cell-IDheader - VPC Lattice routes request to appropriate EKS cluster based on
X-Twilio-Cell-IDheader - EKS/ECS processes request in customer's assigned cell
- Data Layer reads from local DynamoDB replica (eventual consistency) or Aurora read replica, writes to DynamoDB Global Table (multi-master) or Aurora primary (strong consistency)
- Response flows back through VPC Lattice โ Cell Router โ ALB โ CloudFront โ Client
โก Total Latency Budget: CloudFront (5ms) + ALB (5ms) + Cell Router (5-15ms) + VPC Lattice (2ms) + Application (20-100ms) + Database (5-10ms) = 42-137ms end-to-end
๐ Routing Responsibilities: VPC Lattice vs EKS
Inter-Cell Routing (VPC Lattice)
Responsibility: Route requests between cells based on customer assignment
How it works:
- Cell Router Lambda sets
X-Twilio-Cell-IDheader (e.g., "enterprise-us-east-1-a") - VPC Lattice inspects header and routes to correct cell's service network
- Works across VPCs, availability zones, and accounts
- Provides IAM-based authentication between services
Example: Customer A (assigned to Enterprise Cell A) โ VPC Lattice โ Enterprise Cell A EKS cluster
Intra-Cell Routing (EKS/Kubernetes)
Responsibility: Route requests within a cell between microservices
How it works:
- Request arrives at cell's EKS cluster via VPC Lattice
- Kubernetes Ingress Controller (NGINX/ALB Ingress) receives request
- Routes to appropriate service based on path (/v1/voice โ voice-service)
- Service mesh (Istio/Linkerd) handles service-to-service communication
Example: Within Enterprise Cell A: voice-service โ auth-service โ database-service
๐ Key Distinction: VPC Lattice routes to the correct cell (inter-cell), while EKS/Kubernetes routes within the cell to the correct microservice (intra-cell). This separation allows cells to be completely isolated while still enabling centralized routing logic.
๐ก Alternative: For larger enterprises, you could use Istio multi-cluster to span service mesh across cells, but this increases complexity. VPC Lattice + per-cell Kubernetes Ingress is simpler and maintains better fault isolation.
Network Boundaries & Cell Edge Architecture
๐๏ธ Cell Network Boundary Design
VPC Structure Options
| Approach | VPC Topology | Isolation Level | Use Case | Cell Edge Device |
|---|---|---|---|---|
| Option 1: VPC-per-Cell โญ | Each cell = dedicated VPC | ๐ข Strongest (network isolation) | Enterprise cells | VPC Lattice โ ALB โ EKS Pods (IP mode) |
| Option 2: Shared VPC, Subnet Isolation | Cells in different subnets | ๐ก Moderate (security groups) | Mid-Market, SMB cells | VPC Lattice โ ALB (target groups) โ EKS |
| Option 3: Namespace Isolation | Cells = Kubernetes namespaces | ๐ Weakest (logical only) | Dev/test environments | Shared ALB โ Ingress Controller โ Namespaces |
Recommended Architecture: Hybrid Approach
Enterprise Cells (Top 100 customers):
- VPC Boundary: Dedicated VPC per cell (e.g.,
enterprise-cell-a-vpc,enterprise-cell-b-vpc) - ALL use 10.0.0.0/16 (overlapping IPs) - Subnets: Private subnets across 3 AZs for EKS, public subnets for NAT gateways
- Edge Device: Application Load Balancer (ALB) registered with VPC Lattice as target
- Why ALB: Layer 7 routing (path/host), WAF integration, TLS termination, managed by AWS Load Balancer Controller
- Traffic Flow: VPC Lattice โ ALB โ Target Group (EKS pods in IP mode) โ Pods
- โ VPC Lattice enables overlapping IPs - routes by service name, not IP address
- Note: Use NLB instead if you need source IP preservation or non-HTTP protocols
Mid-Market Cells:
- VPC Boundary: Shared VPC with isolated subnets per cell (e.g.,
midmarket-vpc= 10.1.0.0/16) - Subnets: Cell A (10.1.0.0/18), Cell B (10.1.64.0/18), Cell C (10.1.128.0/18)
- Edge Device: Application Load Balancer (ALB) with separate target groups per cell
- Why ALB: Layer 7, path-based routing, WAF integration, TLS termination
- Traffic Flow: VPC Lattice โ ALB โ Target Group (Cell-specific) โ EKS Pods
SMB Cells (ECS Fargate):
- VPC Boundary: Shared VPC with Fargate tasks in isolated subnets
- Edge Device: ALB with target groups pointing to Fargate service
- Why ALB: Native ECS integration, automatic task registration/deregistration
- Traffic Flow: VPC Lattice โ ALB โ Fargate Tasks (via AWS Cloud Map service discovery)
What Sits at the Cell Edge?
๐ฏ Answer: It depends on the cell type, but generally:
For Enterprise Cells (VPC-per-cell):
- VPC Lattice Service Network routes traffic to the cell's VPC service
- Application Load Balancer (ALB) sits at the VPC edge, registered as VPC Lattice target
- ALB routes directly to Target Group containing EKS pods (IP mode)
- AWS Load Balancer Controller manages ALB lifecycle and target registration
For Mid-Market/SMB Cells (Shared VPC):
- VPC Lattice routes to a shared ALB within the VPC
- ALB sits at the cell's logical edge, with listener rules routing to cell-specific target groups
- Target groups contain either EKS worker nodes (NodePort) or Fargate tasks
- Kubernetes/ECS service discovery handles pod-level routing
๐ Key Point: The ALB/NLB is NOT shared across all cells. Enterprise cells get dedicated NLBs in their own VPCs. Mid-Market/SMB cells may share an ALB but use separate target groups for isolation.
Regional vs Cell-Level ALB
โ ๏ธ Clarification on the Architecture Diagram:
In the diagram above, the regional ALB shown is actually the ingress ALB that sits in front of the Cell Router Lambda, NOT the cell edge devices. Here's the corrected flow:
- CloudFront/Global Accelerator โ Regional Ingress ALB (internet-facing, routes to Cell Router Lambda)
- Cell Router Lambda determines cell assignment, sets
X-Twilio-Cell-IDheader - VPC Lattice reads header, routes to cell-specific NLB/ALB
- Cell Edge NLB/ALB โ Kubernetes Ingress or ECS Service
- Ingress/Service โ Application Pods/Tasks
So there are TWO load balancer layers:
- Layer 1 (Regional): Ingress ALB for Cell Router Lambda (1 per region)
- Layer 2 (Cell): ALB at each cell's edge, registered with VPC Lattice (1 per enterprise cell, or shared for SMB)
๐ก Why ALB at cell edge, not NLB: VPC Lattice supports ALB as a target type. ALB provides Layer 7 features (path routing, host headers, WAF integration) while VPC Lattice handles the service discovery. Only use NLB if you need source IP preservation or non-HTTP protocols.
Example: Enterprise Cell Network Boundary
VPC: enterprise-cell-a-vpc (10.0.0.0/16) โ Same CIDR as all other enterprise cells
Subnets:
- Public: 10.0.0.0/20 (AZ-a), 10.0.16.0/20 (AZ-b), 10.0.32.0/20 (AZ-c) - ALB nodes
- Private: 10.0.64.0/18 (AZ-a), 10.0.128.0/18 (AZ-b), 10.0.192.0/18 (AZ-c) - EKS nodes
Edge Device: ALB (enterprise-a-alb) created by AWS Load Balancer Controller, registered with VPC Lattice
VPC Lattice Target: ALB ARN registered as service target
ALB Target Group: EKS pods in IP target mode (port 8080)
Kubernetes: AWS Load Balancer Controller watches Ingress resources, auto-creates ALB + target groups
Security Groups:
- ALB SG: Allow 443 from VPC Lattice service network CIDR (169.254.171.0/24)
- EKS Pod SG: Allow 8080 from ALB SG, allow pod-to-pod within VPC
Kubernetes Ingress Controller: AWS Load Balancer Controller vs NGINX
For the cell edge ingress layer, there are two primary options. Here's the trade-off analysis:
โ AWS Load Balancer Controller
Recommended for this AWS-native architecture
How it works:
- Kubernetes controller watches Ingress resources
- Automatically provisions ALB/NLB per Ingress
- Integrates directly with VPC Lattice targets
- Uses AWS SDK to manage load balancers
Advantages:
- โ AWS-native integration (VPC Lattice, WAF, Shield)
- โ Automatic ALB/NLB lifecycle management
- โ Native support for target groups, health checks
- โ Managed by AWS (security patches, updates)
- โ Better observability (CloudWatch, X-Ray)
- โ IP or instance target modes
Disadvantages:
- โ AWS vendor lock-in (can't move to GCP/Azure easily)
- โ Less feature-rich than NGINX (no Lua, plugins)
- โ ALB costs (~$20/month per Ingress)
- โ Limited advanced routing (no regex, rewrite limits)
Best for: AWS-committed architectures, teams preferring managed services, production workloads needing WAF/Shield integration
NGINX Ingress Controller
Alternative for multi-cloud flexibility
How it works:
- NGINX pods run as DaemonSet on nodes
- Exposed via single NLB (cost-efficient)
- NGINX handles all routing internally
- Config via Kubernetes Ingress annotations
Advantages:
- โ Multi-cloud portable (works on EKS, GKE, AKS)
- โ Feature-rich (rate limiting, auth, rewrites, Lua)
- โ Battle-tested, huge community
- โ Single NLB cost (vs ALB per Ingress)
- โ Advanced routing (regex, complex rewrites)
- โ Extensive customization via ConfigMaps
Disadvantages:
- โ Self-managed (you patch, upgrade, scale)
- โ Extra network hop (NLB โ NGINX pod โ app pod)
- โ Less AWS-native integration
- โ Resource overhead (NGINX pods consume CPU/memory)
- โ Need to manage NGINX pod scaling
Best for: Multi-cloud strategy, need advanced routing features, team has NGINX expertise, cost-sensitive (fewer ALBs)
๐ Recommendation for This Architecture
Use AWS Load Balancer Controller for the following reasons:
- Already committed to AWS: You're using VPC Lattice, DynamoDB Global Tables, Aurora - already AWS-native
- VPC Lattice integration: ALB/NLB created by controller can be registered as VPC Lattice targets seamlessly
- WAF requirement: Twilio needs DDoS protection - ALB integrates directly with WAF
- Operational simplicity: Managed service reduces ops burden vs self-managing NGINX at scale
- Cell isolation: Separate ALBs per cell provide better blast radius containment than shared NGINX
When to use NGINX instead: If you need advanced routing (complex regex, Lua scripts) or if maintaining multi-cloud optionality is critical
Cell Network Boundary Detail
Cell Architecture Terminology & AWS Implementation
๐๏ธ Mapping Cell-Based Architecture to AWS Implementation
Your cell-based architecture implementation using separate AWS accounts and VPCs to define cell boundaries is recognized as a best practice for managing scale and isolation. This physical realization directly addresses account quotas and VPC limits like Network Address Usage (NAU).
| Your Term / Component | Industry Term | AWS Implementation | Function |
|---|---|---|---|
| Cell | Unit of Scale / Deployment Cell | Dedicated AWS Account + VPC (10.0.0.0/16) | Logical grouping representing a unit of scale and deployment; change-set boundary at deployment time to limit blast radius. Contains size-capped workload replica. |
| Cloud Native Landing Zone | Management Plane | AWS Control Tower Multi-Account Environment | Foundational substrate that sets up and governs a secure multi-account AWS environment. Provides design-time governance, security guardrails, and account-level orchestration. |
| Control Plane | Control Plane / Orchestrator | Infrastructure-as-Code + Lambda/Step Functions | The "cockpit of the platform" that interacts with your Landing Zone to manage cell lifecycle: provisioning new cell accounts, de-provisioning old ones, and migrating tenants between cells. |
| Platform Services / Global Cell | Shared Singleton Services / Tier 0 | Multi-Region Deployment (Identity, IAM, etc.) | Critical services that cannot be easily cellularized (Identity, Access Management, Scheduler). Often reside in a "Global Cell" as Tier 0 dependencies. Must be multi-region to avoid single point of failure. |
| Cell Router | Traffic Partitioning Layer / Router | Route 53 / API Gateway / Lambda | The "thinnest possible layer" that uses a partition key (Customer ID) to route incoming requests to the specific endpoint of the AWS account/VPC where that customer's cell resides. |
๐ข Shipping Port Analogy
Cloud Native Landing Zone (Management Plane) = The port authority, providing the docks and security rules
Cell (AWS Account + VPC) = A standardized shipping container; if one container is damaged (deployment failure), the contents of others are safe
Cell Router = The harbor pilot who knows exactly which container belongs to which customer
Control Plane = The crane operator who brings new containers onto the dock as the port grows
Platform Services/Global Cell = The port's central administration building that all ships must check in with
Why Separate AWS Accounts for Cells?
โ Account-Level Benefits
- Natural Isolation Boundary: Actions in one account cannot impact others, effectively bypassing regional account-level limits
- Quota Independence: Each cell gets its own set of AWS service quotas (EC2 instances, VPCs, NAT Gateways, etc.)
- Blast Radius Containment: IAM misconfigurations, security incidents, or service disruptions are contained within the account
- Cost Allocation: Perfect cost attribution per cell/customer segment using AWS Cost Explorer and tagging
- Compliance & Auditing: Clear security boundaries for regulatory requirements (SOC2, HIPAA, PCI-DSS)
- VPC Limit Bypass: Avoid VPC-level Network Address Usage (NAU) limits by having dedicated VPCs per account
๐ Sizing Example: Enterprise Cells
Cell Capacity Planning:
- Max Customers per Cell: 100 enterprise customers
- EKS Cluster per Cell: 1 cluster with 20-50 nodes
- Fixed Maximum Size: When Cell A reaches 100 customers, Control Plane automatically provisions Cell B
- Saturation Limit: CPU > 70%, Memory > 75%, or customer count = max triggers new cell creation
Control Plane Automation
Control Plane Responsibilities
- Cell Provisioning: Use AWS Control Tower Account Factory API to create new AWS account โ Deploy VPC + EKS + DynamoDB via Terraform/CloudFormation โ Register cell with VPC Lattice Service Network
- Tenant Onboarding: When new customer signs up โ Determine segment (SMB/Mid-Market/Enterprise) โ Assign to least-loaded cell in segment โ Update DynamoDB customer_cell_mapping table โ Cache assignment in Redis
- Cell Scaling: Monitor cell capacity metrics โ Trigger Step Function when saturation threshold reached โ Provision new cell account โ New customers automatically route to new cell (lowest load)
- Cell Migration: Dual-write phase โ Background data sync โ Atomic cutover (update DynamoDB + invalidate cache) โ Cleanup old cell data after 24 hours
- Cell Decommissioning: Drain traffic โ Migrate all customers to other cells โ Delete AWS account via Control Tower
Multi-Region Architecture Considerations
๐ Regional Cell Deployment
Each region (us-east-1, eu-west-1, ap-southeast-1) has its own set of cells:
Example Cell Distribution:
- us-east-1: enterprise-cell-a, enterprise-cell-b, midmarket-cell-shared, smb-cell-shared
- eu-west-1: enterprise-cell-eu-a, enterprise-cell-eu-b, midmarket-cell-eu-shared
- ap-southeast-1: enterprise-cell-apac-a, midmarket-cell-apac-shared
Global Cell (Platform Services):
- Identity Service: Deployed in all 3 regions with DynamoDB Global Tables for user authentication
- API Key Management: Replicated globally for low-latency validation
- Billing Service: Multi-region for quota checks and balance deductions
โ ๏ธ Critical: Platform Services are "Tier 0" dependencies - if Identity service fails globally, all cells are affected. Therefore, these MUST be deployed multi-region with active-active replication.
Interview Talking Points
๐ค 2-Minute Summary: Cell-Based Architecture on AWS
"For Twilio's cell-based architecture, I'd implement cells as dedicated AWS accounts with their own VPCs, all using the same CIDR block (10.0.0.0/16) for maximum autonomy. This bypasses VPC-level quotas and enables true fault isolation."
"The Cloud Native Landing Zone is built on AWS Control Tower, which sets up the multi-account governance structure. The Control Plane automates cell lifecycleโprovisioning new accounts when capacity is reached, using infrastructure-as-code to deploy identical cell stacks, and managing customer migrations."
"VPC Lattice solves the overlapping IP problem by routing based on service names, not IP addresses. The Cell Router looks up which cell a customer belongs to in DynamoDB (cached in Redis), then VPC Lattice routes the request to that cell's service endpointโno VPC peering needed."
"Platform Services like Identity and IAM run in a separate Global Cell, deployed multi-region as Tier 0 dependencies. All regional cells depend on these for authentication and authorization, so they must be highly available across multiple regions."
"This architecture balances operational efficiency with isolation: enterprise customers get dedicated cells (accounts), mid-market shares cells in a VPC, and SMB uses namespace isolation within shared clusters. The Control Plane automates it all."
Cell Partitioning Strategy
Why Cell-Based Architecture?
Cell-based architecture is a pattern where you partition your infrastructure into isolated, independent units (cells). Each cell is a complete deployment of your application stack, limiting the blast radius of failures.
Benefits: Fault isolation, independent scaling, easier testing, phased rollouts, regional compliance.
Cell Partitioning Dimensions
| Partition Type | Strategy | Rationale | AWS Implementation |
|---|---|---|---|
| Customer Size | Enterprise, Mid-Market, SMB | Different SLAs, resource needs, isolation requirements | Separate EKS clusters, dedicated capacity reservations |
| Geography | US, EU, APAC regions | Data residency, latency, compliance (GDPR, etc.) | Multi-region deployment, Route 53 geolocation routing |
| Product Vertical | Voice, SMS, Video, Verify | Different latency/consistency requirements | Product-specific microservices in each cell |
| Availability Zone | Multi-AZ within region | High availability, AZ failure isolation | EKS node groups across 3 AZs, Aurora Multi-AZ |
Cell Design Per Customer Segment
๐ข Enterprise Cell
Customers: Uber, Lyft, Airbnb (top 100 customers)
Characteristics:
- Dedicated EKS cluster (1000+ nodes)
- Reserved capacity, dedicated NAT gateways
- Aurora Multi-Master for writes
- DynamoDB Global Tables
- 99.99% SLA
- Priority support, custom metrics
๐ช Mid-Market Cell
Customers: Growing startups (1000-10,000 customers)
Characteristics:
- Shared EKS cluster (200-500 nodes)
- Burstable capacity
- Aurora read replicas (writes to primary)
- DynamoDB Global Tables
- 99.95% SLA
- Standard support
๐ SMB Cells (A/B/C/D)
Customers: Small businesses (100,000+ customers)
Characteristics:
- ECS Fargate (serverless, cost-optimized)
- Spot instances for batch workloads
- Aurora read replicas (shared primary)
- DynamoDB on-demand pricing
- 99.9% SLA
- Community support
โก Cell Routing Strategy
How requests route to cells:
- Route 53: Geolocation routing directs to nearest region
- API Gateway / ALB: Custom domain with cell identifier in header/path
- Cell Router Service: Looks up customer โ cell mapping in DynamoDB
- VPC Lattice: Service mesh routes to correct EKS cluster/namespace
Customer ID โ Hash โ Cell Assignment โ Persistent mapping in DynamoDB
Cell Taxonomy: Workflow-Affinity, Operationally-Differentiated
๐ฏ Core Principle: Group by Workflow Affinity, Not Product Catalog
Services that communicate at runtime should share a cell. Services with uncorrelated demand and no runtime dependencies can be separated for better capacity efficiency.
Why? Co-location solves cross-cell API calls and cascading failures. But forcing unrelated services together creates capacity modeling challengesโyou're sizing cells for combined peak load even when services never peak together.
The Core Trade-off
| Approach | Cross-Cell Calls | Capacity Efficiency | Routing Complexity |
|---|---|---|---|
| All services, one cell | None | Poor (uncorrelated demand) | Simple |
| Workflow-affinity cells โ | Rare (across workflow boundaries) | Good | Moderate |
| Service-per-cell | Frequent | Optimal | Complex |
When to Co-locate Services (Same Cell)
โ Co-locate when:
- Services call each other synchronously โ SMS delivery โ Voice callback โ WhatsApp fallback
- Customer workflows span multiple services in a single request path โ Verify API uses SMS + Voice
- Failure in one impacts the other anyway โ Tightly coupled dependencies
- Similar scaling characteristics โ Both bursty, both sustained, etc.
When to Separate Services (Different Cells)
๐ Separate when:
- Services rarely or never communicate at runtime โ No cross-cell call risk
- Vastly different scaling characteristics โ SMS is bursty (millions/minute), Voice is sustained (concurrent calls), Video is bandwidth-bound
- Different instance type optimizations โ CPU-bound vs memory-bound vs I/O-bound workloads
- Uncorrelated demand patterns โ Don't want to size for combined peak when services never peak together
The Auto-Scaling Reality Check
Auto-scaling helps but has limits:
| Factor | Challenge |
|---|---|
| Scale-up latency | 2-5 minutes for EC2, need headroom buffer |
| Minimum instances | Cost floor regardless of actual demand |
| Different triggers | SMS scales on queue depth, Voice on concurrent connections |
| Compounding headroom | If SMS needs 30% headroom and Voice needs 30%, combined cell needs bothโeven if they never peak together |
Recommended: Workflow-Affinity Cell Types
๐ฑ Messaging Cell
- Services: SMS, MMS, WhatsApp, RCS
- Why together: Similar patterns (message queues), often used in fallback chains
- Scaling: Queue depth, messages/second
๐ฅ Real-time Media Cell
- Services: Voice, Video, WebRTC
- Why together: Connection-based, jitter-sensitive, often combined in apps
- Scaling: Concurrent connections, bandwidth
โ๏ธ Async Communication Cell
- Services: Email (SendGrid), Fax
- Why together: Batch-oriented, store-and-forward, different SLA expectations
- Scaling: Throughput, delivery rate
๐ Verify Cell
- Services: Verify API (2FA), Lookup
- Why together: Cross-channel verification workflows (SMS + Voice + Email)
- Scaling: Requests/second, verification attempts
# Workflow-affinity routing: customer_id + region + service_category โ cell_id # Example mappings: customer: acme-corp, region: us-east-1, category: messaging โ messaging-enterprise-us-001 customer: acme-corp, region: us-east-1, category: realtime โ realtime-enterprise-us-001 customer: acme-corp, region: us-east-1, category: verify โ verify-enterprise-us-001 # Same customer, different cells by workflow affinity # Cross-category calls are rare (by design)
Cell Taxonomy Dimensions
Primary Dimensions (Tier 1 - Must Have)
-
Geographic/Regional
- Why: Legal requirement for data residency (GDPR, Chinese data laws)
- Why: Latency matters for real-time voice/video
- Impact: Minimum deployment in US, EU, APAC regions
- Interview value: Shows understanding of regulatory constraints
-
Customer Segment
- Why: Enterprise vs SMB have fundamentally different operational needs
- Examples:
- Enterprise: Dedicated cells, custom SLAs, contractual guarantees, 24/7 support
- SMB/Developer: Shared multi-tenant cells, best-effort, self-service
- Impact: Different blast radius tolerance, isolation requirements, cost models
- Interview value: Demonstrates business/technical balance
-
Compliance/Regulatory
- Why: You CANNOT mix HIPAA healthcare traffic with regular SMS
- Why: PCI-DSS for payment flows requires isolation
- Impact: Separate cell infrastructure with auditable controls
- Interview value: Shows enterprise architecture experience
Secondary Dimensions (Conditional)
- Workload Characteristics (May be implied by product type):
- Transactional: 2FA codes, alerts (low-latency, predictable)
- Marketing/Bulk: Campaign SMS (high-throughput, bursty)
- Real-time: Voice/Video (sustained connections, jitter-sensitive)
Note: This might be handled within cells via different service tiers rather than requiring separate cell types.
What to AVOID: Per-Service Cells Without Workflow Analysis
Don't blindly create one cell type per product. Instead, analyze actual workflow dependencies:
- Ask: "If a customer uses Service A, how often do they call Service B in the same request?"
- If frequently: Co-locate in the same workflow-affinity cell
- If rarely/never: Separate cells are fineโno cross-cell call risk
Better: Messaging-Cell (SMS + MMS + WhatsApp), Realtime-Cell (Voice + Video + WebRTC)
Best: Workflow analysis determines grouping based on actual runtime dependencies
Key insight: The goal isn't "fewer cells" or "more cells"โit's minimizing cross-cell calls while avoiding capacity modeling nightmares from forcing unrelated services together.
Operational Differentiation: Same Code, Different Posture
Every cell runs the same microservices (SMS Service, Voice Service, Video Service, etc.). What changes is the operational configuration:
| Aspect | Enterprise Cell | SMB Cell |
|---|---|---|
| Availability | Multi-AZ, N+2 redundancy | Multi-AZ, N+1 redundancy |
| Compute | r7g.2xlarge (dedicated) | r7g.large (shared, burstable) |
| Database | Multi-AZ RDS, 6 read replicas | Multi-AZ RDS, 2 read replicas |
| Capacity | Reserved instances, pre-scaled | Auto-scaling, spot instances |
| Network | Dedicated NAT Gateways, Transit Gateway | Shared NAT, standard VPC routing |
| Storage | Provisioned IOPS SSD (io2) | General Purpose SSD (gp3) |
| Monitoring | Sub-minute CloudWatch, custom metrics | 5-minute CloudWatch, standard metrics |
| Blast Radius | 10-50 customers per cell | 1,000+ customers per cell |
| SLA | 99.99% (52 min downtime/year) | 99.9% (8.7 hrs downtime/year) |
| Cost per Customer | ~$500-2000/month | ~$10-50/month |
Key Benefits of This Architecture
1๏ธโฃ No Cross-Cell Cascading Failures
- Customer uses SMS + Voice + Video โ all in same cell
- No distributed transactions across cells
- Failures stay contained within cell boundary
2๏ธโฃ Noisy Neighbor Isolation
- SMB customer traffic spike โ only affects other SMB customers
- Enterprise customers in separate cells โ unaffected
- No SLA violations for high-paying customers
3๏ธโฃ Simple Customer Migration
- Upgrade SMB โ Enterprise: Update DynamoDB routing table
- Zero code changes, just point to new cell
- Gradual traffic shifting (10% โ 50% โ 100%)
4๏ธโฃ Economic Optimization
- Enterprise: Infrastructure dollars proportional to revenue
- SMB: Economies of scale through higher density
- Right-sized spending per customer tier
5๏ธโฃ Operational Simplicity
- Every cell has identical structure (same services)
- One deployment pipeline, one monitoring setup
- Only Terraform variables change (instance sizes, etc.)
Infrastructure-as-Code Example
# Terraform module - workflow-affinity cells with operational tiers
module "messaging_cell" {
source = "./modules/twilio-cell"
# Cell identity
cell_id = "messaging-enterprise-us-001"
category = "messaging"
region = "us-east-1"
segment = "enterprise"
# Services in this workflow affinity group
services = [
"sms-service",
"mms-service",
"whatsapp-service",
"rcs-service"
]
# Operational config (enterprise tier)
instance_type = "r7g.2xlarge"
min_instances = 6
scaling_metric = "queue_depth" # Messaging scales on queue depth
}
module "realtime_cell" {
source = "./modules/twilio-cell"
cell_id = "realtime-enterprise-us-001"
category = "realtime"
region = "us-east-1"
segment = "enterprise"
# Services in this workflow affinity group
services = [
"voice-service",
"video-service",
"webrtc-service"
]
# Different scaling characteristics
instance_type = "c7g.2xlarge" # CPU-optimized for media processing
min_instances = 4
scaling_metric = "concurrent_connections" # Realtime scales on connections
}
# Same customer can be in both cells (different workflow categories)
# Cross-category calls are rare by design
๐ก Interview Talking Point: Cell Taxonomy
"I'd group services by workflow affinity rather than putting everything in one cell or splitting by individual product.
The key question is: 'Do these services call each other at runtime?' If yes, co-locate them to avoid cross-cell latency and cascading failures. If they rarely or never communicate, separate cells let them scale independently with better capacity efficiency.
For Twilio, this might mean: A Messaging Cell (SMS, MMS, WhatsAppโsimilar patterns, fallback chains), a Real-time Media Cell (Voice, Video, WebRTCโconnection-based, jitter-sensitive), and an Async Cell (Email, Faxโbatch-oriented, different SLAs).
Why not just one cell with everything? Capacity modeling becomes a nightmare. SMS is bursty, Voice is sustained, Video is bandwidth-heavy. If you force them together, you're sizing for combined peak even when they never peak simultaneously. Auto-scaling helps, but you still pay the minimum instance cost floor and headroom buffer for each service.
Within each workflow category, I'd still differentiate by operational tierโEnterprise cells over-provisioned for SLA guarantees, SMB cells cost-optimized with higher density. Same deployment pipeline, different operational posture.
The goal is minimizing cross-cell calls while avoiding the capacity modeling tax of forcing unrelated services together."
The "So What?" Test for Cell Taxonomy
For each dimension, ask: "Does this require materially different infrastructure, SLAs, or operational procedures?"
- โ Geography โ Yes (different AWS regions, data residency laws)
- โ Segment โ Yes (dedicated vs shared, different SLAs)
- โ Compliance โ Yes (audit logs, encryption, access controls)
- โ Workflow affinity โ Yes (different scaling characteristics, instance types, capacity models)
- โ ๏ธ Individual product โ Only if services rarely communicate AND have very different scaling needs
Principle: Segment on operational characteristics AND runtime dependencies. Co-locate services that call each other; separate services with uncorrelated demand that don't communicate.
Example Cell Naming Convention
# Workflow-affinity cell naming: {category}-{segment}-{region}-{compliance}
Messaging-Enterprise-US-HIPAA # Healthcare customer, messaging services, US
Messaging-SMB-EU-Standard # Small business, messaging services, EU
Realtime-Enterprise-US-Standard # Enterprise customer, voice/video, US
Realtime-SMB-APAC-Standard # SMB customer, voice/video, Asia-Pacific
Async-Enterprise-US-PCI # Payment processor, email/fax, US
Verify-Enterprise-US-HIPAA # Healthcare, 2FA workflows, US
# Cell Router lookup (now includes service category):
Customer ID + Service Category โ DynamoDB โ Cell Assignment โ VPC Lattice โ Correct Cell
# Same customer, multiple cells:
acme-corp โ messaging โ Messaging-Enterprise-US-Standard
acme-corp โ realtime โ Realtime-Enterprise-US-Standard
acme-corp โ verify โ Verify-Enterprise-US-Standard
Detailed Architecture Layers
Request Flow Through Architecture
AWS Service Mapping by Layer
Layer 1: Global Edge & Traffic Management
- Route 53: DNS with health checks, geolocation routing, latency-based routing
- CloudFront: CDN for static assets, API caching, geo-restrictions
- AWS Global Accelerator: Anycast IPs, automatic failover between regions
- WAF & Shield: DDoS protection, rate limiting, bot mitigation
Layer 2: Regional Gateway
- API Gateway: Regional REST/WebSocket APIs, request validation, throttling
- Application Load Balancer (ALB): Layer 7 load balancing, TLS termination, path-based routing
- Network Load Balancer (NLB): Layer 4 for Voice/Video (UDP/TCP), ultra-low latency
- VPC Peering / Transit Gateway: Cross-region connectivity
Layer 3: Cell Router & Service Mesh
- Cell Router Service: Custom Lambda/ECS service that maps customer โ cell
- DynamoDB: Stores customer-to-cell assignments, low-latency lookups
- VPC Lattice: Service mesh for inter-service communication, service-to-service connectivity
- AWS Cloud Map: Service discovery for microservices
Layer 4: Compute Cells
- EKS (Kubernetes): Enterprise and mid-market cells, full control, custom scheduling
- ECS Fargate: SMB cells, serverless, lower operational overhead
- EC2 Auto Scaling: EKS worker nodes, reserved + spot instances
- Lambda: Event-driven workloads (webhooks, async processing)
Layer 5: Data & Storage
- DynamoDB Global Tables: Multi-master, active-active replication across regions
- Aurora Multi-Master: PostgreSQL with multi-master for enterprise cells
- Aurora with Read Replicas: Cross-region replicas, leader-follower pattern
- ElastiCache (Redis): Session management, rate limiting, caching
- S3: Call recordings, message attachments, static assets
- Kinesis: Real-time event streaming, analytics pipeline
- SQS/SNS: Asynchronous messaging, fan-out patterns
Database Layer Strategy
Multi-Master vs Leader-Follower Decision Matrix
The choice between multi-master and leader-follower depends on:
- Write Pattern: Multi-region writes? Geographic distribution of writes?
- Consistency Requirements: Strong consistency needed? Conflict tolerance?
- Latency Sensitivity: Can tolerate cross-region write latency?
- Data Characteristics: High contention? Append-only? Partitionable?
Database Selection by Use Case
DynamoDB Global Tables RECOMMENDED
Pattern: Multi-Master (Active/Active)
Use Cases:
- Customer account data
- API keys & credentials
- SMS message logs (append-only)
- Usage metrics & billing
- Configuration data
Why:
- โ Fully managed, no ops overhead
- โ Sub-10ms latency
- โ Auto-scaling, pay-per-request
- โ Last-write-wins conflict resolution
- โ 99.999% availability SLA
Regions: US-East-1, EU-West-1, AP-Southeast-1
Aurora PostgreSQL Multi-Master
Pattern: Multi-Master (Active/Active)
Use Cases:
- Enterprise customer configurations
- Complex queries with ACID
- Multi-table transactions
Why:
- โ ACID transactions across masters
- โ SQL interface, complex queries
- โ Up to 2 masters per region
- โ ๏ธ Limited to single region multi-master
- โ ๏ธ Higher latency than DynamoDB
Configuration: 2 masters + 3 read replicas per region
Aurora Global Database (Leader-Follower)
Pattern: Leader-Follower (Active/Passive)
Use Cases:
- Phone number inventory
- Payment transactions
- Strong consistency requirements
- Regulatory compliance data
Why:
- โ Strong consistency guarantees
- โ < 1 second cross-region replication
- โ Failover in < 1 minute (RPO < 1s)
- โ Read replicas in secondary regions
- โ ๏ธ Writes only to primary region
Configuration: Primary in US-East-1, replicas in EU + APAC
Cassandra on EC2/Kubernetes
Pattern: Masterless (Peer-to-Peer)
Use Cases:
- Time-series data (call metrics)
- Extreme scale (billions of records)
- High write throughput
- When DynamoDB limits hit
Why:
- โ Linear scalability
- โ No single point of failure
- โ Tunable consistency (CL=QUORUM)
- โ Multi-datacenter replication
- โ ๏ธ Higher operational complexity
- โ ๏ธ Need dedicated ops team
When to use: Only if DynamoDB can't meet scale/cost requirements
Database Replication Topology
Conflict Resolution Strategies
| Database | Conflict Resolution | Use Case Fit |
|---|---|---|
| DynamoDB Global Tables | Last-write-wins (LWW) based on timestamp | โ
Account updates, configs (rare conflicts) โ Append-only logs (no conflicts) |
| Aurora Multi-Master | Automatic deadlock detection, transaction rollback | โ
Enterprise configs with transactions โ ๏ธ Limited to 2 masters in same region |
| Cassandra | Tunable: LWW, custom application logic, CRDTs | โ
Time-series (no conflicts) โ Counters with CRDT increments |
| Application-Level | Version vectors, CRDTs, custom merge logic | โ
Complex business logic โ Shopping carts, collaborative editing |
Twilio Product to Database Mapping
Principle: Match Consistency Model to Product Requirements
Different Twilio products have different consistency, latency, and scale requirements. Choose the right database pattern for each.
| Twilio Product | Data Type | Database Choice | Replication Pattern | Consistency Model | Rationale |
|---|---|---|---|---|---|
| Programmable Voice | Call records, CDRs | DynamoDB Global Tables | Multi-Master | Eventual | Append-only, high volume, sub-second replication lag acceptable, query by CallSID |
| Programmable Voice | Active call state | ElastiCache Redis | N/A (Cache) | Strong* | Low latency, session state, ephemeral (TTL). *Single-node writes within region |
| Programmable SMS | Message logs | DynamoDB Global Tables | Multi-Master | Eventual | Append-only, billions of messages, partitioned by phone number, ~1s replication lag |
| Programmable SMS | Message archives | S3 โ Glacier | Cross-region replication | Eventual | Compliance, long-term retention, cost-optimized, S3 read-after-write consistency |
| Phone Numbers | Inventory management | Aurora Global Database | Leader-Follower | Strong | CRITICAL: Can't double-assign numbers. ACID transactions, serializable isolation |
| Conversations API | Messages, participants | DynamoDB Global Tables | Multi-Master | Eventual | Append-only messages, out-of-order delivery acceptable, partition by conversation |
| Verify API | Verification tokens | ElastiCache Redis | N/A (Cache) | Strong* | Short-lived (5-10 min TTL), low latency. *Regional strong consistency |
| Account Management | Customer accounts | DynamoDB Global Tables | Multi-Master | Eventual | Low write contention, global access, last-write-wins acceptable, ~1s replication |
| Billing | Usage metrics | Kinesis โ S3 โ Redshift | Event streaming | Eventual | High volume, analytics, batch processing, eventual aggregation acceptable |
| Billing | Invoices, payments | Aurora Global Database | Leader-Follower | Strong | CRITICAL: Financial transactions require ACID guarantees, no double-charging |
| Video | Room state | ElastiCache Redis | N/A (Cache) | Strong* | Real-time, low latency, ephemeral. *Regional consistency, conflicts rare |
| Video | Recordings | S3 Multi-Region | Cross-region replication | Eventual | Large files, CDN integration, durability, S3 eventual cross-region consistency |
๐ Consistency Models Explained
Eventual Consistency
Definition: Writes to one replica will eventually propagate to all replicas, but reads may see stale data during replication lag (~1 second for DynamoDB Global Tables).
When to use:
- Append-only data (logs, messages, events)
- Data where conflicts are rare or easily resolved (last-write-wins)
- High availability and low latency are more important than consistency
- Multi-region writes are required
Trade-off: Clients might read stale data briefly, but system remains available during network partitions.
Strong Consistency
Definition: All reads see the most recent write. Once a write is acknowledged, all subsequent reads from any replica will see that write.
When to use:
- Financial transactions (billing, payments)
- Inventory management (phone numbers, seats)
- Any data where stale reads cause business problems
- ACID transaction requirements
Trade-off: Higher latency for cross-region reads, limited write scalability (single leader), reduced availability during partitions.
๐ Interview Tip: Always justify your consistency choice. For Twilio, voice/SMS logs can tolerate eventual consistency (append-only, no conflicts), but phone number inventory requires strong consistency (can't double-assign). This is a key architectural trade-off!
Data Flow Example: SMS Message
(EU customer)
(DynamoDB lookup)
(EKS pod)
(local write)
US + APAC
(<1 sec)
(event stream)
(batch)
(compliance archive)
(7 year retention)
AWS Well-Architected Framework Alignment
๐ฏ Operational Excellence
- IaC: Terraform/CloudFormation for all cells
- CI/CD: Blue/green deployments per cell
- Observability: CloudWatch, X-Ray, cell-level dashboards
- Runbooks: Automated remediation with Systems Manager
- Chaos Engineering: Fault injection per cell (GameDay exercises)
๐ Security
- IAM: Service roles per cell, least privilege
- Encryption: At-rest (KMS) and in-transit (TLS 1.3)
- Secrets: Secrets Manager, automatic rotation
- Network: VPC isolation per cell, PrivateLink
- Compliance: GDPR data residency, SOC2, HIPAA
๐ช Reliability
- Multi-AZ: All cells across 3 AZs
- Multi-Region: Active/active in 3 regions
- Cell Isolation: Blast radius limited to single cell
- Backups: Automated DynamoDB PITR, Aurora snapshots
- Disaster Recovery: RTO < 5min, RPO < 1sec
โก Performance Efficiency
- Global Accelerator: Anycast routing, TCP optimization
- DynamoDB: Single-digit ms latency, auto-scaling
- ElastiCache: Sub-millisecond caching
- Compute: Graviton instances (40% better perf/cost)
- Observability: X-Ray distributed tracing
๐ฐ Cost Optimization
- Right-Sizing: Enterprise (reserved), SMB (spot/fargate)
- DynamoDB: On-demand for SMB, provisioned for enterprise
- S3 Lifecycle: Glacier for compliance archives
- Compute Savings Plans: 30-50% savings
- Monitoring: Cost anomaly detection per cell
๐ Sustainability
- Regions: Choose AWS regions with renewable energy
- Graviton: 60% less energy than x86
- Serverless: Fargate, Lambda for variable workloads
- S3 Intelligent-Tiering: Automatic archival
- Auto-Scaling: Scale down during off-peak
Key Architecture Decisions & Trade-offs
โ Decision: DynamoDB Global Tables over Cassandra
Rationale: Managed service, lower ops overhead, sufficient scale for most Twilio use cases
Trade-off: Less control over replication topology, higher cost at extreme scale
When to reconsider: If DynamoDB costs > $1M/month or need more than 5 regions
โ Decision: Cell-based architecture by customer size
Rationale: Different SLAs, resource isolation, blast radius containment
Trade-off: More complex routing, cell rebalancing overhead
Mitigation: Automated cell assignment, gradual customer migration tools
โ Decision: EKS for Enterprise, ECS Fargate for SMB
Rationale: Enterprise needs custom scheduling/control, SMB needs cost efficiency
Trade-off: Managing two orchestration platforms
Mitigation: Shared CI/CD pipelines, standardized observability
โ Decision: Aurora Global Database for phone inventory
Rationale: Strong consistency required, can't double-assign phone numbers
Trade-off: Single write region, cross-region writes have higher latency
Mitigation: Regional inventory pools, write to nearest region's pool
Scalability Estimates
| Component | Current Scale | Target Scale | Bottleneck | Mitigation |
|---|---|---|---|---|
| DynamoDB Global Tables | 10K RCU/WCU per table | 100K RCU/WCU | Partition hotspots | Partition key design (customer_id + timestamp), adaptive capacity |
| Aurora Global | 100K transactions/sec | 500K transactions/sec | Write throughput | Horizontal sharding by region, read replicas |
| EKS Cluster (Enterprise) | 1000 nodes | 5000 nodes | Control plane limits | Multiple clusters per region, federated control plane |
| API Gateway | 10K req/sec | 100K req/sec | Regional quotas | Multi-region, ALB for high throughput paths |
| Kinesis Streams | 1000 shards | 10K shards | Shard management | Automatic resharding, Kinesis Data Firehose for batch |
Monitoring & Observability
Cell-Level Dashboards
- CloudWatch: Custom metrics per cell (latency, error rate, throughput)
- X-Ray: Distributed tracing across cells, service map
- Prometheus + Grafana: EKS cluster metrics, pod-level observability
- DynamoDB Contributor Insights: Identify hot partitions
- Aurora Performance Insights: Database query analysis
- CloudWatch Alarms: Cell health checks, automatic remediation via Lambda
SLI/SLO Tracking:
- API latency p99 < 200ms (per cell, per product)
- Error rate < 0.1% (per cell)
- Availability > 99.99% (enterprise cells), > 99.95% (mid-market), > 99.9% (SMB)
Summary & Interview Talking Points
๐ค 2-Minute Elevator Pitch
"For Twilio's global platform, I'd design a cell-based architecture partitioned by customer sizeโenterprise, mid-market, and SMBโdeployed across three AWS regions for multi-region active/active redundancy."
"At the data layer, we'd use DynamoDB Global Tables for multi-master replication where eventual consistency is acceptableโlike customer accounts and message logs. For strong consistency needs like phone number inventory, we'd use Aurora Global Database with leader-follower replication."
"Enterprise customers get dedicated EKS clusters with Aurora Multi-Master and reserved capacity. Mid-market shares EKS with Aurora replicas. SMB runs on cost-optimized ECS Fargate."
"This design aligns with AWS Well-Architected: operational excellence through IaC and cell-level observability, security via VPC isolation and encryption, reliability through multi-AZ/multi-region with blast radius containment, performance via low-latency databases and global accelerator, and cost optimization by right-sizing per customer segment."
"The cell architecture limits blast radiusโif one cell fails, it only affects that customer segment. We can independently scale, deploy, and test each cell."
๐ Key Takeaways
- Cell-based architecture limits blast radius and enables independent scaling
- Multi-master (DynamoDB Global Tables) for low-latency global writes with eventual consistency
- Leader-follower (Aurora Global Database) for strong consistency requirements
- Customer segmentation (Enterprise/Mid-Market/SMB) drives compute and database choices
- Product-specific patterns match data consistency to business requirements
- AWS-native services reduce operational overhead vs self-managed (DynamoDB > Cassandra)
- Well-Architected Framework ensures holistic design across all pillars
๐ Next Steps for Deep Dive
- Design detailed cell routing algorithm (DynamoDB lookups, least-loaded assignment, cell rebalancing)
- Plan disaster recovery runbooks (regional failover, data loss scenarios)
- Cost modeling per customer segment (reserved vs on-demand, DynamoDB pricing)
- Security deep-dive (GDPR compliance, data residency, encryption key management)
- Migration strategy (monolith โ cells, zero-downtime migration)