Twilio Cell-Based Architecture on AWS

Architecture Overview

🎯 Core Design Principles

Full Cell Autonomy: Each enterprise cell is a completely isolated VPC with independent IP space (overlapping CIDRs allowed)
Blast Radius Containment: Cell failures affect only customers in that cell (~100 enterprise customers max per cell)
Multi-Region Active/Active: Customers write locally to their region, DynamoDB Global Tables replicate globally
VPC Lattice for Inter-Cell Routing: Service-based routing (not IP-based) enables overlapping IP spaces
Consistency by Use Case: Multi-master (DynamoDB) for logs/events, leader-follower (Aurora) for phone inventory/payments

📐 Key Architectural Decisions

Decision	Approach	Rationale
Cell Isolation	VPC-per-cell for Enterprise, shared VPC for Mid-Market/SMB	Balance isolation (enterprise) vs operational efficiency (SMB)
IP Addressing	Overlapping RFC1918 (all cells use 10.0.0.0/16)	Full autonomy, simplified addressing, VPC Lattice handles routing
Cell Edge	NLB (Enterprise) or shared ALB (Mid-Market/SMB)	NLB for ultra-low latency, ALB for WAF/path routing
Inter-Cell Routing	VPC Lattice (service mesh)	Supports overlapping IPs, IAM auth, cross-VPC without peering
Database Strategy	DynamoDB Global Tables (multi-master) + Aurora Global (leader-follower)	Match consistency model to use case requirements

How Overlapping IP Spaces Work with VPC Lattice

🔑 The Problem: Cell Autonomy with RFC1918 IP Overlaps

Challenge: For true cell autonomy, each Enterprise cell should be a fully independent VPC. But if we have 100+ enterprise cells, we'd run out of unique RFC1918 IP space. The solution: allow all cells to use the same CIDR blocks (e.g., 10.0.0.0/16).

✅ The Solution: VPC Lattice Service-Based Routing

VPC Lattice routes based on service names, not IP addresses. This means overlapping IP spaces across VPCs are perfectly fine.

How It Works:

Each cell is a dedicated VPC with the same CIDR (10.0.0.0/16)
VPC Lattice creates a Service Network that spans all VPCs in the region
Each cell registers its services with unique service names (e.g., enterprise-cell-a-api, enterprise-cell-b-api)
Cell Router sets X-Twilio-Cell-ID header (e.g., "enterprise-us-east-1-a")
VPC Lattice routes to the service by name, not by IP address
Service discovery happens via AWS Cloud Map - VPC Lattice resolves service name to the correct VPC's load balancer

Example: Three cells with overlapping IPs

enterprise-cell-a-vpc: 10.0.0.0/16 → Service: enterprise-cell-a-api
enterprise-cell-b-vpc: 10.0.0.0/16 → Service: enterprise-cell-b-api
enterprise-cell-c-vpc: 10.0.0.0/16 → Service: enterprise-cell-c-api

All three cells have pod IPs like 10.0.50.10, but VPC Lattice routes correctly because it uses service names, not IPs.

💡 Key Benefit: No VPC peering needed! VPC Lattice handles cross-VPC communication with built-in IAM authentication, observability, and no risk of IP conflicts.

Architecture Diagrams

📐 Creating Professional AWS Architecture Diagrams

For interview presentations, create professional diagrams using these tools with official AWS Architecture Icons:

draw.io (diagrams.net) - Free, has built-in AWS icon library (app.diagrams.net)
CloudCraft - AWS-specific, 3D diagrams (cloudcraft.co)
Lucidchart - Professional diagramming tool with AWS shapes
AWS Architecture Icons - Download official icons: aws.amazon.com/architecture/icons/

Complete Request Flow Architecture

🔄 End-to-End Request Flow

📋 Diagram Specifications for Interview

Create these diagrams using draw.io or CloudCraft with AWS icons:

1. Multi-Region Overview Diagram

Components to include:

Route 53 (global) → CloudFront (global) → Regional ALBs (us-east-1, eu-west-1, ap-southeast-1)
Cell Router Lambda in each region
VPC Lattice service network (spans multiple VPCs in region)
3-4 cell VPCs per region, all labeled "10.0.0.0/16" to show overlapping IPs
Arrows showing data replication (DynamoDB bidirectional, Aurora primary→replica)

2. Cell Detail Diagram

Components to include:

VPC (10.0.0.0/16) with public/private subnets across 3 AZs
ALB in public subnets
EKS cluster nodes in private subnets
Pods (Voice, SMS, Auth services)
DynamoDB and Aurora databases
ElastiCache Redis cluster
NAT Gateways, Internet Gateway
Security group arrows showing traffic flow

3. VPC Lattice Overlapping IP Diagram

Show how overlapping IPs work:

VPC Lattice service network at the top
3 VPCs below, ALL labeled "10.0.0.0/16"
Each VPC has different service name (enterprise-a-api, enterprise-b-api, enterprise-c-api)
Annotation: "VPC Lattice routes by service name, not IP"
Show pods in each VPC with same IP (e.g., 10.0.50.10) to emphasize overlapping

🔧 Key Configuration Details

Route 53 Configuration

Routing Policy: Geolocation (US → us-east-1, EU → eu-west-1, APAC → ap-southeast-1)
Health Checks: /health endpoint, 30s interval, 3 failures = unhealthy
Failover: If primary region unhealthy, route to secondary region
TTL: 60 seconds (balance between failover speed and DNS query volume)

CloudFront Configuration

Origin: Origin group with regional ALBs (automatic failover)
Cache Behavior: GET requests cached 60s, POST/PUT/DELETE bypass cache
SSL/TLS: Custom certificate from ACM, TLS 1.2+ only
WAF: Rate limiting (10,000 requests per 5 minutes per IP)

Global Accelerator Configuration

Anycast IPs: 2 static IPv4 addresses (global)
Endpoints: Regional ALBs in each region (weight: 100)
Health Checks: Port 443, HTTPS protocol, 30s interval
Traffic Dial: 100% to healthy endpoints, automatic failover

VPC Lattice Configuration

Service Network: Cross-VPC service discovery and routing
Routing: Header-based routing on X-Twilio-Cell-ID
Auth: IAM-based service-to-service authentication
Observability: Access logs to S3, metrics to CloudWatch

ALB Configuration

Scheme: Internet-facing, cross-zone load balancing enabled
TLS: TLS 1.3, certificate from ACM, ALPN support
Routing: Path-based routing (/v1/voice, /v1/sms, etc.)
Sticky Sessions: Application cookie, 1-hour duration

Cell Router Lambda Configuration

Runtime: Node.js 20, arm64 (Graviton2)
Memory: 1024MB, Timeout: 10 seconds
Concurrency: Reserved concurrency per region (1000)
VPC: Deployed in VPC to access ElastiCache privately

📊 Traffic Flow Example

Client makes HTTPS request to api.twilio.com
Route 53 resolves DNS based on geolocation → Returns CloudFront distribution
CloudFront terminates TLS, applies WAF rules, routes to nearest regional ALB
Global Accelerator (optional) provides anycast routing to nearest region
ALB terminates TLS (if not via CloudFront), routes based on path to API Gateway or directly to Cell Router
Cell Router Lambda extracts customer ID from JWT, queries ElastiCache/DynamoDB for cell assignment, assigns new customers to least-loaded cell, sets X-Twilio-Cell-ID header
VPC Lattice routes request to appropriate EKS cluster based on X-Twilio-Cell-ID header
EKS/ECS processes request in customer's assigned cell
Data Layer reads from local DynamoDB replica (eventual consistency) or Aurora read replica, writes to DynamoDB Global Table (multi-master) or Aurora primary (strong consistency)
Response flows back through VPC Lattice → Cell Router → ALB → CloudFront → Client

⚡ Total Latency Budget: CloudFront (5ms) + ALB (5ms) + Cell Router (5-15ms) + VPC Lattice (2ms) + Application (20-100ms) + Database (5-10ms) = 42-137ms end-to-end

🔀 Routing Responsibilities: VPC Lattice vs EKS

Inter-Cell Routing (VPC Lattice)

Responsibility: Route requests between cells based on customer assignment

How it works:

Cell Router Lambda sets X-Twilio-Cell-ID header (e.g., "enterprise-us-east-1-a")
VPC Lattice inspects header and routes to correct cell's service network
Works across VPCs, availability zones, and accounts
Provides IAM-based authentication between services

Example: Customer A (assigned to Enterprise Cell A) → VPC Lattice → Enterprise Cell A EKS cluster

Intra-Cell Routing (EKS/Kubernetes)

Responsibility: Route requests within a cell between microservices

How it works:

Request arrives at cell's EKS cluster via VPC Lattice
Kubernetes Ingress Controller (NGINX/ALB Ingress) receives request
Routes to appropriate service based on path (/v1/voice → voice-service)
Service mesh (Istio/Linkerd) handles service-to-service communication

Example: Within Enterprise Cell A: voice-service → auth-service → database-service

🔑 Key Distinction: VPC Lattice routes to the correct cell (inter-cell), while EKS/Kubernetes routes within the cell to the correct microservice (intra-cell). This separation allows cells to be completely isolated while still enabling centralized routing logic.

💡 Alternative: For larger enterprises, you could use Istio multi-cluster to span service mesh across cells, but this increases complexity. VPC Lattice + per-cell Kubernetes Ingress is simpler and maintains better fault isolation.

Network Boundaries & Cell Edge Architecture

🏗️ Cell Network Boundary Design

VPC Structure Options

Approach	VPC Topology	Isolation Level	Use Case	Cell Edge Device
Option 1: VPC-per-Cell ⭐	Each cell = dedicated VPC	🟢 Strongest (network isolation)	Enterprise cells	VPC Lattice → ALB → EKS Pods (IP mode)
Option 2: Shared VPC, Subnet Isolation	Cells in different subnets	🟡 Moderate (security groups)	Mid-Market, SMB cells	VPC Lattice → ALB (target groups) → EKS
Option 3: Namespace Isolation	Cells = Kubernetes namespaces	🟠 Weakest (logical only)	Dev/test environments	Shared ALB → Ingress Controller → Namespaces

Recommended Architecture: Hybrid Approach

Enterprise Cells (Top 100 customers):

VPC Boundary: Dedicated VPC per cell (e.g., enterprise-cell-a-vpc, enterprise-cell-b-vpc) - ALL use 10.0.0.0/16 (overlapping IPs)
Subnets: Private subnets across 3 AZs for EKS, public subnets for NAT gateways
Edge Device: Application Load Balancer (ALB) registered with VPC Lattice as target
Why ALB: Layer 7 routing (path/host), WAF integration, TLS termination, managed by AWS Load Balancer Controller
Traffic Flow: VPC Lattice → ALB → Target Group (EKS pods in IP mode) → Pods
✓ VPC Lattice enables overlapping IPs - routes by service name, not IP address
Note: Use NLB instead if you need source IP preservation or non-HTTP protocols

Mid-Market Cells:

VPC Boundary: Shared VPC with isolated subnets per cell (e.g., midmarket-vpc = 10.1.0.0/16)
Subnets: Cell A (10.1.0.0/18), Cell B (10.1.64.0/18), Cell C (10.1.128.0/18)
Edge Device: Application Load Balancer (ALB) with separate target groups per cell
Why ALB: Layer 7, path-based routing, WAF integration, TLS termination
Traffic Flow: VPC Lattice → ALB → Target Group (Cell-specific) → EKS Pods

SMB Cells (ECS Fargate):

VPC Boundary: Shared VPC with Fargate tasks in isolated subnets
Edge Device: ALB with target groups pointing to Fargate service
Why ALB: Native ECS integration, automatic task registration/deregistration
Traffic Flow: VPC Lattice → ALB → Fargate Tasks (via AWS Cloud Map service discovery)

What Sits at the Cell Edge?

🎯 Answer: It depends on the cell type, but generally:

For Enterprise Cells (VPC-per-cell):

VPC Lattice Service Network routes traffic to the cell's VPC service
Application Load Balancer (ALB) sits at the VPC edge, registered as VPC Lattice target
ALB routes directly to Target Group containing EKS pods (IP mode)
AWS Load Balancer Controller manages ALB lifecycle and target registration

For Mid-Market/SMB Cells (Shared VPC):

VPC Lattice routes to a shared ALB within the VPC
ALB sits at the cell's logical edge, with listener rules routing to cell-specific target groups
Target groups contain either EKS worker nodes (NodePort) or Fargate tasks
Kubernetes/ECS service discovery handles pod-level routing

🔑 Key Point: The ALB/NLB is NOT shared across all cells. Enterprise cells get dedicated NLBs in their own VPCs. Mid-Market/SMB cells may share an ALB but use separate target groups for isolation.

Regional vs Cell-Level ALB

⚠️ Clarification on the Architecture Diagram:

In the diagram above, the regional ALB shown is actually the ingress ALB that sits in front of the Cell Router Lambda, NOT the cell edge devices. Here's the corrected flow:

CloudFront/Global Accelerator → Regional Ingress ALB (internet-facing, routes to Cell Router Lambda)
Cell Router Lambda determines cell assignment, sets X-Twilio-Cell-ID header
VPC Lattice reads header, routes to cell-specific NLB/ALB
Cell Edge NLB/ALB → Kubernetes Ingress or ECS Service
Ingress/Service → Application Pods/Tasks

So there are TWO load balancer layers:

Layer 1 (Regional): Ingress ALB for Cell Router Lambda (1 per region)
Layer 2 (Cell): ALB at each cell's edge, registered with VPC Lattice (1 per enterprise cell, or shared for SMB)

💡 Why ALB at cell edge, not NLB: VPC Lattice supports ALB as a target type. ALB provides Layer 7 features (path routing, host headers, WAF integration) while VPC Lattice handles the service discovery. Only use NLB if you need source IP preservation or non-HTTP protocols.

Example: Enterprise Cell Network Boundary

VPC: enterprise-cell-a-vpc (10.0.0.0/16) ← Same CIDR as all other enterprise cells

Subnets:

Public: 10.0.0.0/20 (AZ-a), 10.0.16.0/20 (AZ-b), 10.0.32.0/20 (AZ-c) - ALB nodes
Private: 10.0.64.0/18 (AZ-a), 10.0.128.0/18 (AZ-b), 10.0.192.0/18 (AZ-c) - EKS nodes

Edge Device: ALB (enterprise-a-alb) created by AWS Load Balancer Controller, registered with VPC Lattice

VPC Lattice Target: ALB ARN registered as service target

ALB Target Group: EKS pods in IP target mode (port 8080)

Kubernetes: AWS Load Balancer Controller watches Ingress resources, auto-creates ALB + target groups

Security Groups:

ALB SG: Allow 443 from VPC Lattice service network CIDR (169.254.171.0/24)
EKS Pod SG: Allow 8080 from ALB SG, allow pod-to-pod within VPC

Kubernetes Ingress Controller: AWS Load Balancer Controller vs NGINX

For the cell edge ingress layer, there are two primary options. Here's the trade-off analysis:

✅ AWS Load Balancer Controller

Recommended for this AWS-native architecture

How it works:

Kubernetes controller watches Ingress resources
Automatically provisions ALB/NLB per Ingress
Integrates directly with VPC Lattice targets
Uses AWS SDK to manage load balancers

Advantages:

✅ AWS-native integration (VPC Lattice, WAF, Shield)
✅ Automatic ALB/NLB lifecycle management
✅ Native support for target groups, health checks
✅ Managed by AWS (security patches, updates)
✅ Better observability (CloudWatch, X-Ray)
✅ IP or instance target modes

Disadvantages:

❌ AWS vendor lock-in (can't move to GCP/Azure easily)
❌ Less feature-rich than NGINX (no Lua, plugins)
❌ ALB costs (~$20/month per Ingress)
❌ Limited advanced routing (no regex, rewrite limits)

Best for: AWS-committed architectures, teams preferring managed services, production workloads needing WAF/Shield integration

NGINX Ingress Controller

Alternative for multi-cloud flexibility

How it works:

NGINX pods run as DaemonSet on nodes
Exposed via single NLB (cost-efficient)
NGINX handles all routing internally
Config via Kubernetes Ingress annotations

Advantages:

✅ Multi-cloud portable (works on EKS, GKE, AKS)
✅ Feature-rich (rate limiting, auth, rewrites, Lua)
✅ Battle-tested, huge community
✅ Single NLB cost (vs ALB per Ingress)
✅ Advanced routing (regex, complex rewrites)
✅ Extensive customization via ConfigMaps

Disadvantages:

❌ Self-managed (you patch, upgrade, scale)
❌ Extra network hop (NLB → NGINX pod → app pod)
❌ Less AWS-native integration
❌ Resource overhead (NGINX pods consume CPU/memory)
❌ Need to manage NGINX pod scaling

Best for: Multi-cloud strategy, need advanced routing features, team has NGINX expertise, cost-sensitive (fewer ALBs)

📊 Recommendation for This Architecture

Use AWS Load Balancer Controller for the following reasons:

Already committed to AWS: You're using VPC Lattice, DynamoDB Global Tables, Aurora - already AWS-native
VPC Lattice integration: ALB/NLB created by controller can be registered as VPC Lattice targets seamlessly
WAF requirement: Twilio needs DDoS protection - ALB integrates directly with WAF
Operational simplicity: Managed service reduces ops burden vs self-managing NGINX at scale
Cell isolation: Separate ALBs per cell provide better blast radius containment than shared NGINX

When to use NGINX instead: If you need advanced routing (complex regex, Lua scripts) or if maintaining multi-cloud optionality is critical

Cell Network Boundary Detail

Cell Architecture Terminology & AWS Implementation

🏗️ Mapping Cell-Based Architecture to AWS Implementation

Your cell-based architecture implementation using separate AWS accounts and VPCs to define cell boundaries is recognized as a best practice for managing scale and isolation. This physical realization directly addresses account quotas and VPC limits like Network Address Usage (NAU).

Your Term / Component	Industry Term	AWS Implementation	Function
Cell	Unit of Scale / Deployment Cell	Dedicated AWS Account + VPC (10.0.0.0/16)	Logical grouping representing a unit of scale and deployment; change-set boundary at deployment time to limit blast radius. Contains size-capped workload replica.
Cloud Native Landing Zone	Management Plane	AWS Control Tower Multi-Account Environment	Foundational substrate that sets up and governs a secure multi-account AWS environment. Provides design-time governance, security guardrails, and account-level orchestration.
Control Plane	Control Plane / Orchestrator	Infrastructure-as-Code + Lambda/Step Functions	The "cockpit of the platform" that interacts with your Landing Zone to manage cell lifecycle: provisioning new cell accounts, de-provisioning old ones, and migrating tenants between cells.
Platform Services / Global Cell	Shared Singleton Services / Tier 0	Multi-Region Deployment (Identity, IAM, etc.)	Critical services that cannot be easily cellularized (Identity, Access Management, Scheduler). Often reside in a "Global Cell" as Tier 0 dependencies. Must be multi-region to avoid single point of failure.
Cell Router	Traffic Partitioning Layer / Router	Route 53 / API Gateway / Lambda	The "thinnest possible layer" that uses a partition key (Customer ID) to route incoming requests to the specific endpoint of the AWS account/VPC where that customer's cell resides.

🚢 Shipping Port Analogy

Cloud Native Landing Zone (Management Plane) = The port authority, providing the docks and security rules
Cell (AWS Account + VPC) = A standardized shipping container; if one container is damaged (deployment failure), the contents of others are safe
Cell Router = The harbor pilot who knows exactly which container belongs to which customer
Control Plane = The crane operator who brings new containers onto the dock as the port grows
Platform Services/Global Cell = The port's central administration building that all ships must check in with

Why Separate AWS Accounts for Cells?

✅ Account-Level Benefits

Natural Isolation Boundary: Actions in one account cannot impact others, effectively bypassing regional account-level limits
Quota Independence: Each cell gets its own set of AWS service quotas (EC2 instances, VPCs, NAT Gateways, etc.)
Blast Radius Containment: IAM misconfigurations, security incidents, or service disruptions are contained within the account
Cost Allocation: Perfect cost attribution per cell/customer segment using AWS Cost Explorer and tagging
Compliance & Auditing: Clear security boundaries for regulatory requirements (SOC2, HIPAA, PCI-DSS)
VPC Limit Bypass: Avoid VPC-level Network Address Usage (NAU) limits by having dedicated VPCs per account

📊 Sizing Example: Enterprise Cells

Cell Capacity Planning:

Max Customers per Cell: 100 enterprise customers
EKS Cluster per Cell: 1 cluster with 20-50 nodes
Fixed Maximum Size: When Cell A reaches 100 customers, Control Plane automatically provisions Cell B
Saturation Limit: CPU > 70%, Memory > 75%, or customer count = max triggers new cell creation

Control Plane Automation

Control Plane Responsibilities

Cell Provisioning: Use AWS Control Tower Account Factory API to create new AWS account → Deploy VPC + EKS + DynamoDB via Terraform/CloudFormation → Register cell with VPC Lattice Service Network
Tenant Onboarding: When new customer signs up → Determine segment (SMB/Mid-Market/Enterprise) → Assign to least-loaded cell in segment → Update DynamoDB customer_cell_mapping table → Cache assignment in Redis
Cell Scaling: Monitor cell capacity metrics → Trigger Step Function when saturation threshold reached → Provision new cell account → New customers automatically route to new cell (lowest load)
Cell Migration: Dual-write phase → Background data sync → Atomic cutover (update DynamoDB + invalidate cache) → Cleanup old cell data after 24 hours
Cell Decommissioning: Drain traffic → Migrate all customers to other cells → Delete AWS account via Control Tower

Multi-Region Architecture Considerations

🌍 Regional Cell Deployment

Each region (us-east-1, eu-west-1, ap-southeast-1) has its own set of cells:

Example Cell Distribution:

us-east-1: enterprise-cell-a, enterprise-cell-b, midmarket-cell-shared, smb-cell-shared
eu-west-1: enterprise-cell-eu-a, enterprise-cell-eu-b, midmarket-cell-eu-shared
ap-southeast-1: enterprise-cell-apac-a, midmarket-cell-apac-shared

Global Cell (Platform Services):

Identity Service: Deployed in all 3 regions with DynamoDB Global Tables for user authentication
API Key Management: Replicated globally for low-latency validation
Billing Service: Multi-region for quota checks and balance deductions

⚠️ Critical: Platform Services are "Tier 0" dependencies - if Identity service fails globally, all cells are affected. Therefore, these MUST be deployed multi-region with active-active replication.

Interview Talking Points

🎤 2-Minute Summary: Cell-Based Architecture on AWS

"For Twilio's cell-based architecture, I'd implement cells as dedicated AWS accounts with their own VPCs, all using the same CIDR block (10.0.0.0/16) for maximum autonomy. This bypasses VPC-level quotas and enables true fault isolation."

"The Cloud Native Landing Zone is built on AWS Control Tower, which sets up the multi-account governance structure. The Control Plane automates cell lifecycle—provisioning new accounts when capacity is reached, using infrastructure-as-code to deploy identical cell stacks, and managing customer migrations."

"VPC Lattice solves the overlapping IP problem by routing based on service names, not IP addresses. The Cell Router looks up which cell a customer belongs to in DynamoDB (cached in Redis), then VPC Lattice routes the request to that cell's service endpoint—no VPC peering needed."

"Platform Services like Identity and IAM run in a separate Global Cell, deployed multi-region as Tier 0 dependencies. All regional cells depend on these for authentication and authorization, so they must be highly available across multiple regions."

"This architecture balances operational efficiency with isolation: enterprise customers get dedicated cells (accounts), mid-market shares cells in a VPC, and SMB uses namespace isolation within shared clusters. The Control Plane automates it all."

Cell Partitioning Strategy

Why Cell-Based Architecture?

Cell-based architecture is a pattern where you partition your infrastructure into isolated, independent units (cells). Each cell is a complete deployment of your application stack, limiting the blast radius of failures.

Benefits: Fault isolation, independent scaling, easier testing, phased rollouts, regional compliance.

Cell Partitioning Dimensions

Partition Type	Strategy	Rationale	AWS Implementation
Customer Size	Enterprise, Mid-Market, SMB	Different SLAs, resource needs, isolation requirements	Separate EKS clusters, dedicated capacity reservations
Geography	US, EU, APAC regions	Data residency, latency, compliance (GDPR, etc.)	Multi-region deployment, Route 53 geolocation routing
Product Vertical	Voice, SMS, Video, Verify	Different latency/consistency requirements	Product-specific microservices in each cell
Availability Zone	Multi-AZ within region	High availability, AZ failure isolation	EKS node groups across 3 AZs, Aurora Multi-AZ

Cell Design Per Customer Segment

🏢 Enterprise Cell

Customers: Uber, Lyft, Airbnb (top 100 customers)

Characteristics:

Dedicated EKS cluster (1000+ nodes)
Reserved capacity, dedicated NAT gateways
Aurora Multi-Master for writes
DynamoDB Global Tables
99.99% SLA
Priority support, custom metrics

🏪 Mid-Market Cell

Customers: Growing startups (1000-10,000 customers)

Characteristics:

Shared EKS cluster (200-500 nodes)
Burstable capacity
Aurora read replicas (writes to primary)
DynamoDB Global Tables
99.95% SLA
Standard support

🏠 SMB Cells (A/B/C/D)

Customers: Small businesses (100,000+ customers)

Characteristics:

ECS Fargate (serverless, cost-optimized)
Spot instances for batch workloads
Aurora read replicas (shared primary)
DynamoDB on-demand pricing
99.9% SLA
Community support

⚡ Cell Routing Strategy

How requests route to cells:

Route 53: Geolocation routing directs to nearest region
API Gateway / ALB: Custom domain with cell identifier in header/path
Cell Router Service: Looks up customer → cell mapping in DynamoDB
VPC Lattice: Service mesh routes to correct EKS cluster/namespace

Customer ID → Hash → Cell Assignment → Persistent mapping in DynamoDB

Cell Taxonomy: Workflow-Affinity, Operationally-Differentiated

🎯 Core Principle: Group by Workflow Affinity, Not Product Catalog

Services that communicate at runtime should share a cell. Services with uncorrelated demand and no runtime dependencies can be separated for better capacity efficiency.

Why? Co-location solves cross-cell API calls and cascading failures. But forcing unrelated services together creates capacity modeling challenges—you're sizing cells for combined peak load even when services never peak together.

The Core Trade-off

Approach	Cross-Cell Calls	Capacity Efficiency	Routing Complexity
All services, one cell	None	Poor (uncorrelated demand)	Simple
Workflow-affinity cells ✅	Rare (across workflow boundaries)	Good	Moderate
Service-per-cell	Frequent	Optimal	Complex

When to Co-locate Services (Same Cell)

✅ Co-locate when:

Services call each other synchronously — SMS delivery → Voice callback → WhatsApp fallback
Customer workflows span multiple services in a single request path — Verify API uses SMS + Voice
Failure in one impacts the other anyway — Tightly coupled dependencies
Similar scaling characteristics — Both bursty, both sustained, etc.

When to Separate Services (Different Cells)

🔀 Separate when:

Services rarely or never communicate at runtime — No cross-cell call risk
Vastly different scaling characteristics — SMS is bursty (millions/minute), Voice is sustained (concurrent calls), Video is bandwidth-bound
Different instance type optimizations — CPU-bound vs memory-bound vs I/O-bound workloads
Uncorrelated demand patterns — Don't want to size for combined peak when services never peak together

The Auto-Scaling Reality Check

Auto-scaling helps but has limits:

Factor	Challenge
Scale-up latency	2-5 minutes for EC2, need headroom buffer
Minimum instances	Cost floor regardless of actual demand
Different triggers	SMS scales on queue depth, Voice on concurrent connections
Compounding headroom	If SMS needs 30% headroom and Voice needs 30%, combined cell needs both—even if they never peak together

Recommended: Workflow-Affinity Cell Types

📱 Messaging Cell

Services: SMS, MMS, WhatsApp, RCS
Why together: Similar patterns (message queues), often used in fallback chains
Scaling: Queue depth, messages/second

🎥 Real-time Media Cell

Services: Voice, Video, WebRTC
Why together: Connection-based, jitter-sensitive, often combined in apps
Scaling: Concurrent connections, bandwidth

✉️ Async Communication Cell

Services: Email (SendGrid), Fax
Why together: Batch-oriented, store-and-forward, different SLA expectations
Scaling: Throughput, delivery rate

🔐 Verify Cell

Services: Verify API (2FA), Lookup
Why together: Cross-channel verification workflows (SMS + Voice + Email)
Scaling: Requests/second, verification attempts

# Workflow-affinity routing:
customer_id + region + service_category → cell_id

# Example mappings:
customer: acme-corp, region: us-east-1, category: messaging  → messaging-enterprise-us-001
customer: acme-corp, region: us-east-1, category: realtime   → realtime-enterprise-us-001
customer: acme-corp, region: us-east-1, category: verify     → verify-enterprise-us-001

# Same customer, different cells by workflow affinity
# Cross-category calls are rare (by design)

Cell Taxonomy Dimensions

Primary Dimensions (Tier 1 - Must Have)

Geographic/Regional
- Why: Legal requirement for data residency (GDPR, Chinese data laws)
- Why: Latency matters for real-time voice/video
- Impact: Minimum deployment in US, EU, APAC regions
- Interview value: Shows understanding of regulatory constraints
Customer Segment
- Why: Enterprise vs SMB have fundamentally different operational needs
- Examples:
  - Enterprise: Dedicated cells, custom SLAs, contractual guarantees, 24/7 support
  - SMB/Developer: Shared multi-tenant cells, best-effort, self-service
- Impact: Different blast radius tolerance, isolation requirements, cost models
- Interview value: Demonstrates business/technical balance
Compliance/Regulatory
- Why: You CANNOT mix HIPAA healthcare traffic with regular SMS
- Why: PCI-DSS for payment flows requires isolation
- Impact: Separate cell infrastructure with auditable controls
- Interview value: Shows enterprise architecture experience

Secondary Dimensions (Conditional)

Workload Characteristics (May be implied by product type):
- Transactional: 2FA codes, alerts (low-latency, predictable)
- Marketing/Bulk: Campaign SMS (high-throughput, bursty)
- Real-time: Voice/Video (sustained connections, jitter-sensitive)

Note: This might be handled within cells via different service tiers rather than requiring separate cell types.

What to AVOID: Per-Service Cells Without Workflow Analysis

Don't blindly create one cell type per product. Instead, analyze actual workflow dependencies:

Ask: "If a customer uses Service A, how often do they call Service B in the same request?"
If frequently: Co-locate in the same workflow-affinity cell
If rarely/never: Separate cells are fine—no cross-cell call risk

                        Bad: SMS-Cell, Voice-Cell, Video-Cell (arbitrary per-service split, ignores workflows)

                        Better: Messaging-Cell (SMS + MMS + WhatsApp), Realtime-Cell (Voice + Video + WebRTC)

                        Best: Workflow analysis determines grouping based on actual runtime dependencies

Key insight: The goal isn't "fewer cells" or "more cells"—it's minimizing cross-cell calls while avoiding capacity modeling nightmares from forcing unrelated services together.

Operational Differentiation: Same Code, Different Posture

Every cell runs the same microservices (SMS Service, Voice Service, Video Service, etc.). What changes is the operational configuration:

Aspect	Enterprise Cell	SMB Cell
Availability	Multi-AZ, N+2 redundancy	Multi-AZ, N+1 redundancy
Compute	r7g.2xlarge (dedicated)	r7g.large (shared, burstable)
Database	Multi-AZ RDS, 6 read replicas	Multi-AZ RDS, 2 read replicas
Capacity	Reserved instances, pre-scaled	Auto-scaling, spot instances
Network	Dedicated NAT Gateways, Transit Gateway	Shared NAT, standard VPC routing
Storage	Provisioned IOPS SSD (io2)	General Purpose SSD (gp3)
Monitoring	Sub-minute CloudWatch, custom metrics	5-minute CloudWatch, standard metrics
Blast Radius	10-50 customers per cell	1,000+ customers per cell
SLA	99.99% (52 min downtime/year)	99.9% (8.7 hrs downtime/year)
Cost per Customer	~$500-2000/month	~$10-50/month

Key Benefits of This Architecture

1️⃣ No Cross-Cell Cascading Failures

Customer uses SMS + Voice + Video → all in same cell
No distributed transactions across cells
Failures stay contained within cell boundary

2️⃣ Noisy Neighbor Isolation

SMB customer traffic spike → only affects other SMB customers
Enterprise customers in separate cells → unaffected
No SLA violations for high-paying customers

3️⃣ Simple Customer Migration

Upgrade SMB → Enterprise: Update DynamoDB routing table
Zero code changes, just point to new cell
Gradual traffic shifting (10% → 50% → 100%)

4️⃣ Economic Optimization

Enterprise: Infrastructure dollars proportional to revenue
SMB: Economies of scale through higher density
Right-sized spending per customer tier

5️⃣ Operational Simplicity

Every cell has identical structure (same services)
One deployment pipeline, one monitoring setup
Only Terraform variables change (instance sizes, etc.)

Infrastructure-as-Code Example

# Terraform module - workflow-affinity cells with operational tiers
module "messaging_cell" {
  source = "./modules/twilio-cell"

  # Cell identity
  cell_id  = "messaging-enterprise-us-001"
  category = "messaging"
  region   = "us-east-1"
  segment  = "enterprise"

  # Services in this workflow affinity group
  services = [
    "sms-service",
    "mms-service",
    "whatsapp-service",
    "rcs-service"
  ]

  # Operational config (enterprise tier)
  instance_type = "r7g.2xlarge"
  min_instances = 6
  scaling_metric = "queue_depth"  # Messaging scales on queue depth
}

module "realtime_cell" {
  source = "./modules/twilio-cell"

  cell_id  = "realtime-enterprise-us-001"
  category = "realtime"
  region   = "us-east-1"
  segment  = "enterprise"

  # Services in this workflow affinity group
  services = [
    "voice-service",
    "video-service",
    "webrtc-service"
  ]

  # Different scaling characteristics
  instance_type = "c7g.2xlarge"  # CPU-optimized for media processing
  min_instances = 4
  scaling_metric = "concurrent_connections"  # Realtime scales on connections
}

# Same customer can be in both cells (different workflow categories)
# Cross-category calls are rare by design

💡 Interview Talking Point: Cell Taxonomy

"I'd group services by workflow affinity rather than putting everything in one cell or splitting by individual product.

The key question is: 'Do these services call each other at runtime?' If yes, co-locate them to avoid cross-cell latency and cascading failures. If they rarely or never communicate, separate cells let them scale independently with better capacity efficiency.

For Twilio, this might mean: A Messaging Cell (SMS, MMS, WhatsApp—similar patterns, fallback chains), a Real-time Media Cell (Voice, Video, WebRTC—connection-based, jitter-sensitive), and an Async Cell (Email, Fax—batch-oriented, different SLAs).

Why not just one cell with everything? Capacity modeling becomes a nightmare. SMS is bursty, Voice is sustained, Video is bandwidth-heavy. If you force them together, you're sizing for combined peak even when they never peak simultaneously. Auto-scaling helps, but you still pay the minimum instance cost floor and headroom buffer for each service.

Within each workflow category, I'd still differentiate by operational tier—Enterprise cells over-provisioned for SLA guarantees, SMB cells cost-optimized with higher density. Same deployment pipeline, different operational posture.

The goal is minimizing cross-cell calls while avoiding the capacity modeling tax of forcing unrelated services together."

The "So What?" Test for Cell Taxonomy

For each dimension, ask: "Does this require materially different infrastructure, SLAs, or operational procedures?"

✅ Geography → Yes (different AWS regions, data residency laws)
✅ Segment → Yes (dedicated vs shared, different SLAs)
✅ Compliance → Yes (audit logs, encryption, access controls)
✅ Workflow affinity → Yes (different scaling characteristics, instance types, capacity models)
⚠️ Individual product → Only if services rarely communicate AND have very different scaling needs

Principle: Segment on operational characteristics AND runtime dependencies. Co-locate services that call each other; separate services with uncorrelated demand that don't communicate.

Example Cell Naming Convention

# Workflow-affinity cell naming: {category}-{segment}-{region}-{compliance}
Messaging-Enterprise-US-HIPAA    # Healthcare customer, messaging services, US
Messaging-SMB-EU-Standard        # Small business, messaging services, EU
Realtime-Enterprise-US-Standard  # Enterprise customer, voice/video, US
Realtime-SMB-APAC-Standard       # SMB customer, voice/video, Asia-Pacific
Async-Enterprise-US-PCI          # Payment processor, email/fax, US
Verify-Enterprise-US-HIPAA       # Healthcare, 2FA workflows, US

# Cell Router lookup (now includes service category):
Customer ID + Service Category → DynamoDB → Cell Assignment → VPC Lattice → Correct Cell

# Same customer, multiple cells:
acme-corp → messaging  → Messaging-Enterprise-US-Standard
acme-corp → realtime   → Realtime-Enterprise-US-Standard
acme-corp → verify     → Verify-Enterprise-US-Standard

Detailed Architecture Layers

Request Flow Through Architecture

AWS Service Mapping by Layer

Layer 1: Global Edge & Traffic Management

Route 53: DNS with health checks, geolocation routing, latency-based routing
CloudFront: CDN for static assets, API caching, geo-restrictions
AWS Global Accelerator: Anycast IPs, automatic failover between regions
WAF & Shield: DDoS protection, rate limiting, bot mitigation

Layer 2: Regional Gateway

API Gateway: Regional REST/WebSocket APIs, request validation, throttling
Application Load Balancer (ALB): Layer 7 load balancing, TLS termination, path-based routing
Network Load Balancer (NLB): Layer 4 for Voice/Video (UDP/TCP), ultra-low latency
VPC Peering / Transit Gateway: Cross-region connectivity

Layer 3: Cell Router & Service Mesh

Cell Router Service: Custom Lambda/ECS service that maps customer → cell
DynamoDB: Stores customer-to-cell assignments, low-latency lookups
VPC Lattice: Service mesh for inter-service communication, service-to-service connectivity
AWS Cloud Map: Service discovery for microservices

Layer 4: Compute Cells

EKS (Kubernetes): Enterprise and mid-market cells, full control, custom scheduling
ECS Fargate: SMB cells, serverless, lower operational overhead
EC2 Auto Scaling: EKS worker nodes, reserved + spot instances
Lambda: Event-driven workloads (webhooks, async processing)

Layer 5: Data & Storage

DynamoDB Global Tables: Multi-master, active-active replication across regions
Aurora Multi-Master: PostgreSQL with multi-master for enterprise cells
Aurora with Read Replicas: Cross-region replicas, leader-follower pattern
ElastiCache (Redis): Session management, rate limiting, caching
S3: Call recordings, message attachments, static assets
Kinesis: Real-time event streaming, analytics pipeline
SQS/SNS: Asynchronous messaging, fan-out patterns

Database Layer Strategy

Multi-Master vs Leader-Follower Decision Matrix

The choice between multi-master and leader-follower depends on:

Write Pattern: Multi-region writes? Geographic distribution of writes?
Consistency Requirements: Strong consistency needed? Conflict tolerance?
Latency Sensitivity: Can tolerate cross-region write latency?
Data Characteristics: High contention? Append-only? Partitionable?

Database Selection by Use Case

DynamoDB Global Tables RECOMMENDED

Pattern: Multi-Master (Active/Active)

Use Cases:

Customer account data
API keys & credentials
SMS message logs (append-only)
Usage metrics & billing
Configuration data

Why:

✅ Fully managed, no ops overhead
✅ Sub-10ms latency
✅ Auto-scaling, pay-per-request
✅ Last-write-wins conflict resolution
✅ 99.999% availability SLA

Regions: US-East-1, EU-West-1, AP-Southeast-1

Aurora PostgreSQL Multi-Master

Pattern: Multi-Master (Active/Active)

Use Cases:

Enterprise customer configurations
Complex queries with ACID
Multi-table transactions

Why:

✅ ACID transactions across masters
✅ SQL interface, complex queries
✅ Up to 2 masters per region
⚠️ Limited to single region multi-master
⚠️ Higher latency than DynamoDB

Configuration: 2 masters + 3 read replicas per region

Aurora Global Database (Leader-Follower)

Pattern: Leader-Follower (Active/Passive)

Use Cases:

Phone number inventory
Payment transactions
Strong consistency requirements
Regulatory compliance data

Why:

✅ Strong consistency guarantees
✅ < 1 second cross-region replication
✅ Failover in < 1 minute (RPO < 1s)
✅ Read replicas in secondary regions
⚠️ Writes only to primary region

Configuration: Primary in US-East-1, replicas in EU + APAC

Cassandra on EC2/Kubernetes

Pattern: Masterless (Peer-to-Peer)

Use Cases:

Time-series data (call metrics)
Extreme scale (billions of records)
High write throughput
When DynamoDB limits hit

Why:

✅ Linear scalability
✅ No single point of failure
✅ Tunable consistency (CL=QUORUM)
✅ Multi-datacenter replication
⚠️ Higher operational complexity
⚠️ Need dedicated ops team

When to use: Only if DynamoDB can't meet scale/cost requirements

Database Replication Topology

Conflict Resolution Strategies

Database	Conflict Resolution	Use Case Fit
DynamoDB Global Tables	Last-write-wins (LWW) based on timestamp	✅ Account updates, configs (rare conflicts) ✅ Append-only logs (no conflicts)
Aurora Multi-Master	Automatic deadlock detection, transaction rollback	✅ Enterprise configs with transactions ⚠️ Limited to 2 masters in same region
Cassandra	Tunable: LWW, custom application logic, CRDTs	✅ Time-series (no conflicts) ✅ Counters with CRDT increments
Application-Level	Version vectors, CRDTs, custom merge logic	✅ Complex business logic ✅ Shopping carts, collaborative editing

Twilio Product to Database Mapping

Principle: Match Consistency Model to Product Requirements

Different Twilio products have different consistency, latency, and scale requirements. Choose the right database pattern for each.

Twilio Product	Data Type	Database Choice	Replication Pattern	Consistency Model	Rationale
Programmable Voice	Call records, CDRs	DynamoDB Global Tables	Multi-Master	Eventual	Append-only, high volume, sub-second replication lag acceptable, query by CallSID
Programmable Voice	Active call state	ElastiCache Redis	N/A (Cache)	Strong*	Low latency, session state, ephemeral (TTL). *Single-node writes within region
Programmable SMS	Message logs	DynamoDB Global Tables	Multi-Master	Eventual	Append-only, billions of messages, partitioned by phone number, ~1s replication lag
Programmable SMS	Message archives	S3 → Glacier	Cross-region replication	Eventual	Compliance, long-term retention, cost-optimized, S3 read-after-write consistency
Phone Numbers	Inventory management	Aurora Global Database	Leader-Follower	Strong	CRITICAL: Can't double-assign numbers. ACID transactions, serializable isolation
Conversations API	Messages, participants	DynamoDB Global Tables	Multi-Master	Eventual	Append-only messages, out-of-order delivery acceptable, partition by conversation
Verify API	Verification tokens	ElastiCache Redis	N/A (Cache)	Strong*	Short-lived (5-10 min TTL), low latency. *Regional strong consistency
Account Management	Customer accounts	DynamoDB Global Tables	Multi-Master	Eventual	Low write contention, global access, last-write-wins acceptable, ~1s replication
Billing	Usage metrics	Kinesis → S3 → Redshift	Event streaming	Eventual	High volume, analytics, batch processing, eventual aggregation acceptable
Billing	Invoices, payments	Aurora Global Database	Leader-Follower	Strong	CRITICAL: Financial transactions require ACID guarantees, no double-charging
Video	Room state	ElastiCache Redis	N/A (Cache)	Strong*	Real-time, low latency, ephemeral. *Regional consistency, conflicts rare
Video	Recordings	S3 Multi-Region	Cross-region replication	Eventual	Large files, CDN integration, durability, S3 eventual cross-region consistency

📖 Consistency Models Explained

Eventual Consistency

Definition: Writes to one replica will eventually propagate to all replicas, but reads may see stale data during replication lag (~1 second for DynamoDB Global Tables).

When to use:

Append-only data (logs, messages, events)
Data where conflicts are rare or easily resolved (last-write-wins)
High availability and low latency are more important than consistency
Multi-region writes are required

Trade-off: Clients might read stale data briefly, but system remains available during network partitions.

Strong Consistency

Definition: All reads see the most recent write. Once a write is acknowledged, all subsequent reads from any replica will see that write.

When to use:

Financial transactions (billing, payments)
Inventory management (phone numbers, seats)
Any data where stale reads cause business problems
ACID transaction requirements

Trade-off: Higher latency for cross-region reads, limited write scalability (single leader), reduced availability during partitions.

🔑 Interview Tip: Always justify your consistency choice. For Twilio, voice/SMS logs can tolerate eventual consistency (append-only, no conflicts), but phone number inventory requires strong consistency (can't double-assign). This is a key architectural trade-off!

Data Flow Example: SMS Message

API Request
(EU customer)

→

Cell Router
(DynamoDB lookup)

→

EU Cell
(EKS pod)

→

DynamoDB EU
(local write)

→

Replicate to
US + APAC
(<1 sec)

Kinesis
(event stream)

→

Firehose
(batch)

→

S3
(compliance archive)

→

Glacier
(7 year retention)

AWS Well-Architected Framework Alignment

🎯 Operational Excellence

IaC: Terraform/CloudFormation for all cells
CI/CD: Blue/green deployments per cell
Observability: CloudWatch, X-Ray, cell-level dashboards
Runbooks: Automated remediation with Systems Manager
Chaos Engineering: Fault injection per cell (GameDay exercises)

🔒 Security

IAM: Service roles per cell, least privilege
Encryption: At-rest (KMS) and in-transit (TLS 1.3)
Secrets: Secrets Manager, automatic rotation
Network: VPC isolation per cell, PrivateLink
Compliance: GDPR data residency, SOC2, HIPAA

💪 Reliability

Multi-AZ: All cells across 3 AZs
Multi-Region: Active/active in 3 regions
Cell Isolation: Blast radius limited to single cell
Backups: Automated DynamoDB PITR, Aurora snapshots
Disaster Recovery: RTO < 5min, RPO < 1sec

⚡ Performance Efficiency

Global Accelerator: Anycast routing, TCP optimization
DynamoDB: Single-digit ms latency, auto-scaling
ElastiCache: Sub-millisecond caching
Compute: Graviton instances (40% better perf/cost)
Observability: X-Ray distributed tracing

💰 Cost Optimization

Right-Sizing: Enterprise (reserved), SMB (spot/fargate)
DynamoDB: On-demand for SMB, provisioned for enterprise
S3 Lifecycle: Glacier for compliance archives
Compute Savings Plans: 30-50% savings
Monitoring: Cost anomaly detection per cell

🌍 Sustainability

Regions: Choose AWS regions with renewable energy
Graviton: 60% less energy than x86
Serverless: Fargate, Lambda for variable workloads
S3 Intelligent-Tiering: Automatic archival
Auto-Scaling: Scale down during off-peak

Key Architecture Decisions & Trade-offs

✅ Decision: DynamoDB Global Tables over Cassandra

Rationale: Managed service, lower ops overhead, sufficient scale for most Twilio use cases

Trade-off: Less control over replication topology, higher cost at extreme scale

When to reconsider: If DynamoDB costs > $1M/month or need more than 5 regions

✅ Decision: Cell-based architecture by customer size

Rationale: Different SLAs, resource isolation, blast radius containment

Trade-off: More complex routing, cell rebalancing overhead

Mitigation: Automated cell assignment, gradual customer migration tools

✅ Decision: EKS for Enterprise, ECS Fargate for SMB

Rationale: Enterprise needs custom scheduling/control, SMB needs cost efficiency

Trade-off: Managing two orchestration platforms

Mitigation: Shared CI/CD pipelines, standardized observability

✅ Decision: Aurora Global Database for phone inventory

Rationale: Strong consistency required, can't double-assign phone numbers

Trade-off: Single write region, cross-region writes have higher latency

Mitigation: Regional inventory pools, write to nearest region's pool

Scalability Estimates

Component	Current Scale	Target Scale	Bottleneck	Mitigation
DynamoDB Global Tables	10K RCU/WCU per table	100K RCU/WCU	Partition hotspots	Partition key design (customer_id + timestamp), adaptive capacity
Aurora Global	100K transactions/sec	500K transactions/sec	Write throughput	Horizontal sharding by region, read replicas
EKS Cluster (Enterprise)	1000 nodes	5000 nodes	Control plane limits	Multiple clusters per region, federated control plane
API Gateway	10K req/sec	100K req/sec	Regional quotas	Multi-region, ALB for high throughput paths
Kinesis Streams	1000 shards	10K shards	Shard management	Automatic resharding, Kinesis Data Firehose for batch

Monitoring & Observability

Cell-Level Dashboards

CloudWatch: Custom metrics per cell (latency, error rate, throughput)
X-Ray: Distributed tracing across cells, service map
Prometheus + Grafana: EKS cluster metrics, pod-level observability
DynamoDB Contributor Insights: Identify hot partitions
Aurora Performance Insights: Database query analysis
CloudWatch Alarms: Cell health checks, automatic remediation via Lambda

SLI/SLO Tracking:

API latency p99 < 200ms (per cell, per product)
Error rate < 0.1% (per cell)
Availability > 99.99% (enterprise cells), > 99.95% (mid-market), > 99.9% (SMB)

Summary & Interview Talking Points

🎤 2-Minute Elevator Pitch

"For Twilio's global platform, I'd design a cell-based architecture partitioned by customer size—enterprise, mid-market, and SMB—deployed across three AWS regions for multi-region active/active redundancy."

"At the data layer, we'd use DynamoDB Global Tables for multi-master replication where eventual consistency is acceptable—like customer accounts and message logs. For strong consistency needs like phone number inventory, we'd use Aurora Global Database with leader-follower replication."

"Enterprise customers get dedicated EKS clusters with Aurora Multi-Master and reserved capacity. Mid-market shares EKS with Aurora replicas. SMB runs on cost-optimized ECS Fargate."

"This design aligns with AWS Well-Architected: operational excellence through IaC and cell-level observability, security via VPC isolation and encryption, reliability through multi-AZ/multi-region with blast radius containment, performance via low-latency databases and global accelerator, and cost optimization by right-sizing per customer segment."

"The cell architecture limits blast radius—if one cell fails, it only affects that customer segment. We can independently scale, deploy, and test each cell."

📋 Key Takeaways

Cell-based architecture limits blast radius and enables independent scaling
Multi-master (DynamoDB Global Tables) for low-latency global writes with eventual consistency
Leader-follower (Aurora Global Database) for strong consistency requirements
Customer segmentation (Enterprise/Mid-Market/SMB) drives compute and database choices
Product-specific patterns match data consistency to business requirements
AWS-native services reduce operational overhead vs self-managed (DynamoDB > Cassandra)
Well-Architected Framework ensures holistic design across all pillars

🚀 Next Steps for Deep Dive

Design detailed cell routing algorithm (DynamoDB lookups, least-loaded assignment, cell rebalancing)
Plan disaster recovery runbooks (regional failover, data loss scenarios)
Cost modeling per customer segment (reserved vs on-demand, DynamoDB pricing)
Security deep-dive (GDPR compliance, data residency, encryption key management)
Migration strategy (monolith → cells, zero-downtime migration)