The Challenge: Product-Agnostic Cells Add Deployment Complexity
When each cell runs ALL Twilio services (SMS, Voice, Video, WhatsApp, Email), releasing software across cells becomes complicated. This is the fundamental trade-off of product-agnostic cells:
| Benefit | Trade-off |
|---|---|
| No cross-cell API calls (runtime simplicity) | More cells to update during deployments |
| Customer's services co-located | Breaking changes require global coordination |
| Blast radius limited to cell | Schema migrations must be orchestrated carefully |
The Meta-Insight
Product-agnostic cells shift complexity from runtime (no cross-cell calls) to deployment (more cells to update). That's a good trade—deployment complexity is solved with automation; runtime complexity causes outages.
Deployment Strategy: Progressive Rollout
Wave-Based Deployment
Wave 0: Canary Cell (1 cell)
└── Internal traffic or opt-in beta customers
└── Bake time: 30-60 minutes
└── Automated rollback on error rate spike
Wave 1: SMB Cells (25% of SMB)
└── Lower blast radius, self-service customers
└── Bake time: 2-4 hours
└── Monitor error rates, latency p99
Wave 2: Remaining SMB Cells (75%)
└── Bake time: 4-8 hours
Wave 3: Enterprise Cells (one at a time)
└── Bake time: 24 hours per cell
└── Customer notification for maintenance windows
Why This Works
| Concern | Solution |
|---|---|
| Bad deploy takes down everything | Wave-based rollout limits blast radius |
| Rolling back is hard | Each cell is independent - rollback only affected cells |
| Enterprise customers need stability | They're always last, with longest bake times |
| Speed vs safety | SMB cells move fast; Enterprise cells move carefully |
Automated Deployment Pipeline
# deployment-pipeline.yaml (Step Functions or similar) states: DeployToCanary: cell: canary-us-internal action: deploy next: ValidateCanary ValidateCanary: checks: - error_rate < 0.1% - p99_latency < 500ms - no_pagerduty_alerts: 30m on_failure: RollbackCanary on_success: DeployToSMBWave1 DeployToSMBWave1: cells: [smb-us-1, smb-us-2, smb-eu-1] # 25% of SMB parallel: true next: ValidateSMBWave1 ValidateSMBWave1: bake_time: 4h checks: [error_rate, latency, customer_complaints] on_failure: RollbackSMBWave1 on_success: DeployToSMBWave2 # ... continues through all waves DeployToEnterprise: cells: [enterprise-us-hipaa] requires_approval: true # Human gate for enterprise notification: slack://enterprise-releases
Decouple Deployment from Activation
The Principle
Deploy code to all cells quickly, activate features slowly. Feature flags let you deploy code without activating it.
// Feature flags let you deploy code without activating it const features = { newVoiceCodec: { enabled: false, // Deployed everywhere, activated nowhere rolloutPercentage: 0, allowlist: ['enterprise-customer-123'] // Beta testers } }; // Gradual activation AFTER deployment is complete // Day 1: Deploy to all cells (feature off) // Day 2: Enable for 5% of traffic // Day 3: Enable for 25% // Day 7: Enable for 100%
Separation of Concerns
┌─────────────────────────────────────────────────────────────────┐
│ Breaking changes are a ROUTING problem, not a DEPLOYMENT │
│ problem. │
│ │
│ Deploy code progressively → Cells (wave-based, safe) │
│ Enable versions instantly → Landing Zone (global, atomic) │
└─────────────────────────────────────────────────────────────────┘
Operations Strategy: Centralized Control, Decentralized Execution
GitOps Model
infrastructure-repo/ ├── modules/ │ └── cell/ # Shared cell template │ ├── sms-service.tf │ ├── voice-service.tf │ ├── video-service.tf │ └── variables.tf │ ├── cells/ │ ├── enterprise-us-hipaa/ │ │ └── terragrunt.hcl # Just variables, includes module │ ├── enterprise-eu-standard/ │ │ └── terragrunt.hcl │ ├── smb-us-standard/ │ │ └── terragrunt.hcl │ └── ... │ └── argocd/ └── applications/ # One ArgoCD app per cell
Operational Pattern
┌─────────────────────────────────────────────────────┐
│ Central Control Plane │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
│ │ ArgoCD │ │ Datadog/ │ │ PagerDuty │ │
│ │ (Deploy) │ │ Grafana │ │ (Alerting) │ │
│ └──────┬──────┘ └──────┬──────┘ └──────┬──────┘ │
└─────────┼────────────────┼────────────────┼─────────┘
│ │ │
┌─────┴─────┬──────────┴────────┬───────┴────┐
▼ ▼ ▼ ▼
┌───────┐ ┌───────┐ ┌───────┐ ┌───────┐
│Cell 1 │ │Cell 2 │ ... │Cell N │ │Cell M │
│ (SMB) │ │ (SMB) │ │(Ent.) │ │(Ent.) │
└───────┘ └───────┘ └───────┘ └───────┘
One Dashboard, Drill-Down Capability
# Datadog dashboard hierarchy Global View: - All cells health heatmap (green/yellow/red) - Cross-cell comparison: latency, error rates, CPU - Deployment status: which cells on which version Cell View (drill down): - Per-service metrics within cell - Customer impact (affected accounts) - Recent deployments and config changes Service View (drill down further): - Individual pod/container metrics - Traces, logs, spans
Key Operational Simplifications
| Challenge | Solution |
|---|---|
| "Which cell is on which version?" | ArgoCD dashboard shows all cells |
| "How do I debug across cells?" | Centralized logging (Datadog/Splunk) with cell_id tag |
| "Runbooks for each cell?" | Same runbook, parameterized by cell_id |
| "Alert fatigue from N cells?" | Aggregate alerts by issue type, not by cell |
| "Config drift between cells?" | GitOps ensures cells match repo state |
API Versioning at the Edge (Landing Zone)
The Principle
Breaking API changes don't roll through cells—they're handled at the Landing Zone edge. Cells run multiple API versions simultaneously; the edge routes /v1/* to v1 handlers and /v2/* to v2 handlers.
Architecture: Version Routing at Edge
┌─────────────────────────────────────┐
│ Cloud Native Landing Zone │
│ ┌─────────────────────────────────┐ │
│ │ API Gateway (Version Router) │ │
│ │ │ │
POST /v1/Messages│ │ /v1/* ──→ v1 handler │ │
─────────────────┼─▶│ /v2/* ──→ v2 handler │ │
POST /v2/Messages│ │ │ │
│ └───────────┬─────────────────────┘ │
└──────────────┼───────────────────────┘
│
┌──────────────┼───────────────────────┐
│ Cell Router │
│ (Routes to correct cell) │
└──────────────┼───────────────────────┘
│
┌────────────────────┼────────────────────┐
▼ ▼ ▼
┌──────────┐ ┌──────────┐ ┌──────────┐
│ Cell 1 │ │ Cell 2 │ │ Cell 3 │
│ (v1+v2) │ │ (v1+v2) │ │ (v1+v2) │
└──────────┘ └──────────┘ └──────────┘
Cells Run Multiple API Versions
// Cells are version-agnostic - they run ALL supported versions // The edge determines which version handler to invoke // In Landing Zone API Gateway: const routes = { 'POST /v1/Messages': 'sms-service:handleV1Message', 'POST /v2/Messages': 'sms-service:handleV2Message', // Breaking change 'POST /v1/Calls': 'voice-service:handleV1Call', 'POST /v2/Calls': 'voice-service:handleV2Call', }; // Inside each cell - BOTH handlers exist simultaneously class SmsService { handleV1Message(req) { // Old contract: { To, From, Body } return this.sendMessage(req.To, req.From, req.Body); } handleV2Message(req) { // New contract: { recipient, sender, content, metadata } return this.sendMessage(req.recipient, req.sender, req.content); } // Shared internal logic - only contract differs sendMessage(to, from, body) { /* ... */ } }
What Lives Where
| Layer | Responsibilities | Breaking Change Strategy |
|---|---|---|
| Landing Zone | API Gateway, version routing, rate limiting, auth | Routes /v1/* vs /v2/* to correct handlers |
| Control Plane | Cell provisioning, customer routing, capacity | Tracks which cells support which versions |
| Cell Router | Customer → Cell mapping | Version-unaware (all cells run all versions) |
| Cell | Run services, process requests | Contains BOTH v1 and v2 handlers |
Breaking Change Rollout Timeline
Month 0: Deploy v2 handlers to all cells (v2 disabled at edge)
Progressive rollout, normal wave-based deployment. v2 code exists but receives zero traffic.
Month 1: Enable v2 at API Gateway
Single config change in Landing Zone. Instant global availability. v1 still works, no customer impact.
Month 3: Announce v1 deprecation
Customer notification. v1 continues working.
Month 12: Disable v1 at API Gateway
Single config change. Remove v1 handlers from cells (cleanup).
Risk Model Inversion
Deployment is low-risk because the new code isn't reachable. Enablement is low-risk because all cells already have the code. And sunset is just another config change—disable the old route when customers have migrated.
Landing Zone API Gateway Config
# api-gateway-config.yaml (AWS API Gateway / Kong / Envoy) routes: - path: /v1/Messages status: active backend: ${cell_router}/sms/v1/messages deprecation_date: null - path: /v2/Messages status: active backend: ${cell_router}/sms/v2/messages introduced: 2025-01-15 - path: /v0/Messages status: deprecated backend: ${cell_router}/sms/v1/messages # Internally aliased sunset_date: 2025-06-01 response_headers: Deprecation: "true" Sunset: "Sat, 01 Jun 2025 00:00:00 GMT"
Control Plane Version Tracking
// Control Plane tracks cell capabilities const cellMetadata = { 'enterprise-us-hipaa': { versions: { sms: ['v1', 'v2'], voice: ['v1', 'v2'], video: ['v1'] // v2 not yet deployed to this cell }, lastDeployed: '2025-01-15T10:30:00Z' } }; // During v2 rollout, Control Plane can: // 1. Track which cells have v2 code // 2. Gate v2 enablement until all cells ready // 3. Provide rollout status dashboard
Interview Talking Point: API Versioning
"Breaking API changes don't roll through cells—they're handled at the Landing Zone. Cells run multiple API versions simultaneously; the edge routes/v1/*to v1 handlers and/v2/*to v2 handlers.
The rollout becomes two separate concerns: First, deploy the new handlers to cells using normal progressive rollout—code exists but receives no traffic. Then, enable the new version at the API Gateway—a single global config change that's instant and atomic.
This inverts the risk model. Deployment is low-risk because the new code isn't reachable. Enablement is low-risk because all cells already have the code. And sunset is just another config change—disable the old route when customers have migrated.
The Control Plane tracks which cells support which versions, so we can gate enablement until all cells are ready. But the versioning logic itself lives at the edge where it belongs."
Database Schema Migrations Across Cells
Key Advantage
Cell isolation actually helps with schema migrations. Each cell has its own database, so you migrate cell-by-cell rather than globally. The blast radius of a failed migration is one cell, not the whole platform.
The Core Pattern: Expand-Contract Migrations
┌─────────────────────────────────────────────────────────────────┐
│ EXPAND-CONTRACT PATTERN │
├─────────────────────────────────────────────────────────────────┤
│ │
│ Phase 1: EXPAND (Schema) │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ ALTER TABLE messages ADD COLUMN recipient_id VARCHAR; │ │
│ │ -- Nullable, old code ignores it, new code can write │ │
│ └─────────────────────────────────────────────────────────┘ │
│ ↓ │
│ Phase 2: MIGRATE (Code + Data) │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ - Deploy code that writes to BOTH old and new columns │ │
│ │ - Backfill: UPDATE messages SET recipient_id = to_phone │ │
│ │ - Code reads from new column, falls back to old │ │
│ └─────────────────────────────────────────────────────────┘ │
│ ↓ │
│ Phase 3: CONTRACT (Cleanup) │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ - Deploy code that ONLY uses new column │ │
│ │ - ALTER TABLE messages DROP COLUMN to_phone; │ │
│ └─────────────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────┘
Cell-by-Cell Migration Strategy
Control Plane
(Migration Orchestrator)
│
┌─────────────────┼─────────────────┐
│ │ │
▼ ▼ ▼
┌─────────┐ ┌─────────┐ ┌─────────┐
│ Cell 1 │ │ Cell 2 │ │ Cell 3 │
│ v2.1 │ │ v2.0 │ │ v2.0 │
│ Schema │ │ Schema │ │ Schema │
│ ✅ Done │ │ 🔄 Running│ │ ⏳ Pending│
└─────────┘ └─────────┘ └─────────┘
│ │ │
▼ ▼ ▼
┌─────────┐ ┌─────────┐ ┌─────────┐
│ DB │ │ DB │ │ DB │
│ (new) │ │(migrating)│ │ (old) │
└─────────┘ └─────────┘ └─────────┘
Migration order: Canary → SMB Wave 1 → SMB Wave 2 → Enterprise
The Golden Rule: Code Must Handle Both Schemas
// During migration window, code must be "schema-bilingual" class MessageRepository { async getMessage(id) { const row = await db.query('SELECT * FROM messages WHERE id = ?', [id]); // Handle both old and new schema return { id: row.id, // New column with fallback to old recipientId: row.recipient_id || row.to_phone, senderId: row.sender_id || row.from_phone, // ... rest of fields }; } async createMessage(msg) { // Write to BOTH during migration (dual-write) await db.query(` INSERT INTO messages (id, to_phone, from_phone, recipient_id, sender_id, body) VALUES (?, ?, ?, ?, ?, ?) `, [msg.id, msg.recipientId, msg.senderId, msg.recipientId, msg.senderId, msg.body]); // ↑ old columns ↑ new columns (same values) } }
Migration Orchestration in Control Plane
# migration-manifest.yaml migration: id: "2025-01-15-recipient-id-column" description: "Rename to_phone → recipient_id" phases: - name: expand type: schema script: migrations/001_add_recipient_id.sql rollout: - cells: [canary-internal] wait: 1h - cells: [smb-us-1, smb-us-2] wait: 4h - cells: [smb-*] wait: 24h - cells: [enterprise-*] wait: 48h requires_approval: true - name: migrate-code type: deployment version: "v2.5.0" # Code with dual-write rollout: standard # Normal wave-based deployment - name: backfill type: data script: migrations/001_backfill_recipient_id.py parallel_cells: 3 # Max 3 cells backfilling at once checkpoint_interval: 10000 # Rows between checkpoints - name: migrate-code-readonly type: deployment version: "v2.6.0" # Code reads only new column rollout: standard - name: contract type: schema script: migrations/001_drop_to_phone.sql rollout: - cells: [canary-internal] wait: 24h - cells: [smb-*] wait: 48h - cells: [enterprise-*] wait: 168h # 1 week bake time requires_approval: true
Online Schema Change Tools (Zero Downtime)
# pt-online-schema-change (Percona) - for MySQL # Creates shadow table, syncs with triggers, atomic swap pt-online-schema-change \ --alter "ADD COLUMN recipient_id VARCHAR(50)" \ --execute \ D=twilio,t=messages # gh-ost (GitHub) - for MySQL # Binlog-based replication, no triggers gh-ost \ --alter="ADD COLUMN recipient_id VARCHAR(50)" \ --database=twilio \ --table=messages \ --execute # For PostgreSQL - native online DDL is better # But for large tables, use pg_repack or similar
Control Plane Migration State Tracking
// Control Plane tracks migration state per cell const migrationState = { 'migration-2025-01-15-recipient-id': { status: 'in_progress', currentPhase: 'backfill', cells: { 'canary-internal': { phase: 'contract', status: 'complete' }, 'smb-us-1': { phase: 'backfill', status: 'complete', rows: 1500000 }, 'smb-us-2': { phase: 'backfill', status: 'running', rows: 850000, total: 2000000 }, 'smb-eu-1': { phase: 'migrate-code', status: 'complete' }, 'enterprise-us-hipaa': { phase: 'expand', status: 'pending' }, }, canRollback: true, // Until contract phase startedAt: '2025-01-15T10:00:00Z', } }; // Dashboard shows: // ████████████░░░░░░░░ 60% complete (12/20 cells) // Current: Backfilling smb-us-2 (42% of rows) // Next: smb-eu-2 (waiting for smb-us-2)
The Rollback Question
| Phase | Rollback Strategy |
|---|---|
| Expand | Just drop the new column (no data loss) |
| Migrate Code | Roll back to old code version (still reads old columns) |
| Backfill | Stop backfill, no harm done |
| Contract | Point of no return - old column is gone |
Point of No Return
The contract phase (dropping old columns) is irreversible. This is why enterprise cells get 1-week bake times and require human approval before contract phase.
What About Global/Platform Services?
Platform Services (Identity, Billing) have shared databases that span regions:
Option 1: DynamoDB Global Tables (Schema-less)
- No formal schema to migrate
- Code handles missing/new attributes gracefully
- Just deploy code that writes new attributes
Option 2: Aurora Global Database
- Primary region does schema change
- Replicates to read replicas automatically
- Use expand-contract (longer timeline)
- Schema change is atomic, replication handles it
Option 3: Blue-Green Database
- Provision new database with new schema
- Dual-write during migration
- Cut over when backfill complete
- More complex, but zero-risk rollback
Interview Talking Point: Schema Migrations
"Schema migrations in a cell-based architecture are actually easier because each cell has its own database—you migrate cell by cell, not globally. The blast radius of a failed migration is one cell, not the whole platform.
The pattern is expand-contract: add the new column as nullable, deploy code that dual-writes, backfill existing data, then deploy code that only uses the new column, and finally drop the old column. Code must be schema-bilingual during the migration window.
The Control Plane orchestrates this—tracking which cells are in which phase, managing the rollout order (canary, SMB, then enterprise), and knowing when rollback is still possible. The point of no return is the contract phase when you drop the old column.
For global Platform Services like Identity, I'd use DynamoDB Global Tables where possible—no formal schema means no schema migrations. For relational data, Aurora Global handles replication automatically, but you still use expand-contract with longer timelines since there's no cell isolation to limit blast radius."
Interview Summary: Deployment & Operations
Comprehensive Talking Point
"Product-agnostic cells add deployment complexity, but we solve it with progressive rollout and decoupled activation. Code deploys in waves—canary cell first, then SMB cells, then enterprise cells last with 24-hour bake times. But I separate deployment from activation using feature flags, so we can deploy everywhere quickly but enable features gradually.
For operations, it's GitOps: one cell module, many instances with different variables. ArgoCD syncs the desired state, Datadog gives us one dashboard with drill-down to any cell, and runbooks are parameterized by cell_id. The complexity is in the automation, not the operator's head.
Breaking API changes don't roll through cells—they're handled at the Landing Zone. And schema migrations are actually easier with cell isolation—you migrate cell by cell, limiting blast radius to one cell instead of the whole platform."
Key Principles Summary
| Principle | Implementation |
|---|---|
| Progressive Rollout | Canary → SMB → Enterprise with increasing bake times |
| Decouple Deploy from Enable | Feature flags let you deploy everywhere, activate gradually |
| API Versioning at Edge | Landing Zone routes /v1/* and /v2/* to different handlers |
| Schema: Expand-Contract | Add column → Dual-write → Backfill → Drop old column |
| Cell Isolation Helps | Blast radius limited to one cell for both deploys and migrations |
| GitOps Everything | One module, N instances; ArgoCD ensures cells match repo state |
The Meta-Principle
Cells are execution environments. They shouldn't make routing decisions about API versions. Push that complexity UP to the Landing Zone where you have global visibility and atomic control. Push operational complexity INTO automation, not operators' heads.