How to Use This Document
This document prepares you for Distinguished Architect-level questions about the specific architectures used at Twilio Segment. Questions progress from foundational understanding to deep technical reasoning to trade-off analysis.
Kafka & Event Streaming Intermediate
Exactly-once in Kafka requires three components working together:
- Idempotent producer: Set
enable.idempotence=true. Kafka assigns a producer ID and sequence number to each message, deduplicating retries within a session. - Transactional writes: For cross-partition atomicity, use
initTransactions(),beginTransaction(),commitTransaction(). Either all partitions get the message or none do. - Consumer isolation: Set
isolation.level=read_committedso consumers only see committed messages.
But this is only exactly-once within Kafka. The real challenge is external system integration, which is why Segment uses application-level deduplication with RocksDB.
Kafka's exactly-once is scoped to the Kafka cluster. Once you publish to an external destination (Google Analytics, Salesforce), you're back to at-least-once. Mobile clients also retry on network failure, creating duplicates before data reaches Kafka. The 0.6% duplicate rate comes from these client retries, not from Kafka itself. RocksDB provides application-level deduplication that persists across consumer restarts.
Partitioning by messageId concentrates duplicates:
- Same messageId always goes to the same partition and consumer
- Deduplication becomes local - check one RocksDB instance, not a distributed lookup
- If we partitioned by customerId, duplicates could land on different partitions, requiring cross-partition deduplication (expensive)
The trade-off: we lose per-customer ordering. But for analytics events, global ordering matters less than throughput. If you need per-customer ordering, you'd partition by customerId and accept the distributed dedup complexity.
Multi-tier failover within each TAPI shard:
- Primary cluster in the same AZ as the TAPI shard
- Secondary cluster in a different AZ, continuously replicated
- Replicated service monitors broker health metrics (latency, error rate, ISR shrinkage)
- When primary degrades beyond threshold, routes new writes to secondary
- Automatic revert when primary recovers
This is more sophisticated than Kafka's built-in replication because it handles entire cluster failures, not just broker failures.
RocksDB & Deduplication Advanced
RocksDB serves as a high-performance, embedded deduplication index. The key requirements:
- 60 billion keys with 4-week retention
- Sub-millisecond lookups for every incoming message
- 1.5TB storage per partition
Why not alternatives:
| Option | Why Not |
|---|---|
| Redis | 60B keys × 40 bytes = 2.4TB RAM. Cost-prohibitive for a cache. |
| DynamoDB | Network latency on every message (~5-15ms) would throttle throughput. |
| In-memory HashMap | Same RAM problem as Redis, plus no persistence. |
RocksDB is embedded (no network hop), persistent (EBS-backed), and uses Bloom filters for the common case (new messages that haven't been seen).
| Aspect | Traditional Cache | RocksDB Here |
|---|---|---|
| Location | Remote (network hop) | Embedded (local) |
| Cache miss | Query source of truth | Never - it has ALL data for its partition |
| Durability | Ephemeral | Persistent (EBS) |
| Rebuild | Cold start, gradual fill | Read Kafka output once |
It's a "cache" in that it's derived data (can be rebuilt from Kafka), but it never has misses because it's complete for its partition.
Bloom filters are probabilistic data structures that answer "definitely not in set" or "possibly in set."
How they work:
- Hash the key with k hash functions
- Set k bits in a bit array
- To query: hash again, check if all k bits are set
- If any bit is 0: definitely not seen (no disk read needed)
- If all bits are 1: possibly seen (read from LSM tree to confirm)
Why critical here: Most messages are new (not duplicates). For new messages, the Bloom filter returns "not seen" without touching disk. At 400K messages/second, avoiding disk reads on 99.4% of lookups is essential.
RocksDB default is ~1% false positive rate. You tune it with bits per key (10 bits = 1%, 15 bits = 0.1%). For deduplication, false positives mean unnecessarily reading disk for a new message - annoying but not catastrophic. False negatives are impossible (the design guarantees this), which matters because a false negative would mean letting a duplicate through.
Crash Scenario:
1. Worker reads from Kafka INPUT
2. Writes to RocksDB
3. Publishes to Kafka OUTPUT
4. Commits offset to Kafka INPUT
Problem: What if crash between steps?
Case A: Crash after step 2, before step 3
- RocksDB has the ID, Kafka OUTPUT doesn't
- On restart: rebuild RocksDB from Kafka OUTPUT
- ID is NOT in rebuilt RocksDB, message is re-processed
- Correct behavior (at-least-once)
Case B: Crash after step 3, before step 4
- Kafka OUTPUT has the message
- On restart: Kafka replays from last committed offset
- Message re-processed, but RocksDB (rebuilt) knows it's a duplicate
- Correct behavior (deduplication handles it)
Key insight: Kafka OUTPUT is the commit point.
RocksDB is rebuildable from OUTPUT.
Their engineering blog explicitly states:
"We no longer have a set of large Memcached instances which require failover replicas. Instead we use embedded RocksDB databases which require zero coordination."
Benefits of the switch:
- No network hop: Memcached ~1ms, RocksDB local ~0.01ms
- No coordination: No distributed cache invalidation protocols
- No failover replicas: RocksDB is rebuilt from Kafka, not replicated
- Simpler ops: No Memcached cluster to manage
Trade-off: Rebuild time on cold start (minutes). Acceptable because failures are rare and Kafka buffers during rebuild.
Centrifuge: Database-as-Queue Advanced
The core problem: 88,000 logical queues.
Segment has 42K sources sending to an average of 2.1 destinations = 88K source-destination pairs. Each needs:
- Fair scheduling (big customer shouldn't block small customer)
- Isolated failure handling (Google Analytics down shouldn't block Salesforce)
- Per-pair retry policies
Why queues fail:
| Approach | Problem |
|---|---|
| Single queue | One slow endpoint backs up everything |
| Per-destination queue | Big customers dominate, small customers starve |
| Per-pair queue (88K) | Kafka/RabbitMQ can't handle this cardinality |
Database-as-queue insight: MySQL gives SQL flexibility for dynamic QoS. "Change QoS by running a single SQL statement." Need to deprioritize a failing destination? One UPDATE. Queues don't give you that.
It's an anti-pattern when you use traditional UPDATE-based designs. Centrifuge works because:
- Immutable rows: No UPDATEs, only INSERTs. No write contention, no index fragmentation.
- KSUID primary key: Time-sorted, so inserts are sequential (append-only).
- TABLE DROP instead of DELETE: Cycle databases every 30 min, drop tables instead of random deletes.
- In-memory caching: Directors cache active jobs, minimizing queries.
They took the database-as-queue pattern and fixed all the reasons it's usually slow.
Schema design:
-- jobs: immutable after insert
CREATE TABLE jobs (
ksuid CHAR(27) PRIMARY KEY,
payload BLOB,
endpoint VARCHAR(255),
expires_at TIMESTAMP
);
-- job_state_transitions: append-only log
CREATE TABLE job_state_transitions (
id BIGINT AUTO_INCREMENT,
job_ksuid CHAR(27),
to_state ENUM('executing','succeeded','discarded','awaiting_retry'),
transitioned_at TIMESTAMP
);
-- Current state = most recent transition
SELECT j.* FROM jobs j
JOIN (SELECT job_ksuid, MAX(id) as latest
FROM job_state_transitions GROUP BY job_ksuid) t
ON j.ksuid = t.job_ksuid
WHERE to_state = 'awaiting_scheduling';
Benefits:
- No write contention (no row locks on UPDATE)
- No index fragmentation (KSUID is time-ordered)
- Audit trail for free (every state change recorded)
- Simple debugging (replay state history)
Traditional approach:
DELETE FROM jobs WHERE status = 'completed' AND created_at < NOW() - INTERVAL 1 HOUR;
-- Problems:
-- 1. Locks rows during deletion
-- 2. Creates index fragmentation
-- 3. Triggers expensive vacuuming/compaction
-- 4. Slow for millions of rows
Centrifuge approach:
1. Create new JobDB every ~30 minutes
2. Route new jobs to newest database
3. Old databases drain retry queues (4-hour window)
4. When database empty:
DROP TABLE jobs;
DROP TABLE job_state_transitions;
-- Instant, regardless of table size
DROP TABLE is O(1) - it just removes metadata. DELETE is O(n) with all the cleanup overhead. At 400K writes/second, this difference is massive.
- Acquire lock: Consul session locks a specific JobDB (one Director per database)
- Load jobs: SELECT awaiting_scheduling jobs into memory cache
- Accept RPCs: Upstream Kafka consumers call Director to log new jobs (INSERT)
- Execute HTTP: Pop from memory cache, make HTTP call to external API
- Handle response:
- 200 OK: INSERT state → succeeded
- 5xx/timeout: INSERT state → awaiting_retry
- 4xx: INSERT state → discarded
- Retry loop: Background thread scans awaiting_retry with exponential backoff
- Archive: After 4 hours, write to S3, INSERT state → archived
Scale: 80-300 Directors at peak, scaled by CPU utilization. Each Director owns one JobDB.
Cell-Based Architecture for Twilio Advanced
Core principle: Product-agnostic cells, operationally-differentiated tiers.
Each cell runs ALL Twilio services—SMS, Voice, Video, WhatsApp, Email (SendGrid), Verify, Segment CDP. The differentiation is HOW cells are operated (Enterprise vs SMB), not WHAT they run.
┌─────────────────────────────────────────────────────────────────────────────┐
│ TWILIO CELL-BASED ARCHITECTURE │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ LANDING ZONE (Management Plane) │
│ ┌────────────────────────────────────────────────────────────────────────┐ │
│ │ AWS Control Tower │ IAM Identity Center │ CloudTrail │ SCPs │ │
│ └────────────────────────────────────────────────────────────────────────┘ │
│ │ │
│ GLOBAL/PLATFORM SERVICES (Tier 0 - Multi-Region) │
│ ┌────────────────────────────────────────────────────────────────────────┐ │
│ │ Identity (Stytch) │ API Key Mgmt │ Billing │ Cell Router │ TrustHub │ │
│ │ [DynamoDB Global Tables - Multi-Master Replication] │ │
│ └────────────────────────────────────────────────────────────────────────┘ │
│ │ │
│ CELL ROUTER (Data Plane - Lambda @ Edge) │
│ ┌────────────────────────────────────────────────────────────────────────┐ │
│ │ 1. Check Redis cache (95% hit, ~5ms) │ │
│ │ 2. Query DynamoDB on miss (~15ms) │ │
│ │ 3. Assign new customers to least-loaded cell (~20ms) │ │
│ │ 4. Set X-Twilio-Cell-ID header → VPC Lattice routes to cell │ │
│ └────────────────────────────────────────────────────────────────────────┘ │
│ │ │
│ ┌────────────────────────┼────────────────────────┐ │
│ ▼ ▼ ▼ │
│ ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐ │
│ │ ENTERPRISE CELL │ │ MID-MARKET CELL │ │ SMB CELL │ │
│ │ (AWS Account) │ │ (AWS Account) │ │ (AWS Account) │ │
│ │ │ │ │ │ │ │
│ │ VPC 10.0.0.0/16 │ │ VPC 10.0.0.0/16 │ │ VPC 10.0.0.0/16 │ │
│ │ (overlapping OK │ │ (overlapping OK │ │ (overlapping OK │ │
│ │ via VPC Lattice)│ │ via VPC Lattice)│ │ via VPC Lattice)│ │
│ │ │ │ │ │ │ │
│ │ ALL SERVICES: │ │ ALL SERVICES: │ │ ALL SERVICES: │ │
│ │ • SMS Gateway │ │ • SMS Gateway │ │ • SMS Gateway │ │
│ │ • Voice (WebRTC)│ │ • Voice (WebRTC)│ │ • Voice (WebRTC)│ │
│ │ • Video SFU │ │ • Video SFU │ │ • Video SFU │ │
│ │ • WhatsApp │ │ • WhatsApp │ │ • WhatsApp │ │
│ │ • SendGrid │ │ • SendGrid │ │ • SendGrid │ │
│ │ • Verify │ │ • Verify │ │ • Verify │ │
│ │ • Segment CDP │ │ • Segment CDP │ │ • Segment CDP │ │
│ │ (Kafka+RocksDB│ │ (Kafka+RocksDB│ │ (Kafka+RocksDB│ │
│ │ +Centrifuge) │ │ +Centrifuge) │ │ +Centrifuge) │ │
│ │ │ │ │ │ │ │
│ │ 10-50 customers │ │ 100-500 customers│ │ 1000+ customers │ │
│ │ 99.99% SLA │ │ 99.95% SLA │ │ 99.9% SLA │ │
│ └─────────────────┘ └─────────────────┘ └─────────────────┘ │
│ │
└──────────────────────────────────────────────────────────────────────────────┘
Three critical reasons:
- No cross-cell API calls: Twilio customers use multiple products together. If Voice is in Cell A and SMS is in Cell B, a ConversationRelay call that sends an SMS creates a cross-cell dependency. Failures cascade.
- Simpler customer migration: Upgrading from SMB to Enterprise is just updating the routing table. No data migration across product-specific cells.
- Unified observability: One customer's full journey (Segment event → Journey trigger → SMS send → Voice callback) is visible in one cell's logs.
Each cell has its own complete Segment CDP stack:
WITHIN EACH CELL:
┌─────────────────────────────────────────────────────────────────┐
│ CELL: ENTERPRISE-US-EAST │
├─────────────────────────────────────────────────────────────────┤
│ │
│ SEGMENT CDP (per-cell instance) │
│ ┌────────────────────────────────────────────────────────────┐ │
│ │ Tracking API (Go) ──► NSQ buffer ──► Kafka (cell-local) │ │
│ │ │ │ │
│ │ ▼ │ │
│ │ Dedup Workers ◄── RocksDB (EBS-backed, per-partition) │ │
│ │ │ │ │
│ │ ▼ │ │
│ │ Centrifuge (MySQL/RDS per-cell) ──► External destinations │ │
│ └────────────────────────────────────────────────────────────┘ │
│ │
│ MESSAGING SERVICES (per-cell instance) │
│ ┌────────────────────────────────────────────────────────────┐ │
│ │ SMS Gateway ──► Account Queue ──► Super Network │ │
│ │ Voice Gateway ──► Media Server ──► PSTN/SIP │ │
│ │ WhatsApp Business API ──► Meta Cloud API │ │
│ │ SendGrid MTA ──► ISP connections │ │
│ └────────────────────────────────────────────────────────────┘ │
│ │
│ SHARED WITHIN CELL │
│ ┌────────────────────────────────────────────────────────────┐ │
│ │ EKS Cluster (100-200 nodes) │ Aurora PostgreSQL (primary) │ │
│ │ ElastiCache Redis │ MSK (Managed Kafka) │ S3 buckets │ │
│ └────────────────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────┘
KEY INSIGHT: Kafka is CELL-LOCAL, not global.
- Each cell has its own MSK cluster
- No cross-cell Kafka consumers
- Blast radius = one cell
Why cell-local Kafka?
- Global Kafka = single point of failure for entire platform
- Cell-local Kafka = cell failure affects only that cell's customers
- Scale independently: Enterprise cells get bigger Kafka clusters
Customer profiles are cell-local by design. A customer belongs to ONE cell. All their data (events, profiles, traits) lives in that cell. If they need to be migrated (SMB → Enterprise), the Control Plane orchestrates a full data migration with traffic shifting.
What IS global:
- Customer → Cell routing (DynamoDB Global Tables)
- API Key validation (Global identity service)
- Billing aggregation (Global, but not in request path)
- Compliance registry (TrustHub, global read-mostly)
The Super Network is a special case: it's shared, not cell-local.
┌─────────────────────────────────────────────────────────────────┐
│ SUPER NETWORK ARCHITECTURE │
├─────────────────────────────────────────────────────────────────┤
│ │
│ CELLS (Data Plane) │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
│ │Enterprise│ │Mid-Market│ │ SMB │ │
│ │ Cell │ │ Cell │ │ Cell │ │
│ └────┬─────┘ └────┬─────┘ └────┬─────┘ │
│ │ │ │ │
│ └─────────────┼─────────────┘ │
│ ▼ │
│ SUPER NETWORK (Shared Infrastructure) │
│ ┌────────────────────────────────────────────────────────────┐ │
│ │ │ │
│ │ Traffic Optimization Engine │ │
│ │ • 900+ data points per message │ │
│ │ • 3.2B data points monitored daily │ │
│ │ • 4x redundant routes per destination │ │
│ │ │ │
│ │ SMPP Gateways (Regional) │ │
│ │ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ │ │
│ │ │ US-EAST │ │ US-WEST │ │ EU │ │ APAC │ │ │
│ │ │ Gateway │ │ Gateway │ │ Gateway │ │ Gateway │ │ │
│ │ └────┬─────┘ └────┬─────┘ └────┬─────┘ └────┬─────┘ │ │
│ │ │ │ │ │ │ │
│ │ └────────────┴────────────┴────────────┘ │ │
│ │ │ │ │
│ │ ▼ │ │
│ │ ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐ │ │
│ │ │ AT&T │ │ Verizon │ │T-Mobile │ │Vodafone │ ...4800 │ │
│ │ └─────────┘ └─────────┘ └─────────┘ └─────────┘ │ │
│ │ │ │
│ └────────────────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────┘
WHY SHARED?
• Carrier connections are expensive to establish (contracts, compliance)
• 4,800 connections × N cells = unsustainable
• Traffic Optimization Engine needs global visibility for routing decisions
• Carriers don't care about your internal cell boundaries
Isolation strategy: Cells send to Super Network via internal API. Super Network is stateless (no customer data). If Super Network degrades, all cells degrade (acceptable - it's the carrier connection, not Twilio's logic).
REQUEST FLOW: Customer sends SMS via API
1. REQUEST ARRIVES AT EDGE
POST https://api.twilio.com/2010-04-01/Accounts/AC123/Messages.json
Authorization: Basic {API_KEY}
2. GLOBAL IDENTITY CHECK (Platform Services)
→ Validate API key (DynamoDB Global Tables)
→ Extract account_id: AC123
→ Check TrustHub compliance status
3. CELL ROUTER (Lambda @ Edge)
async function routeToCell(accountId, region) {
// Step 1: Check Redis cache (95% hit rate)
let cellId = await redis.get(`cell:${accountId}:${region}`);
if (cellId) return { cellId, latency: '~5ms' };
// Step 2: Query DynamoDB
const record = await dynamodb.get({
TableName: 'customer_cell_mapping',
Key: { account_id: accountId, region: region }
});
if (record.Item) {
await redis.setEx(`cell:${accountId}:${region}`, 3600, record.Item.cell_id);
return { cellId: record.Item.cell_id, latency: '~15ms' };
}
// Step 3: New customer - assign to least-loaded cell
const cells = await getCellMetrics(region, segment); // CPU, customer count
const bestCell = cells.sort((a,b) => a.loadScore - b.loadScore)[0];
await dynamodb.put({
TableName: 'customer_cell_mapping',
Item: { account_id: accountId, region, cell_id: bestCell.id },
ConditionExpression: 'attribute_not_exists(account_id)' // Prevent race
});
return { cellId: bestCell.id, latency: '~20ms' };
}
4. ROUTE TO CELL VIA VPC LATTICE
→ Set header: X-Twilio-Cell-ID: enterprise-us-east-1
→ VPC Lattice routes by service name (not IP address)
→ Request arrives at cell's SMS Gateway
5. CELL PROCESSES REQUEST
→ SMS Gateway validates, queues, sends to Super Network
→ All processing happens within cell boundary
→ Response returns via same path
Latency optimization:
- DynamoDB: ~5-15ms (network to AWS region)
- Redis (ElastiCache): ~1-5ms (same VPC)
- At 400K requests/second, saving 10ms per request is significant
Redis is a cache, DynamoDB is source of truth:
- Redis TTL = 1 hour (customer rarely changes cells)
- Cache invalidation: Control Plane publishes to SNS on migration
- 95% cache hit rate = only 5% pay DynamoDB latency
Control Plane orchestrates multi-phase migration:
MIGRATION PHASES:
Phase 1: PREPARE (hours before)
├── Snapshot source cell data for customer
├── Begin async replication to target cell
└── Verify target cell has capacity
Phase 2: DUAL-WRITE (brief window)
├── Update routing: both cells receive writes
├── Source cell forwards to target cell
└── Ensures no data loss during cutover
Phase 3: CUTOVER (seconds)
├── Update DynamoDB: customer → new cell
├── Invalidate Redis cache (SNS broadcast)
├── New requests go to target cell immediately
└── Source cell rejects with redirect
Phase 4: DRAIN (minutes to hours)
├── Source cell processes in-flight requests
├── Centrifuge drains retry queues
├── Kafka consumers process remaining messages
└── Verify all data migrated
Phase 5: CLEANUP
├── Delete customer data from source cell
├── Update billing/metering
└── Notify customer (if requested)
TRAFFIC SHIFTING (optional for large customers):
Instead of instant cutover, shift gradually:
- 10% → target cell (monitor errors)
- 50% → target cell (monitor latency)
- 100% → target cell (complete migration)
| Benefit | Cost |
|---|---|
| Blast radius reduction: Cell failure = 10-1000 customers, not millions | Operational overhead: N cells to deploy, monitor, patch |
| Noisy neighbor isolation: SMB traffic spike doesn't affect Enterprise SLAs | Resource duplication: Each cell runs full stack (Kafka, RDS, EKS) |
| Independent scaling: Scale Enterprise cells without touching SMB | Cross-cell complexity: Migrations, global services, routing layer |
| Compliance isolation: HIPAA customers in dedicated cells with audit controls | Cost inefficiency: Under-utilized cells still have base costs |
| Simpler capacity planning: Plan per cell, not globally | Engineering complexity: Team must understand cell boundaries |
Net assessment: For Twilio's scale (10M+ developers, enterprise SLAs, global presence), the blast radius reduction and noisy neighbor isolation justify the operational overhead. The alternative—one global deployment—means any failure affects everyone.
Cell capacity thresholds (Control Plane monitors these):
| Metric | Enterprise Cell | SMB Cell | Action |
|---|---|---|---|
| Customer count | 50 | 2,000 | Create new cell when exceeded |
| CPU utilization | 60% | 70% | Migrate customers or scale cell |
| Kafka lag | 1 minute | 5 minutes | Scale consumers or create cell |
| RDS connections | 70% | 80% | Scale RDS or migrate customers |
| API latency P99 | 200ms | 500ms | Investigate, possibly migrate |
New cell provisioning (Control Plane automation):
1. Trigger: Cell at 70% capacity threshold
2. Control Plane initiates AWS Account Factory
3. Terraform provisions: VPC, EKS, RDS, MSK, etc. (~30 min)
4. Deploy services via ArgoCD (~10 min)
5. Health checks pass
6. Update hash ring: new cell joins rotation
7. New customers assigned to new cell
8. Existing customers stay put (no automatic migration)
Interview Talking Point: The 2-Minute Cell Architecture Answer
"I'd design Twilio's platform with product-agnostic, operationally-differentiated cells. Each cell runs ALL services—SMS, Voice, Video, Segment, Verify, SendGrid—in a single AWS account with its own VPC. The key insight is aligning customers to cells, not products, which eliminates cross-cell API calls and cascading failures.
The differentiation is operational tier: Enterprise cells have 10-50 customers with N+2 redundancy and 99.99% SLAs. SMB cells pack 1000+ customers with cost-optimized infrastructure and 99.9% SLAs. Same services, different operational posture.
For routing, I'd use a Cell Router at the edge—Lambda checking Redis cache first (95% hit rate), falling back to DynamoDB Global Tables. New customers get assigned to the least-loaded cell in their segment. VPC Lattice routes requests by service name, not IP address, which lets every cell use the same 10.0.0.0/16 CIDR without conflicts.
The trade-off is operational overhead—N cells to deploy and monitor—but the blast radius reduction is worth it. A cell failure affects 50 customers, not millions. And noisy neighbor isolation means an SMB traffic spike can't impact Enterprise SLAs."
Trade-off Questions Advanced
| Aspect | Embedded RocksDB | Centralized Redis |
|---|---|---|
| Latency | ~0.01ms (local) | ~1ms (network) |
| Operational complexity | None (embedded) | Cluster management, failover |
| Data locality | Must partition by messageId | Any partitioning works |
| Rebuild time | Minutes (read Kafka) | Gradual (cache warming) |
| Memory efficiency | Disk-backed (1.5TB ok) | RAM-backed (2.4TB expensive) |
| Cross-partition queries | Impossible | Easy |
When to choose RocksDB: High throughput, predictable access patterns, data fits single partition, can tolerate rebuild time.
When to choose Redis: Cross-partition queries needed, unpredictable access patterns, millisecond latency acceptable, data fits in RAM budget.
| Technology | Use When | Example in Segment |
|---|---|---|
| Kafka | Durable log, replay needed, ordering within partition | Event backbone, source of truth |
| RocksDB | Local state, high-throughput lookups, derived from log | Deduplication index (60B keys) |
| MySQL (Centrifuge) | Complex querying, dynamic QoS, high cardinality queues | Delivery to 700+ external APIs |
Design principle: Use the simplest tool that meets requirements. Kafka when you need a log. RocksDB when you need local state. SQL when you need query flexibility.
| Failure | Impact | Mitigation |
|---|---|---|
| Kafka broker failure | Writes blocked | Multi-cluster failover, secondary cluster in different AZ |
| RocksDB corruption | Dedup state lost | Rebuild from Kafka OUTPUT topic |
| Director crash | Jobs orphaned | Consul TTL releases lock, new Director picks up |
| External API outage | Delivery blocked | 4-hour retry queue, S3 archival |
| AZ failure | Partial outage | TAPI shards + Kafka clusters span AZs |
Key insight: Every component is either stateless (can restart) or has state derived from a durable log (can rebuild). This is the core of event sourcing reliability.
System Design Questions Advanced
This is essentially Segment's architecture. Walk through it:
1. INGESTION LAYER
- Sharded API servers (TAPI) behind load balancer
- Local NSQ buffer per server for backpressure
- Write to Kafka with idempotent producer
2. KAFKA BACKBONE
- Multi-AZ deployment with cluster failover
- Partition by messageId (for dedup locality)
- Retain 7 days for replay
3. DEDUPLICATION WORKERS
- Consume from Kafka INPUT
- Embedded RocksDB for 60B key lookups
- Bloom filters for fast "not seen" path
- Publish to Kafka OUTPUT (this is the commit point)
4. TRANSFORMATION WORKERS
- JSON parsing, schema validation
- Custom transforms per destination
- Partition by destination for locality
5. DELIVERY LAYER (Centrifuge-style)
- MySQL-based job queue (not traditional queue)
- Immutable rows, state transition table
- Directors with Consul locks
- 4-hour retry with exponential backoff
- S3 archival for undelivered
6. OBSERVABILITY
- Per-stage latency histograms
- Dedup rate monitoring (expect ~0.6%)
- Per-destination success rate
- Alerting on retry queue growth
Scale each layer independently:
- Ingestion: Add TAPI shards. Horizontal scaling, just DNS/LB changes.
- Kafka: Add partitions (careful: can't reduce later). Consumer groups auto-rebalance.
- Dedup workers: Scale with partitions. Each worker owns a partition subset.
- RocksDB: More partitions = smaller RocksDB per worker. Or upgrade to larger EBS volumes.
- Centrifuge: More Directors + JobDBs. Scale by CPU utilization.
Key insight: The architecture was designed for horizontal scaling. No single component is a bottleneck because state is partitioned.
Segment's architecture is 2016-era. Modern alternatives:
- Kafka Streams/Flink: Instead of custom dedup workers, use streaming frameworks with built-in state management.
- FoundationDB/TiKV: Instead of per-partition RocksDB, use distributed KV store with transactions.
- Temporal/Cadence: Instead of custom Centrifuge, use workflow orchestration for reliable delivery.
- Pulsar: Instead of Kafka + failover logic, use built-in geo-replication.
But: Segment's custom approach gives fine-grained control. The TABLE DROP trick in Centrifuge is hard to replicate in off-the-shelf systems. Sometimes custom wins.
Technical Leadership Questions Intermediate
Analogy: "Imagine a hotel front desk checking if a guest has already checked in. They could call every room (slow), keep a list in their head (impossible for 60 billion guests), or have a quick lookup book right at the desk. RocksDB is that lookup book - it's right next to the worker, instant to check, and can handle billions of entries."
Business impact: "This lets us process 400,000 events per second without duplicates. Duplicates mean wrong analytics, double-charged customers, and broken integrations. RocksDB prevents that at scale."
Build custom when:
- Off-the-shelf doesn't meet scale requirements (88K queues)
- Critical differentiator (reliability IS the product)
- Control needed for specific optimizations (TABLE DROP trick)
- Team has expertise to build and maintain
Use off-the-shelf when:
- Not a differentiator (auth, payments, logging)
- Maintenance burden exceeds value
- Faster time to market needed
Segment's choice: Custom Centrifuge because queue cardinality was a hard requirement. Used AWS RDS (off-the-shelf) for the database underneath.
- Documentation: Architecture decision records (ADRs) explaining WHY, not just WHAT.
- Observability: Every component emits metrics. Dashboards show system health at a glance.
- Runbooks: Step-by-step guides for common failures. "RocksDB rebuild takes 15 minutes, here's how to monitor."
- Testing: Chaos engineering (kill Directors, corrupt RocksDB) to verify recovery works.
- Simplicity: Each component has ONE job. Dedup workers dedupe. Directors deliver. No god services.
Behavioral & Leadership Questions DA Focus
Influencing Without Authority
What they're assessing: Can you influence at scale without positional authority?
Structure your answer:
- Context: What was the architectural problem? Why did it matter to the business?
- Stakeholder landscape: Who needed to be convinced? What were their concerns?
- Your approach: How did you build alignment? (Data, prototypes, 1:1s, written proposals)
- Resistance: What pushback did you face? How did you address it?
- Outcome: What changed? What was the business impact?
Example themes:
- Migrating from monolith to microservices (or vice versa)
- Adopting cell-based architecture across product lines
- Standardizing on a new database or messaging technology
- Implementing SLO/SLI culture across engineering
Good answer elements:
- Seek to understand their perspective first (maybe they see something you don't)
- Find common ground on the problem definition
- Use data and prototypes, not authority
- Disagree and commit if consensus isn't possible
- Preserve the relationship—you'll work together again
What they're assessing: Can you hold technical standards while maintaining business partnerships?
Key elements:
- Understand the "why": What business outcome were they trying to achieve?
- Quantify the risk: "This would increase P0 incidents by 3x" not "this is bad"
- Offer alternatives: Never just say no—propose a path that meets the need safely
- Escalate appropriately: If you can't align, bring in leadership with clear trade-offs
Example: "Product wanted to ship a feature that bypassed our rate limiting. I explained that our rate limits protect carrier relationships—if we get flagged for spam, we lose sender reputation for ALL customers. Instead, I proposed a tiered approach where verified enterprise customers get higher limits through a separate queue."
Technical Decision Making
What they're assessing: Self-awareness, learning orientation, intellectual honesty.
Structure:
- The decision: What did you decide and why did it seem right at the time?
- The signals: How did you realize it was wrong? (metrics, team feedback, incidents)
- Your response: Did you course-correct quickly or defend the decision?
- The fix: What did you do to remediate?
- The learning: What would you do differently? How did you prevent similar mistakes?
Good example themes:
- Over-engineering (built for scale that never came)
- Under-engineering (cut corners that caused incidents)
- Technology choice that didn't fit the team's skills
- Premature optimization vs. premature abstraction
DA-level answer:
- Identify what information would change the decision—focus on gathering that
- Make reversible decisions quickly, irreversible decisions carefully
- Set decision criteria upfront: "If X happens, we'll do Y"
- Time-box analysis: "We'll decide by Friday with whatever we know"
- Document assumptions explicitly so you can revisit them
What they're assessing: Strategic thinking, ability to communicate technical debt.
Framework:
SHORT-TERM vs LONG-TERM DECISION FRAMEWORK
1. CLASSIFY THE DEBT
- Deliberate & Prudent: "We know this is a shortcut, ship now, fix in Q2"
- Deliberate & Reckless: "We don't have time for design" (avoid this)
- Inadvertent: "We didn't know better" (learning opportunity)
2. QUANTIFY THE COST
- Developer productivity impact (hours/week)
- Incident frequency and MTTR
- Feature velocity slowdown
- Recruitment/retention impact
3. MAKE IT VISIBLE
- Tech debt backlog with business impact estimates
- "Debt service" time in sprint planning (20% rule)
- Architecture fitness functions that alert on degradation
4. NEGOTIATE EXPLICITLY
- "We can ship in 2 weeks with debt, or 4 weeks clean"
- "This debt will cost us 2 engineer-weeks per quarter until fixed"
- Get business sign-off on the trade-off
Key quote: "I never let debt accumulate silently. If we're taking a shortcut, I make sure leadership understands the interest payments we'll make later."
Organizational Impact
What they're assessing: Ability to scale your impact beyond your immediate team.
Examples of raising the bar:
- Standards & Guidelines: Created API design standards adopted across 12 teams
- Review Processes: Established architecture review board for cross-cutting changes
- Tooling: Built internal platforms that encoded best practices
- Education: Ran workshops, wrote internal docs, mentored senior engineers
- Hiring: Raised interview bar, calibrated technical assessments
- Culture: Modeled behaviors (blameless postmortems, design docs, ADRs)
Metrics to cite:
- Reduction in P0 incidents
- Improved deployment frequency
- Faster onboarding time for new engineers
- Increased reuse of shared components
What they're assessing: Cross-functional leadership, prioritization, stakeholder management.
Structure:
- The initiative: What were you trying to accomplish? Why did it need multiple teams?
- The conflicts: What were the competing priorities? Why couldn't teams just agree?
- Your role: How did you facilitate alignment? (workshops, proposals, escalation)
- The trade-offs: What did you give up to get alignment? What was non-negotiable?
- The outcome: Did the initiative succeed? What would you do differently?
Example: "We needed to migrate to cell-based architecture, but each product team had different timelines. I created a migration framework where teams could adopt at their own pace, but I held firm on the shared cell router and routing table schema. Teams got flexibility on timing; we got architectural consistency."
Mentorship & Team Development
What they're assessing: Do you multiply your impact through others?
Development approaches:
| From | To | Development Focus |
|---|---|---|
| Senior | Staff | Scope expansion: own a system, not just features |
| Staff | Principal | Influence expansion: drive org-wide initiatives |
| Principal | Distinguished | Industry impact: thought leadership, external influence |
Concrete tactics:
- Stretch assignments: Give them ownership of ambiguous, cross-team problems
- Visibility: Put them in front of leadership, let them present their work
- Feedback loops: Regular 1:1s focused on growth, not just status
- Sponsorship: Advocate for them in calibration and promotion discussions
- Modeling: Show them what good looks like (design docs, influence, decisions)
What they're assessing: Conflict resolution, maintaining psychological safety.
Framework:
- Understand both perspectives: 1:1 conversations before any group discussion
- Separate people from positions: Focus on the technical trade-offs, not personalities
- Find the shared goal: What do both parties actually want for the system/team?
- Make it safe to disagree: "We can disagree on approach and still respect each other"
- Decide and move on: If consensus isn't possible, make a call and commit together
Red flags to avoid:
- Taking sides publicly
- Letting conflict fester ("they'll work it out")
- Making it about who's "right"
- Punishing dissent
Crisis & Incident Leadership
What they're assessing: Composure under pressure, systematic thinking, communication.
Structure (Incident Timeline):
1. DETECTION
- How did you find out? (Alerts, customer report, intuition)
- How long from start to detection?
2. TRIAGE
- How did you assess severity and scope?
- Who did you pull in? How did you communicate?
3. MITIGATION
- What was the first thing you did to stop the bleeding?
- Trade-offs: speed vs. thoroughness
4. RESOLUTION
- Root cause identification
- The actual fix
5. COMMUNICATION
- Internal: leadership, other teams
- External: customers, status page
6. FOLLOW-UP
- Blameless postmortem
- Action items and ownership
- Systemic improvements
DA-level additions:
- How did you shield the team from exec pressure during the incident?
- What systemic changes did you drive afterward?
- How did you turn the incident into a learning opportunity?
What they're assessing: Can you shape culture, not just respond to incidents?
Cultural elements:
- SLOs as contracts: Teams own their SLOs and error budgets
- Blameless postmortems: Focus on systems, not individuals
- On-call as first-class work: Not a burden, a learning opportunity
- Chaos engineering: Proactively find weaknesses
- Celebrate reliability: Recognize teams that improve MTTR, reduce incidents
Concrete practices:
- Weekly reliability review with leadership
- Error budget policies that slow feature work when budgets are exhausted
- Runbook requirements before production deployment
- Game days and disaster recovery drills
Strategy & Vision
What they're assessing: Strategic thinking, communication at multiple levels.
Vision development process:
- Understand business direction: Where is the company going in 3-5 years?
- Assess current state: What are our technical strengths and weaknesses?
- Identify gaps: What technical capabilities do we need that we don't have?
- Prioritize ruthlessly: What 3 things matter most?
- Create milestones: What does year 1, year 2, year 3 look like?
Communication at different levels:
| Audience | Focus | Format |
|---|---|---|
| Executives | Business outcomes, investment required | 1-page strategy doc |
| Engineering leadership | Technical approach, team implications | Architecture proposal |
| Individual engineers | How their work connects to the vision | All-hands, team meetings |
What they're assessing: Pragmatism, business acumen, long-term thinking.
Decision framework:
BUILD when:
✓ Core differentiator (this IS your product)
✓ Unique requirements that off-the-shelf can't meet
✓ Team has expertise to build AND maintain
✓ Control is critical (security, compliance, performance)
BUY when:
✓ Commodity capability (auth, payments, logging)
✓ Time-to-market is critical
✓ Maintenance burden exceeds value
✓ Vendor has expertise you don't
EXAMPLES from Twilio:
- BUILD: Centrifuge (unique 88K queue cardinality requirement)
- BUY: AWS RDS for MySQL underneath Centrifuge
- BUILD: Custom Cell Router (core to architecture)
- BUY: Consul for distributed locking (commodity)
Key insight: "The question isn't 'can we build it?' It's 'should we spend our engineering calories here?' Every hour spent building commodity infrastructure is an hour not spent on customer value."
Interview Tips: Behavioral Questions
- Being too technical: These questions assess leadership, not architecture. Lead with the people/org impact.
- Taking all the credit: Use "we" for team accomplishments, "I" only for your specific contributions.
- No concrete examples: Vague answers like "I always communicate well" don't demonstrate competence.
- Badmouthing previous employers: Even if the situation was bad, focus on what you learned.
- Not quantifying impact: "Improved reliability" → "Reduced P0 incidents from 4/month to 1/quarter"
- Situation: Context (brief—10% of answer)
- Task: Your specific responsibility
- Action: What YOU did (detailed—50% of answer)
- Result: Outcome with metrics
- +Reflection: What you learned, what you'd do differently (DA differentiator)
Quick Reference: Key Numbers to Remember
Segment CDP / Data Infrastructure
| Metric | Value | Context |
|---|---|---|
| Events/second | 400,000 (sustained), 1M (peak) | Tracking API throughput |
| Duplicate rate | 0.6% | From mobile client retries |
| RocksDB keys | 60 billion | 4-week deduplication window |
| RocksDB size | 1.5 TB | Per partition |
| Centrifuge throughput | 400K req/sec (sustained), 2M (tested) | Outbound HTTP delivery |
| Source-destination pairs | 88,000 | 42K sources × 2.1 avg destinations |
| Directors | 80-300 | Scaled by CPU utilization |
| Retry window | 4 hours | With exponential backoff |
| JobDB rotation | ~30 minutes | At 100% fill |
| Destinations supported | 700+ | External integrations |
Cell-Based Architecture
| Metric | Value | Context |
|---|---|---|
| Enterprise cell customers | 10-50 | Dedicated resources, 99.99% SLA |
| SMB cell customers | 1,000+ | Shared resources, 99.9% SLA |
| Cell Router cache hit rate | 95% | Redis ElastiCache |
| Cache lookup latency | ~5ms | Redis hit |
| DynamoDB lookup latency | ~15ms | Cache miss |
| New customer assignment | ~20ms | Least-loaded cell selection |
| VPC CIDR per cell | 10.0.0.0/16 | Overlapping OK via VPC Lattice |
| Cell provisioning time | ~40 minutes | Account Factory + Terraform + ArgoCD |
| EKS nodes per cell | 100-200 | Enterprise vs SMB sizing |
Super Network / Messaging
| Metric | Value | Context |
|---|---|---|
| Carrier connections | 4,800 | Global SMPP/SIP connections |
| Data points per message | 900+ | Routing decision inputs |
| Data points monitored daily | 3.2 billion | Network health monitoring |
| Redundant routes | 4x average | Per destination failover |
| GLL Edge locations | 9 | Voice/Video low-latency edge |
| Short code throughput | 100+ MPS | Messages per second |
| Toll-free throughput | 3 MPS (upgradeable) | Default rate |
| Queue timeout | 4 hours | Max queue time before error 30001 |