Twilio Design Scenarios - Interview Prep

Interview Framework (URDAD)

Understand & Clarify Requirements
- Ask about scale (requests/sec, data size)
- Clarify functional requirements
- Understand non-functional requirements (latency, availability, consistency)
- Identify constraints (budget, technology, timeline)
Requirements → API Design
- Define APIs (REST, GraphQL, gRPC)
- Show request/response formats
Data Model
- What data to store
- Storage technology choices
- Schema design
Architecture & Components
- High-level architecture diagram
- Component responsibilities
- Data flow
Deep Dive
- Focus on interesting/complex parts
- Discuss trade-offs
- Address failure scenarios

Medium

Design Twilio's SMS delivery system that can handle 1 million messages per second globally with 99.95% availability.

Functional Requirements

Accept SMS messages via REST API
Validate phone numbers and message content
Route messages to appropriate carriers
Track delivery status (queued, sent, delivered, failed)
Provide webhook callbacks for status updates
Support message retry on failures

Non-Functional Requirements

Scale: 1M messages/second peak, 500K sustained
Latency: < 500ms API response time p99
Availability: 99.95% (4.4 hours downtime/year)
Durability: Once accepted, message must not be lost
Consistency: Eventual consistency acceptable for status

High-Level Architecture

┌─────────┐ ┌──────────────┐ ┌─────────────┐ │Customer │─────>│ API Gateway │─────>│ Kafka │ │ POST │<─────│ (Regional) │ │ (Events) │ └─────────┘ └──────────────┘ └──────┬──────┘ │ │ │ │ ┌────▼─────┐ ┌────▼──────┐ │DynamoDB │ │ Workers │ │(Metadata)│ │ (Fleet) │ └──────────┘ └─────┬─────┘ │ ┌─────▼──────┐ │ Carriers │ │ (AT&T, etc)│ └────────────┘

Component Design

1. API Gateway Layer

Technology: AWS API Gateway + Lambda or ECS/Fargate
Responsibilities:
- Authentication (API key validation)
- Rate limiting (per account)
- Request validation
- Idempotency key handling
- Generate unique message SID
Scaling: Multi-region deployment, auto-scaling based on request rate

2. Message Queue (Kafka)

Why Kafka: High throughput, durability, ordering per partition
Partitioning: By message_sid for ordering guarantees
Topics:
- messages.incoming - Newly accepted messages
- messages.carrier-delivery - Ready to send to carrier
- messages.status-updates - Delivery status changes
Retention: 7 days for replay capability

3. Worker Fleet

Technology: ECS/Kubernetes pods
Consumer groups: Parallel processing, rebalancing on failures
Responsibilities:
- Consume from Kafka
- Route to appropriate carrier based on destination
- Handle carrier-specific protocols
- Retry logic with exponential backoff
- Publish status updates to Kafka
Circuit breakers: Per carrier to handle failures

4. Storage (DynamoDB)

Table: Messages
- Partition key: account_id
- Sort key: message_sid
- Attributes: to, from, body, status, timestamps
- GSI on message_sid for lookups
Consistency: Eventual consistent reads for status checks
Auto-scaling: On-demand capacity mode

5. Webhook Delivery

Separate worker fleet consuming from messages.status-updates
POST to customer webhook URL
Retry with exponential backoff: 1min, 5min, 30min, 1hr, 6hr
Circuit breaker per webhook URL
Dead letter queue for failed webhooks

Key Design Decisions

Decision	Rationale	Alternative Considered
Kafka over SQS	Higher throughput, ordering guarantees, replay capability	SQS simpler but lacks ordering, replay
DynamoDB over RDS	Auto-scaling, low-latency, handles write-heavy workload	RDS would need sharding at this scale
Async processing	Decouple API response from carrier delivery for reliability	Synchronous would timeout frequently
Regional deployment	Low latency for customers in each region	Single region = higher latency globally

What to Emphasize in Interview

At-least-once delivery with idempotency: "We design for at-least-once semantics with idempotency keys rather than expensive exactly-once"
Fault isolation: "Circuit breakers per carrier prevent cascading failures"
Durability: "Once API returns 201, message is persisted in Kafka - guaranteed delivery"
Observability: "Every state transition is an event - full audit trail"
Scalability: "Kafka partitions + worker auto-scaling handle traffic spikes"

Hard

Design Twilio's messaging platform to run active-active in 3 regions (US, EU, APAC) where customers can send messages from any region and experience consistent behavior.

Functional Requirements

Accept messages in any region
Customers see consistent account state globally
Message history accessible from any region
Account balance/quota enforced globally
Survive full region failure

Constraints

Cross-region latency: 100-200ms
Each region should work independently during partition
No global locks or synchronous cross-region coordination
Data residency: EU customer data stays in EU

High-Level Architecture

┌──────────────────────────────────────────────────────────┐ │ GLOBAL ROUTING LAYER │ │ (DNS-based or Anycast IP routing) │ └────┬──────────────────────┬──────────────────────┬───────┘ │ │ │ ┌────▼────────┐ ┌───────▼────────┐ ┌──────▼────────┐ │ US Region │ │ EU Region │ │ APAC Region │ │ │ │ │ │ │ │ ┌─────────┐ │ │ ┌─────────┐ │ │ ┌─────────┐ │ │ │API GW │ │ │ │API GW │ │ │ │API GW │ │ │ └────┬────┘ │ │ └────┬────┘ │ │ └────┬────┘ │ │ │ │ │ │ │ │ │ │ │ ┌────▼────┐ │ │ ┌────▼────┐ │ │ ┌────▼────┐ │ │ │Kafka │ │◄───────►│Kafka │◄─────────►│Kafka │ │ │ └────┬────┘ │ │ └────┬────┘ │ │ └────┬────┘ │ │ │ │ │ │ │ │ │ │ │ ┌────▼────┐ │ │ ┌────▼────┐ │ │ ┌────▼────┐ │ │ │Workers │ │ │ │Workers │ │ │ │Workers │ │ │ └─────────┘ │ │ └─────────┘ │ │ └─────────┘ │ │ │ │ │ │ │ │ ┌─────────┐ │ │ ┌─────────┐ │ │ ┌─────────┐ │ │ │DynamoDB │◄┼─────┼─►│DynamoDB │◄──┼─────┼─►│DynamoDB │ │ │ │(Global │ │ │ │(Global │ │ │ │(Global │ │ │ │ Tables) │ │ │ │ Tables) │ │ │ │ Tables) │ │ │ └─────────┘ │ │ └─────────┘ │ │ └─────────┘ │ └─────────────┘ └────────────────┘ └───────────────┘

Design Approach

1. Data Classification

Different data has different consistency needs:

Data Type	Consistency Needed	Storage Strategy
Message events	Regional, eventual global	Regional Kafka, async replication
Account metadata	Global eventual	DynamoDB Global Tables
Account balance/quota	Strong consistency (avoid)	Regional balance, periodic reconciliation
Message history	Regional, async replicated	Regional DynamoDB, cross-region replication

2. Routing Strategy

Home region per account: Each account has a primary region (set at signup or based on location)
Regional affinity: Route customer's API calls to their home region when possible
Graceful degradation: If home region is down, route to nearest healthy region
Data residency: EU accounts MUST have home region in EU (GDPR)

3. Handling Cross-Region Writes

When EU customer's request lands in US region:

Option A (Proxy): US region proxies request to EU region, waits for response
- Pro: Strong consistency
- Con: Higher latency (200ms+ penalty)
Option B (Local write + async reconcile): Accept in US, async replicate to EU
- Pro: Low latency
- Con: Potential conflicts, violates data residency
Recommended: Option A for control plane, local writes for data plane

4. Quota/Balance Enforcement

Challenge: Can't do synchronous cross-region checks (too slow)

Solution: Regional quota allocation

Account has global limit: 10,000 messages/day
Allocate quota to each region: US=5000, EU=5000, APAC=0 (customer doesn't use APAC)
Each region enforces its local quota independently
Nightly reconciliation job redistributes unused quota
If region runs out, can request more from global coordinator (slower path)

Trade-off: Might reject messages even if global quota available (CAP theorem - choosing availability over perfect consistency)

5. Message History Queries

Messages primarily queried in the region they were sent
Each region maintains local message history
Async replication to other regions (DynamoDB Global Tables or custom)
If customer queries in non-home region: eventual consistency acceptable (might miss very recent messages)

6. Failure Scenarios

Scenario: EU region goes down

DNS/routing layer detects failure, stops routing to EU
EU customers routed to US region (nearest)
US region acts as backup for EU accounts
When EU recovers, catch up on replicated data, resume normal routing

Scenario: Network partition between regions

Each region continues operating independently (AP in CAP)
Quota enforcement becomes per-region (might over-deliver globally)
When partition heals, reconcile quota usage
If over-delivered, customer gets charged (better than losing messages)

Key Trade-offs Discussion

Why not single global database?

Would require synchronous cross-region writes → 200ms+ latency → unacceptable for messaging API.

Why not multi-master everywhere?

Conflict resolution is complex for messaging. "Last write wins" doesn't work for quota enforcement.

Chosen approach: Hybrid

Regional processing (fast, isolated)
Async replication (eventual global view)
Partitioned quota (local enforcement with global budget)
Graceful degradation (AP during partitions, reconcile later)

Interview Talking Points

"Active-active doesn't mean all data is global" - Partition data by access patterns
"Choose consistency model per data type" - Messages can be eventual, auth cannot
"Regional independence for resilience" - Each region should survive alone
"Cross-region latency is the enemy" - Design to minimize synchronous cross-region calls
"Reconcile, don't prevent" - Allow local decisions, fix conflicts async

Hard

Design a cell-based architecture for Twilio's messaging platform that provides fault isolation between customers while efficiently using resources.

Requirements

Enterprise customers should be isolated from each other
Small customers can share infrastructure
One customer's traffic spike shouldn't affect others
Cell failure should have bounded blast radius
Support gradual rollout of changes to minimize risk

This is YOUR wheelhouse! Align with your PayPal experience.

Cell Architecture

┌─────────────────────────────────────────────────────────┐ │ CONTROL PLANE (Global) │ │ - Account → Cell mapping │ │ - Cell health monitoring │ │ - Routing configuration │ └────────────────────┬────────────────────────────────────┘ │ ┌─────────────┼─────────────┬──────────────┐ │ │ │ │ ┌──────▼──────┐ ┌───▼────────┐ ┌──▼──────────┐ ┌▼──────────┐ │ Cell-ACME │ │ Cell-Nike │ │Cell-Shared-1│ │Cell-Shared-2│ │ (Dedicated)│ │ (Dedicated)│ │(100 SMBs) │ │(100 SMBs) │ │ │ │ │ │ │ │ │ │ ┌─────────┐ │ │┌─────────┐ │ │┌─────────┐ │ │┌─────────┐ │ │ │API GW │ │ ││API GW │ │ ││API GW │ │ ││API GW │ │ │ └────┬────┘ │ │└────┬────┘ │ │└────┬────┘ │ │└────┬────┘ │ │ │ │ │ │ │ │ │ │ │ │ │ │ ┌────▼────┐ │ │┌────▼────┐ │ │┌────▼────┐ │ │┌────▼────┐ │ │ │Kafka │ │ ││Kafka │ │ ││Kafka │ │ ││Kafka │ │ │ │Workers │ │ ││Workers │ │ ││Workers │ │ ││Workers │ │ │ │DynamoDB │ │ ││DynamoDB │ │ ││DynamoDB │ │ ││DynamoDB │ │ │ │VPC │ │ ││VPC │ │ ││VPC │ │ ││VPC │ │ │ └─────────┘ │ │└─────────┘ │ │└─────────┘ │ │└─────────┘ │ └─────────────┘ └────────────┘ └─────────────┘ └─────────────┘ Blast radius: If Cell-ACME fails, only ACME affected

Design Principles

1. Cell Definition

A cell is:

Fully isolated infrastructure stack (compute, storage, network)
Separate VPC or namespace
Independent deployment unit
Handles subset of total traffic
Fails independently without cascading

2. Cell Sizing Strategy

Cell Type	Tenants	Capacity	Use Case
Dedicated Large	1 (enterprise)	100K msgs/sec	Top 10 customers
Dedicated Medium	1 (enterprise)	10K msgs/sec	Top 100 customers
Shared Large	100-500 SMBs	50K msgs/sec	Paid customers
Shared Small	1000+ free tier	10K msgs/sec	Free/trial users

3. Routing Layer

Mapping Service: Maps account_id → cell_id
Storage: DynamoDB Global Table (highly available)
Caching: Cached in API gateway for low latency
Updates: Account growth triggers cell migration

4. Within Shared Cells: Additional Isolation

Even within shared cells, prevent noisy neighbors:

Rate limiting: Per account
Thread pool bulkheads: Per customer or per priority class
CPU/Memory limits: Kubernetes resource quotas per customer namespace
Connection pool limits: Per account to prevent DB exhaustion

5. Cell Migration

When customer outgrows shared cell:

Provision new dedicated cell
Dual-write messages to both old cell and new cell (shadowing)
Validate new cell working correctly
Update routing: new messages go to new cell only
Backfill historical data async if needed
Decommission old cell's resources for this customer

6. Operational Benefits

Gradual rollout: Deploy change to Cell-Canary (synthetic traffic), then 1 shared cell, then all
Feature flags per cell: Test new features on specific cells
Blast radius: Bug in new deployment only affects one cell
Capacity planning: Add cells when overall capacity reaches threshold

Key Trade-offs

Resource Efficiency vs Isolation

Shared cells: 70-80% resource utilization, but noisy neighbor risk
Dedicated cells: 40-60% utilization, but perfect isolation
Decision: Use dedicated for top customers (who pay for it), shared for long tail

Operational Complexity

Cost: More cells = more infrastructure to manage
Mitigation: Heavy automation, infrastructure-as-code, cell templates

Data Locality

Pro: Each cell has full data for its tenants - no cross-cell queries
Con: Global analytics requires aggregating across cells
Solution: Event streaming to central data warehouse

What to Emphasize (This is your expertise!)

"Fault isolation is non-negotiable" - At PayPal, we prioritize blast radius reduction over resource efficiency
"Cells are a socio-technical pattern" - Not just infrastructure, also ownership boundaries. Each cell can have a dedicated team.
"Start with fewer, larger cells" - Don't over-optimize for isolation on day 1. Add more cells as you scale.
"Automated cell provisioning" - If spinning up a new cell takes manual work, you won't do it. Make it push-button.
"Cells enable organizational scaling" - Conway's Law - cell architecture allows teams to own end-to-end infrastructure

Medium

Design Twilio's identity service that handles authentication for developers, service-to-service auth, and customer (end-user) verification.

Requirements

Developer authentication (console login, API keys)
Service-to-service authentication (inter-cell communication)
Account hierarchy (parent accounts, sub-accounts)
Integrate with verification (2FA, phone verification)
Support for SSO (SAML, OAuth) for enterprise

Constraints

100K API auth checks per second
Sub-10ms latency for API key validation
99.99% availability (auth failure = total outage)
Audit trail for compliance

Identity Architecture

┌───────────────────────────────────────────────────────┐ │ IDENTITY & AUTH SERVICE │ │ │ │ ┌──────────────┐ ┌───────────────┐ ┌───────────┐ │ │ │ Developer │ │ Service │ │End-User │ │ │ │ Auth │ │ Auth │ │ Verify │ │ │ │ │ │ │ │ │ │ │ │ - Console │ │ - mTLS │ │ - 2FA │ │ │ │ - API Keys │ │ - JWTs │ │ - Phone │ │ │ │ - SSO/SAML │ │ - Service │ │ Verify │ │ │ │ │ │ Accounts │ │ │ │ │ └──────┬───────┘ └───────┬───────┘ └─────┬─────┘ │ │ │ │ │ │ │ └──────────┬───────┴────────────────┘ │ │ │ │ │ ┌─────▼──────┐ │ │ │ Auth Core │ │ │ │ │ │ │ │ - Account │ │ │ │ Store │ │ │ │ - Token │ │ │ │ Service │ │ │ │ - Audit │ │ │ └────────────┘ │ └───────────────────────────────────────────────────────┘

1. Developer Authentication

API Key Validation (Hot Path)

Storage: DynamoDB
- PK: api_key_sid
- Attributes: account_id, permissions, created_at, last_used
Caching: Redis/ElastiCache
- Cache API key → account mapping for 5 minutes
- 99% cache hit rate → sub-1ms auth check
- Cache miss → query DynamoDB (5-10ms)
Key rotation: Support multiple active keys, graceful deprecation

Console Login (OAuth2 / OIDC)

Use Auth0, Cognito, or custom OAuth2 provider
Support username/password + MFA
Issue JWT tokens for session management
Short-lived access tokens (15 min), long-lived refresh tokens

Enterprise SSO

SAML 2.0 integration for enterprise customers
Customer configures their IdP (Okta, Azure AD)
Twilio acts as service provider (SP)
JIT (Just-In-Time) provisioning of accounts

2. Service-to-Service Authentication

Why it matters

When API Gateway calls Billing Service, how do we authenticate?

Approach: Service Accounts + JWTs

Each service has a service account with credentials
Services request short-lived JWT tokens from auth service
JWT includes: service_id, permissions, expiry
Target service validates JWT (using public key or shared secret)
Token expires in 1 hour, auto-renewed

Alternative: mTLS (Mutual TLS)

Each service has X.509 certificate
TLS handshake validates both client and server
More secure but more operational overhead
Good for sensitive services (billing, identity)

3. Account Hierarchy

Twilio supports parent accounts with sub-accounts:

Account Hierarchy ┌─────────────────────────┐ │ Parent Account │ │ (Enterprise Corp) │ │ │ │ ┌───────────────────┐ │ │ │ Sub-Account │ │ │ │ (APAC Division) │ │ │ └───────────────────┘ │ │ │ │ ┌───────────────────┐ │ │ │ Sub-Account │ │ │ │ (EU Division) │ │ │ └───────────────────┘ │ └─────────────────────────┘ - Each sub-account has own API keys - Parent can view all sub-account usage - Billing rolls up to parent

Data Model

Accounts table:
- account_sid (PK), parent_account_sid, name, status
API Keys table:
- api_key_sid (PK), account_sid, permissions_scope
Query pattern: Given API key, lookup account, check if account is active, check permissions

4. Audit Logging

Every auth event logged: API key used, token issued, login attempt
Log to Kafka → S3 for compliance
ElasticSearch for searchable audit trail
Include: timestamp, account_id, action, ip_address, result

5. Rate Limiting & Abuse Prevention

Rate limit failed login attempts per IP: 10 per minute
Rate limit API key validation per key: 10K per second
CAPTCHA after 3 failed logins
Temporary account lockout after 10 failed attempts

Key Design Decisions

API Key Storage: DynamoDB vs RDS

Choice: DynamoDB
Reason: Simple key-value lookup, auto-scaling, low latency
Alternative: RDS would work but requires more capacity planning

Caching: How aggressive?

Choice: 5-minute cache TTL
Trade-off: API key revocation takes up to 5 min to propagate
Mitigation: Force cache invalidation on explicit revocation

Service Auth: JWT vs mTLS

Choice: JWT for most services, mTLS for sensitive
JWT easier operationally (no cert management)
mTLS for billing, identity (defense in depth)

Interview Talking Points

"Auth is critical path - optimize for p99 latency" - Every API call goes through auth
"Defense in depth" - Multiple layers: API key + rate limiting + network isolation
"Graceful degradation" - If Redis cache is down, fall back to DynamoDB (slower but works)
"Audit everything" - Auth events are critical for security and compliance

Medium

Design a distributed rate limiting system that can enforce per-account quotas across multiple API gateway instances globally.

Requirements

Enforce rate limits per account (e.g., 1000 messages/minute)
Work across multiple API gateway instances (no single point)
Support different tiers (free, pro, enterprise)
Real-time enforcement (not best-effort)
Provide API clients with rate limit status in response headers

Constraints

10K auth checks per second per gateway instance
Sub-5ms overhead for rate limit check
Minimize false positives (incorrectly blocking)

Click "Show Solution" to reveal the approach. Try designing it yourself first!

Recommended Approach: Redis with Sliding Window

Architecture

┌──────────┐ ┌──────────┐ ┌──────────┐ │API GW 1 │ │API GW 2 │ │API GW 3 │ └────┬─────┘ └────┬─────┘ └────┬─────┘ │ │ │ └─────────────────┼─────────────────┘ │ ┌──────▼──────┐ │ Redis │ │ Cluster │ │ (Shared) │ └─────────────┘ Each gateway instance checks Redis before allowing request

Redis Data Structure: Sorted Set

For each account, store timestamps of recent requests:

Key: rate_limit:{account_id}:{window}
Value: Sorted set with score = timestamp
Example: rate_limit:AC123:minute → {1700000001, 1700000002, ...}

Algorithm: Sliding Window Counter

function checkRateLimit(account_id, limit, window_seconds):
    current_time = now()
    window_start = current_time - window_seconds

    # 1. Remove entries older than window
    ZREMRANGEBYSCORE rate_limit:{account_id} -inf window_start

    # 2. Count remaining entries
    count = ZCARD rate_limit:{account_id}

    # 3. Check if under limit
    if count < limit:
        # 4. Add current request
        ZADD rate_limit:{account_id} current_time current_time

        # 5. Set TTL to auto-expire old data
        EXPIRE rate_limit:{account_id} window_seconds

        return ALLOWED, remaining = (limit - count - 1)
    else:
        return DENIED, remaining = 0

Making it Atomic: Lua Script

Run all Redis commands in a single Lua script for atomicity:

local key = KEYS[1]
local limit = tonumber(ARGV[1])
local window = tonumber(ARGV[2])
local current_time = tonumber(ARGV[3])
local window_start = current_time - window

redis.call('ZREMRANGEBYSCORE', key, '-inf', window_start)
local count = redis.call('ZCARD', key)

if count < limit then
    redis.call('ZADD', key, current_time, current_time)
    redis.call('EXPIRE', key, window)
    return {1, limit - count - 1}  -- allowed, remaining
else
    return {0, 0}  -- denied, remaining
end

Response Headers

Include in every API response:

X-RateLimit-Limit: 1000 - Total limit
X-RateLimit-Remaining: 742 - Requests remaining
X-RateLimit-Reset: 1700000060 - When window resets

Handling Multiple Time Windows

Support multiple limits simultaneously:

1000 requests per minute
10,000 requests per hour
100,000 requests per day

Solution: Check each limit independently, deny if ANY limit exceeded

Scaling Redis

Redis Cluster: Partition by account_id (consistent hashing)
Replication: Redis replica for read availability
Fallback: If Redis is down, fail open or use local in-memory limit (less accurate)

Alternative Approaches

Approach	Pros	Cons
Fixed Window Counter	Simple, low memory	Burst at window edges (2x limit possible)
Token Bucket	Allows bursts, smooth	More complex state
Sliding Window (chosen)	Accurate, no edge bursts	Higher memory (stores timestamps)
Leaky Bucket	Enforces strict rate	No bursting, complex

Trade-offs Discussion

Accuracy vs Performance

Sliding window is most accurate but requires sorted set (more memory)
Fixed window uses single counter (less memory) but allows edge bursts
Decision: Accuracy matters for billing - use sliding window

Centralized (Redis) vs Distributed (local)

Redis: Accurate but adds dependency and latency
Local: Fast but inaccurate in distributed system (each instance tracks separately)
Decision: Use Redis with aggressive caching

Fail Open vs Fail Closed

If Redis is down:
- Fail open: Allow all requests (better availability, risk of abuse)
- Fail closed: Deny all requests (worse availability, safe)
Decision: Fail open with local rate limiting as backup

Interview Talking Points

"Rate limiting is about fairness and protection" - Prevent one customer from consuming all capacity
"Choose algorithm based on requirements" - If bursting is okay, use token bucket. If strict limit needed, use leaky bucket or sliding window
"Graceful degradation" - If centralized rate limiter fails, fall back to local limits
"Observability" - Track rate limit denials as a metric - might indicate customer needs to upgrade tier

Twilio System Design Scenarios

How to Approach System Design Interviews

Interview Framework (URDAD)

Scenario 1: SMS Delivery Pipeline

Component Design

1. API Gateway Layer

2. Message Queue (Kafka)

3. Worker Fleet

4. Storage (DynamoDB)

5. Webhook Delivery

Scenario 2: Multi-Region Active-Active Messaging

Design Approach

1. Data Classification

2. Routing Strategy

3. Handling Cross-Region Writes

4. Quota/Balance Enforcement

5. Message History Queries

6. Failure Scenarios

Scenario: EU region goes down

Scenario: Network partition between regions

Why not single global database?

Why not multi-master everywhere?

Chosen approach: Hybrid

Scenario 3: Cell-Based Architecture for Fault Isolation

Design Principles

1. Cell Definition

2. Cell Sizing Strategy

3. Routing Layer

4. Within Shared Cells: Additional Isolation

5. Cell Migration

6. Operational Benefits

Resource Efficiency vs Isolation

Operational Complexity

Data Locality

Scenario 4: Identity & Authentication Service

Identity Architecture

1. Developer Authentication

API Key Validation (Hot Path)

Console Login (OAuth2 / OIDC)

Enterprise SSO

2. Service-to-Service Authentication

Why it matters

Approach: Service Accounts + JWTs

Alternative: mTLS (Mutual TLS)

3. Account Hierarchy

Data Model

4. Audit Logging

5. Rate Limiting & Abuse Prevention

API Key Storage: DynamoDB vs RDS

Caching: How aggressive?

Service Auth: JWT vs mTLS

Scenario 5: Rate Limiting at Scale

Recommended Approach: Redis with Sliding Window

Architecture

Redis Data Structure: Sorted Set

Algorithm: Sliding Window Counter

Making it Atomic: Lua Script

Response Headers

Handling Multiple Time Windows

Scaling Redis

Alternative Approaches

Accuracy vs Performance

Centralized (Redis) vs Distributed (local)

Fail Open vs Fail Closed