BizComms Overview
A production-grade, multi-region alert notification system built on AWS with cell-based architecture, demonstrating enterprise patterns for failure domain isolation, strong data consistency, and automated operations.
What is BizComms?
BizComms is a real-time alert notification system that monitors HackerNews for new articles and sends SMS and voice notifications to subscribers via Twilio. It's designed to demonstrate:
Cell-Based Architecture
Failure domain isolation with independent regional deployments (cells)
Multi-Region Design
Active-passive deployment across us-east-1 and us-west-1 with automatic failover
Strong Consistency
Aurora Global Database with <1s replication lag for transactional data
Event-Driven
SNS/SQS messaging with at-least-once delivery guarantees
Key Features
- Infrastructure as Code: Complete Terraform modules for AWS
- Kubernetes Orchestration: EKS with Karpenter autoscaling
- CI/CD Pipelines: GitHub Actions for automated deployments
- Observability: CloudWatch metrics, logs, and alarms
- Security Best Practices: IRSA, Secrets Manager, WAF
- High Availability: Multi-AZ deployment with health checks
- Auto-scaling: HPA for pods, Karpenter for nodes
- Cost Optimized: Spot instances, on-demand DynamoDB
System Architecture
High-Level Architecture
Latency-Based Routing] end Users --> R53 subgraph "us-east-1 PRIMARY" subgraph "Cell: cell-use1-01" WAF1[WAF + ALB] EKS1[EKS Cluster] API1[API Pods x3] Worker1[Worker Pods x2] Poller1[Poller Pod x1] Aurora1[(Aurora Primary)] DDB1[(DynamoDB)] SQS1[SQS Queue] SNS1[SNS Topic] WAF1 --> API1 API1 --> Aurora1 API1 --> SNS1 SNS1 --> SQS1 SQS1 --> Worker1 Worker1 --> Aurora1 Worker1 --> DDB1 Poller1 --> API1 HN -.-> Poller1 end end subgraph "us-west-1 SECONDARY" subgraph "Cell: cell-usw1-01" WAF2[WAF + ALB] EKS2[EKS Cluster] API2[API Pods x3] Worker2[Worker Pods x2] Poller2[Poller Pod x1] Aurora2[(Aurora Replica
Read-Only)] DDB2[(DynamoDB)] SQS2[SQS Queue] SNS2[SNS Topic] WAF2 --> API2 API2 --> Aurora2 API2 --> SNS2 SNS2 --> SQS2 SQS2 --> Worker2 Worker2 --> Aurora2 Worker2 --> DDB2 Poller2 --> API2 HN -.-> Poller2 end end R53 --> WAF1 R53 -.Failover.-> WAF2 Aurora1 -.Global DB
Replication.-> Aurora2 DDB1 -.Global Tables
Replication.-> DDB2 Worker1 --> Twilio[Twilio API
SMS & Voice] Worker2 --> Twilio style API1 fill:#0d6efd,color:#fff style API2 fill:#0d6efd,color:#fff style Worker1 fill:#198754,color:#fff style Worker2 fill:#198754,color:#fff style Aurora1 fill:#dc3545,color:#fff style Aurora2 fill:#ffc107,color:#000 style R53 fill:#0dcaf0,color:#000
Cell-Based Architecture
Each region operates as an independent cell with complete isolation. This design limits the blast radius of failures and enables independent scaling.
What is a Cell?
A cell is a self-contained deployment unit with its own compute, storage, and messaging infrastructure. Cells share global data via Aurora Global Database and DynamoDB Global Tables but operate independently.
Cell Components
| Component | Type | Purpose | Replication |
|---|---|---|---|
| VPC | Network | Isolated network per region | None (regional) |
| EKS Cluster | Compute | Kubernetes orchestration | None (regional) |
| Aurora Cluster | Storage | Transactional data (alerts, recipients) | <1s (Global DB) |
| DynamoDB Table | Storage | Delivery status tracking | Eventually consistent (Global Tables) |
| SQS Queue | Messaging | Alert processing queue | None (regional) |
| SNS Topic | Messaging | Event pub/sub | None (regional) |
Network Architecture
10.0.1.0/24] PUB2[Public-1b
10.0.2.0/24] PUB3[Public-1c
10.0.3.0/24] NAT1[NAT Gateway] NAT2[NAT Gateway] ALB[Application LB] end subgraph "Private Subnets" PRIV1[Private-1a
10.0.11.0/24] PRIV2[Private-1b
10.0.12.0/24] PRIV3[Private-1c
10.0.13.0/24] EKS_PODS[EKS Pods] end subgraph "Database Subnets" DB1[Database-1a
10.0.21.0/24] DB2[Database-1b
10.0.22.0/24] DB3[Database-1c
10.0.23.0/24] AURORA[(Aurora)] end IGW[Internet Gateway] end Internet[Internet] --> IGW IGW --> ALB ALB --> EKS_PODS EKS_PODS --> NAT1 EKS_PODS --> NAT2 NAT1 --> IGW NAT2 --> IGW EKS_PODS --> AURORA style PUB1 fill:#d4edda,color:#000 style PUB2 fill:#d4edda,color:#000 style PUB3 fill:#d4edda,color:#000 style PRIV1 fill:#cfe2ff,color:#000 style PRIV2 fill:#cfe2ff,color:#000 style PRIV3 fill:#cfe2ff,color:#000 style DB1 fill:#f8d7da,color:#000 style DB2 fill:#f8d7da,color:#000 style DB3 fill:#f8d7da,color:#000
Subnet Design
- Public Subnets: NAT Gateways, Application Load Balancer
- Private Subnets: EKS nodes and pods (no direct internet access)
- Database Subnets: Aurora clusters (isolated from internet)
Infrastructure Components
AWS Services Used
Networking
- VPC: 10.0.0.0/16 with 3 subnet tiers across 3 AZs
- Route53: Latency-based routing with health checks
- ALB: Application Load Balancer with target groups
- WAF: Web Application Firewall with managed rules
- NAT Gateway: Outbound internet access for private subnets
Compute
- EKS: Kubernetes 1.28+ with managed node groups
- Karpenter: Node autoscaling with spot instance support
- Fargate: Serverless pods for system components (optional)
Data Storage
- Aurora PostgreSQL: Global Database with primary in us-east-1
- DynamoDB: Global Tables for delivery status
- S3: Terraform state, CloudWatch log archive
Messaging
- SQS: Alert processing queue with DLQ
- SNS: Event topics for alert events and delivery status
Security
- Secrets Manager: Twilio credentials, DB passwords, API keys
- KMS: Encryption keys for data at rest
- IAM: IRSA (IAM Roles for Service Accounts)
- Security Groups: Network-level firewall rules
Observability
- CloudWatch Logs: Centralized logging for all services
- CloudWatch Metrics: Custom metrics and dashboards
- CloudWatch Alarms: Automated alerting
Terraform Module Structure
Resource Sizing
| Resource | Development | Production | Notes |
|---|---|---|---|
| Aurora Instance | db.t4g.medium | db.r6g.large | 2 vCPU, 8 GB RAM (prod) |
| EKS Node Group | t3.medium | t3.large | Karpenter manages scaling |
| DynamoDB | On-Demand | On-Demand | Auto-scales based on traffic |
| NAT Gateway | 1 per AZ | 1 per AZ | High availability |
Microservices
Service Architecture
Port 8000] API_ROUTES[11 Endpoints] API_DB[(Aurora)] end subgraph "Worker Service" WORKER[Async Worker] WORKER_SQS[SQS Consumer] WORKER_TWILIO[Twilio Client] WORKER_DDB[(DynamoDB)] end subgraph "Poller Service" POLLER[Async Poller] POLLER_HN[HN API Client] POLLER_STATE[(State Storage)] end CLIENT[Clients] --> API API --> API_DB API --> SNS[SNS Topic] SNS --> SQS[SQS Queue] SQS --> WORKER_SQS WORKER --> WORKER_TWILIO WORKER --> WORKER_DDB POLLER --> POLLER_HN POLLER --> API POLLER --> POLLER_STATE style API fill:#0d6efd,color:#fff style WORKER fill:#198754,color:#fff style POLLER fill:#ffc107,color:#000
1. API Service
Technology: FastAPI (Python 3.11), SQLAlchemy 2.0 (async), Pydantic v2
Purpose: RESTful API for managing alerts and recipients
Endpoints
| Method | Path | Description | Response |
|---|---|---|---|
| GET | /health | Health check | 200 OK |
| POST | /v1/alerts | Create new alert | 202 Accepted |
| GET | /v1/alerts | List alerts (paginated) | 200 OK |
| GET | /v1/alerts/{id} | Get alert details | 200 OK |
| GET | /v1/alerts/{id}/status | Get delivery status | 200 OK |
| PUT | /v1/alerts/{id} | Update alert | 200 OK |
| DELETE | /v1/alerts/{id} | Cancel alert | 204 No Content |
| POST | /v1/recipients | Create recipient | 201 Created |
| GET | /v1/recipients | List recipients | 200 OK |
| GET | /v1/recipients/{id} | Get recipient details | 200 OK |
| PUT | /v1/recipients/{id} | Update recipient | 200 OK |
Configuration
# Kubernetes Deployment
replicas: 3
resources:
requests:
memory: 256Mi
cpu: 250m
limits:
memory: 512Mi
cpu: 500m
# Auto-scaling
minReplicas: 2
maxReplicas: 10
targetCPU: 70%
targetMemory: 80%
2. Worker Service
Technology: Python 3.11 (async), boto3, Twilio SDK
Purpose: Process alert messages from SQS and deliver via Twilio
Processing Flow
Error Handling
- Retry Logic: SQS visibility timeout (30s) with max receive count (3)
- Dead Letter Queue: Failed messages move to DLQ after 3 attempts
- Idempotency: Message deduplication using message_id
3. Poller Service
Technology: Python 3.11 (async), httpx
Purpose: Monitor HackerNews API and create alerts for new stories
Polling Logic
# Every 5 minutes:
1. Fetch top 30 stories from HackerNews API
2. Compare with previous state (stored in DynamoDB)
3. For each new story:
- Fetch story details
- POST to /v1/alerts endpoint
- Update state in DynamoDB
4. Sleep until next poll interval
Alternative: Lambda + EventBridge
The poller could be implemented as a Lambda function triggered by CloudWatch EventBridge (cron: every 5 minutes). This would be more cost-effective (~$0.20/month vs ~$15/month for a continuous pod).
Data Flow & Patterns
End-to-End Alert Flow
Data Consistency Model
Strong Consistency (Aurora)
Used for transactional data requiring ACID guarantees:
- Alerts table: Alert metadata, status, created_at
- Recipients table: Phone numbers, preferences, active status
- Replication: Aurora Global Database (<1 second lag)
Eventual Consistency (DynamoDB)
Used for high-volume, time-series data:
- Delivery status: Twilio SID, status, timestamps
- Replication: DynamoDB Global Tables (typically <1 second)
- TTL: Auto-delete records after 30 days
Messaging Patterns
1. Fan-Out Pattern (SNS → SQS)
alert-events] SNS --> SQS1[SQS Queue:
alert-processing] SNS -.Future.-> SQS2[SQS Queue:
analytics] SNS -.Future.-> SQS3[SQS Queue:
webhooks] SQS1 --> Worker1[Worker Pods] SQS2 -.-> Analytics[Analytics Service] SQS3 -.-> Webhook[Webhook Service] style SNS fill:#ffc107,color:#000 style SQS1 fill:#0d6efd,color:#fff style Worker1 fill:#198754,color:#fff
2. At-Least-Once Delivery
SQS guarantees at-least-once delivery with these mechanisms:
- Visibility Timeout: 30 seconds (message hidden during processing)
- Max Receive Count: 3 attempts before moving to DLQ
- Message Retention: 14 days
- Dead Letter Queue: Failed messages for investigation
3. Idempotency
# Worker ensures idempotent processing
message_id = message.get("MessageId")
# Check if already processed (in DynamoDB)
if delivery_status_exists(message_id):
logger.info(f"Message {message_id} already processed")
return
# Process and store status atomically
process_alert(alert_id)
store_delivery_status(message_id, status)
Regional Failover
PRIMARY
✓ Healthy] R53_N -.Standby.-> USW1_N[us-west-1
SECONDARY] end subgraph "After Failover" R53_F[Route53] -.Failed.-> USE1_F[us-east-1
✗ Unhealthy] R53_F --> USW1_F[us-west-1
PROMOTED
✓ Healthy] end style USE1_N fill:#198754,color:#fff style USW1_F fill:#198754,color:#fff style USE1_F fill:#dc3545,color:#fff
Failover Process:
- Route53 health check detects primary region failure
- DNS automatically routes traffic to secondary region
- Aurora Global Database promotes secondary to primary (manual)
- Applications in secondary region continue serving requests
API Reference
Authentication
API Key Required: Include X-API-Key header in all requests
curl -H "X-API-Key: your-api-key-here" \
https://api.bizcomms.example.com/v1/alerts
Create Alert
POST /v1/alerts
Request Body
{
"title": "New HackerNews Story",
"message": "Check out this awesome article!",
"priority": "high",
"delivery_method": "sms",
"recipient_filters": {
"tags": ["tech", "news"],
"active_only": true
}
}
Response (202 Accepted)
{
"alert_id": "alert_abc123",
"status": "queued",
"created_at": "2025-01-01T12:00:00Z",
"estimated_recipients": 42
}
Get Alert Status
GET /v1/alerts/{id}/status
Response (200 OK)
{
"alert_id": "alert_abc123",
"total_recipients": 42,
"delivery_status": {
"sent": 40,
"delivered": 38,
"failed": 2,
"pending": 0
},
"deliveries": [
{
"recipient_id": "recip_123",
"phone_number": "+1234567890",
"status": "delivered",
"twilio_sid": "SM1234567890abcdef",
"sent_at": "2025-01-01T12:00:05Z",
"delivered_at": "2025-01-01T12:00:07Z"
}
]
}
Create Recipient
POST /v1/recipients
Request Body
{
"name": "John Doe",
"phone_number": "+1234567890",
"preferences": {
"sms_enabled": true,
"voice_enabled": false
},
"tags": ["tech", "news"],
"active": true
}
Response (201 Created)
{
"recipient_id": "recip_123",
"name": "John Doe",
"phone_number": "+1234567890",
"created_at": "2025-01-01T12:00:00Z"
}
Error Responses
| Status Code | Error | Description |
|---|---|---|
| 400 | Bad Request | Invalid request body or parameters |
| 401 | Unauthorized | Missing or invalid API key |
| 404 | Not Found | Resource not found |
| 429 | Too Many Requests | Rate limit exceeded (WAF) |
| 500 | Internal Server Error | Server-side error |
| 503 | Service Unavailable | Service temporarily unavailable |
Deployment Guide
Prerequisites
- AWS Account with Administrator access
- AWS CLI configured
- Terraform 1.6+
- kubectl 1.28+
- Helm 3.13+
- Docker 24.0+
Deployment Steps
Step 1: Terraform Backend Setup
cd terraform/backend-setup
terraform init
terraform apply
# Output: S3 bucket and DynamoDB table for state
Step 2: Deploy Primary Region (us-east-1)
cd ../environments/us-east-1
# Initialize Terraform
terraform init
# Review plan
terraform plan
# Apply infrastructure
terraform apply
# Capture outputs
terraform output -json > outputs.json
Step 3: Deploy Secondary Region (us-west-1)
cd ../us-west-1
terraform init
terraform plan
terraform apply
terraform output -json > outputs.json
Step 4: Build and Push Docker Images
# Authenticate to ECR
aws ecr get-login-password --region us-east-1 | \
docker login --username AWS --password-stdin ACCOUNT_ID.dkr.ecr.us-east-1.amazonaws.com
# Build images
docker build -f docker/Dockerfile.api -t bizcomms-api:latest .
docker build -f docker/Dockerfile.worker -t bizcomms-worker:latest .
docker build -f docker/Dockerfile.poller -t bizcomms-poller:latest .
# Tag and push
docker tag bizcomms-api:latest ACCOUNT_ID.dkr.ecr.us-east-1.amazonaws.com/bizcomms-api:latest
docker push ACCOUNT_ID.dkr.ecr.us-east-1.amazonaws.com/bizcomms-api:latest
# Repeat for worker and poller
Step 5: Deploy Kubernetes Applications
# Configure kubectl
aws eks update-kubeconfig --region us-east-1 --name bizcomms-cell-use1-01-eks
# Get WAF ARN from Terraform
WAF_ARN=$(cd terraform/environments/us-east-1 && terraform output -raw waf_web_acl_arn)
# Deploy with Helm
helm install bizcomms ./helm/bizcomms \
-f helm/bizcomms/values-us-east-1.yaml \
--set ingress.wafAclArn="$WAF_ARN" \
--namespace bizcomms \
--create-namespace
# Verify deployment
kubectl get pods -n bizcomms
kubectl get ingress -n bizcomms
Step 6: Update Route53
# Get ALB DNS name
ALB_DNS=$(kubectl get ingress bizcomms-ingress -n bizcomms -o jsonpath='{.status.loadBalancer.ingress[0].hostname}')
# Update Terraform variables
cd terraform/environments/us-east-1
echo "alb_dns_name = \"$ALB_DNS\"" >> terraform.tfvars
# Re-apply to update Route53
terraform apply -target=module.route53
Deployment Architecture
Verification
# Check API health
curl https://api.bizcomms.example.com/health
# Create test alert
curl -X POST https://api.bizcomms.example.com/v1/alerts \
-H "X-API-Key: your-key" \
-H "Content-Type: application/json" \
-d '{
"title": "Test Alert",
"message": "Hello from BizComms!",
"priority": "high",
"delivery_method": "sms"
}'
# Check logs
kubectl logs -n bizcomms -l app=bizcomms-api --tail=100
kubectl logs -n bizcomms -l app=bizcomms-worker --tail=100
Security
Security Architecture
SQL Injection
XSS Protection
Rate Limiting] ALB[ALB
SSL Termination] SG[Security Groups
Firewall Rules] end subgraph "Identity & Access" IRSA[IRSA
Pod IAM Roles] SM[Secrets Manager
Credential Storage] KMS[KMS
Encryption Keys] end subgraph "Data Protection" TLS[TLS in Transit] ENC_REST[Encryption at Rest] BACKUP[Automated Backups] end Internet[Internet] --> WAF WAF --> ALB ALB --> SG SG --> PODS[EKS Pods] PODS --> IRSA IRSA --> SM SM --> KMS style WAF fill:#dc3545,color:#fff style IRSA fill:#198754,color:#fff style KMS fill:#ffc107,color:#000
Security Best Practices Implemented
1. No Hardcoded Credentials
- IRSA: Pods assume IAM roles via OIDC provider
- Secrets Manager: Twilio credentials, DB passwords, API keys
- Environment Variables: Only non-sensitive config
2. Least Privilege IAM
// API Pod Role
{
"Effect": "Allow",
"Action": [
"secretsmanager:GetSecretValue",
"sns:Publish"
],
"Resource": [
"arn:aws:secretsmanager:*:*:secret:bizcomms-*",
"arn:aws:sns:*:*:bizcomms-*"
]
}
3. Network Isolation
| Component | Subnet Type | Internet Access |
|---|---|---|
| ALB | Public | Direct (IGW) |
| EKS Pods | Private | Via NAT Gateway |
| Aurora | Database | None |
4. Encryption
- In Transit: TLS 1.2+ for all API calls, Aurora connections
- At Rest: KMS encryption for Aurora, DynamoDB, S3
- Secrets: KMS-encrypted in Secrets Manager
5. WAF Protection
# Managed Rules Applied
- AWSManagedRulesSQLiRuleSet # SQL injection
- AWSManagedRulesKnownBadInputsRuleSet # Known threats
- AWSManagedRulesCommonRuleSet # OWASP Top 10
# Rate Limiting
- 1000 requests per 5 minutes per IP
- Custom response: 429 Too Many Requests
6. Audit Logging
- CloudTrail: All AWS API calls logged
- CloudWatch Logs: Application logs (API, Worker, Poller)
- VPC Flow Logs: Network traffic analysis
Security Checklist
Infrastructure
- VPC with private subnets
- Security groups (least privilege)
- WAF with managed rules
- NACLs for subnet protection
Application
- IRSA for pod authentication
- Secrets Manager integration
- Input validation (Pydantic)
- Container image scanning (Trivy)
Monitoring & Observability
Observability Stack
CloudWatch Dashboards
Automated dashboards created by Terraform monitoring module:
1. API Service Dashboard
- Request count (per endpoint)
- Latency (p50, p95, p99)
- Error rate (4xx, 5xx)
- Active connections
2. Worker Service Dashboard
- SQS queue depth
- Messages processed/minute
- Processing duration
- Twilio delivery success rate
3. Infrastructure Dashboard
- Aurora: CPU, connections, IOPS, replication lag
- DynamoDB: Read/write capacity, throttles
- EKS: Pod count, node utilization
CloudWatch Alarms
| Alarm Name | Metric | Threshold | Action |
|---|---|---|---|
| API High Error Rate | 5xx errors | > 10/minute | SNS notification |
| SQS Queue Backup | Queue depth | > 100 messages | SNS + Auto-scale workers |
| Aurora High CPU | CPU utilization | > 80% | SNS notification |
| Aurora Replication Lag | Replication lag | > 5 seconds | SNS notification |
| EKS Node Failure | Node count | < 2 nodes | SNS notification |
Log Aggregation
# Query API logs for errors
aws logs tail /aws/eks/bizcomms-cell-use1-01/api --follow \
--filter-pattern "ERROR"
# CloudWatch Insights query
fields @timestamp, @message
| filter @message like /5xx/
| stats count() by bin(5m)
Health Checks
Kubernetes Probes
# Liveness Probe
livenessProbe:
httpGet:
path: /health
port: 8000
initialDelaySeconds: 30
periodSeconds: 10
# Readiness Probe
readinessProbe:
httpGet:
path: /health
port: 8000
initialDelaySeconds: 10
periodSeconds: 5
Route53 Health Checks
- Endpoint:
https://api.bizcomms.example.com/health - Interval: 30 seconds
- Failure threshold: 3 consecutive failures
- Action: Failover to secondary region
Cost Optimization
Estimated Monthly Costs (Production)
| Service | Configuration | Monthly Cost (USD) |
|---|---|---|
| EKS Cluster | Control plane (2 regions) | $146 |
| EC2 Instances (Karpenter) | 3x t3.large (spot 70%) | $76 |
| Aurora Global Database | 2x db.r6g.large | $438 |
| DynamoDB | On-Demand (10GB, 1M reads/writes) | $26 |
| ALB | 2 load balancers | $33 |
| NAT Gateway | 2 regions × 3 AZs | $194 |
| Data Transfer | Cross-region, internet egress | $50 |
| CloudWatch | Logs, metrics, alarms | $25 |
| Route53 | Hosted zone + health checks | $1 |
| Secrets Manager | 6 secrets | $2.40 |
| TOTAL | ~$991/month |
Cost Optimization Strategies
1. Compute Optimization
Use Karpenter with Spot Instances
Savings: 60-70% reduction on compute costs
# Karpenter Provisioner
spec:
requirements:
- key: karpenter.sh/capacity-type
operator: In
values: ["spot", "on-demand"]
limits:
resources:
cpu: 1000
memory: 1000Gi
2. Database Optimization
- Aurora Serverless v2: Auto-scales based on load (future consideration)
- Read Replicas: Add read replicas only when needed
- Right-Sizing: Start with db.t4g.medium for dev/staging
3. Storage Optimization
- DynamoDB On-Demand: Pay only for actual reads/writes (no provisioned capacity waste)
- TTL: Auto-delete delivery status after 30 days
- S3 Lifecycle: Move CloudWatch logs to Glacier after 90 days
4. Network Optimization
NAT Gateway is Expensive
Alternative: Use VPC Endpoints for AWS services (S3, DynamoDB, Secrets Manager)
Savings: ~$30-50/month by eliminating NAT Gateway traffic to AWS services
5. Monitoring Optimization
- Log Retention: 7 days for debug logs, 30 days for error logs
- Metric Filters: Only send necessary metrics to CloudWatch
- Sampling: Sample debug logs (10%) instead of 100%
Cost by Environment
| Environment | Configuration | Monthly Cost |
|---|---|---|
| Development | Single region, t3.medium, db.t4g.medium, no multi-AZ | ~$180 |
| Staging | Single region, t3.large, db.r6g.large, multi-AZ | ~$400 |
| Production | Multi-region, spot instances, db.r6g.large, HA | ~$991 |