BizComms Overview

A production-grade, multi-region alert notification system built on AWS with cell-based architecture, demonstrating enterprise patterns for failure domain isolation, strong data consistency, and automated operations.

2

AWS Regions

3

Microservices

11

API Endpoints

<1s

DB Replication

What is BizComms?

BizComms is a real-time alert notification system that monitors HackerNews for new articles and sends SMS and voice notifications to subscribers via Twilio. It's designed to demonstrate:

Cell-Based Architecture

Failure domain isolation with independent regional deployments (cells)

Multi-Region Design

Active-passive deployment across us-east-1 and us-west-1 with automatic failover

Strong Consistency

Aurora Global Database with <1s replication lag for transactional data

Event-Driven

SNS/SQS messaging with at-least-once delivery guarantees

Key Features

Infrastructure as Code: Complete Terraform modules for AWS
Kubernetes Orchestration: EKS with Karpenter autoscaling
CI/CD Pipelines: GitHub Actions for automated deployments
Observability: CloudWatch metrics, logs, and alarms

Security Best Practices: IRSA, Secrets Manager, WAF
High Availability: Multi-AZ deployment with health checks
Auto-scaling: HPA for pods, Karpenter for nodes
Cost Optimized: Spot instances, on-demand DynamoDB

System Architecture

High-Level Architecture

graph TB subgraph Internet Users[Users/Clients] HN[HackerNews API] end subgraph "Route53 Global DNS" R53[Route53
Latency-Based Routing] end Users --> R53 subgraph "us-east-1 PRIMARY" subgraph "Cell: cell-use1-01" WAF1[WAF + ALB] EKS1[EKS Cluster] API1[API Pods x3] Worker1[Worker Pods x2] Poller1[Poller Pod x1] Aurora1[(Aurora Primary)] DDB1[(DynamoDB)] SQS1[SQS Queue] SNS1[SNS Topic] WAF1 --> API1 API1 --> Aurora1 API1 --> SNS1 SNS1 --> SQS1 SQS1 --> Worker1 Worker1 --> Aurora1 Worker1 --> DDB1 Poller1 --> API1 HN -.-> Poller1 end end subgraph "us-west-1 SECONDARY" subgraph "Cell: cell-usw1-01" WAF2[WAF + ALB] EKS2[EKS Cluster] API2[API Pods x3] Worker2[Worker Pods x2] Poller2[Poller Pod x1] Aurora2[(Aurora Replica
Read-Only)] DDB2[(DynamoDB)] SQS2[SQS Queue] SNS2[SNS Topic] WAF2 --> API2 API2 --> Aurora2 API2 --> SNS2 SNS2 --> SQS2 SQS2 --> Worker2 Worker2 --> Aurora2 Worker2 --> DDB2 Poller2 --> API2 HN -.-> Poller2 end end R53 --> WAF1 R53 -.Failover.-> WAF2 Aurora1 -.Global DB
Replication.-> Aurora2 DDB1 -.Global Tables
Replication.-> DDB2 Worker1 --> Twilio[Twilio API
SMS & Voice] Worker2 --> Twilio style API1 fill:#0d6efd,color:#fff style API2 fill:#0d6efd,color:#fff style Worker1 fill:#198754,color:#fff style Worker2 fill:#198754,color:#fff style Aurora1 fill:#dc3545,color:#fff style Aurora2 fill:#ffc107,color:#000 style R53 fill:#0dcaf0,color:#000

Cell-Based Architecture

Each region operates as an independent cell with complete isolation. This design limits the blast radius of failures and enables independent scaling.

What is a Cell?

A cell is a self-contained deployment unit with its own compute, storage, and messaging infrastructure. Cells share global data via Aurora Global Database and DynamoDB Global Tables but operate independently.

Cell Components

Component	Type	Purpose	Replication
VPC	Network	Isolated network per region	None (regional)
EKS Cluster	Compute	Kubernetes orchestration	None (regional)
Aurora Cluster	Storage	Transactional data (alerts, recipients)	<1s (Global DB)
DynamoDB Table	Storage	Delivery status tracking	Eventually consistent (Global Tables)
SQS Queue	Messaging	Alert processing queue	None (regional)
SNS Topic	Messaging	Event pub/sub	None (regional)

Network Architecture

graph TB subgraph "VPC: 10.0.0.0/16" subgraph "Public Subnets" PUB1[Public-1a
10.0.1.0/24] PUB2[Public-1b
10.0.2.0/24] PUB3[Public-1c
10.0.3.0/24] NAT1[NAT Gateway] NAT2[NAT Gateway] ALB[Application LB] end subgraph "Private Subnets" PRIV1[Private-1a
10.0.11.0/24] PRIV2[Private-1b
10.0.12.0/24] PRIV3[Private-1c
10.0.13.0/24] EKS_PODS[EKS Pods] end subgraph "Database Subnets" DB1[Database-1a
10.0.21.0/24] DB2[Database-1b
10.0.22.0/24] DB3[Database-1c
10.0.23.0/24] AURORA[(Aurora)] end IGW[Internet Gateway] end Internet[Internet] --> IGW IGW --> ALB ALB --> EKS_PODS EKS_PODS --> NAT1 EKS_PODS --> NAT2 NAT1 --> IGW NAT2 --> IGW EKS_PODS --> AURORA style PUB1 fill:#d4edda,color:#000 style PUB2 fill:#d4edda,color:#000 style PUB3 fill:#d4edda,color:#000 style PRIV1 fill:#cfe2ff,color:#000 style PRIV2 fill:#cfe2ff,color:#000 style PRIV3 fill:#cfe2ff,color:#000 style DB1 fill:#f8d7da,color:#000 style DB2 fill:#f8d7da,color:#000 style DB3 fill:#f8d7da,color:#000

Subnet Design

Public Subnets: NAT Gateways, Application Load Balancer
Private Subnets: EKS nodes and pods (no direct internet access)
Database Subnets: Aurora clusters (isolated from internet)

Infrastructure Components

AWS Services Used

Networking

VPC: 10.0.0.0/16 with 3 subnet tiers across 3 AZs
Route53: Latency-based routing with health checks
ALB: Application Load Balancer with target groups
WAF: Web Application Firewall with managed rules
NAT Gateway: Outbound internet access for private subnets

Compute

EKS: Kubernetes 1.28+ with managed node groups
Karpenter: Node autoscaling with spot instance support
Fargate: Serverless pods for system components (optional)

Data Storage

Aurora PostgreSQL: Global Database with primary in us-east-1
DynamoDB: Global Tables for delivery status
S3: Terraform state, CloudWatch log archive

Messaging

SQS: Alert processing queue with DLQ
SNS: Event topics for alert events and delivery status

Security

Secrets Manager: Twilio credentials, DB passwords, API keys
KMS: Encryption keys for data at rest
IAM: IRSA (IAM Roles for Service Accounts)
Security Groups: Network-level firewall rules

Observability

CloudWatch Logs: Centralized logging for all services
CloudWatch Metrics: Custom metrics and dashboards
CloudWatch Alarms: Automated alerting

Terraform Module Structure

graph LR subgraph "Environment: us-east-1" ENV1[main.tf] end subgraph "Reusable Modules" VPC[VPC Module] AURORA[Aurora Module] DDB[DynamoDB Module] EKS[EKS Module] SQS_SNS[SQS/SNS Module] WAF[WAF Module] ROUTE53[Route53 Module] SECRETS[Secrets Module] MONITOR[Monitoring Module] end ENV1 --> VPC ENV1 --> AURORA ENV1 --> DDB ENV1 --> EKS ENV1 --> SQS_SNS ENV1 --> WAF ENV1 --> ROUTE53 ENV1 --> SECRETS ENV1 --> MONITOR style ENV1 fill:#0d6efd,color:#fff style VPC fill:#198754,color:#fff style AURORA fill:#dc3545,color:#fff style EKS fill:#ffc107,color:#000

Resource Sizing

Resource	Development	Production	Notes
Aurora Instance	db.t4g.medium	db.r6g.large	2 vCPU, 8 GB RAM (prod)
EKS Node Group	t3.medium	t3.large	Karpenter manages scaling
DynamoDB	On-Demand	On-Demand	Auto-scales based on traffic
NAT Gateway	1 per AZ	1 per AZ	High availability

Microservices

Service Architecture

graph LR subgraph "API Service" API[FastAPI
Port 8000] API_ROUTES[11 Endpoints] API_DB[(Aurora)] end subgraph "Worker Service" WORKER[Async Worker] WORKER_SQS[SQS Consumer] WORKER_TWILIO[Twilio Client] WORKER_DDB[(DynamoDB)] end subgraph "Poller Service" POLLER[Async Poller] POLLER_HN[HN API Client] POLLER_STATE[(State Storage)] end CLIENT[Clients] --> API API --> API_DB API --> SNS[SNS Topic] SNS --> SQS[SQS Queue] SQS --> WORKER_SQS WORKER --> WORKER_TWILIO WORKER --> WORKER_DDB POLLER --> POLLER_HN POLLER --> API POLLER --> POLLER_STATE style API fill:#0d6efd,color:#fff style WORKER fill:#198754,color:#fff style POLLER fill:#ffc107,color:#000

1. API Service

Technology: FastAPI (Python 3.11), SQLAlchemy 2.0 (async), Pydantic v2

Purpose: RESTful API for managing alerts and recipients

Endpoints

Method	Path	Description	Response
GET	/health	Health check	200 OK
POST	/v1/alerts	Create new alert	202 Accepted
GET	/v1/alerts	List alerts (paginated)	200 OK
GET	/v1/alerts/{id}	Get alert details	200 OK
GET	/v1/alerts/{id}/status	Get delivery status	200 OK
PUT	/v1/alerts/{id}	Update alert	200 OK
DELETE	/v1/alerts/{id}	Cancel alert	204 No Content
POST	/v1/recipients	Create recipient	201 Created
GET	/v1/recipients	List recipients	200 OK
GET	/v1/recipients/{id}	Get recipient details	200 OK
PUT	/v1/recipients/{id}	Update recipient	200 OK

Configuration

# Kubernetes Deployment
replicas: 3
resources:
  requests:
    memory: 256Mi
    cpu: 250m
  limits:
    memory: 512Mi
    cpu: 500m

# Auto-scaling
minReplicas: 2
maxReplicas: 10
targetCPU: 70%
targetMemory: 80%

2. Worker Service

Technology: Python 3.11 (async), boto3, Twilio SDK

Purpose: Process alert messages from SQS and deliver via Twilio

Processing Flow

sequenceDiagram participant SQS as SQS Queue participant Worker as Worker Pod participant Aurora as Aurora DB participant Twilio as Twilio API participant DDB as DynamoDB SQS->>Worker: Poll messages (long polling) Worker->>Aurora: Fetch alert details Aurora-->>Worker: Alert + Recipients loop For each recipient Worker->>Twilio: Send SMS/Voice Twilio-->>Worker: Delivery SID Worker->>DDB: Store delivery status end Worker->>SQS: Delete message

Error Handling

Retry Logic: SQS visibility timeout (30s) with max receive count (3)
Dead Letter Queue: Failed messages move to DLQ after 3 attempts
Idempotency: Message deduplication using message_id

3. Poller Service

Technology: Python 3.11 (async), httpx

Purpose: Monitor HackerNews API and create alerts for new stories

Polling Logic

# Every 5 minutes:
1. Fetch top 30 stories from HackerNews API
2. Compare with previous state (stored in DynamoDB)
3. For each new story:
   - Fetch story details
   - POST to /v1/alerts endpoint
   - Update state in DynamoDB
4. Sleep until next poll interval

Alternative: Lambda + EventBridge

The poller could be implemented as a Lambda function triggered by CloudWatch EventBridge (cron: every 5 minutes). This would be more cost-effective (~$0.20/month vs ~$15/month for a continuous pod).

Data Flow & Patterns

End-to-End Alert Flow

sequenceDiagram autonumber participant Client participant API participant Aurora participant SNS participant SQS participant Worker participant Twilio participant DDB Client->>API: POST /v1/alerts API->>Aurora: INSERT INTO alerts Aurora-->>API: Alert ID API->>SNS: Publish alert event API-->>Client: 202 Accepted (alert_id) SNS->>SQS: Fan-out to queue SQS->>Worker: Poll message Worker->>Aurora: SELECT recipients Aurora-->>Worker: Recipient list loop Each recipient Worker->>Twilio: Send SMS/Voice Twilio-->>Worker: Delivery SID + Status Worker->>DDB: PUT delivery status end Worker->>SQS: Delete message Note over Client,DDB: Client can query GET /v1/alerts/{id}/status Client->>API: GET /v1/alerts/{id}/status API->>DDB: Query delivery status DDB-->>API: Status records API-->>Client: Delivery statuses

Data Consistency Model

Strong Consistency (Aurora)

Used for transactional data requiring ACID guarantees:

Alerts table: Alert metadata, status, created_at
Recipients table: Phone numbers, preferences, active status
Replication: Aurora Global Database (<1 second lag)

Eventual Consistency (DynamoDB)

Used for high-volume, time-series data:

Delivery status: Twilio SID, status, timestamps
Replication: DynamoDB Global Tables (typically <1 second)
TTL: Auto-delete records after 30 days

Messaging Patterns

1. Fan-Out Pattern (SNS → SQS)

graph LR API[API Service] --> SNS[SNS Topic:
alert-events] SNS --> SQS1[SQS Queue:
alert-processing] SNS -.Future.-> SQS2[SQS Queue:
analytics] SNS -.Future.-> SQS3[SQS Queue:
webhooks] SQS1 --> Worker1[Worker Pods] SQS2 -.-> Analytics[Analytics Service] SQS3 -.-> Webhook[Webhook Service] style SNS fill:#ffc107,color:#000 style SQS1 fill:#0d6efd,color:#fff style Worker1 fill:#198754,color:#fff

2. At-Least-Once Delivery

SQS guarantees at-least-once delivery with these mechanisms:

Visibility Timeout: 30 seconds (message hidden during processing)
Max Receive Count: 3 attempts before moving to DLQ
Message Retention: 14 days
Dead Letter Queue: Failed messages for investigation

3. Idempotency

# Worker ensures idempotent processing
message_id = message.get("MessageId")

# Check if already processed (in DynamoDB)
if delivery_status_exists(message_id):
    logger.info(f"Message {message_id} already processed")
    return

# Process and store status atomically
process_alert(alert_id)
store_delivery_status(message_id, status)

Regional Failover

graph TB subgraph "Normal Operation" R53_N[Route53] --> USE1_N[us-east-1
PRIMARY
✓ Healthy] R53_N -.Standby.-> USW1_N[us-west-1
SECONDARY] end subgraph "After Failover" R53_F[Route53] -.Failed.-> USE1_F[us-east-1
✗ Unhealthy] R53_F --> USW1_F[us-west-1
PROMOTED
✓ Healthy] end style USE1_N fill:#198754,color:#fff style USW1_F fill:#198754,color:#fff style USE1_F fill:#dc3545,color:#fff

Failover Process:

Route53 health check detects primary region failure
DNS automatically routes traffic to secondary region
Aurora Global Database promotes secondary to primary (manual)
Applications in secondary region continue serving requests

API Reference

Authentication

API Key Required: Include X-API-Key header in all requests

curl -H "X-API-Key: your-api-key-here" \
  https://api.bizcomms.example.com/v1/alerts

Create Alert

POST /v1/alerts

Request Body

{
  "title": "New HackerNews Story",
  "message": "Check out this awesome article!",
  "priority": "high",
  "delivery_method": "sms",
  "recipient_filters": {
    "tags": ["tech", "news"],
    "active_only": true
  }
}

Response (202 Accepted)

{
  "alert_id": "alert_abc123",
  "status": "queued",
  "created_at": "2025-01-01T12:00:00Z",
  "estimated_recipients": 42
}

Get Alert Status

GET /v1/alerts/{id}/status

Response (200 OK)

{
  "alert_id": "alert_abc123",
  "total_recipients": 42,
  "delivery_status": {
    "sent": 40,
    "delivered": 38,
    "failed": 2,
    "pending": 0
  },
  "deliveries": [
    {
      "recipient_id": "recip_123",
      "phone_number": "+1234567890",
      "status": "delivered",
      "twilio_sid": "SM1234567890abcdef",
      "sent_at": "2025-01-01T12:00:05Z",
      "delivered_at": "2025-01-01T12:00:07Z"
    }
  ]
}

Create Recipient

POST /v1/recipients

Request Body

{
  "name": "John Doe",
  "phone_number": "+1234567890",
  "preferences": {
    "sms_enabled": true,
    "voice_enabled": false
  },
  "tags": ["tech", "news"],
  "active": true
}

Response (201 Created)

{
  "recipient_id": "recip_123",
  "name": "John Doe",
  "phone_number": "+1234567890",
  "created_at": "2025-01-01T12:00:00Z"
}

Error Responses

Status Code	Error	Description
400	Bad Request	Invalid request body or parameters
401	Unauthorized	Missing or invalid API key
404	Not Found	Resource not found
429	Too Many Requests	Rate limit exceeded (WAF)
500	Internal Server Error	Server-side error
503	Service Unavailable	Service temporarily unavailable

Deployment Guide

Prerequisites

AWS Account with Administrator access
AWS CLI configured
Terraform 1.6+
kubectl 1.28+
Helm 3.13+
Docker 24.0+

Deployment Steps

Step 1: Terraform Backend Setup

cd terraform/backend-setup
terraform init
terraform apply

# Output: S3 bucket and DynamoDB table for state

Step 2: Deploy Primary Region (us-east-1)

cd ../environments/us-east-1

# Initialize Terraform
terraform init

# Review plan
terraform plan

# Apply infrastructure
terraform apply

# Capture outputs
terraform output -json > outputs.json

Step 3: Deploy Secondary Region (us-west-1)

cd ../us-west-1

terraform init
terraform plan
terraform apply

terraform output -json > outputs.json

Step 4: Build and Push Docker Images

# Authenticate to ECR
aws ecr get-login-password --region us-east-1 | \
  docker login --username AWS --password-stdin ACCOUNT_ID.dkr.ecr.us-east-1.amazonaws.com

# Build images
docker build -f docker/Dockerfile.api -t bizcomms-api:latest .
docker build -f docker/Dockerfile.worker -t bizcomms-worker:latest .
docker build -f docker/Dockerfile.poller -t bizcomms-poller:latest .

# Tag and push
docker tag bizcomms-api:latest ACCOUNT_ID.dkr.ecr.us-east-1.amazonaws.com/bizcomms-api:latest
docker push ACCOUNT_ID.dkr.ecr.us-east-1.amazonaws.com/bizcomms-api:latest

# Repeat for worker and poller

Step 5: Deploy Kubernetes Applications

# Configure kubectl
aws eks update-kubeconfig --region us-east-1 --name bizcomms-cell-use1-01-eks

# Get WAF ARN from Terraform
WAF_ARN=$(cd terraform/environments/us-east-1 && terraform output -raw waf_web_acl_arn)

# Deploy with Helm
helm install bizcomms ./helm/bizcomms \
  -f helm/bizcomms/values-us-east-1.yaml \
  --set ingress.wafAclArn="$WAF_ARN" \
  --namespace bizcomms \
  --create-namespace

# Verify deployment
kubectl get pods -n bizcomms
kubectl get ingress -n bizcomms

Step 6: Update Route53

# Get ALB DNS name
ALB_DNS=$(kubectl get ingress bizcomms-ingress -n bizcomms -o jsonpath='{.status.loadBalancer.ingress[0].hostname}')

# Update Terraform variables
cd terraform/environments/us-east-1
echo "alb_dns_name = \"$ALB_DNS\"" >> terraform.tfvars

# Re-apply to update Route53
terraform apply -target=module.route53

Deployment Architecture

graph TB START[Start] --> BACKEND[1. Terraform Backend] BACKEND --> TF_USE1[2. Terraform us-east-1] TF_USE1 --> TF_USW1[3. Terraform us-west-1] TF_USW1 --> DOCKER[4. Build Docker Images] DOCKER --> K8S_USE1[5. Deploy K8s us-east-1] K8S_USE1 --> K8S_USW1[6. Deploy K8s us-west-1] K8S_USW1 --> R53[7. Update Route53] R53 --> VERIFY[8. Verify Deployment] VERIFY --> END[Complete] style START fill:#198754,color:#fff style END fill:#198754,color:#fff style TF_USE1 fill:#0d6efd,color:#fff style K8S_USE1 fill:#ffc107,color:#000

Verification

# Check API health
curl https://api.bizcomms.example.com/health

# Create test alert
curl -X POST https://api.bizcomms.example.com/v1/alerts \
  -H "X-API-Key: your-key" \
  -H "Content-Type: application/json" \
  -d '{
    "title": "Test Alert",
    "message": "Hello from BizComms!",
    "priority": "high",
    "delivery_method": "sms"
  }'

# Check logs
kubectl logs -n bizcomms -l app=bizcomms-api --tail=100
kubectl logs -n bizcomms -l app=bizcomms-worker --tail=100

Security

Security Architecture

graph TB subgraph "Network Security" WAF[WAF
SQL Injection
XSS Protection
Rate Limiting] ALB[ALB
SSL Termination] SG[Security Groups
Firewall Rules] end subgraph "Identity & Access" IRSA[IRSA
Pod IAM Roles] SM[Secrets Manager
Credential Storage] KMS[KMS
Encryption Keys] end subgraph "Data Protection" TLS[TLS in Transit] ENC_REST[Encryption at Rest] BACKUP[Automated Backups] end Internet[Internet] --> WAF WAF --> ALB ALB --> SG SG --> PODS[EKS Pods] PODS --> IRSA IRSA --> SM SM --> KMS style WAF fill:#dc3545,color:#fff style IRSA fill:#198754,color:#fff style KMS fill:#ffc107,color:#000

Security Best Practices Implemented

1. No Hardcoded Credentials

IRSA: Pods assume IAM roles via OIDC provider
Secrets Manager: Twilio credentials, DB passwords, API keys
Environment Variables: Only non-sensitive config

2. Least Privilege IAM

// API Pod Role
{
  "Effect": "Allow",
  "Action": [
    "secretsmanager:GetSecretValue",
    "sns:Publish"
  ],
  "Resource": [
    "arn:aws:secretsmanager:*:*:secret:bizcomms-*",
    "arn:aws:sns:*:*:bizcomms-*"
  ]
}

3. Network Isolation

Component	Subnet Type	Internet Access
ALB	Public	Direct (IGW)
EKS Pods	Private	Via NAT Gateway
Aurora	Database	None

4. Encryption

In Transit: TLS 1.2+ for all API calls, Aurora connections
At Rest: KMS encryption for Aurora, DynamoDB, S3
Secrets: KMS-encrypted in Secrets Manager

5. WAF Protection

# Managed Rules Applied
- AWSManagedRulesSQLiRuleSet    # SQL injection
- AWSManagedRulesKnownBadInputsRuleSet  # Known threats
- AWSManagedRulesCommonRuleSet  # OWASP Top 10

# Rate Limiting
- 1000 requests per 5 minutes per IP
- Custom response: 429 Too Many Requests

6. Audit Logging

CloudTrail: All AWS API calls logged
CloudWatch Logs: Application logs (API, Worker, Poller)
VPC Flow Logs: Network traffic analysis

Security Checklist

Infrastructure

VPC with private subnets
Security groups (least privilege)
WAF with managed rules
NACLs for subnet protection

Application

IRSA for pod authentication
Secrets Manager integration
Input validation (Pydantic)
Container image scanning (Trivy)

Monitoring & Observability

Observability Stack

graph TB subgraph "Data Collection" LOGS[CloudWatch Logs] METRICS[CloudWatch Metrics] TRACES[X-Ray Traces] end subgraph "Visualization" DASH[CloudWatch Dashboards] INSIGHTS[Log Insights] end subgraph "Alerting" ALARMS[CloudWatch Alarms] SNS_ALERT[SNS Topic] EMAIL[Email/SMS] end API[API Service] --> LOGS WORKER[Worker Service] --> LOGS POLLER[Poller Service] --> LOGS API --> METRICS WORKER --> METRICS EKS[EKS Cluster] --> METRICS AURORA[(Aurora)] --> METRICS LOGS --> INSIGHTS METRICS --> DASH METRICS --> ALARMS ALARMS --> SNS_ALERT SNS_ALERT --> EMAIL style LOGS fill:#ffc107,color:#000 style DASH fill:#0d6efd,color:#fff style ALARMS fill:#dc3545,color:#fff

CloudWatch Dashboards

Automated dashboards created by Terraform monitoring module:

1. API Service Dashboard

Request count (per endpoint)
Latency (p50, p95, p99)
Error rate (4xx, 5xx)
Active connections

2. Worker Service Dashboard

SQS queue depth
Messages processed/minute
Processing duration
Twilio delivery success rate

3. Infrastructure Dashboard

Aurora: CPU, connections, IOPS, replication lag
DynamoDB: Read/write capacity, throttles
EKS: Pod count, node utilization

CloudWatch Alarms

Alarm Name	Metric	Threshold	Action
API High Error Rate	5xx errors	> 10/minute	SNS notification
SQS Queue Backup	Queue depth	> 100 messages	SNS + Auto-scale workers
Aurora High CPU	CPU utilization	> 80%	SNS notification
Aurora Replication Lag	Replication lag	> 5 seconds	SNS notification
EKS Node Failure	Node count	< 2 nodes	SNS notification

Log Aggregation

# Query API logs for errors
aws logs tail /aws/eks/bizcomms-cell-use1-01/api --follow \
  --filter-pattern "ERROR"

# CloudWatch Insights query
fields @timestamp, @message
| filter @message like /5xx/
| stats count() by bin(5m)

Health Checks

Kubernetes Probes

# Liveness Probe
livenessProbe:
  httpGet:
    path: /health
    port: 8000
  initialDelaySeconds: 30
  periodSeconds: 10

# Readiness Probe
readinessProbe:
  httpGet:
    path: /health
    port: 8000
  initialDelaySeconds: 10
  periodSeconds: 5

Route53 Health Checks

Endpoint: https://api.bizcomms.example.com/health
Interval: 30 seconds
Failure threshold: 3 consecutive failures
Action: Failover to secondary region

Cost Optimization

Estimated Monthly Costs (Production)

Service	Configuration	Monthly Cost (USD)
EKS Cluster	Control plane (2 regions)	$146
EC2 Instances (Karpenter)	3x t3.large (spot 70%)	$76
Aurora Global Database	2x db.r6g.large	$438
DynamoDB	On-Demand (10GB, 1M reads/writes)	$26
ALB	2 load balancers	$33
NAT Gateway	2 regions × 3 AZs	$194
Data Transfer	Cross-region, internet egress	$50
CloudWatch	Logs, metrics, alarms	$25
Route53	Hosted zone + health checks	$1
Secrets Manager	6 secrets	$2.40
TOTAL		~$991/month

Cost Optimization Strategies

1. Compute Optimization

Use Karpenter with Spot Instances

Savings: 60-70% reduction on compute costs

# Karpenter Provisioner
spec:
  requirements:
    - key: karpenter.sh/capacity-type
      operator: In
      values: ["spot", "on-demand"]
  limits:
    resources:
      cpu: 1000
      memory: 1000Gi

2. Database Optimization

Aurora Serverless v2: Auto-scales based on load (future consideration)
Read Replicas: Add read replicas only when needed
Right-Sizing: Start with db.t4g.medium for dev/staging

3. Storage Optimization

DynamoDB On-Demand: Pay only for actual reads/writes (no provisioned capacity waste)
TTL: Auto-delete delivery status after 30 days
S3 Lifecycle: Move CloudWatch logs to Glacier after 90 days

4. Network Optimization

NAT Gateway is Expensive

Alternative: Use VPC Endpoints for AWS services (S3, DynamoDB, Secrets Manager)

Savings: ~$30-50/month by eliminating NAT Gateway traffic to AWS services

5. Monitoring Optimization

Log Retention: 7 days for debug logs, 30 days for error logs
Metric Filters: Only send necessary metrics to CloudWatch
Sampling: Sample debug logs (10%) instead of 100%

Cost by Environment

Environment	Configuration	Monthly Cost
Development	Single region, t3.medium, db.t4g.medium, no multi-AZ	~$180
Staging	Single region, t3.large, db.r6g.large, multi-AZ	~$400
Production	Multi-region, spot instances, db.r6g.large, HA	~$991