BizComms Overview

A production-grade, multi-region alert notification system built on AWS with cell-based architecture, demonstrating enterprise patterns for failure domain isolation, strong data consistency, and automated operations.

2
AWS Regions
3
Microservices
11
API Endpoints
<1s
DB Replication

What is BizComms?

BizComms is a real-time alert notification system that monitors HackerNews for new articles and sends SMS and voice notifications to subscribers via Twilio. It's designed to demonstrate:

Cell-Based Architecture

Failure domain isolation with independent regional deployments (cells)

Multi-Region Design

Active-passive deployment across us-east-1 and us-west-1 with automatic failover

Strong Consistency

Aurora Global Database with <1s replication lag for transactional data

Event-Driven

SNS/SQS messaging with at-least-once delivery guarantees

Key Features

  • Infrastructure as Code: Complete Terraform modules for AWS
  • Kubernetes Orchestration: EKS with Karpenter autoscaling
  • CI/CD Pipelines: GitHub Actions for automated deployments
  • Observability: CloudWatch metrics, logs, and alarms
  • Security Best Practices: IRSA, Secrets Manager, WAF
  • High Availability: Multi-AZ deployment with health checks
  • Auto-scaling: HPA for pods, Karpenter for nodes
  • Cost Optimized: Spot instances, on-demand DynamoDB

System Architecture

High-Level Architecture

graph TB subgraph Internet Users[Users/Clients] HN[HackerNews API] end subgraph "Route53 Global DNS" R53[Route53
Latency-Based Routing] end Users --> R53 subgraph "us-east-1 PRIMARY" subgraph "Cell: cell-use1-01" WAF1[WAF + ALB] EKS1[EKS Cluster] API1[API Pods x3] Worker1[Worker Pods x2] Poller1[Poller Pod x1] Aurora1[(Aurora Primary)] DDB1[(DynamoDB)] SQS1[SQS Queue] SNS1[SNS Topic] WAF1 --> API1 API1 --> Aurora1 API1 --> SNS1 SNS1 --> SQS1 SQS1 --> Worker1 Worker1 --> Aurora1 Worker1 --> DDB1 Poller1 --> API1 HN -.-> Poller1 end end subgraph "us-west-1 SECONDARY" subgraph "Cell: cell-usw1-01" WAF2[WAF + ALB] EKS2[EKS Cluster] API2[API Pods x3] Worker2[Worker Pods x2] Poller2[Poller Pod x1] Aurora2[(Aurora Replica
Read-Only)] DDB2[(DynamoDB)] SQS2[SQS Queue] SNS2[SNS Topic] WAF2 --> API2 API2 --> Aurora2 API2 --> SNS2 SNS2 --> SQS2 SQS2 --> Worker2 Worker2 --> Aurora2 Worker2 --> DDB2 Poller2 --> API2 HN -.-> Poller2 end end R53 --> WAF1 R53 -.Failover.-> WAF2 Aurora1 -.Global DB
Replication.-> Aurora2 DDB1 -.Global Tables
Replication.-> DDB2 Worker1 --> Twilio[Twilio API
SMS & Voice] Worker2 --> Twilio style API1 fill:#0d6efd,color:#fff style API2 fill:#0d6efd,color:#fff style Worker1 fill:#198754,color:#fff style Worker2 fill:#198754,color:#fff style Aurora1 fill:#dc3545,color:#fff style Aurora2 fill:#ffc107,color:#000 style R53 fill:#0dcaf0,color:#000

Cell-Based Architecture

Each region operates as an independent cell with complete isolation. This design limits the blast radius of failures and enables independent scaling.

What is a Cell?

A cell is a self-contained deployment unit with its own compute, storage, and messaging infrastructure. Cells share global data via Aurora Global Database and DynamoDB Global Tables but operate independently.

Cell Components

Component Type Purpose Replication
VPC Network Isolated network per region None (regional)
EKS Cluster Compute Kubernetes orchestration None (regional)
Aurora Cluster Storage Transactional data (alerts, recipients) <1s (Global DB)
DynamoDB Table Storage Delivery status tracking Eventually consistent (Global Tables)
SQS Queue Messaging Alert processing queue None (regional)
SNS Topic Messaging Event pub/sub None (regional)

Network Architecture

graph TB subgraph "VPC: 10.0.0.0/16" subgraph "Public Subnets" PUB1[Public-1a
10.0.1.0/24] PUB2[Public-1b
10.0.2.0/24] PUB3[Public-1c
10.0.3.0/24] NAT1[NAT Gateway] NAT2[NAT Gateway] ALB[Application LB] end subgraph "Private Subnets" PRIV1[Private-1a
10.0.11.0/24] PRIV2[Private-1b
10.0.12.0/24] PRIV3[Private-1c
10.0.13.0/24] EKS_PODS[EKS Pods] end subgraph "Database Subnets" DB1[Database-1a
10.0.21.0/24] DB2[Database-1b
10.0.22.0/24] DB3[Database-1c
10.0.23.0/24] AURORA[(Aurora)] end IGW[Internet Gateway] end Internet[Internet] --> IGW IGW --> ALB ALB --> EKS_PODS EKS_PODS --> NAT1 EKS_PODS --> NAT2 NAT1 --> IGW NAT2 --> IGW EKS_PODS --> AURORA style PUB1 fill:#d4edda,color:#000 style PUB2 fill:#d4edda,color:#000 style PUB3 fill:#d4edda,color:#000 style PRIV1 fill:#cfe2ff,color:#000 style PRIV2 fill:#cfe2ff,color:#000 style PRIV3 fill:#cfe2ff,color:#000 style DB1 fill:#f8d7da,color:#000 style DB2 fill:#f8d7da,color:#000 style DB3 fill:#f8d7da,color:#000

Subnet Design

Infrastructure Components

AWS Services Used

Networking

  • VPC: 10.0.0.0/16 with 3 subnet tiers across 3 AZs
  • Route53: Latency-based routing with health checks
  • ALB: Application Load Balancer with target groups
  • WAF: Web Application Firewall with managed rules
  • NAT Gateway: Outbound internet access for private subnets

Compute

  • EKS: Kubernetes 1.28+ with managed node groups
  • Karpenter: Node autoscaling with spot instance support
  • Fargate: Serverless pods for system components (optional)

Data Storage

  • Aurora PostgreSQL: Global Database with primary in us-east-1
  • DynamoDB: Global Tables for delivery status
  • S3: Terraform state, CloudWatch log archive

Messaging

  • SQS: Alert processing queue with DLQ
  • SNS: Event topics for alert events and delivery status

Security

  • Secrets Manager: Twilio credentials, DB passwords, API keys
  • KMS: Encryption keys for data at rest
  • IAM: IRSA (IAM Roles for Service Accounts)
  • Security Groups: Network-level firewall rules

Observability

  • CloudWatch Logs: Centralized logging for all services
  • CloudWatch Metrics: Custom metrics and dashboards
  • CloudWatch Alarms: Automated alerting

Terraform Module Structure

graph LR subgraph "Environment: us-east-1" ENV1[main.tf] end subgraph "Reusable Modules" VPC[VPC Module] AURORA[Aurora Module] DDB[DynamoDB Module] EKS[EKS Module] SQS_SNS[SQS/SNS Module] WAF[WAF Module] ROUTE53[Route53 Module] SECRETS[Secrets Module] MONITOR[Monitoring Module] end ENV1 --> VPC ENV1 --> AURORA ENV1 --> DDB ENV1 --> EKS ENV1 --> SQS_SNS ENV1 --> WAF ENV1 --> ROUTE53 ENV1 --> SECRETS ENV1 --> MONITOR style ENV1 fill:#0d6efd,color:#fff style VPC fill:#198754,color:#fff style AURORA fill:#dc3545,color:#fff style EKS fill:#ffc107,color:#000

Resource Sizing

Resource Development Production Notes
Aurora Instance db.t4g.medium db.r6g.large 2 vCPU, 8 GB RAM (prod)
EKS Node Group t3.medium t3.large Karpenter manages scaling
DynamoDB On-Demand On-Demand Auto-scales based on traffic
NAT Gateway 1 per AZ 1 per AZ High availability

Microservices

Service Architecture

graph LR subgraph "API Service" API[FastAPI
Port 8000] API_ROUTES[11 Endpoints] API_DB[(Aurora)] end subgraph "Worker Service" WORKER[Async Worker] WORKER_SQS[SQS Consumer] WORKER_TWILIO[Twilio Client] WORKER_DDB[(DynamoDB)] end subgraph "Poller Service" POLLER[Async Poller] POLLER_HN[HN API Client] POLLER_STATE[(State Storage)] end CLIENT[Clients] --> API API --> API_DB API --> SNS[SNS Topic] SNS --> SQS[SQS Queue] SQS --> WORKER_SQS WORKER --> WORKER_TWILIO WORKER --> WORKER_DDB POLLER --> POLLER_HN POLLER --> API POLLER --> POLLER_STATE style API fill:#0d6efd,color:#fff style WORKER fill:#198754,color:#fff style POLLER fill:#ffc107,color:#000

1. API Service

Technology: FastAPI (Python 3.11), SQLAlchemy 2.0 (async), Pydantic v2

Purpose: RESTful API for managing alerts and recipients

Endpoints

Method Path Description Response
GET /health Health check 200 OK
POST /v1/alerts Create new alert 202 Accepted
GET /v1/alerts List alerts (paginated) 200 OK
GET /v1/alerts/{id} Get alert details 200 OK
GET /v1/alerts/{id}/status Get delivery status 200 OK
PUT /v1/alerts/{id} Update alert 200 OK
DELETE /v1/alerts/{id} Cancel alert 204 No Content
POST /v1/recipients Create recipient 201 Created
GET /v1/recipients List recipients 200 OK
GET /v1/recipients/{id} Get recipient details 200 OK
PUT /v1/recipients/{id} Update recipient 200 OK

Configuration

# Kubernetes Deployment
replicas: 3
resources:
  requests:
    memory: 256Mi
    cpu: 250m
  limits:
    memory: 512Mi
    cpu: 500m

# Auto-scaling
minReplicas: 2
maxReplicas: 10
targetCPU: 70%
targetMemory: 80%

2. Worker Service

Technology: Python 3.11 (async), boto3, Twilio SDK

Purpose: Process alert messages from SQS and deliver via Twilio

Processing Flow

sequenceDiagram participant SQS as SQS Queue participant Worker as Worker Pod participant Aurora as Aurora DB participant Twilio as Twilio API participant DDB as DynamoDB SQS->>Worker: Poll messages (long polling) Worker->>Aurora: Fetch alert details Aurora-->>Worker: Alert + Recipients loop For each recipient Worker->>Twilio: Send SMS/Voice Twilio-->>Worker: Delivery SID Worker->>DDB: Store delivery status end Worker->>SQS: Delete message

Error Handling

3. Poller Service

Technology: Python 3.11 (async), httpx

Purpose: Monitor HackerNews API and create alerts for new stories

Polling Logic

# Every 5 minutes:
1. Fetch top 30 stories from HackerNews API
2. Compare with previous state (stored in DynamoDB)
3. For each new story:
   - Fetch story details
   - POST to /v1/alerts endpoint
   - Update state in DynamoDB
4. Sleep until next poll interval
Alternative: Lambda + EventBridge

The poller could be implemented as a Lambda function triggered by CloudWatch EventBridge (cron: every 5 minutes). This would be more cost-effective (~$0.20/month vs ~$15/month for a continuous pod).

Data Flow & Patterns

End-to-End Alert Flow

sequenceDiagram autonumber participant Client participant API participant Aurora participant SNS participant SQS participant Worker participant Twilio participant DDB Client->>API: POST /v1/alerts API->>Aurora: INSERT INTO alerts Aurora-->>API: Alert ID API->>SNS: Publish alert event API-->>Client: 202 Accepted (alert_id) SNS->>SQS: Fan-out to queue SQS->>Worker: Poll message Worker->>Aurora: SELECT recipients Aurora-->>Worker: Recipient list loop Each recipient Worker->>Twilio: Send SMS/Voice Twilio-->>Worker: Delivery SID + Status Worker->>DDB: PUT delivery status end Worker->>SQS: Delete message Note over Client,DDB: Client can query GET /v1/alerts/{id}/status Client->>API: GET /v1/alerts/{id}/status API->>DDB: Query delivery status DDB-->>API: Status records API-->>Client: Delivery statuses

Data Consistency Model

Strong Consistency (Aurora)

Used for transactional data requiring ACID guarantees:

Eventual Consistency (DynamoDB)

Used for high-volume, time-series data:

Messaging Patterns

1. Fan-Out Pattern (SNS → SQS)

graph LR API[API Service] --> SNS[SNS Topic:
alert-events] SNS --> SQS1[SQS Queue:
alert-processing] SNS -.Future.-> SQS2[SQS Queue:
analytics] SNS -.Future.-> SQS3[SQS Queue:
webhooks] SQS1 --> Worker1[Worker Pods] SQS2 -.-> Analytics[Analytics Service] SQS3 -.-> Webhook[Webhook Service] style SNS fill:#ffc107,color:#000 style SQS1 fill:#0d6efd,color:#fff style Worker1 fill:#198754,color:#fff

2. At-Least-Once Delivery

SQS guarantees at-least-once delivery with these mechanisms:

3. Idempotency

# Worker ensures idempotent processing
message_id = message.get("MessageId")

# Check if already processed (in DynamoDB)
if delivery_status_exists(message_id):
    logger.info(f"Message {message_id} already processed")
    return

# Process and store status atomically
process_alert(alert_id)
store_delivery_status(message_id, status)

Regional Failover

graph TB subgraph "Normal Operation" R53_N[Route53] --> USE1_N[us-east-1
PRIMARY
✓ Healthy] R53_N -.Standby.-> USW1_N[us-west-1
SECONDARY] end subgraph "After Failover" R53_F[Route53] -.Failed.-> USE1_F[us-east-1
✗ Unhealthy] R53_F --> USW1_F[us-west-1
PROMOTED
✓ Healthy] end style USE1_N fill:#198754,color:#fff style USW1_F fill:#198754,color:#fff style USE1_F fill:#dc3545,color:#fff

Failover Process:

  1. Route53 health check detects primary region failure
  2. DNS automatically routes traffic to secondary region
  3. Aurora Global Database promotes secondary to primary (manual)
  4. Applications in secondary region continue serving requests

API Reference

Authentication

API Key Required: Include X-API-Key header in all requests

curl -H "X-API-Key: your-api-key-here" \
  https://api.bizcomms.example.com/v1/alerts

Create Alert

POST /v1/alerts

Request Body

{
  "title": "New HackerNews Story",
  "message": "Check out this awesome article!",
  "priority": "high",
  "delivery_method": "sms",
  "recipient_filters": {
    "tags": ["tech", "news"],
    "active_only": true
  }
}

Response (202 Accepted)

{
  "alert_id": "alert_abc123",
  "status": "queued",
  "created_at": "2025-01-01T12:00:00Z",
  "estimated_recipients": 42
}

Get Alert Status

GET /v1/alerts/{id}/status

Response (200 OK)

{
  "alert_id": "alert_abc123",
  "total_recipients": 42,
  "delivery_status": {
    "sent": 40,
    "delivered": 38,
    "failed": 2,
    "pending": 0
  },
  "deliveries": [
    {
      "recipient_id": "recip_123",
      "phone_number": "+1234567890",
      "status": "delivered",
      "twilio_sid": "SM1234567890abcdef",
      "sent_at": "2025-01-01T12:00:05Z",
      "delivered_at": "2025-01-01T12:00:07Z"
    }
  ]
}

Create Recipient

POST /v1/recipients

Request Body

{
  "name": "John Doe",
  "phone_number": "+1234567890",
  "preferences": {
    "sms_enabled": true,
    "voice_enabled": false
  },
  "tags": ["tech", "news"],
  "active": true
}

Response (201 Created)

{
  "recipient_id": "recip_123",
  "name": "John Doe",
  "phone_number": "+1234567890",
  "created_at": "2025-01-01T12:00:00Z"
}

Error Responses

Status Code Error Description
400 Bad Request Invalid request body or parameters
401 Unauthorized Missing or invalid API key
404 Not Found Resource not found
429 Too Many Requests Rate limit exceeded (WAF)
500 Internal Server Error Server-side error
503 Service Unavailable Service temporarily unavailable

Deployment Guide

Prerequisites

Deployment Steps

Step 1: Terraform Backend Setup

cd terraform/backend-setup
terraform init
terraform apply

# Output: S3 bucket and DynamoDB table for state

Step 2: Deploy Primary Region (us-east-1)

cd ../environments/us-east-1

# Initialize Terraform
terraform init

# Review plan
terraform plan

# Apply infrastructure
terraform apply

# Capture outputs
terraform output -json > outputs.json

Step 3: Deploy Secondary Region (us-west-1)

cd ../us-west-1

terraform init
terraform plan
terraform apply

terraform output -json > outputs.json

Step 4: Build and Push Docker Images

# Authenticate to ECR
aws ecr get-login-password --region us-east-1 | \
  docker login --username AWS --password-stdin ACCOUNT_ID.dkr.ecr.us-east-1.amazonaws.com

# Build images
docker build -f docker/Dockerfile.api -t bizcomms-api:latest .
docker build -f docker/Dockerfile.worker -t bizcomms-worker:latest .
docker build -f docker/Dockerfile.poller -t bizcomms-poller:latest .

# Tag and push
docker tag bizcomms-api:latest ACCOUNT_ID.dkr.ecr.us-east-1.amazonaws.com/bizcomms-api:latest
docker push ACCOUNT_ID.dkr.ecr.us-east-1.amazonaws.com/bizcomms-api:latest

# Repeat for worker and poller

Step 5: Deploy Kubernetes Applications

# Configure kubectl
aws eks update-kubeconfig --region us-east-1 --name bizcomms-cell-use1-01-eks

# Get WAF ARN from Terraform
WAF_ARN=$(cd terraform/environments/us-east-1 && terraform output -raw waf_web_acl_arn)

# Deploy with Helm
helm install bizcomms ./helm/bizcomms \
  -f helm/bizcomms/values-us-east-1.yaml \
  --set ingress.wafAclArn="$WAF_ARN" \
  --namespace bizcomms \
  --create-namespace

# Verify deployment
kubectl get pods -n bizcomms
kubectl get ingress -n bizcomms

Step 6: Update Route53

# Get ALB DNS name
ALB_DNS=$(kubectl get ingress bizcomms-ingress -n bizcomms -o jsonpath='{.status.loadBalancer.ingress[0].hostname}')

# Update Terraform variables
cd terraform/environments/us-east-1
echo "alb_dns_name = \"$ALB_DNS\"" >> terraform.tfvars

# Re-apply to update Route53
terraform apply -target=module.route53

Deployment Architecture

graph TB START[Start] --> BACKEND[1. Terraform Backend] BACKEND --> TF_USE1[2. Terraform us-east-1] TF_USE1 --> TF_USW1[3. Terraform us-west-1] TF_USW1 --> DOCKER[4. Build Docker Images] DOCKER --> K8S_USE1[5. Deploy K8s us-east-1] K8S_USE1 --> K8S_USW1[6. Deploy K8s us-west-1] K8S_USW1 --> R53[7. Update Route53] R53 --> VERIFY[8. Verify Deployment] VERIFY --> END[Complete] style START fill:#198754,color:#fff style END fill:#198754,color:#fff style TF_USE1 fill:#0d6efd,color:#fff style K8S_USE1 fill:#ffc107,color:#000

Verification

# Check API health
curl https://api.bizcomms.example.com/health

# Create test alert
curl -X POST https://api.bizcomms.example.com/v1/alerts \
  -H "X-API-Key: your-key" \
  -H "Content-Type: application/json" \
  -d '{
    "title": "Test Alert",
    "message": "Hello from BizComms!",
    "priority": "high",
    "delivery_method": "sms"
  }'

# Check logs
kubectl logs -n bizcomms -l app=bizcomms-api --tail=100
kubectl logs -n bizcomms -l app=bizcomms-worker --tail=100

Security

Security Architecture

graph TB subgraph "Network Security" WAF[WAF
SQL Injection
XSS Protection
Rate Limiting] ALB[ALB
SSL Termination] SG[Security Groups
Firewall Rules] end subgraph "Identity & Access" IRSA[IRSA
Pod IAM Roles] SM[Secrets Manager
Credential Storage] KMS[KMS
Encryption Keys] end subgraph "Data Protection" TLS[TLS in Transit] ENC_REST[Encryption at Rest] BACKUP[Automated Backups] end Internet[Internet] --> WAF WAF --> ALB ALB --> SG SG --> PODS[EKS Pods] PODS --> IRSA IRSA --> SM SM --> KMS style WAF fill:#dc3545,color:#fff style IRSA fill:#198754,color:#fff style KMS fill:#ffc107,color:#000

Security Best Practices Implemented

1. No Hardcoded Credentials

2. Least Privilege IAM

// API Pod Role
{
  "Effect": "Allow",
  "Action": [
    "secretsmanager:GetSecretValue",
    "sns:Publish"
  ],
  "Resource": [
    "arn:aws:secretsmanager:*:*:secret:bizcomms-*",
    "arn:aws:sns:*:*:bizcomms-*"
  ]
}

3. Network Isolation

Component Subnet Type Internet Access
ALB Public Direct (IGW)
EKS Pods Private Via NAT Gateway
Aurora Database None

4. Encryption

5. WAF Protection

# Managed Rules Applied
- AWSManagedRulesSQLiRuleSet    # SQL injection
- AWSManagedRulesKnownBadInputsRuleSet  # Known threats
- AWSManagedRulesCommonRuleSet  # OWASP Top 10

# Rate Limiting
- 1000 requests per 5 minutes per IP
- Custom response: 429 Too Many Requests

6. Audit Logging

Security Checklist

Infrastructure

  • VPC with private subnets
  • Security groups (least privilege)
  • WAF with managed rules
  • NACLs for subnet protection

Application

  • IRSA for pod authentication
  • Secrets Manager integration
  • Input validation (Pydantic)
  • Container image scanning (Trivy)

Monitoring & Observability

Observability Stack

graph TB subgraph "Data Collection" LOGS[CloudWatch Logs] METRICS[CloudWatch Metrics] TRACES[X-Ray Traces] end subgraph "Visualization" DASH[CloudWatch Dashboards] INSIGHTS[Log Insights] end subgraph "Alerting" ALARMS[CloudWatch Alarms] SNS_ALERT[SNS Topic] EMAIL[Email/SMS] end API[API Service] --> LOGS WORKER[Worker Service] --> LOGS POLLER[Poller Service] --> LOGS API --> METRICS WORKER --> METRICS EKS[EKS Cluster] --> METRICS AURORA[(Aurora)] --> METRICS LOGS --> INSIGHTS METRICS --> DASH METRICS --> ALARMS ALARMS --> SNS_ALERT SNS_ALERT --> EMAIL style LOGS fill:#ffc107,color:#000 style DASH fill:#0d6efd,color:#fff style ALARMS fill:#dc3545,color:#fff

CloudWatch Dashboards

Automated dashboards created by Terraform monitoring module:

1. API Service Dashboard

2. Worker Service Dashboard

3. Infrastructure Dashboard

CloudWatch Alarms

Alarm Name Metric Threshold Action
API High Error Rate 5xx errors > 10/minute SNS notification
SQS Queue Backup Queue depth > 100 messages SNS + Auto-scale workers
Aurora High CPU CPU utilization > 80% SNS notification
Aurora Replication Lag Replication lag > 5 seconds SNS notification
EKS Node Failure Node count < 2 nodes SNS notification

Log Aggregation

# Query API logs for errors
aws logs tail /aws/eks/bizcomms-cell-use1-01/api --follow \
  --filter-pattern "ERROR"

# CloudWatch Insights query
fields @timestamp, @message
| filter @message like /5xx/
| stats count() by bin(5m)

Health Checks

Kubernetes Probes

# Liveness Probe
livenessProbe:
  httpGet:
    path: /health
    port: 8000
  initialDelaySeconds: 30
  periodSeconds: 10

# Readiness Probe
readinessProbe:
  httpGet:
    path: /health
    port: 8000
  initialDelaySeconds: 10
  periodSeconds: 5

Route53 Health Checks

Cost Optimization

Estimated Monthly Costs (Production)

Service Configuration Monthly Cost (USD)
EKS Cluster Control plane (2 regions) $146
EC2 Instances (Karpenter) 3x t3.large (spot 70%) $76
Aurora Global Database 2x db.r6g.large $438
DynamoDB On-Demand (10GB, 1M reads/writes) $26
ALB 2 load balancers $33
NAT Gateway 2 regions × 3 AZs $194
Data Transfer Cross-region, internet egress $50
CloudWatch Logs, metrics, alarms $25
Route53 Hosted zone + health checks $1
Secrets Manager 6 secrets $2.40
TOTAL ~$991/month

Cost Optimization Strategies

1. Compute Optimization

Use Karpenter with Spot Instances

Savings: 60-70% reduction on compute costs

# Karpenter Provisioner
spec:
  requirements:
    - key: karpenter.sh/capacity-type
      operator: In
      values: ["spot", "on-demand"]
  limits:
    resources:
      cpu: 1000
      memory: 1000Gi

2. Database Optimization

3. Storage Optimization

4. Network Optimization

NAT Gateway is Expensive

Alternative: Use VPC Endpoints for AWS services (S3, DynamoDB, Secrets Manager)

Savings: ~$30-50/month by eliminating NAT Gateway traffic to AWS services

5. Monitoring Optimization

Cost by Environment

Environment Configuration Monthly Cost
Development Single region, t3.medium, db.t4g.medium, no multi-AZ ~$180
Staging Single region, t3.large, db.r6g.large, multi-AZ ~$400
Production Multi-region, spot instances, db.r6g.large, HA ~$991