AWS Architecture Guide

Building Scalable, Secure Cloud Infrastructure

Guide Overview

This guide covers AWS architecture from the ground up: account organization, networking, compute, storage, databases, and security. Each section includes Terraform examples, comparison tables, and real-world architecture patterns for senior/staff engineer interviews.

Cross-references: See Messaging Systems for SQS, SNS, Kinesis, EventBridge details.

1. AWS Organizations & Account Structure

1.1 Multi-Account Strategy

Why Multiple Accounts?
graph TB Root["AWS Organization Root
Management Account"] Root --> Core["Core OU"] Root --> Workloads["Workloads OU"] Root --> Security["Security OU"] Core --> Log["Log Archive Account
(CloudTrail, Config)"] Core --> Audit["Audit Account
(Security Hub, GuardDuty)"] Workloads --> Dev["Development Account"] Workloads --> Staging["Staging Account"] Workloads --> Prod["Production Account"] Security --> SecurityTools["Security Tooling Account
(Inspector, Macie)"] Root -.->|"SCPs"| Core Root -.->|"SCPs"| Workloads Root -.->|"SCPs"| Security style Root fill:#ff9900,stroke:#232f3e,stroke-width:3px,color:#232f3e style Prod fill:#bf616a,stroke:#d08770,stroke-width:2px,color:#2e3440 style Security fill:#a3be8c,stroke:#8fbcbb,stroke-width:2px,color:#2e3440

Account Structure Best Practices

Account Type Purpose Who Has Access
Management (Root) AWS Organizations, billing, minimal workloads Finance, senior ops only
Log Archive Centralized logging (CloudTrail, VPC Flow Logs) Security team (read-only), automated systems (write)
Audit/Security Security Hub, GuardDuty aggregation, compliance Security team, compliance auditors
Shared Services Active Directory, VPN, Transit Gateway Network team, infra team
Development Development/testing workloads All engineers (broad permissions)
Staging Pre-production testing Engineers (limited), QA team
Production Live customer-facing workloads On-call engineers (read-only + incident response)

Terraform: AWS Organizations Setup

# Create AWS Organization (run in management account)
resource "aws_organizations_organization" "org" {
  feature_set = "ALL"

  aws_service_access_principals = [
    "cloudtrail.amazonaws.com",
    "config.amazonaws.com",
    "guardduty.amazonaws.com",
    "securityhub.amazonaws.com"
  ]

  enabled_policy_types = [
    "SERVICE_CONTROL_POLICY",
    "TAG_POLICY"
  ]
}

# Create Organizational Units
resource "aws_organizations_organizational_unit" "workloads" {
  name      = "Workloads"
  parent_id = aws_organizations_organization.org.roots[0].id
}

resource "aws_organizations_organizational_unit" "security" {
  name      = "Security"
  parent_id = aws_organizations_organization.org.roots[0].id
}

# Create Production Account
resource "aws_organizations_account" "production" {
  name      = "production"
  email     = "aws-prod@company.com"
  parent_id = aws_organizations_organizational_unit.workloads.id

  role_name = "OrganizationAccountAccessRole"

  tags = {
    Environment = "production"
    CostCenter  = "engineering"
  }
}

# Service Control Policy (SCP) - Prevent region usage outside allowed list
resource "aws_organizations_policy" "require_regions" {
  name        = "RequireAllowedRegions"
  description = "Restrict operations to us-east-1 and us-west-2"
  type        = "SERVICE_CONTROL_POLICY"

  content = jsonencode({
    Version = "2012-10-17"
    Statement = [
      {
        Sid    = "DenyAllOutsideAllowedRegions"
        Effect = "Deny"
        NotAction = [
          "iam:*",
          "organizations:*",
          "route53:*",
          "budgets:*",
          "cloudfront:*",
          "support:*",
          "sts:*"
        ]
        Resource = "*"
        Condition = {
          StringNotEquals = {
            "aws:RequestedRegion" = [
              "us-east-1",
              "us-west-2"
            ]
          }
        }
      }
    ]
  })
}

# Attach SCP to Workloads OU
resource "aws_organizations_policy_attachment" "workloads_regions" {
  policy_id = aws_organizations_policy.require_regions.id
  target_id = aws_organizations_organizational_unit.workloads.id
}

1.2 IAM Foundation

IAM Best Practices:

IAM Components

Component Purpose Example Use Case
IAM Users Human users (avoid for services) Developer accessing AWS Console
IAM Roles Assumed by services or federated users EC2 instance accessing S3, cross-account access
IAM Policies JSON documents defining permissions Allow S3 read access to specific bucket
IAM Groups Collection of users for permission management Engineers group with EC2/S3 access
Service Control Policies (SCPs) Org-level permission boundaries Prevent production account from deleting CloudTrail

Terraform: IAM Roles for Services

# IAM Role for EC2 instances to access S3 and DynamoDB
resource "aws_iam_role" "app_server" {
  name = "app-server-role"

  # Trust policy - who can assume this role
  assume_role_policy = jsonencode({
    Version = "2012-10-17"
    Statement = [
      {
        Action = "sts:AssumeRole"
        Effect = "Allow"
        Principal = {
          Service = "ec2.amazonaws.com"
        }
      }
    ]
  })

  tags = {
    Purpose = "application-server"
  }
}

# Permission policy - what this role can do
resource "aws_iam_role_policy" "app_server_permissions" {
  name = "app-server-permissions"
  role = aws_iam_role.app_server.id

  policy = jsonencode({
    Version = "2012-10-17"
    Statement = [
      {
        # S3 access
        Effect = "Allow"
        Action = [
          "s3:GetObject",
          "s3:PutObject"
        ]
        Resource = "arn:aws:s3:::my-app-bucket/*"
      },
      {
        # DynamoDB access
        Effect = "Allow"
        Action = [
          "dynamodb:GetItem",
          "dynamodb:PutItem",
          "dynamodb:Query"
        ]
        Resource = "arn:aws:dynamodb:us-east-1:*:table/Users"
      },
      {
        # Secrets Manager access
        Effect = "Allow"
        Action = [
          "secretsmanager:GetSecretValue"
        ]
        Resource = "arn:aws:secretsmanager:us-east-1:*:secret:app/*"
      }
    ]
  })
}

# Instance profile (wrapper for role used by EC2)
resource "aws_iam_instance_profile" "app_server" {
  name = "app-server-profile"
  role = aws_iam_role.app_server.name
}

# Attach instance profile to EC2 instance
resource "aws_instance" "app" {
  ami                  = "ami-12345678"
  instance_type        = "t3.medium"
  iam_instance_profile = aws_iam_instance_profile.app_server.name

  # Now this instance can access S3, DynamoDB, Secrets Manager
  # without any API keys or credentials!
}

Cross-Account Access Pattern

# In Production Account: Role that Dev Account can assume
resource "aws_iam_role" "prod_readonly_from_dev" {
  name = "prod-readonly-cross-account"

  assume_role_policy = jsonencode({
    Version = "2012-10-17"
    Statement = [
      {
        Effect = "Allow"
        Principal = {
          AWS = "arn:aws:iam::111111111111:root"  # Dev account ID
        }
        Action = "sts:AssumeRole"
        Condition = {
          StringEquals = {
            "sts:ExternalId" = "unique-external-id-12345"
          }
        }
      }
    ]
  })
}

resource "aws_iam_role_policy_attachment" "readonly" {
  role       = aws_iam_role.prod_readonly_from_dev.name
  policy_arn = "arn:aws:iam::aws:policy/ReadOnlyAccess"
}

# In Dev Account: Policy allowing users to assume prod role
resource "aws_iam_policy" "assume_prod_readonly" {
  name = "assume-prod-readonly"

  policy = jsonencode({
    Version = "2012-10-17"
    Statement = [
      {
        Effect = "Allow"
        Action = "sts:AssumeRole"
        Resource = "arn:aws:iam::222222222222:role/prod-readonly-cross-account"
      }
    ]
  })
}

# Attach to engineers group
resource "aws_iam_group_policy_attachment" "engineers_assume_prod" {
  group      = aws_iam_group.engineers.name
  policy_arn = aws_iam_policy.assume_prod_readonly.arn
}

2. Networking Architecture

2.1 VPC Fundamentals

What is a VPC?

A Virtual Private Cloud (VPC) is an isolated network within AWS where you launch resources. Think of it as your own private data center in the cloud with complete control over:

graph TB Internet["Internet"] subgraph VPC["VPC: 10.0.0.0/16"] IGW["Internet Gateway"] subgraph AZ1["Availability Zone us-east-1a"] PubSub1["Public Subnet
10.0.1.0/24"] PrivSub1["Private Subnet
10.0.10.0/24"] NAT1["NAT Gateway"] end subgraph AZ2["Availability Zone us-east-1b"] PubSub2["Public Subnet
10.0.2.0/24"] PrivSub2["Private Subnet
10.0.20.0/24"] NAT2["NAT Gateway"] end Web1["Web Server
EC2"] Web2["Web Server
EC2"] App1["App Server
EC2"] App2["App Server
EC2"] PubSub1 --> Web1 PubSub2 --> Web2 PrivSub1 --> App1 PrivSub2 --> App2 PubSub1 --> NAT1 PubSub2 --> NAT2 end Internet <--> IGW IGW <--> PubSub1 IGW <--> PubSub2 PrivSub1 --> NAT1 PrivSub2 --> NAT2 NAT1 --> IGW NAT2 --> IGW style VPC fill:#2e3440,stroke:#88c0d0,stroke-width:3px,color:#e0e0e0 style PubSub1 fill:#a3be8c,stroke:#8fbcbb,stroke-width:2px,color:#2e3440 style PubSub2 fill:#a3be8c,stroke:#8fbcbb,stroke-width:2px,color:#2e3440 style PrivSub1 fill:#bf616a,stroke:#d08770,stroke-width:2px,color:#2e3440 style PrivSub2 fill:#bf616a,stroke:#d08770,stroke-width:2px,color:#2e3440

Public vs Private Subnets

Subnet Type Internet Access Route Table Use Cases
Public Subnet Direct (via Internet Gateway) 0.0.0.0/0 → Internet Gateway Load balancers, bastion hosts, NAT gateways
Private Subnet Outbound only (via NAT Gateway) 0.0.0.0/0 → NAT Gateway Application servers, databases, internal services
Isolated Subnet None Local VPC routes only Databases with no internet access needed

Terraform: Complete VPC Setup

# VPC
resource "aws_vpc" "main" {
  cidr_block           = "10.0.0.0/16"
  enable_dns_hostnames = true
  enable_dns_support   = true

  tags = {
    Name = "production-vpc"
  }
}

# Internet Gateway
resource "aws_internet_gateway" "main" {
  vpc_id = aws_vpc.main.id

  tags = {
    Name = "production-igw"
  }
}

# Public Subnet AZ1
resource "aws_subnet" "public_1" {
  vpc_id                  = aws_vpc.main.id
  cidr_block              = "10.0.1.0/24"
  availability_zone       = "us-east-1a"
  map_public_ip_on_launch = true

  tags = {
    Name = "public-subnet-1a"
    Type = "public"
  }
}

# Public Subnet AZ2
resource "aws_subnet" "public_2" {
  vpc_id                  = aws_vpc.main.id
  cidr_block              = "10.0.2.0/24"
  availability_zone       = "us-east-1b"
  map_public_ip_on_launch = true

  tags = {
    Name = "public-subnet-1b"
    Type = "public"
  }
}

# Private Subnet AZ1
resource "aws_subnet" "private_1" {
  vpc_id            = aws_vpc.main.id
  cidr_block        = "10.0.10.0/24"
  availability_zone = "us-east-1a"

  tags = {
    Name = "private-subnet-1a"
    Type = "private"
  }
}

# Private Subnet AZ2
resource "aws_subnet" "private_2" {
  vpc_id            = aws_vpc.main.id
  cidr_block        = "10.0.20.0/24"
  availability_zone = "us-east-1b"

  tags = {
    Name = "private-subnet-1b"
    Type = "private"
  }
}

# Elastic IP for NAT Gateway
resource "aws_eip" "nat_1" {
  domain = "vpc"

  tags = {
    Name = "nat-gateway-eip-1a"
  }
}

# NAT Gateway (for private subnets to access internet)
resource "aws_nat_gateway" "nat_1" {
  allocation_id = aws_eip.nat_1.id
  subnet_id     = aws_subnet.public_1.id

  tags = {
    Name = "nat-gateway-1a"
  }

  depends_on = [aws_internet_gateway.main]
}

# Public Route Table
resource "aws_route_table" "public" {
  vpc_id = aws_vpc.main.id

  route {
    cidr_block = "0.0.0.0/0"
    gateway_id = aws_internet_gateway.main.id
  }

  tags = {
    Name = "public-route-table"
  }
}

# Private Route Table
resource "aws_route_table" "private_1" {
  vpc_id = aws_vpc.main.id

  route {
    cidr_block     = "0.0.0.0/0"
    nat_gateway_id = aws_nat_gateway.nat_1.id
  }

  tags = {
    Name = "private-route-table-1a"
  }
}

# Route Table Associations
resource "aws_route_table_association" "public_1" {
  subnet_id      = aws_subnet.public_1.id
  route_table_id = aws_route_table.public.id
}

resource "aws_route_table_association" "public_2" {
  subnet_id      = aws_subnet.public_2.id
  route_table_id = aws_route_table.public.id
}

resource "aws_route_table_association" "private_1" {
  subnet_id      = aws_subnet.private_1.id
  route_table_id = aws_route_table.private_1.id
}

2.2 Security Groups vs NACLs

Feature Security Groups Network ACLs
Level Instance (ENI) level Subnet level
State Stateful (return traffic auto-allowed) Stateless (must explicitly allow return traffic)
Rules Allow rules only Allow and Deny rules
Rule Evaluation All rules evaluated Rules processed in number order
Default Deny all inbound, allow all outbound Allow all inbound and outbound
Use Case Primary firewall (granular control) Additional subnet-level protection
# Security Group for Web Servers
resource "aws_security_group" "web" {
  name        = "web-servers"
  description = "Security group for web tier"
  vpc_id      = aws_vpc.main.id

  # Inbound rules
  ingress {
    description = "HTTPS from internet"
    from_port   = 443
    to_port     = 443
    protocol    = "tcp"
    cidr_blocks = ["0.0.0.0/0"]
  }

  ingress {
    description = "HTTP from internet"
    from_port   = 80
    to_port     = 80
    protocol    = "tcp"
    cidr_blocks = ["0.0.0.0/0"]
  }

  # Outbound rules (allow all by default)
  egress {
    from_port   = 0
    to_port     = 0
    protocol    = "-1"
    cidr_blocks = ["0.0.0.0/0"]
  }

  tags = {
    Name = "web-sg"
  }
}

# Security Group for Application Servers
resource "aws_security_group" "app" {
  name        = "app-servers"
  description = "Security group for application tier"
  vpc_id      = aws_vpc.main.id

  # Only allow traffic from web tier
  ingress {
    description     = "HTTP from web tier"
    from_port       = 8080
    to_port         = 8080
    protocol        = "tcp"
    security_groups = [aws_security_group.web.id]  # Reference web SG
  }

  egress {
    from_port   = 0
    to_port     = 0
    protocol    = "-1"
    cidr_blocks = ["0.0.0.0/0"]
  }

  tags = {
    Name = "app-sg"
  }
}

# Security Group for Database
resource "aws_security_group" "db" {
  name        = "database"
  description = "Security group for database tier"
  vpc_id      = aws_vpc.main.id

  # Only allow traffic from app tier
  ingress {
    description     = "PostgreSQL from app tier"
    from_port       = 5432
    to_port         = 5432
    protocol        = "tcp"
    security_groups = [aws_security_group.app.id]
  }

  egress {
    from_port   = 0
    to_port     = 0
    protocol    = "-1"
    cidr_blocks = ["0.0.0.0/0"]
  }

  tags = {
    Name = "db-sg"
  }
}

2.3 VPC Peering & Transit Gateway

VPC Peering

Use VPC Peering when: Limitations: Non-transitive (A↔B, B↔C doesn't mean A↔C), doesn't scale beyond ~10 VPCs
# VPC Peering Connection
resource "aws_vpc_peering_connection" "prod_to_shared" {
  vpc_id        = aws_vpc.production.id
  peer_vpc_id   = aws_vpc.shared_services.id
  peer_region   = "us-east-1"
  auto_accept   = true

  tags = {
    Name = "prod-to-shared-services"
  }
}

# Add route in Production VPC to Shared Services VPC
resource "aws_route" "prod_to_shared" {
  route_table_id            = aws_route_table.production_private.id
  destination_cidr_block    = "10.1.0.0/16"  # Shared Services CIDR
  vpc_peering_connection_id = aws_vpc_peering_connection.prod_to_shared.id
}

# Add route in Shared Services VPC to Production VPC
resource "aws_route" "shared_to_prod" {
  route_table_id            = aws_route_table.shared_private.id
  destination_cidr_block    = "10.0.0.0/16"  # Production CIDR
  vpc_peering_connection_id = aws_vpc_peering_connection.prod_to_shared.id
}

Transit Gateway (TGW)

Use Transit Gateway when: Advantages: Transitive routing, simpler management at scale, centralized network monitoring
graph TB OnPrem["On-Premises
Data Center"] TGW["Transit Gateway
(Hub)"] VPC1["Production VPC
10.0.0.0/16"] VPC2["Staging VPC
10.1.0.0/16"] VPC3["Dev VPC
10.2.0.0/16"] VPC4["Shared Services VPC
10.3.0.0/16"] OnPrem <-->|"VPN/Direct Connect"| TGW TGW <--> VPC1 TGW <--> VPC2 TGW <--> VPC3 TGW <--> VPC4 style TGW fill:#ff9900,stroke:#232f3e,stroke-width:3px,color:#232f3e style VPC1 fill:#bf616a,stroke:#d08770,stroke-width:2px,color:#2e3440 style VPC4 fill:#a3be8c,stroke:#8fbcbb,stroke-width:2px,color:#2e3440
# Transit Gateway
resource "aws_ec2_transit_gateway" "main" {
  description                     = "Main TGW for all VPCs"
  default_route_table_association = "enable"
  default_route_table_propagation = "enable"

  tags = {
    Name = "main-tgw"
  }
}

# Attach Production VPC to TGW
resource "aws_ec2_transit_gateway_vpc_attachment" "production" {
  subnet_ids         = [aws_subnet.prod_private_1.id, aws_subnet.prod_private_2.id]
  transit_gateway_id = aws_ec2_transit_gateway.main.id
  vpc_id             = aws_vpc.production.id

  tags = {
    Name = "production-tgw-attachment"
  }
}

# Attach Shared Services VPC to TGW
resource "aws_ec2_transit_gateway_vpc_attachment" "shared" {
  subnet_ids         = [aws_subnet.shared_private_1.id, aws_subnet.shared_private_2.id]
  transit_gateway_id = aws_ec2_transit_gateway.main.id
  vpc_id             = aws_vpc.shared_services.id

  tags = {
    Name = "shared-services-tgw-attachment"
  }
}

# Route from Production VPC to TGW (for reaching other VPCs)
resource "aws_route" "prod_to_tgw" {
  route_table_id         = aws_route_table.production_private.id
  destination_cidr_block = "10.0.0.0/8"  # All internal networks
  transit_gateway_id     = aws_ec2_transit_gateway.main.id
}

2.4 VPC Endpoints (PrivateLink)

Why VPC Endpoints?

Access AWS services (S3, DynamoDB, etc.) without going through the internet. Benefits:

Endpoint Type Services How It Works
Gateway Endpoint S3, DynamoDB Route table entry, no ENI created
Interface Endpoint Most AWS services (EC2, SNS, SQS, etc.) ENI with private IP in your subnet
# S3 Gateway Endpoint (free)
resource "aws_vpc_endpoint" "s3" {
  vpc_id       = aws_vpc.main.id
  service_name = "com.amazonaws.us-east-1.s3"

  route_table_ids = [
    aws_route_table.private_1.id,
    aws_route_table.private_2.id
  ]

  tags = {
    Name = "s3-gateway-endpoint"
  }
}

# DynamoDB Gateway Endpoint (free)
resource "aws_vpc_endpoint" "dynamodb" {
  vpc_id       = aws_vpc.main.id
  service_name = "com.amazonaws.us-east-1.dynamodb"

  route_table_ids = [
    aws_route_table.private_1.id,
    aws_route_table.private_2.id
  ]

  tags = {
    Name = "dynamodb-gateway-endpoint"
  }
}

# Secrets Manager Interface Endpoint
resource "aws_vpc_endpoint" "secretsmanager" {
  vpc_id              = aws_vpc.main.id
  service_name        = "com.amazonaws.us-east-1.secretsmanager"
  vpc_endpoint_type   = "Interface"
  private_dns_enabled = true

  subnet_ids = [
    aws_subnet.private_1.id,
    aws_subnet.private_2.id
  ]

  security_group_ids = [
    aws_security_group.vpc_endpoints.id
  ]

  tags = {
    Name = "secretsmanager-interface-endpoint"
  }
}

3. Compute Services

3.1 Compute Service Comparison

Service Type When to Use Pros Cons
EC2 Virtual Machines Full OS control, legacy apps, persistent workloads Complete control, any software Must manage OS, scaling complexity
Lambda Serverless Functions Event-driven, short tasks (<15min), API backends No servers, auto-scaling, pay-per-use 15min timeout, cold starts, limited runtime options
ECS Container Orchestration Docker containers, AWS-native orchestration Simple, integrates well with AWS AWS-only, less ecosystem than Kubernetes
EKS Managed Kubernetes Need Kubernetes, complex microservices, multi-cloud Industry standard, portable, huge ecosystem Complex, expensive, steep learning curve
Fargate Serverless Containers Containers without managing servers (ECS/EKS) No server management, auto-scaling More expensive than EC2, less control
App Runner Fully Managed Apps Deploy from source code/container with zero config Easiest deployment, auto-scaling Less control, newer service

3.2 EC2 Fundamentals

Instance Types Decision Tree

Family Type Use Case Example
T3/T4g Burstable Web servers, dev/test, low-traffic apps t3.medium (2 vCPU, 4 GB)
M5/M6i General Purpose Balanced compute/memory, app servers m5.xlarge (4 vCPU, 16 GB)
C5/C6i Compute Optimized CPU-intensive (batch processing, ML inference) c5.2xlarge (8 vCPU, 16 GB)
R5/R6i Memory Optimized In-memory databases, caches, big data r5.xlarge (4 vCPU, 32 GB)
P3/P4 GPU Machine learning training, HPC p3.2xlarge (8 vCPU, 1 GPU)
# Auto Scaling Group for web servers
resource "aws_launch_template" "web" {
  name_prefix   = "web-"
  image_id      = "ami-12345678"  # Amazon Linux 2023
  instance_type = "t3.medium"

  # IAM instance profile
  iam_instance_profile {
    name = aws_iam_instance_profile.web.name
  }

  # Security group
  vpc_security_group_ids = [aws_security_group.web.id]

  # User data script (runs on boot)
  user_data = base64encode(<<-EOF
    #!/bin/bash
    yum update -y
    yum install -y httpd
    systemctl start httpd
    systemctl enable httpd
    echo "

Hello from $(hostname)

" > /var/www/html/index.html EOF ) # EBS volume configuration block_device_mappings { device_name = "/dev/xvda" ebs { volume_size = 20 volume_type = "gp3" encrypted = true delete_on_termination = true } } # Metadata options (IMDSv2 enforced) metadata_options { http_endpoint = "enabled" http_tokens = "required" # Require IMDSv2 http_put_response_hop_limit = 1 } tag_specifications { resource_type = "instance" tags = { Name = "web-server" } } } # Auto Scaling Group resource "aws_autoscaling_group" "web" { name = "web-asg" vpc_zone_identifier = [aws_subnet.public_1.id, aws_subnet.public_2.id] target_group_arns = [aws_lb_target_group.web.arn] health_check_type = "ELB" health_check_grace_period = 300 min_size = 2 max_size = 10 desired_capacity = 2 launch_template { id = aws_launch_template.web.id version = "$Latest" } tag { key = "Name" value = "web-server" propagate_at_launch = true } } # Auto Scaling Policy (CPU-based) resource "aws_autoscaling_policy" "web_cpu" { name = "web-cpu-scaling" autoscaling_group_name = aws_autoscaling_group.web.name policy_type = "TargetTrackingScaling" target_tracking_configuration { predefined_metric_specification { predefined_metric_type = "ASGAverageCPUUtilization" } target_value = 70.0 # Target 70% CPU } }

3.3 Lambda (Serverless Functions)

Lambda Best Practices:
# Lambda function for S3 image processing
resource "aws_lambda_function" "image_processor" {
  filename      = "image_processor.zip"
  function_name = "process-uploaded-images"
  role          = aws_iam_role.lambda_exec.arn
  handler       = "index.handler"
  runtime       = "python3.11"

  source_code_hash = filebase64sha256("image_processor.zip")

  # Memory and timeout
  memory_size = 1024  # MB (also affects CPU)
  timeout     = 60    # seconds

  # Environment variables
  environment {
    variables = {
      DEST_BUCKET = aws_s3_bucket.processed_images.id
      MAX_WIDTH   = "1920"
      MAX_HEIGHT  = "1080"
    }
  }

  # VPC configuration (only if needed)
  # vpc_config {
  #   subnet_ids         = [aws_subnet.private_1.id, aws_subnet.private_2.id]
  #   security_group_ids = [aws_security_group.lambda.id]
  # }

  tags = {
    Purpose = "image-processing"
  }
}

# IAM role for Lambda
resource "aws_iam_role" "lambda_exec" {
  name = "lambda-image-processor-role"

  assume_role_policy = jsonencode({
    Version = "2012-10-17"
    Statement = [
      {
        Action = "sts:AssumeRole"
        Effect = "Allow"
        Principal = {
          Service = "lambda.amazonaws.com"
        }
      }
    ]
  })
}

# Lambda policy - S3 access
resource "aws_iam_role_policy" "lambda_s3" {
  name = "lambda-s3-access"
  role = aws_iam_role.lambda_exec.id

  policy = jsonencode({
    Version = "2012-10-17"
    Statement = [
      {
        Effect = "Allow"
        Action = [
          "s3:GetObject"
        ]
        Resource = "${aws_s3_bucket.uploads.arn}/*"
      },
      {
        Effect = "Allow"
        Action = [
          "s3:PutObject"
        ]
        Resource = "${aws_s3_bucket.processed_images.arn}/*"
      }
    ]
  })
}

# Attach AWS managed policy for Lambda basics (CloudWatch Logs)
resource "aws_iam_role_policy_attachment" "lambda_basic" {
  role       = aws_iam_role.lambda_exec.name
  policy_arn = "arn:aws:iam::aws:policy/service-role/AWSLambdaBasicExecutionRole"
}

# S3 trigger for Lambda
resource "aws_lambda_permission" "allow_s3" {
  statement_id  = "AllowExecutionFromS3"
  action        = "lambda:InvokeFunction"
  function_name = aws_lambda_function.image_processor.function_name
  principal     = "s3.amazonaws.com"
  source_arn    = aws_s3_bucket.uploads.arn
}

resource "aws_s3_bucket_notification" "upload_trigger" {
  bucket = aws_s3_bucket.uploads.id

  lambda_function {
    lambda_function_arn = aws_lambda_function.image_processor.arn
    events              = ["s3:ObjectCreated:*"]
    filter_prefix       = "uploads/"
    filter_suffix       = ".jpg"
  }

  depends_on = [aws_lambda_permission.allow_s3]
}

3.4 ECS (Elastic Container Service)

ECS vs EKS Decision:
# ECS Cluster
resource "aws_ecs_cluster" "main" {
  name = "production-cluster"

  setting {
    name  = "containerInsights"
    value = "enabled"
  }

  tags = {
    Environment = "production"
  }
}

# ECS Task Definition (defines container configuration)
resource "aws_ecs_task_definition" "app" {
  family                   = "app-service"
  network_mode             = "awsvpc"
  requires_compatibilities = ["FARGATE"]
  cpu                      = "512"   # 0.5 vCPU
  memory                   = "1024"  # 1 GB

  # Task execution role (pulls images, writes logs)
  execution_role_arn = aws_iam_role.ecs_task_execution.arn

  # Task role (permissions for app itself)
  task_role_arn = aws_iam_role.ecs_task.arn

  container_definitions = jsonencode([
    {
      name  = "app"
      image = "123456789012.dkr.ecr.us-east-1.amazonaws.com/my-app:latest"

      portMappings = [
        {
          containerPort = 8080
          protocol      = "tcp"
        }
      ]

      environment = [
        {
          name  = "ENV"
          value = "production"
        }
      ]

      # Secrets from Secrets Manager
      secrets = [
        {
          name      = "DB_PASSWORD"
          valueFrom = aws_secretsmanager_secret.db_password.arn
        }
      ]

      logConfiguration = {
        logDriver = "awslogs"
        options = {
          "awslogs-group"         = aws_cloudwatch_log_group.ecs.name
          "awslogs-region"        = "us-east-1"
          "awslogs-stream-prefix" = "app"
        }
      }

      healthCheck = {
        command     = ["CMD-SHELL", "curl -f http://localhost:8080/health || exit 1"]
        interval    = 30
        timeout     = 5
        retries     = 3
        startPeriod = 60
      }
    }
  ])
}

# ECS Service (runs and maintains task instances)
resource "aws_ecs_service" "app" {
  name            = "app-service"
  cluster         = aws_ecs_cluster.main.id
  task_definition = aws_ecs_task_definition.app.arn
  desired_count   = 2
  launch_type     = "FARGATE"

  network_configuration {
    subnets          = [aws_subnet.private_1.id, aws_subnet.private_2.id]
    security_groups  = [aws_security_group.ecs_tasks.id]
    assign_public_ip = false
  }

  # Load balancer configuration
  load_balancer {
    target_group_arn = aws_lb_target_group.app.arn
    container_name   = "app"
    container_port   = 8080
  }

  # Auto-scaling
  lifecycle {
    ignore_changes = [desired_count]  # Let auto-scaling manage this
  }

  depends_on = [aws_lb_listener.app]
}

# Auto-scaling for ECS service
resource "aws_appautoscaling_target" "ecs" {
  max_capacity       = 10
  min_capacity       = 2
  resource_id        = "service/${aws_ecs_cluster.main.name}/${aws_ecs_service.app.name}"
  scalable_dimension = "ecs:service:DesiredCount"
  service_namespace  = "ecs"
}

resource "aws_appautoscaling_policy" "ecs_cpu" {
  name               = "ecs-cpu-autoscaling"
  policy_type        = "TargetTrackingScaling"
  resource_id        = aws_appautoscaling_target.ecs.resource_id
  scalable_dimension = aws_appautoscaling_target.ecs.scalable_dimension
  service_namespace  = aws_appautoscaling_target.ecs.service_namespace

  target_tracking_scaling_policy_configuration {
    predefined_metric_specification {
      predefined_metric_type = "ECSServiceAverageCPUUtilization"
    }
    target_value = 70.0
  }
}

4. Storage Services

4.1 Storage Service Comparison

Service Type Use Case Max Size Access
S3 Object Storage Static assets, backups, data lake, archives Unlimited (5TB per object) HTTP API
EBS Block Storage EC2 instance boot/data volumes 64 TiB Single EC2 instance (except io2 Multi-Attach)
EFS File Storage (NFS) Shared file storage across multiple EC2/containers Unlimited Multiple instances concurrently
FSx Managed File Systems Windows (SMB), Lustre (HPC), NetApp, OpenZFS Varies Multiple instances

4.2 S3 (Simple Storage Service)

S3 Storage Classes

Storage Class Use Case Availability Retrieval Cost
S3 Standard Frequently accessed data 99.99% Instant $$$$
S3 Intelligent-Tiering Unknown/changing access patterns 99.9% Instant $$$ + monitoring fee
S3 Standard-IA Infrequently accessed (backups) 99.9% Instant $$ + retrieval fee
S3 One Zone-IA Reproducible data (thumbnails) 99.5% Instant $ + retrieval fee
S3 Glacier Instant Archive with instant access 99.9% Instant $ + retrieval fee
S3 Glacier Flexible Archives (1-5 min retrieval) 99.99% Minutes-hours Very low
S3 Glacier Deep Archive Long-term archive (12h retrieval) 99.99% 12 hours Lowest
# S3 bucket with security best practices
resource "aws_s3_bucket" "app_data" {
  bucket = "my-app-data-bucket"

  tags = {
    Purpose = "application-data"
  }
}

# Block public access (CRITICAL)
resource "aws_s3_bucket_public_access_block" "app_data" {
  bucket = aws_s3_bucket.app_data.id

  block_public_acls       = true
  block_public_policy     = true
  ignore_public_acls      = true
  restrict_public_buckets = true
}

# Enable versioning
resource "aws_s3_bucket_versioning" "app_data" {
  bucket = aws_s3_bucket.app_data.id

  versioning_configuration {
    status = "Enabled"
  }
}

# Server-side encryption (AES-256 or KMS)
resource "aws_s3_bucket_server_side_encryption_configuration" "app_data" {
  bucket = aws_s3_bucket.app_data.id

  rule {
    apply_server_side_encryption_by_default {
      sse_algorithm     = "aws:kms"
      kms_master_key_id = aws_kms_key.s3.id
    }
    bucket_key_enabled = true  # Reduces KMS costs
  }
}

# Lifecycle policy (auto-transition to cheaper storage)
resource "aws_s3_bucket_lifecycle_configuration" "app_data" {
  bucket = aws_s3_bucket.app_data.id

  rule {
    id     = "transition-old-data"
    status = "Enabled"

    # Transition to IA after 30 days
    transition {
      days          = 30
      storage_class = "STANDARD_IA"
    }

    # Transition to Glacier after 90 days
    transition {
      days          = 90
      storage_class = "GLACIER"
    }

    # Delete after 365 days
    expiration {
      days = 365
    }

    # Clean up old versions
    noncurrent_version_expiration {
      noncurrent_days = 90
    }
  }
}

# Bucket policy (restrict access)
resource "aws_s3_bucket_policy" "app_data" {
  bucket = aws_s3_bucket.app_data.id

  policy = jsonencode({
    Version = "2012-10-17"
    Statement = [
      {
        # Enforce SSL/TLS
        Sid    = "DenyInsecureTransport"
        Effect = "Deny"
        Principal = "*"
        Action = "s3:*"
        Resource = [
          aws_s3_bucket.app_data.arn,
          "${aws_s3_bucket.app_data.arn}/*"
        ]
        Condition = {
          Bool = {
            "aws:SecureTransport" = "false"
          }
        }
      }
    ]
  })
}

4.3 EBS (Elastic Block Store)

EBS Volume Types

Type Use Case IOPS Throughput Cost
gp3 General purpose (default choice) 3,000-16,000 (configurable) 125-1,000 MB/s $0.08/GB-month
gp2 General purpose (older) 100-16,000 (size-based) Up to 250 MB/s $0.10/GB-month
io2 High-performance databases Up to 64,000 Up to 1,000 MB/s $0.125/GB + IOPS cost
st1 Big data, data warehouses (HDD) N/A (throughput optimized) Up to 500 MB/s $0.045/GB-month
sc1 Cold storage (HDD) N/A Up to 250 MB/s $0.015/GB-month
# EBS volume for database
resource "aws_ebs_volume" "database" {
  availability_zone = "us-east-1a"
  size              = 100  # GB
  type              = "gp3"
  iops              = 3000
  throughput        = 125  # MB/s
  encrypted         = true
  kms_key_id        = aws_kms_key.ebs.id

  tags = {
    Name = "database-volume"
  }
}

# Attach to EC2 instance
resource "aws_volume_attachment" "database" {
  device_name = "/dev/sdf"
  volume_id   = aws_ebs_volume.database.id
  instance_id = aws_instance.database.id
}

# EBS snapshot for backup
resource "aws_ebs_snapshot" "database_backup" {
  volume_id   = aws_ebs_volume.database.id
  description = "Database backup ${formatdate("YYYY-MM-DD", timestamp())}"

  tags = {
    Name = "database-snapshot"
  }
}

# Data Lifecycle Manager - automated snapshots
resource "aws_dlm_lifecycle_policy" "database_backups" {
  description        = "Daily database snapshots"
  execution_role_arn = aws_iam_role.dlm.arn
  state              = "ENABLED"

  policy_details {
    resource_types = ["VOLUME"]

    schedule {
      name = "Daily snapshots"

      create_rule {
        interval      = 24  # hours
        interval_unit = "HOURS"
        times         = ["03:00"]  # 3 AM UTC
      }

      retain_rule {
        count = 7  # Keep 7 daily snapshots
      }

      tags_to_add = {
        SnapshotCreator = "DLM"
      }

      copy_tags = true
    }

    target_tags = {
      Backup = "true"
    }
  }
}

4.4 EFS (Elastic File System)

EFS vs EBS:
# EFS file system
resource "aws_efs_file_system" "shared_data" {
  creation_token = "shared-app-data"
  encrypted      = true
  kms_key_id     = aws_kms_key.efs.id

  # Performance mode
  performance_mode = "generalPurpose"  # or "maxIO" for high parallelism

  # Throughput mode
  throughput_mode = "bursting"  # or "provisioned" for consistent throughput

  lifecycle_policy {
    transition_to_ia = "AFTER_30_DAYS"  # Move to cheaper Infrequent Access
  }

  tags = {
    Name = "shared-data"
  }
}

# Mount targets (one per AZ)
resource "aws_efs_mount_target" "az1" {
  file_system_id  = aws_efs_file_system.shared_data.id
  subnet_id       = aws_subnet.private_1.id
  security_groups = [aws_security_group.efs.id]
}

resource "aws_efs_mount_target" "az2" {
  file_system_id  = aws_efs_file_system.shared_data.id
  subnet_id       = aws_subnet.private_2.id
  security_groups = [aws_security_group.efs.id]
}

# Security group for EFS
resource "aws_security_group" "efs" {
  name        = "efs-mount-targets"
  description = "Security group for EFS mount targets"
  vpc_id      = aws_vpc.main.id

  ingress {
    description     = "NFS from app servers"
    from_port       = 2049
    to_port         = 2049
    protocol        = "tcp"
    security_groups = [aws_security_group.app.id]
  }

  tags = {
    Name = "efs-sg"
  }
}

# User data to mount EFS on EC2 instance
resource "aws_instance" "app" {
  # ... other config ...

  user_data = <<-EOF
    #!/bin/bash
    yum install -y amazon-efs-utils
    mkdir /mnt/efs
    mount -t efs ${aws_efs_file_system.shared_data.id}:/ /mnt/efs
    echo "${aws_efs_file_system.shared_data.id}:/ /mnt/efs efs defaults,_netdev 0 0" >> /etc/fstab
  EOF
}

5. Database Services

5.1 Database Service Comparison

Service Type Engine When to Use Scaling
RDS Relational PostgreSQL, MySQL, MariaDB, Oracle, SQL Server Traditional RDBMS needs, ACID compliance Vertical (read replicas for reads)
Aurora Relational MySQL, PostgreSQL compatible High-performance relational, cloud-native Vertical + auto-scaling read replicas
DynamoDB NoSQL (Key-Value/Document) Proprietary Massive scale, single-digit ms latency, serverless Horizontal (unlimited)
ElastiCache In-Memory Cache Redis, Memcached Caching, session store, real-time analytics Horizontal (cluster mode)
DocumentDB NoSQL (Document) MongoDB compatible MongoDB workloads on AWS Vertical + read replicas
Neptune Graph Gremlin, SPARQL Graph databases (social networks, fraud detection) Vertical + read replicas

Relational vs NoSQL Decision Tree

Choose Relational (RDS/Aurora) when: Choose NoSQL (DynamoDB) when:

5.2 RDS (Relational Database Service)

# RDS PostgreSQL instance
resource "aws_db_instance" "postgres" {
  identifier     = "production-postgres"
  engine         = "postgres"
  engine_version = "15.4"
  instance_class = "db.t3.medium"

  # Storage
  allocated_storage     = 100  # GB
  max_allocated_storage = 1000  # Auto-scaling up to 1TB
  storage_type          = "gp3"
  storage_encrypted     = true
  kms_key_id            = aws_kms_key.rds.id

  # Database config
  db_name  = "myapp"
  username = "admin"
  password = random_password.db_master.result  # Use random password!
  port     = 5432

  # Networking
  db_subnet_group_name   = aws_db_subnet_group.main.name
  vpc_security_group_ids = [aws_security_group.rds.id]
  publicly_accessible    = false  # NEVER true for production!

  # High Availability
  multi_az               = true  # Standby in different AZ

  # Backups
  backup_retention_period = 7  # Days
  backup_window           = "03:00-04:00"  # UTC
  maintenance_window      = "Mon:04:00-Mon:05:00"  # UTC

  # Monitoring
  enabled_cloudwatch_logs_exports = ["postgresql", "upgrade"]
  monitoring_interval             = 60  # Enhanced monitoring every 60s
  monitoring_role_arn             = aws_iam_role.rds_monitoring.arn

  # Security
  deletion_protection       = true  # Prevent accidental deletion
  skip_final_snapshot       = false
  final_snapshot_identifier = "production-postgres-final-snapshot"

  # Performance Insights
  performance_insights_enabled    = true
  performance_insights_kms_key_id = aws_kms_key.rds.id
  performance_insights_retention_period = 7

  tags = {
    Name = "production-postgres"
  }
}

# DB Subnet Group (spans multiple AZs)
resource "aws_db_subnet_group" "main" {
  name       = "main-db-subnet-group"
  subnet_ids = [aws_subnet.private_1.id, aws_subnet.private_2.id]

  tags = {
    Name = "main-db-subnet-group"
  }
}

# Read Replica (for read scaling)
resource "aws_db_instance" "postgres_replica" {
  identifier              = "production-postgres-replica"
  replicate_source_db     = aws_db_instance.postgres.identifier
  instance_class          = "db.t3.medium"
  publicly_accessible     = false

  # Can be in different region for disaster recovery
  # availability_zone = "us-east-1b"

  tags = {
    Name = "production-postgres-replica"
  }
}

# Random password for DB
resource "random_password" "db_master" {
  length  = 32
  special = true
}

# Store password in Secrets Manager
resource "aws_secretsmanager_secret" "db_password" {
  name = "production/postgres/master-password"

  recovery_window_in_days = 7
}

resource "aws_secretsmanager_secret_version" "db_password" {
  secret_id     = aws_secretsmanager_secret.db_password.id
  secret_string = random_password.db_master.result
}

5.3 Aurora (Cloud-Native Relational)

Aurora Advantages over RDS: Trade-off: 20-30% more expensive than RDS, but often worth it
# Aurora Cluster
resource "aws_rds_cluster" "aurora" {
  cluster_identifier      = "production-aurora-cluster"
  engine                  = "aurora-postgresql"
  engine_version          = "15.4"
  database_name           = "myapp"
  master_username         = "admin"
  master_password         = random_password.aurora.result

  # Networking
  db_subnet_group_name    = aws_db_subnet_group.main.name
  vpc_security_group_ids  = [aws_security_group.aurora.id]

  # Backups
  backup_retention_period = 7
  preferred_backup_window = "03:00-04:00"

  # Encryption
  storage_encrypted = true
  kms_key_id        = aws_kms_key.aurora.id

  # Deletion protection
  deletion_protection = true
  skip_final_snapshot = false
  final_snapshot_identifier = "aurora-final-snapshot"

  # Enable backtrack (point-in-time restore without restoring from backup)
  backtrack_window = 72  # hours

  # CloudWatch Logs
  enabled_cloudwatch_logs_exports = ["postgresql"]

  tags = {
    Name = "production-aurora"
  }
}

# Aurora Writer Instance
resource "aws_rds_cluster_instance" "aurora_writer" {
  identifier         = "production-aurora-writer"
  cluster_identifier = aws_rds_cluster.aurora.id
  instance_class     = "db.r6g.large"
  engine             = aws_rds_cluster.aurora.engine
  engine_version     = aws_rds_cluster.aurora.engine_version

  performance_insights_enabled = true

  tags = {
    Name = "aurora-writer"
    Role = "writer"
  }
}

# Aurora Reader Instance (auto-scaling target)
resource "aws_rds_cluster_instance" "aurora_reader_1" {
  identifier         = "production-aurora-reader-1"
  cluster_identifier = aws_rds_cluster.aurora.id
  instance_class     = "db.r6g.large"
  engine             = aws_rds_cluster.aurora.engine
  engine_version     = aws_rds_cluster.aurora.engine_version

  performance_insights_enabled = true

  tags = {
    Name = "aurora-reader-1"
    Role = "reader"
  }
}

# Auto-scaling for Aurora read replicas
resource "aws_appautoscaling_target" "aurora_replicas" {
  max_capacity       = 5
  min_capacity       = 1
  resource_id        = "cluster:${aws_rds_cluster.aurora.cluster_identifier}"
  scalable_dimension = "rds:cluster:ReadReplicaCount"
  service_namespace  = "rds"
}

resource "aws_appautoscaling_policy" "aurora_replicas" {
  name               = "aurora-cpu-autoscaling"
  policy_type        = "TargetTrackingScaling"
  resource_id        = aws_appautoscaling_target.aurora_replicas.resource_id
  scalable_dimension = aws_appautoscaling_target.aurora_replicas.scalable_dimension
  service_namespace  = aws_appautoscaling_target.aurora_replicas.service_namespace

  target_tracking_scaling_policy_configuration {
    predefined_metric_specification {
      predefined_metric_type = "RDSReaderAverageCPUUtilization"
    }
    target_value = 70.0
  }
}

5.4 DynamoDB (NoSQL)

DynamoDB Basics:
# DynamoDB table
resource "aws_dynamodb_table" "users" {
  name           = "Users"
  billing_mode   = "PAY_PER_REQUEST"  # On-demand pricing
  # Or "PROVISIONED" with read_capacity and write_capacity

  hash_key  = "user_id"       # Partition key
  range_key = "created_at"    # Sort key (optional)

  attribute {
    name = "user_id"
    type = "S"  # String
  }

  attribute {
    name = "created_at"
    type = "N"  # Number (Unix timestamp)
  }

  attribute {
    name = "email"
    type = "S"
  }

  # Global Secondary Index (query by email)
  global_secondary_index {
    name            = "EmailIndex"
    hash_key        = "email"
    projection_type = "ALL"  # Include all attributes in index

    # For on-demand billing, no need to specify capacity
  }

  # Point-in-time recovery
  point_in_time_recovery {
    enabled = true
  }

  # Server-side encryption
  server_side_encryption {
    enabled     = true
    kms_key_arn = aws_kms_key.dynamodb.arn
  }

  # TTL (auto-delete expired items)
  ttl {
    attribute_name = "expiration_time"
    enabled        = true
  }

  # DynamoDB Streams (for change data capture)
  stream_enabled   = true
  stream_view_type = "NEW_AND_OLD_IMAGES"

  tags = {
    Name = "users-table"
  }
}

# DynamoDB Global Table (multi-region replication)
resource "aws_dynamodb_table" "orders_global" {
  name           = "Orders"
  billing_mode   = "PAY_PER_REQUEST"
  hash_key       = "order_id"

  stream_enabled   = true
  stream_view_type = "NEW_AND_OLD_IMAGES"

  attribute {
    name = "order_id"
    type = "S"
  }

  # Replicate to multiple regions
  replica {
    region_name = "us-west-2"
  }

  replica {
    region_name = "eu-west-1"
  }

  tags = {
    Name = "orders-global-table"
  }
}

5.5 ElastiCache (Redis/Memcached)

Redis vs Memcached:

See Messaging Systems Guide for Redis Pub/Sub details.

# ElastiCache Redis Cluster
resource "aws_elasticache_replication_group" "redis" {
  replication_group_id       = "production-redis"
  replication_group_description = "Redis cluster for session storage and caching"

  engine               = "redis"
  engine_version       = "7.0"
  node_type            = "cache.r6g.large"
  num_cache_clusters   = 3  # 1 primary + 2 replicas
  parameter_group_name = "default.redis7"
  port                 = 6379

  # Networking
  subnet_group_name  = aws_elasticache_subnet_group.main.name
  security_group_ids = [aws_security_group.redis.id]

  # High Availability
  automatic_failover_enabled = true  # Auto-failover to replica
  multi_az_enabled           = true

  # Backups
  snapshot_retention_limit = 5  # Days
  snapshot_window          = "03:00-04:00"
  maintenance_window       = "sun:05:00-sun:06:00"

  # Encryption
  at_rest_encryption_enabled = true
  transit_encryption_enabled = true
  auth_token_enabled         = true
  auth_token                 = random_password.redis_auth.result

  # Logs
  log_delivery_configuration {
    destination      = aws_cloudwatch_log_group.redis.name
    destination_type = "cloudwatch-logs"
    log_format       = "json"
    log_type         = "slow-log"
  }

  tags = {
    Name = "production-redis"
  }
}

# Subnet group for ElastiCache
resource "aws_elasticache_subnet_group" "main" {
  name       = "main-cache-subnet-group"
  subnet_ids = [aws_subnet.private_1.id, aws_subnet.private_2.id]

  tags = {
    Name = "main-cache-subnet-group"
  }
}

# Security group for Redis
resource "aws_security_group" "redis" {
  name        = "redis-cluster"
  description = "Security group for Redis cluster"
  vpc_id      = aws_vpc.main.id

  ingress {
    description     = "Redis from app servers"
    from_port       = 6379
    to_port         = 6379
    protocol        = "tcp"
    security_groups = [aws_security_group.app.id]
  }

  tags = {
    Name = "redis-sg"
  }
}

6. Security Deep Dive

6.1 KMS (Key Management Service)

# KMS key for encrypting data
resource "aws_kms_key" "app_data" {
  description             = "KMS key for application data encryption"
  deletion_window_in_days = 30  # Grace period before deletion
  enable_key_rotation     = true

  # Key policy
  policy = jsonencode({
    Version = "2012-10-17"
    Statement = [
      {
        Sid    = "Enable IAM User Permissions"
        Effect = "Allow"
        Principal = {
          AWS = "arn:aws:iam::${data.aws_caller_identity.current.account_id}:root"
        }
        Action   = "kms:*"
        Resource = "*"
      },
      {
        Sid    = "Allow services to use the key"
        Effect = "Allow"
        Principal = {
          Service = [
            "s3.amazonaws.com",
            "rds.amazonaws.com",
            "dynamodb.amazonaws.com",
            "secretsmanager.amazonaws.com"
          ]
        }
        Action = [
          "kms:Decrypt",
          "kms:GenerateDataKey"
        ]
        Resource = "*"
      }
    ]
  })

  tags = {
    Name = "app-data-key"
  }
}

# Alias for easier reference
resource "aws_kms_alias" "app_data" {
  name          = "alias/app-data"
  target_key_id = aws_kms_key.app_data.key_id
}

6.2 Secrets Manager

# Secrets Manager secret
resource "aws_secretsmanager_secret" "api_key" {
  name                    = "production/api/external-service-key"
  description             = "API key for external service"
  recovery_window_in_days = 7

  tags = {
    Purpose = "external-api-auth"
  }
}

# Secret value
resource "aws_secretsmanager_secret_version" "api_key" {
  secret_id     = aws_secretsmanager_secret.api_key.id
  secret_string = jsonencode({
    api_key    = "super-secret-key"
    api_secret = "super-secret-value"
  })
}

# IAM policy to allow Lambda to read secret
resource "aws_iam_role_policy" "lambda_read_secret" {
  name = "read-api-secret"
  role = aws_iam_role.lambda.id

  policy = jsonencode({
    Version = "2012-10-17"
    Statement = [
      {
        Effect = "Allow"
        Action = [
          "secretsmanager:GetSecretValue"
        ]
        Resource = aws_secretsmanager_secret.api_key.arn
      }
    ]
  })
}

# Automatic rotation (Lambda function rotates secret)
resource "aws_secretsmanager_secret_rotation" "api_key" {
  secret_id           = aws_secretsmanager_secret.api_key.id
  rotation_lambda_arn = aws_lambda_function.rotate_secret.arn

  rotation_rules {
    automatically_after_days = 30
  }
}

6.3 Security Monitoring & Compliance

Essential Security Services

Service Purpose What It Detects
GuardDuty Threat detection Unusual API calls, compromised instances, reconnaissance
Security Hub Centralized security findings Aggregates GuardDuty, Inspector, Macie, Config findings
CloudTrail API audit logging Who did what, when (all AWS API calls)
Config Resource compliance Non-compliant resources (unencrypted EBS, public S3)
Inspector Vulnerability scanning EC2/ECR vulnerabilities, network exposure
Macie Data discovery & protection Sensitive data in S3 (PII, credentials)
WAF Web application firewall SQL injection, XSS, bot traffic
# Enable GuardDuty (threat detection)
resource "aws_guardduty_detector" "main" {
  enable = true

  finding_publishing_frequency = "FIFTEEN_MINUTES"

  datasources {
    s3_logs {
      enable = true
    }
    kubernetes {
      audit_logs {
        enable = true
      }
    }
  }

  tags = {
    Name = "main-guardduty"
  }
}

# Enable Security Hub (compliance & security findings)
resource "aws_securityhub_account" "main" {
  enable_default_standards = true
  control_finding_generator = "SECURITY_CONTROL"
}

# Enable CloudTrail (API logging)
resource "aws_cloudtrail" "main" {
  name                          = "organization-trail"
  s3_bucket_name                = aws_s3_bucket.cloudtrail.id
  include_global_service_events = true
  is_multi_region_trail         = true
  is_organization_trail         = true

  # Enable logging of data events
  event_selector {
    read_write_type           = "All"
    include_management_events = true

    # Log S3 data events
    data_resource {
      type   = "AWS::S3::Object"
      values = ["arn:aws:s3:::*/"]
    }

    # Log Lambda invocations
    data_resource {
      type   = "AWS::Lambda::Function"
      values = ["arn:aws:lambda:*:*:function/*"]
    }
  }

  # CloudWatch Logs integration
  cloud_watch_logs_group_arn = "${aws_cloudwatch_log_group.cloudtrail.arn}:*"
  cloud_watch_logs_role_arn  = aws_iam_role.cloudtrail_cloudwatch.arn

  tags = {
    Name = "organization-cloudtrail"
  }
}

# Enable Config (resource compliance)
resource "aws_config_configuration_recorder" "main" {
  name     = "main-config-recorder"
  role_arn = aws_iam_role.config.arn

  recording_group {
    all_supported                 = true
    include_global_resource_types = true
  }
}

resource "aws_config_configuration_recorder_status" "main" {
  name       = aws_config_configuration_recorder.main.name
  is_enabled = true

  depends_on = [aws_config_delivery_channel.main]
}

# Config rules (compliance checks)
resource "aws_config_config_rule" "encrypted_volumes" {
  name = "encrypted-volumes"

  source {
    owner             = "AWS"
    source_identifier = "ENCRYPTED_VOLUMES"
  }

  depends_on = [aws_config_configuration_recorder.main]
}

resource "aws_config_config_rule" "s3_bucket_public_read_prohibited" {
  name = "s3-bucket-public-read-prohibited"

  source {
    owner             = "AWS"
    source_identifier = "S3_BUCKET_PUBLIC_READ_PROHIBITED"
  }

  depends_on = [aws_config_configuration_recorder.main]
}

7. Real-World Architecture Patterns

7.1 Multi-Tier Web Application

Architecture: 3-Tier Web App with Auto-Scaling

Use Case: E-commerce website with variable traffic

graph TB Users["Users"] subgraph "AWS Cloud" Route53["Route 53
DNS"] CloudFront["CloudFront CDN
(Static Assets)"] subgraph "VPC" subgraph "Public Subnets" ALB["Application Load Balancer
(us-east-1a, us-east-1b)"] end subgraph "Private Subnets - Web Tier" ASG1["Auto Scaling Group
EC2 Web Servers"] end subgraph "Private Subnets - App Tier" ASG2["Auto Scaling Group
EC2 App Servers"] Cache["ElastiCache Redis
(Session Store)"] end subgraph "Private Subnets - Data Tier" RDS["Aurora PostgreSQL
(Multi-AZ)"] S3["S3
(User Uploads)"] end end end Users --> Route53 Route53 --> CloudFront CloudFront --> S3 Route53 --> ALB ALB --> ASG1 ASG1 --> ASG2 ASG2 --> Cache ASG2 --> RDS ASG2 --> S3 style CloudFront fill:#ff9900,stroke:#232f3e,stroke-width:2px,color:#232f3e style ALB fill:#a3be8c,stroke:#8fbcbb,stroke-width:2px,color:#2e3440 style RDS fill:#bf616a,stroke:#d08770,stroke-width:2px,color:#2e3440

Key Components:

See Full Terraform:

Combines VPC setup, ALB, Auto Scaling Groups, RDS, ElastiCache from examples above.

7.2 Serverless Microservices

Architecture: Event-Driven Serverless

Use Case: Order processing system with multiple services

graph TB API["API Gateway
(REST API)"] Lambda1["Lambda: Create Order"] Lambda2["Lambda: Process Payment"] Lambda3["Lambda: Update Inventory"] Lambda4["Lambda: Send Notification"] DDB["DynamoDB: Orders Table"] SQS1["SQS: Payment Queue"] SQS2["SQS: Inventory Queue"] SNS["SNS: Order Events Topic"] S3["S3: Order Receipts"] SES["SES: Email Service"] API --> Lambda1 Lambda1 --> DDB Lambda1 --> SQS1 Lambda1 --> SQS2 Lambda1 --> SNS SQS1 --> Lambda2 Lambda2 --> DDB SQS2 --> Lambda3 Lambda3 --> DDB SNS --> Lambda4 Lambda4 --> S3 Lambda4 --> SES style API fill:#ff9900,stroke:#232f3e,stroke-width:2px,color:#232f3e style DDB fill:#88c0d0,stroke:#81a1c1,stroke-width:2px,color:#2e3440 style SNS fill:#a3be8c,stroke:#8fbcbb,stroke-width:2px,color:#2e3440

Flow:

  1. User calls API Gateway → Lambda creates order in DynamoDB
  2. Lambda publishes to SQS queues (payment, inventory)
  3. Lambda publishes to SNS topic (order-events)
  4. Payment Lambda processes from SQS, updates DynamoDB
  5. Inventory Lambda processes from SQS, updates DynamoDB
  6. Notification Lambda subscribes to SNS, generates receipt (S3), sends email (SES)

Benefits:

  • No servers to manage: Lambda auto-scales
  • Cost-effective: Pay per request
  • Decoupled: Services communicate via SQS/SNS
  • Resilient: SQS retries failed messages, DLQ for poison messages

See also: Messaging Systems Guide for detailed SQS/SNS/EventBridge patterns

7.3 Data Processing Pipeline

Architecture: Real-Time Analytics Pipeline

Use Case: Process clickstream data for real-time dashboards

graph LR App["Web App"] Kinesis["Kinesis Data Stream
(Clickstream)"] Lambda1["Lambda: Transform"] Firehose["Kinesis Firehose"] S3["S3: Data Lake
(Parquet)"] Athena["Athena
(SQL Queries)"] QuickSight["QuickSight
(Dashboards)"] Lambda2["Lambda: Real-time Alerts"] DDB["DynamoDB: Aggregates"] App --> Kinesis Kinesis --> Lambda1 Lambda1 --> Firehose Firehose --> S3 S3 --> Athena Athena --> QuickSight Kinesis --> Lambda2 Lambda2 --> DDB style Kinesis fill:#ff9900,stroke:#232f3e,stroke-width:2px,color:#232f3e style S3 fill:#a3be8c,stroke:#8fbcbb,stroke-width:2px,color:#2e3440

Flow:

  1. Web app sends events to Kinesis Data Stream
  2. Lambda transforms/enriches events in real-time
  3. Firehose batches and writes to S3 (Parquet format)
  4. Athena queries S3 data lake (SQL)
  5. QuickSight creates dashboards from Athena queries
  6. Separate Lambda reads stream for real-time alerts, writes aggregates to DynamoDB

Interview Tips

Key Topics to Master:
  • Multi-account strategy: Why use multiple accounts, SCPs, cross-account access
  • VPC design: Public/private subnets, NAT Gateway, VPC Peering vs Transit Gateway
  • Security layers: Security Groups (stateful) vs NACLs (stateless)
  • Compute trade-offs: EC2 vs Lambda vs ECS vs EKS - when to use each
  • Database selection: RDS vs Aurora vs DynamoDB decision tree
  • Storage classes: S3 Standard → IA → Glacier based on access patterns
  • IAM best practices: Roles over users, least privilege, service accounts
  • Encryption: KMS for data-at-rest, TLS for data-in-transit
  • Monitoring: GuardDuty, Security Hub, CloudTrail, Config
  • Architecture patterns: Multi-tier, serverless, data pipelines
Common Interview Questions:
  1. Design a highly available, scalable web application on AWS
  2. How would you secure access to a database in a private subnet?
  3. Explain the difference between Security Groups and NACLs
  4. When would you choose DynamoDB over RDS?
  5. How do you enable cross-account access securely?
  6. Design a disaster recovery strategy for a critical application
  7. How would you handle a Lambda function that needs to access resources in a VPC?
  8. Explain the difference between VPC Peering and Transit Gateway
  9. How do you implement defense-in-depth security in AWS?
  10. Design a data pipeline for processing millions of events per day