CD-02: Infrastructure as Code Pipeline

User Story

As a DevOps Engineer on the MangaAssist AI Chatbot team, I want to establish an automated Infrastructure as Code (IaC) pipeline that provisions, validates, and deploys all AWS resources through version-controlled code, So that infrastructure changes are reviewable, repeatable, and auditable — eliminating manual console changes and enabling the 1-2 person team to manage 20+ AWS services safely.

Acceptance Criteria

High-Level Design

Infrastructure Pipeline Overview

flowchart LR
    subgraph "Developer"
        A[CDK Code Change] --> B[PR Created]
    end

    subgraph "CI Phase"
        B --> C[CDK Synth]
        C --> D[CDK Unit Tests]
        D --> E[cdk diff — Post to PR]
        E --> F[Security Scan — cfn-nag]
        F --> G[Cost Estimate]
    end

    subgraph "Staging Deploy"
        G --> H[Deploy to Staging]
        H --> I[Integration Tests]
        I --> J{Tests Pass?}
    end

    subgraph "Production Deploy"
        J -->|Yes| K[Manual Approval Gate]
        K --> L[Deploy to Production]
        L --> M[Post-Deploy Validation]
        J -->|No| N[Fail + Notify]
    end

    style K fill:#ff9900,color:#000
    style L fill:#1B660F,color:#fff
    style N fill:#DD344C,color:#fff

AWS Resources Managed by This Pipeline

Stack	Resources	Change Frequency
Networking	VPC, Subnets, Security Groups, NAT Gateway	Rare (quarterly)
Compute	ECS Cluster, Task Definitions, ALB, Lambda Functions	Weekly
Data	DynamoDB Tables, DAX Clusters, ElastiCache Redis	Monthly
AI/ML	SageMaker Endpoints, Bedrock Config, Model Registry	Monthly
Search	OpenSearch Serverless Collections, Indexes	Monthly
Edge	CloudFront Distribution, API Gateway, WAF Rules	Bi-weekly
Observability	CloudWatch Dashboards, Alarms, X-Ray Groups	Weekly
Security	IAM Roles, Cognito User Pool, Secrets Manager	Rare
Storage	S3 Buckets, Lifecycle Policies	Rare

Low-Level Design

1. CDK Project Structure — Multi-Stack Architecture

infra/
├── bin/
│   └── app.ts                  # Stack instantiation + environment config
├── lib/
│   ├── stacks/
│   │   ├── networking-stack.ts     # VPC, subnets, security groups
│   │   ├── data-stack.ts           # DynamoDB, DAX, ElastiCache
│   │   ├── compute-stack.ts        # ECS, ALB, Lambda
│   │   ├── ai-ml-stack.ts          # SageMaker, Bedrock config
│   │   ├── search-stack.ts         # OpenSearch Serverless
│   │   ├── edge-stack.ts           # CloudFront, API Gateway, WAF
│   │   ├── observability-stack.ts  # CloudWatch, X-Ray
│   │   └── security-stack.ts       # IAM, Cognito, Secrets
│   ├── constructs/
│   │   ├── fargate-service.ts      # Reusable ECS Fargate pattern
│   │   ├── dynamodb-table.ts       # Table with standard config
│   │   └── alarm-factory.ts        # CloudWatch alarm patterns
│   └── config/
│       ├── staging.ts              # Staging environment config
│       └── production.ts           # Production environment config
├── test/
│   ├── stacks/
│   │   ├── networking-stack.test.ts
│   │   ├── compute-stack.test.ts
│   │   └── ...
│   └── constructs/
│       └── fargate-service.test.ts
├── cdk.json
├── tsconfig.json
└── package.json

2. Stack Dependency Graph

graph TD
    NET["Networking Stack<br/>(VPC, Subnets, SGs)"]
    SEC["Security Stack<br/>(IAM, Cognito, Secrets)"]
    DATA["Data Stack<br/>(DynamoDB, Redis, DAX)"]
    SEARCH["Search Stack<br/>(OpenSearch Serverless)"]
    AIML["AI/ML Stack<br/>(SageMaker, Bedrock)"]
    COMP["Compute Stack<br/>(ECS, ALB, Lambda)"]
    EDGE["Edge Stack<br/>(CloudFront, API GW, WAF)"]
    OBS["Observability Stack<br/>(CloudWatch, X-Ray)"]

    NET --> DATA
    NET --> SEARCH
    NET --> AIML
    NET --> COMP
    SEC --> COMP
    SEC --> AIML
    SEC --> DATA
    DATA --> COMP
    SEARCH --> COMP
    AIML --> COMP
    COMP --> EDGE
    COMP --> OBS
    EDGE --> OBS

    style NET fill:#ff9900,color:#000
    style COMP fill:#146eb4,color:#fff
    style AIML fill:#8C4FFF,color:#fff

3. CDK Stack Example — Compute Stack

import * as cdk from 'aws-cdk-lib';
import * as ecs from 'aws-cdk-lib/aws-ecs';
import * as ec2 from 'aws-cdk-lib/aws-ec2';
import * as elbv2 from 'aws-cdk-lib/aws-elasticloadbalancingv2';
import { Construct } from 'constructs';

interface ComputeStackProps extends cdk.StackProps {
  vpc: ec2.IVpc;
  cluster?: ecs.ICluster;
  environment: 'staging' | 'production';
}

export class ComputeStack extends cdk.Stack {
  public readonly cluster: ecs.Cluster;
  public readonly service: ecs.FargateService;

  constructor(scope: Construct, id: string, props: ComputeStackProps) {
    super(scope, id, props);

    const config = {
      staging: { desiredCount: 2, cpu: 512, memory: 1024 },
      production: { desiredCount: 4, cpu: 1024, memory: 2048 },
    }[props.environment];

    this.cluster = new ecs.Cluster(this, 'ChatbotCluster', {
      vpc: props.vpc,
      containerInsights: true,
    });

    const taskDef = new ecs.FargateTaskDefinition(this, 'ChatbotTask', {
      cpu: config.cpu,
      memoryLimitMiB: config.memory,
    });

    taskDef.addContainer('chatbot', {
      image: ecs.ContainerImage.fromEcrRepository(/* ECR repo reference */),
      portMappings: [{ containerPort: 8080 }],
      logging: ecs.LogDrivers.awsLogs({ streamPrefix: 'chatbot' }),
      healthCheck: {
        command: ['CMD-SHELL', 'curl -f http://localhost:8080/health || exit 1'],
        interval: cdk.Duration.seconds(15),
        timeout: cdk.Duration.seconds(5),
        retries: 3,
      },
    });

    this.service = new ecs.FargateService(this, 'ChatbotService', {
      cluster: this.cluster,
      taskDefinition: taskDef,
      desiredCount: config.desiredCount,
      deploymentController: { type: ecs.DeploymentControllerType.CODE_DEPLOY },
      circuitBreaker: { rollback: true },
    });
  }
}

4. CDK Unit Tests

import { Template, Match } from 'aws-cdk-lib/assertions';
import * as cdk from 'aws-cdk-lib';
import { ComputeStack } from '../lib/stacks/compute-stack';
import { NetworkingStack } from '../lib/stacks/networking-stack';

describe('ComputeStack', () => {
  let template: Template;

  beforeAll(() => {
    const app = new cdk.App();
    const networkStack = new NetworkingStack(app, 'TestNetwork', {
      env: { account: '123456789', region: 'us-east-1' },
    });
    const stack = new ComputeStack(app, 'TestCompute', {
      vpc: networkStack.vpc,
      environment: 'production',
      env: { account: '123456789', region: 'us-east-1' },
    });
    template = Template.fromStack(stack);
  });

  test('ECS service uses CodeDeploy deployment controller', () => {
    template.hasResourceProperties('AWS::ECS::Service', {
      DeploymentController: { Type: 'CODE_DEPLOY' },
    });
  });

  test('Production has 4 desired tasks', () => {
    template.hasResourceProperties('AWS::ECS::Service', {
      DesiredCount: 4,
    });
  });

  test('Task definition uses 1024 CPU for production', () => {
    template.hasResourceProperties('AWS::ECS::TaskDefinition', {
      Cpu: '1024',
    });
  });

  test('Container has health check configured', () => {
    template.hasResourceProperties('AWS::ECS::TaskDefinition', {
      ContainerDefinitions: Match.arrayWith([
        Match.objectLike({
          HealthCheck: Match.objectLike({
            Command: Match.arrayWith(['CMD-SHELL']),
          }),
        }),
      ]),
    });
  });

  test('No public subnets for ECS tasks', () => {
    template.hasResourceProperties('AWS::ECS::Service', {
      NetworkConfiguration: Match.objectLike({
        AwsvpcConfiguration: Match.objectLike({
          AssignPublicIp: 'DISABLED',
        }),
      }),
    });
  });
});

5. CI Pipeline — CDK Validation

name: infra-validate
on:
  pull_request:
    paths: ['infra/**']

permissions:
  id-token: write  # OIDC for AWS auth
  contents: read
  pull-requests: write

jobs:
  validate:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Configure AWS credentials (OIDC)
        uses: aws-actions/configure-aws-credentials@v4
        with:
          role-to-assume: arn:aws:iam::role/github-actions-infra
          aws-region: us-east-1

      - name: Install dependencies
        run: cd infra && npm ci

      - name: CDK Synth
        run: cd infra && npx cdk synth --all

      - name: CDK Unit Tests
        run: cd infra && npm test

      - name: Security Scan (cfn-nag)
        run: |
          gem install cfn-nag
          cfn_nag_scan --input-path infra/cdk.out/*.template.json

      - name: CDK Diff (Post to PR)
        run: |
          cd infra
          DIFF=$(npx cdk diff --all 2>&1) || true
          echo "## Infrastructure Changes" > diff.md
          echo '```' >> diff.md
          echo "$DIFF" >> diff.md
          echo '```' >> diff.md

      - name: Comment diff on PR
        uses: marocchino/sticky-pull-request-comment@v2
        with:
          path: infra/diff.md

6. CD Pipeline — Staged Deployment

sequenceDiagram
    participant Dev as Developer
    participant GH as GitHub
    participant CI as CI Pipeline
    participant STG as Staging Account
    participant PROD as Production Account
    participant CW as CloudWatch

    Dev->>GH: Merge PR to main
    GH->>CI: Trigger deploy pipeline
    CI->>CI: cdk synth + tests
    CI->>STG: cdk deploy --all
    STG-->>CI: Stack outputs
    CI->>STG: Run integration tests
    STG-->>CI: Tests pass

    CI->>GH: Request manual approval
    Dev->>GH: Approve deployment
    GH->>CI: Approval received

    CI->>PROD: cdk deploy --all (stack-by-stack)
    Note over CI,PROD: Deploy order: Networking → Security → Data → AI/ML → Compute → Edge → Observability
    PROD-->>CI: All stacks deployed
    CI->>CW: Verify alarms healthy
    CI->>GH: Post deployment summary

7. Drift Detection

# Daily drift detection Lambda (triggered by EventBridge)
import boto3
import json

def lambda_handler(event, context):
    cfn = boto3.client('cloudformation')
    sns = boto3.client('sns')

    stacks = [
        'MangaAssist-Networking',
        'MangaAssist-Data',
        'MangaAssist-Compute',
        'MangaAssist-Edge',
        'MangaAssist-Observability',
    ]

    drifted_stacks = []
    for stack_name in stacks:
        cfn.detect_stack_drift(StackName=stack_name)
        waiter = cfn.get_waiter('stack_drift_detection_complete')
        # Note: In production, use Step Functions for async wait

        status = cfn.describe_stack_drift_detection_status(
            StackDriftDetectionId=detection_id
        )
        if status['StackDriftStatus'] == 'DRIFTED':
            drifted_stacks.append({
                'stack': stack_name,
                'drifted_resources': status['DriftedStackResourceCount'],
            })

    if drifted_stacks:
        sns.publish(
            TopicArn='arn:aws:sns:us-east-1:ACCOUNT:infra-drift-alerts',
            Subject=f'DRIFT DETECTED: {len(drifted_stacks)} stacks',
            Message=json.dumps(drifted_stacks, indent=2),
        )

    return {'drifted': len(drifted_stacks), 'details': drifted_stacks}

Critical Decisions

Decision 1: IaC Tool — AWS CDK (TypeScript) vs Terraform vs Raw CloudFormation

Criteria (Weight)	AWS CDK (TypeScript)	Terraform (HCL)	Raw CloudFormation (YAML)
Abstraction Level (20%)	9/10 — L2/L3 constructs, loops, conditions	7/10 — modules, for_each	4/10 — verbose, repetitive
AWS Integration (20%)	10/10 — synthesizes to CFN natively	7/10 — AWS provider, some lag	10/10 — first-party
Type Safety (15%)	10/10 — full TypeScript types	5/10 — HCL validation only	3/10 — YAML, no types
State Management (15%)	8/10 — CFN manages state	6/10 — state file (locking, remote)	8/10 — CFN manages state
Multi-Cloud (5%)	2/10 — AWS only	10/10 — any provider	1/10 — AWS only
Testing (10%)	9/10 — CDK assertions, snapshot tests	7/10 — terratest, plan validation	4/10 — cfn-lint only
Team Learning Curve (10%)	8/10 — team knows TypeScript	5/10 — must learn HCL	6/10 — verbose but readable
Drift Detection (5%)	7/10 — CFN drift detection	9/10 — terraform plan detects drift	7/10 — CFN drift detection
Weighted Score	8.7/10	6.8/10	5.5/10

Decision: AWS CDK (TypeScript)

Rationale: The MangaAssist chatbot is 100% AWS. Multi-cloud support (Terraform's strength) provides zero value here. CDK's TypeScript constructs dramatically reduce code volume — a DynamoDB table with GSIs, auto-scaling, and alarms takes ~30 lines in CDK vs ~150 in CloudFormation YAML. The team already writes TypeScript for the frontend, so CDK adds no new language overhead.

Why not Terraform? State management is a liability for a small team. S3 + DynamoDB state locking works but is another system to maintain. State corruption requires manual surgery. CDK delegates state entirely to CloudFormation — one less thing to break.

Why not raw CloudFormation? The chatbot has 20+ AWS services across 8+ stacks. Raw CloudFormation YAML would be thousands of lines of repetitive configuration. CDK's constructs (e.g., new ecs.FargateService() auto-creates task definition, security groups, IAM roles, CloudWatch log groups) reduce this by 5-10x.

Decision 2: Stack Architecture — Single Stack vs Multi-Stack

Criteria	Single Stack	Multi-Stack (Current Choice)
Blast Radius	HIGH — one failure affects everything	LOW — failure isolated to one stack
Deploy Speed	Slow — CFN evaluates all resources	Fast — only changed stack deploys
Dependency Management	Simple — no cross-stack refs	Complex — stack exports, deploy order
Rollback Scope	All or nothing	Per-stack rollback
CFN Resource Limit	Risk hitting 500 resource limit	Each stack well under limit
Team Parallelism	One person deploys at a time	Different stacks can deploy independently

Decision: Multi-Stack (8 stacks)

Rationale: The chatbot uses 20+ different AWS services. A single stack would approach CloudFormation's 500-resource limit and create unacceptable blast radius — a typo in a CloudWatch alarm config should not risk rolling back the VPC. Multi-stack deployment adds ~5 minutes to the pipeline (sequential stack deploys) but eliminates the risk of cascading failures.

Decision 3: Environment Isolation — Separate AWS Accounts vs Same Account

Criteria	Separate Accounts	Same Account + Prefixes
Security Isolation	Strong — IAM boundary at account level	Weak — shared IAM, risk of cross-env access
Cost Visibility	Clear — per-account billing	Murky — requires cost allocation tags
Operational Complexity	Higher — cross-account roles, multiple credentials	Lower — single set of credentials
Resource Limits	Independent per account	Shared limits (could conflict)
Team Size Fit (1-2)	Overhead for small team	Simpler for small team
AWS Organizations	Required	Not needed

Decision: Separate AWS Accounts (via AWS Organizations)

graph LR
    subgraph "AWS Organization"
        MGMT["Management Account<br/>(Billing, SSO)"]
        STG["Staging Account<br/>(MangaAssist-Staging)"]
        PROD["Production Account<br/>(MangaAssist-Production)"]
        SHARED["Shared Services<br/>(ECR, Artifacts)"]
    end

    MGMT --> STG
    MGMT --> PROD
    MGMT --> SHARED
    SHARED -->|"Cross-account ECR pull"| STG
    SHARED -->|"Cross-account ECR pull"| PROD

Rationale: Even for a 1-2 person team, account-level isolation prevents the most dangerous class of mistakes — a staging deployment accidentally modifying production resources. AWS Organizations + IAM Identity Center (SSO) makes multi-account management nearly as simple as single-account. The overhead is worth the safety guarantee.

Tradeoffs

The Debate: Infrastructure Change Velocity vs Safety

graph TD
    subgraph "Product Manager"
        PM1["Fast infrastructure changes"]
        PM2["New features need new AWS resources quickly"]
        PM3["Don't want infra to be the bottleneck"]
    end

    subgraph "Architect"
        AR1["Every infra change reviewed by human"]
        AR2["Staging must mirror production exactly"]
        AR3["Zero-tolerance for infra drift"]
    end

    subgraph "DevOps Engineer"
        DE1["I'm the only one who can approve"]
        DE2["Manual approval creates bottleneck"]
        DE3["Automate gates, not approvals"]
    end

    PM2 ---|"Tension"| AR1
    PM3 ---|"Tension"| DE1
    AR2 ---|"Cost tension"| PM1
    DE2 ---|"Agrees with"| PM3
    AR3 ---|"Enables trust in"| DE3

Resolution

Concern	Solution	Compromise
PM: Infra changes slow down features	Separate compute stack deploys in < 10 min	Networking/security changes still slow (and should be)
Architect: Human review required	`cdk diff` auto-posted to every PR — review is fast	Only production gets manual approval; staging is fully automated
Architect: Staging must mirror prod	Same CDK code, different config values per environment	Staging runs at 50% capacity (cost saving) — not identical
DevOps: Single approver bottleneck	Expand approval to 2 people (DevOps + Tech Lead)	Tech Lead review may slow things slightly but removes bus factor
All: Drift must not happen	Daily drift detection Lambda + SNS alerts	Detection, not prevention — manual console access not blocked

Key Tradeoff: CDK Abstraction Power vs Debugging Difficulty

The double-edged sword of CDK: High-level constructs (L2/L3) auto-generate resources you didn't explicitly define (security groups, IAM policies, log groups). This is powerful but means debugging CloudFormation failures requires understanding the generated template.

CDK Code:       ~200 lines TypeScript
Generated CFN:  ~2,000 lines YAML
Debug Surface:  10x larger than source code

Mitigation: 1. Always review cdk diff output (not just CDK code) before approving 2. Use CDK snapshot tests to catch unexpected resource changes 3. Keep cdk.out/ artifacts for debugging failed deployments 4. Prefer L2 constructs over L3 (less magic, more control)