LOCAL PREVIEW View on GitHub

CD-02: Infrastructure as Code Pipeline

User Story

As a DevOps Engineer on the MangaAssist AI Chatbot team, I want to establish an automated Infrastructure as Code (IaC) pipeline that provisions, validates, and deploys all AWS resources through version-controlled code, So that infrastructure changes are reviewable, repeatable, and auditable — eliminating manual console changes and enabling the 1-2 person team to manage 20+ AWS services safely.


Acceptance Criteria

  • All AWS resources are defined in CDK (TypeScript) — zero manual console provisioning
  • Every infrastructure change goes through PR review with cdk diff output attached
  • CDK unit tests validate resource configurations before deployment
  • Staging environment is deployed and integration-tested before production
  • Production deployments require manual approval (the only manual gate in any pipeline)
  • Drift detection runs daily and alerts on unauthorized console changes
  • Stack rollback is automatic on CloudFormation deployment failure
  • Infrastructure costs are estimated before deployment via cdk diff + AWS Pricing API
  • Secrets and sensitive config are stored in AWS Secrets Manager (never in code)
  • Multi-stack architecture limits blast radius per deployment

High-Level Design

Infrastructure Pipeline Overview

flowchart LR
    subgraph "Developer"
        A[CDK Code Change] --> B[PR Created]
    end

    subgraph "CI Phase"
        B --> C[CDK Synth]
        C --> D[CDK Unit Tests]
        D --> E[cdk diff — Post to PR]
        E --> F[Security Scan — cfn-nag]
        F --> G[Cost Estimate]
    end

    subgraph "Staging Deploy"
        G --> H[Deploy to Staging]
        H --> I[Integration Tests]
        I --> J{Tests Pass?}
    end

    subgraph "Production Deploy"
        J -->|Yes| K[Manual Approval Gate]
        K --> L[Deploy to Production]
        L --> M[Post-Deploy Validation]
        J -->|No| N[Fail + Notify]
    end

    style K fill:#ff9900,color:#000
    style L fill:#1B660F,color:#fff
    style N fill:#DD344C,color:#fff

AWS Resources Managed by This Pipeline

Stack Resources Change Frequency
Networking VPC, Subnets, Security Groups, NAT Gateway Rare (quarterly)
Compute ECS Cluster, Task Definitions, ALB, Lambda Functions Weekly
Data DynamoDB Tables, DAX Clusters, ElastiCache Redis Monthly
AI/ML SageMaker Endpoints, Bedrock Config, Model Registry Monthly
Search OpenSearch Serverless Collections, Indexes Monthly
Edge CloudFront Distribution, API Gateway, WAF Rules Bi-weekly
Observability CloudWatch Dashboards, Alarms, X-Ray Groups Weekly
Security IAM Roles, Cognito User Pool, Secrets Manager Rare
Storage S3 Buckets, Lifecycle Policies Rare

Low-Level Design

1. CDK Project Structure — Multi-Stack Architecture

infra/
├── bin/
│   └── app.ts                  # Stack instantiation + environment config
├── lib/
│   ├── stacks/
│   │   ├── networking-stack.ts     # VPC, subnets, security groups
│   │   ├── data-stack.ts           # DynamoDB, DAX, ElastiCache
│   │   ├── compute-stack.ts        # ECS, ALB, Lambda
│   │   ├── ai-ml-stack.ts          # SageMaker, Bedrock config
│   │   ├── search-stack.ts         # OpenSearch Serverless
│   │   ├── edge-stack.ts           # CloudFront, API Gateway, WAF
│   │   ├── observability-stack.ts  # CloudWatch, X-Ray
│   │   └── security-stack.ts       # IAM, Cognito, Secrets
│   ├── constructs/
│   │   ├── fargate-service.ts      # Reusable ECS Fargate pattern
│   │   ├── dynamodb-table.ts       # Table with standard config
│   │   └── alarm-factory.ts        # CloudWatch alarm patterns
│   └── config/
│       ├── staging.ts              # Staging environment config
│       └── production.ts           # Production environment config
├── test/
│   ├── stacks/
│   │   ├── networking-stack.test.ts
│   │   ├── compute-stack.test.ts
│   │   └── ...
│   └── constructs/
│       └── fargate-service.test.ts
├── cdk.json
├── tsconfig.json
└── package.json

2. Stack Dependency Graph

graph TD
    NET["Networking Stack<br/>(VPC, Subnets, SGs)"]
    SEC["Security Stack<br/>(IAM, Cognito, Secrets)"]
    DATA["Data Stack<br/>(DynamoDB, Redis, DAX)"]
    SEARCH["Search Stack<br/>(OpenSearch Serverless)"]
    AIML["AI/ML Stack<br/>(SageMaker, Bedrock)"]
    COMP["Compute Stack<br/>(ECS, ALB, Lambda)"]
    EDGE["Edge Stack<br/>(CloudFront, API GW, WAF)"]
    OBS["Observability Stack<br/>(CloudWatch, X-Ray)"]

    NET --> DATA
    NET --> SEARCH
    NET --> AIML
    NET --> COMP
    SEC --> COMP
    SEC --> AIML
    SEC --> DATA
    DATA --> COMP
    SEARCH --> COMP
    AIML --> COMP
    COMP --> EDGE
    COMP --> OBS
    EDGE --> OBS

    style NET fill:#ff9900,color:#000
    style COMP fill:#146eb4,color:#fff
    style AIML fill:#8C4FFF,color:#fff

3. CDK Stack Example — Compute Stack

import * as cdk from 'aws-cdk-lib';
import * as ecs from 'aws-cdk-lib/aws-ecs';
import * as ec2 from 'aws-cdk-lib/aws-ec2';
import * as elbv2 from 'aws-cdk-lib/aws-elasticloadbalancingv2';
import { Construct } from 'constructs';

interface ComputeStackProps extends cdk.StackProps {
  vpc: ec2.IVpc;
  cluster?: ecs.ICluster;
  environment: 'staging' | 'production';
}

export class ComputeStack extends cdk.Stack {
  public readonly cluster: ecs.Cluster;
  public readonly service: ecs.FargateService;

  constructor(scope: Construct, id: string, props: ComputeStackProps) {
    super(scope, id, props);

    const config = {
      staging: { desiredCount: 2, cpu: 512, memory: 1024 },
      production: { desiredCount: 4, cpu: 1024, memory: 2048 },
    }[props.environment];

    this.cluster = new ecs.Cluster(this, 'ChatbotCluster', {
      vpc: props.vpc,
      containerInsights: true,
    });

    const taskDef = new ecs.FargateTaskDefinition(this, 'ChatbotTask', {
      cpu: config.cpu,
      memoryLimitMiB: config.memory,
    });

    taskDef.addContainer('chatbot', {
      image: ecs.ContainerImage.fromEcrRepository(/* ECR repo reference */),
      portMappings: [{ containerPort: 8080 }],
      logging: ecs.LogDrivers.awsLogs({ streamPrefix: 'chatbot' }),
      healthCheck: {
        command: ['CMD-SHELL', 'curl -f http://localhost:8080/health || exit 1'],
        interval: cdk.Duration.seconds(15),
        timeout: cdk.Duration.seconds(5),
        retries: 3,
      },
    });

    this.service = new ecs.FargateService(this, 'ChatbotService', {
      cluster: this.cluster,
      taskDefinition: taskDef,
      desiredCount: config.desiredCount,
      deploymentController: { type: ecs.DeploymentControllerType.CODE_DEPLOY },
      circuitBreaker: { rollback: true },
    });
  }
}

4. CDK Unit Tests

import { Template, Match } from 'aws-cdk-lib/assertions';
import * as cdk from 'aws-cdk-lib';
import { ComputeStack } from '../lib/stacks/compute-stack';
import { NetworkingStack } from '../lib/stacks/networking-stack';

describe('ComputeStack', () => {
  let template: Template;

  beforeAll(() => {
    const app = new cdk.App();
    const networkStack = new NetworkingStack(app, 'TestNetwork', {
      env: { account: '123456789', region: 'us-east-1' },
    });
    const stack = new ComputeStack(app, 'TestCompute', {
      vpc: networkStack.vpc,
      environment: 'production',
      env: { account: '123456789', region: 'us-east-1' },
    });
    template = Template.fromStack(stack);
  });

  test('ECS service uses CodeDeploy deployment controller', () => {
    template.hasResourceProperties('AWS::ECS::Service', {
      DeploymentController: { Type: 'CODE_DEPLOY' },
    });
  });

  test('Production has 4 desired tasks', () => {
    template.hasResourceProperties('AWS::ECS::Service', {
      DesiredCount: 4,
    });
  });

  test('Task definition uses 1024 CPU for production', () => {
    template.hasResourceProperties('AWS::ECS::TaskDefinition', {
      Cpu: '1024',
    });
  });

  test('Container has health check configured', () => {
    template.hasResourceProperties('AWS::ECS::TaskDefinition', {
      ContainerDefinitions: Match.arrayWith([
        Match.objectLike({
          HealthCheck: Match.objectLike({
            Command: Match.arrayWith(['CMD-SHELL']),
          }),
        }),
      ]),
    });
  });

  test('No public subnets for ECS tasks', () => {
    template.hasResourceProperties('AWS::ECS::Service', {
      NetworkConfiguration: Match.objectLike({
        AwsvpcConfiguration: Match.objectLike({
          AssignPublicIp: 'DISABLED',
        }),
      }),
    });
  });
});

5. CI Pipeline — CDK Validation

name: infra-validate
on:
  pull_request:
    paths: ['infra/**']

permissions:
  id-token: write  # OIDC for AWS auth
  contents: read
  pull-requests: write

jobs:
  validate:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Configure AWS credentials (OIDC)
        uses: aws-actions/configure-aws-credentials@v4
        with:
          role-to-assume: arn:aws:iam::role/github-actions-infra
          aws-region: us-east-1

      - name: Install dependencies
        run: cd infra && npm ci

      - name: CDK Synth
        run: cd infra && npx cdk synth --all

      - name: CDK Unit Tests
        run: cd infra && npm test

      - name: Security Scan (cfn-nag)
        run: |
          gem install cfn-nag
          cfn_nag_scan --input-path infra/cdk.out/*.template.json

      - name: CDK Diff (Post to PR)
        run: |
          cd infra
          DIFF=$(npx cdk diff --all 2>&1) || true
          echo "## Infrastructure Changes" > diff.md
          echo '```' >> diff.md
          echo "$DIFF" >> diff.md
          echo '```' >> diff.md

      - name: Comment diff on PR
        uses: marocchino/sticky-pull-request-comment@v2
        with:
          path: infra/diff.md

6. CD Pipeline — Staged Deployment

sequenceDiagram
    participant Dev as Developer
    participant GH as GitHub
    participant CI as CI Pipeline
    participant STG as Staging Account
    participant PROD as Production Account
    participant CW as CloudWatch

    Dev->>GH: Merge PR to main
    GH->>CI: Trigger deploy pipeline
    CI->>CI: cdk synth + tests
    CI->>STG: cdk deploy --all
    STG-->>CI: Stack outputs
    CI->>STG: Run integration tests
    STG-->>CI: Tests pass

    CI->>GH: Request manual approval
    Dev->>GH: Approve deployment
    GH->>CI: Approval received

    CI->>PROD: cdk deploy --all (stack-by-stack)
    Note over CI,PROD: Deploy order: Networking → Security → Data → AI/ML → Compute → Edge → Observability
    PROD-->>CI: All stacks deployed
    CI->>CW: Verify alarms healthy
    CI->>GH: Post deployment summary

7. Drift Detection

# Daily drift detection Lambda (triggered by EventBridge)
import boto3
import json

def lambda_handler(event, context):
    cfn = boto3.client('cloudformation')
    sns = boto3.client('sns')

    stacks = [
        'MangaAssist-Networking',
        'MangaAssist-Data',
        'MangaAssist-Compute',
        'MangaAssist-Edge',
        'MangaAssist-Observability',
    ]

    drifted_stacks = []
    for stack_name in stacks:
        cfn.detect_stack_drift(StackName=stack_name)
        waiter = cfn.get_waiter('stack_drift_detection_complete')
        # Note: In production, use Step Functions for async wait

        status = cfn.describe_stack_drift_detection_status(
            StackDriftDetectionId=detection_id
        )
        if status['StackDriftStatus'] == 'DRIFTED':
            drifted_stacks.append({
                'stack': stack_name,
                'drifted_resources': status['DriftedStackResourceCount'],
            })

    if drifted_stacks:
        sns.publish(
            TopicArn='arn:aws:sns:us-east-1:ACCOUNT:infra-drift-alerts',
            Subject=f'DRIFT DETECTED: {len(drifted_stacks)} stacks',
            Message=json.dumps(drifted_stacks, indent=2),
        )

    return {'drifted': len(drifted_stacks), 'details': drifted_stacks}

Critical Decisions

Decision 1: IaC Tool — AWS CDK (TypeScript) vs Terraform vs Raw CloudFormation

Criteria (Weight) AWS CDK (TypeScript) Terraform (HCL) Raw CloudFormation (YAML)
Abstraction Level (20%) 9/10 — L2/L3 constructs, loops, conditions 7/10 — modules, for_each 4/10 — verbose, repetitive
AWS Integration (20%) 10/10 — synthesizes to CFN natively 7/10 — AWS provider, some lag 10/10 — first-party
Type Safety (15%) 10/10 — full TypeScript types 5/10 — HCL validation only 3/10 — YAML, no types
State Management (15%) 8/10 — CFN manages state 6/10 — state file (locking, remote) 8/10 — CFN manages state
Multi-Cloud (5%) 2/10 — AWS only 10/10 — any provider 1/10 — AWS only
Testing (10%) 9/10 — CDK assertions, snapshot tests 7/10 — terratest, plan validation 4/10 — cfn-lint only
Team Learning Curve (10%) 8/10 — team knows TypeScript 5/10 — must learn HCL 6/10 — verbose but readable
Drift Detection (5%) 7/10 — CFN drift detection 9/10 — terraform plan detects drift 7/10 — CFN drift detection
Weighted Score 8.7/10 6.8/10 5.5/10

Decision: AWS CDK (TypeScript)

Rationale: The MangaAssist chatbot is 100% AWS. Multi-cloud support (Terraform's strength) provides zero value here. CDK's TypeScript constructs dramatically reduce code volume — a DynamoDB table with GSIs, auto-scaling, and alarms takes ~30 lines in CDK vs ~150 in CloudFormation YAML. The team already writes TypeScript for the frontend, so CDK adds no new language overhead.

Why not Terraform? State management is a liability for a small team. S3 + DynamoDB state locking works but is another system to maintain. State corruption requires manual surgery. CDK delegates state entirely to CloudFormation — one less thing to break.

Why not raw CloudFormation? The chatbot has 20+ AWS services across 8+ stacks. Raw CloudFormation YAML would be thousands of lines of repetitive configuration. CDK's constructs (e.g., new ecs.FargateService() auto-creates task definition, security groups, IAM roles, CloudWatch log groups) reduce this by 5-10x.


Decision 2: Stack Architecture — Single Stack vs Multi-Stack

Criteria Single Stack Multi-Stack (Current Choice)
Blast Radius HIGH — one failure affects everything LOW — failure isolated to one stack
Deploy Speed Slow — CFN evaluates all resources Fast — only changed stack deploys
Dependency Management Simple — no cross-stack refs Complex — stack exports, deploy order
Rollback Scope All or nothing Per-stack rollback
CFN Resource Limit Risk hitting 500 resource limit Each stack well under limit
Team Parallelism One person deploys at a time Different stacks can deploy independently

Decision: Multi-Stack (8 stacks)

Rationale: The chatbot uses 20+ different AWS services. A single stack would approach CloudFormation's 500-resource limit and create unacceptable blast radius — a typo in a CloudWatch alarm config should not risk rolling back the VPC. Multi-stack deployment adds ~5 minutes to the pipeline (sequential stack deploys) but eliminates the risk of cascading failures.


Decision 3: Environment Isolation — Separate AWS Accounts vs Same Account

Criteria Separate Accounts Same Account + Prefixes
Security Isolation Strong — IAM boundary at account level Weak — shared IAM, risk of cross-env access
Cost Visibility Clear — per-account billing Murky — requires cost allocation tags
Operational Complexity Higher — cross-account roles, multiple credentials Lower — single set of credentials
Resource Limits Independent per account Shared limits (could conflict)
Team Size Fit (1-2) Overhead for small team Simpler for small team
AWS Organizations Required Not needed

Decision: Separate AWS Accounts (via AWS Organizations)

graph LR
    subgraph "AWS Organization"
        MGMT["Management Account<br/>(Billing, SSO)"]
        STG["Staging Account<br/>(MangaAssist-Staging)"]
        PROD["Production Account<br/>(MangaAssist-Production)"]
        SHARED["Shared Services<br/>(ECR, Artifacts)"]
    end

    MGMT --> STG
    MGMT --> PROD
    MGMT --> SHARED
    SHARED -->|"Cross-account ECR pull"| STG
    SHARED -->|"Cross-account ECR pull"| PROD

Rationale: Even for a 1-2 person team, account-level isolation prevents the most dangerous class of mistakes — a staging deployment accidentally modifying production resources. AWS Organizations + IAM Identity Center (SSO) makes multi-account management nearly as simple as single-account. The overhead is worth the safety guarantee.


Tradeoffs

The Debate: Infrastructure Change Velocity vs Safety

graph TD
    subgraph "Product Manager"
        PM1["Fast infrastructure changes"]
        PM2["New features need new AWS resources quickly"]
        PM3["Don't want infra to be the bottleneck"]
    end

    subgraph "Architect"
        AR1["Every infra change reviewed by human"]
        AR2["Staging must mirror production exactly"]
        AR3["Zero-tolerance for infra drift"]
    end

    subgraph "DevOps Engineer"
        DE1["I'm the only one who can approve"]
        DE2["Manual approval creates bottleneck"]
        DE3["Automate gates, not approvals"]
    end

    PM2 ---|"Tension"| AR1
    PM3 ---|"Tension"| DE1
    AR2 ---|"Cost tension"| PM1
    DE2 ---|"Agrees with"| PM3
    AR3 ---|"Enables trust in"| DE3

Resolution

Concern Solution Compromise
PM: Infra changes slow down features Separate compute stack deploys in < 10 min Networking/security changes still slow (and should be)
Architect: Human review required cdk diff auto-posted to every PR — review is fast Only production gets manual approval; staging is fully automated
Architect: Staging must mirror prod Same CDK code, different config values per environment Staging runs at 50% capacity (cost saving) — not identical
DevOps: Single approver bottleneck Expand approval to 2 people (DevOps + Tech Lead) Tech Lead review may slow things slightly but removes bus factor
All: Drift must not happen Daily drift detection Lambda + SNS alerts Detection, not prevention — manual console access not blocked

Key Tradeoff: CDK Abstraction Power vs Debugging Difficulty

The double-edged sword of CDK: High-level constructs (L2/L3) auto-generate resources you didn't explicitly define (security groups, IAM policies, log groups). This is powerful but means debugging CloudFormation failures requires understanding the generated template.

CDK Code:       ~200 lines TypeScript
Generated CFN:  ~2,000 lines YAML
Debug Surface:  10x larger than source code

Mitigation: 1. Always review cdk diff output (not just CDK code) before approving 2. Use CDK snapshot tests to catch unexpected resource changes 3. Keep cdk.out/ artifacts for debugging failed deployments 4. Prefer L2 constructs over L3 (less magic, more control)