CD-02: Infrastructure as Code Pipeline
User Story
As a DevOps Engineer on the MangaAssist AI Chatbot team, I want to establish an automated Infrastructure as Code (IaC) pipeline that provisions, validates, and deploys all AWS resources through version-controlled code, So that infrastructure changes are reviewable, repeatable, and auditable — eliminating manual console changes and enabling the 1-2 person team to manage 20+ AWS services safely.
Acceptance Criteria
- All AWS resources are defined in CDK (TypeScript) — zero manual console provisioning
- Every infrastructure change goes through PR review with
cdk diffoutput attached - CDK unit tests validate resource configurations before deployment
- Staging environment is deployed and integration-tested before production
- Production deployments require manual approval (the only manual gate in any pipeline)
- Drift detection runs daily and alerts on unauthorized console changes
- Stack rollback is automatic on CloudFormation deployment failure
- Infrastructure costs are estimated before deployment via
cdk diff+ AWS Pricing API - Secrets and sensitive config are stored in AWS Secrets Manager (never in code)
- Multi-stack architecture limits blast radius per deployment
High-Level Design
Infrastructure Pipeline Overview
flowchart LR
subgraph "Developer"
A[CDK Code Change] --> B[PR Created]
end
subgraph "CI Phase"
B --> C[CDK Synth]
C --> D[CDK Unit Tests]
D --> E[cdk diff — Post to PR]
E --> F[Security Scan — cfn-nag]
F --> G[Cost Estimate]
end
subgraph "Staging Deploy"
G --> H[Deploy to Staging]
H --> I[Integration Tests]
I --> J{Tests Pass?}
end
subgraph "Production Deploy"
J -->|Yes| K[Manual Approval Gate]
K --> L[Deploy to Production]
L --> M[Post-Deploy Validation]
J -->|No| N[Fail + Notify]
end
style K fill:#ff9900,color:#000
style L fill:#1B660F,color:#fff
style N fill:#DD344C,color:#fff
AWS Resources Managed by This Pipeline
| Stack | Resources | Change Frequency |
|---|---|---|
| Networking | VPC, Subnets, Security Groups, NAT Gateway | Rare (quarterly) |
| Compute | ECS Cluster, Task Definitions, ALB, Lambda Functions | Weekly |
| Data | DynamoDB Tables, DAX Clusters, ElastiCache Redis | Monthly |
| AI/ML | SageMaker Endpoints, Bedrock Config, Model Registry | Monthly |
| Search | OpenSearch Serverless Collections, Indexes | Monthly |
| Edge | CloudFront Distribution, API Gateway, WAF Rules | Bi-weekly |
| Observability | CloudWatch Dashboards, Alarms, X-Ray Groups | Weekly |
| Security | IAM Roles, Cognito User Pool, Secrets Manager | Rare |
| Storage | S3 Buckets, Lifecycle Policies | Rare |
Low-Level Design
1. CDK Project Structure — Multi-Stack Architecture
infra/
├── bin/
│ └── app.ts # Stack instantiation + environment config
├── lib/
│ ├── stacks/
│ │ ├── networking-stack.ts # VPC, subnets, security groups
│ │ ├── data-stack.ts # DynamoDB, DAX, ElastiCache
│ │ ├── compute-stack.ts # ECS, ALB, Lambda
│ │ ├── ai-ml-stack.ts # SageMaker, Bedrock config
│ │ ├── search-stack.ts # OpenSearch Serverless
│ │ ├── edge-stack.ts # CloudFront, API Gateway, WAF
│ │ ├── observability-stack.ts # CloudWatch, X-Ray
│ │ └── security-stack.ts # IAM, Cognito, Secrets
│ ├── constructs/
│ │ ├── fargate-service.ts # Reusable ECS Fargate pattern
│ │ ├── dynamodb-table.ts # Table with standard config
│ │ └── alarm-factory.ts # CloudWatch alarm patterns
│ └── config/
│ ├── staging.ts # Staging environment config
│ └── production.ts # Production environment config
├── test/
│ ├── stacks/
│ │ ├── networking-stack.test.ts
│ │ ├── compute-stack.test.ts
│ │ └── ...
│ └── constructs/
│ └── fargate-service.test.ts
├── cdk.json
├── tsconfig.json
└── package.json
2. Stack Dependency Graph
graph TD
NET["Networking Stack<br/>(VPC, Subnets, SGs)"]
SEC["Security Stack<br/>(IAM, Cognito, Secrets)"]
DATA["Data Stack<br/>(DynamoDB, Redis, DAX)"]
SEARCH["Search Stack<br/>(OpenSearch Serverless)"]
AIML["AI/ML Stack<br/>(SageMaker, Bedrock)"]
COMP["Compute Stack<br/>(ECS, ALB, Lambda)"]
EDGE["Edge Stack<br/>(CloudFront, API GW, WAF)"]
OBS["Observability Stack<br/>(CloudWatch, X-Ray)"]
NET --> DATA
NET --> SEARCH
NET --> AIML
NET --> COMP
SEC --> COMP
SEC --> AIML
SEC --> DATA
DATA --> COMP
SEARCH --> COMP
AIML --> COMP
COMP --> EDGE
COMP --> OBS
EDGE --> OBS
style NET fill:#ff9900,color:#000
style COMP fill:#146eb4,color:#fff
style AIML fill:#8C4FFF,color:#fff
3. CDK Stack Example — Compute Stack
import * as cdk from 'aws-cdk-lib';
import * as ecs from 'aws-cdk-lib/aws-ecs';
import * as ec2 from 'aws-cdk-lib/aws-ec2';
import * as elbv2 from 'aws-cdk-lib/aws-elasticloadbalancingv2';
import { Construct } from 'constructs';
interface ComputeStackProps extends cdk.StackProps {
vpc: ec2.IVpc;
cluster?: ecs.ICluster;
environment: 'staging' | 'production';
}
export class ComputeStack extends cdk.Stack {
public readonly cluster: ecs.Cluster;
public readonly service: ecs.FargateService;
constructor(scope: Construct, id: string, props: ComputeStackProps) {
super(scope, id, props);
const config = {
staging: { desiredCount: 2, cpu: 512, memory: 1024 },
production: { desiredCount: 4, cpu: 1024, memory: 2048 },
}[props.environment];
this.cluster = new ecs.Cluster(this, 'ChatbotCluster', {
vpc: props.vpc,
containerInsights: true,
});
const taskDef = new ecs.FargateTaskDefinition(this, 'ChatbotTask', {
cpu: config.cpu,
memoryLimitMiB: config.memory,
});
taskDef.addContainer('chatbot', {
image: ecs.ContainerImage.fromEcrRepository(/* ECR repo reference */),
portMappings: [{ containerPort: 8080 }],
logging: ecs.LogDrivers.awsLogs({ streamPrefix: 'chatbot' }),
healthCheck: {
command: ['CMD-SHELL', 'curl -f http://localhost:8080/health || exit 1'],
interval: cdk.Duration.seconds(15),
timeout: cdk.Duration.seconds(5),
retries: 3,
},
});
this.service = new ecs.FargateService(this, 'ChatbotService', {
cluster: this.cluster,
taskDefinition: taskDef,
desiredCount: config.desiredCount,
deploymentController: { type: ecs.DeploymentControllerType.CODE_DEPLOY },
circuitBreaker: { rollback: true },
});
}
}
4. CDK Unit Tests
import { Template, Match } from 'aws-cdk-lib/assertions';
import * as cdk from 'aws-cdk-lib';
import { ComputeStack } from '../lib/stacks/compute-stack';
import { NetworkingStack } from '../lib/stacks/networking-stack';
describe('ComputeStack', () => {
let template: Template;
beforeAll(() => {
const app = new cdk.App();
const networkStack = new NetworkingStack(app, 'TestNetwork', {
env: { account: '123456789', region: 'us-east-1' },
});
const stack = new ComputeStack(app, 'TestCompute', {
vpc: networkStack.vpc,
environment: 'production',
env: { account: '123456789', region: 'us-east-1' },
});
template = Template.fromStack(stack);
});
test('ECS service uses CodeDeploy deployment controller', () => {
template.hasResourceProperties('AWS::ECS::Service', {
DeploymentController: { Type: 'CODE_DEPLOY' },
});
});
test('Production has 4 desired tasks', () => {
template.hasResourceProperties('AWS::ECS::Service', {
DesiredCount: 4,
});
});
test('Task definition uses 1024 CPU for production', () => {
template.hasResourceProperties('AWS::ECS::TaskDefinition', {
Cpu: '1024',
});
});
test('Container has health check configured', () => {
template.hasResourceProperties('AWS::ECS::TaskDefinition', {
ContainerDefinitions: Match.arrayWith([
Match.objectLike({
HealthCheck: Match.objectLike({
Command: Match.arrayWith(['CMD-SHELL']),
}),
}),
]),
});
});
test('No public subnets for ECS tasks', () => {
template.hasResourceProperties('AWS::ECS::Service', {
NetworkConfiguration: Match.objectLike({
AwsvpcConfiguration: Match.objectLike({
AssignPublicIp: 'DISABLED',
}),
}),
});
});
});
5. CI Pipeline — CDK Validation
name: infra-validate
on:
pull_request:
paths: ['infra/**']
permissions:
id-token: write # OIDC for AWS auth
contents: read
pull-requests: write
jobs:
validate:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Configure AWS credentials (OIDC)
uses: aws-actions/configure-aws-credentials@v4
with:
role-to-assume: arn:aws:iam::role/github-actions-infra
aws-region: us-east-1
- name: Install dependencies
run: cd infra && npm ci
- name: CDK Synth
run: cd infra && npx cdk synth --all
- name: CDK Unit Tests
run: cd infra && npm test
- name: Security Scan (cfn-nag)
run: |
gem install cfn-nag
cfn_nag_scan --input-path infra/cdk.out/*.template.json
- name: CDK Diff (Post to PR)
run: |
cd infra
DIFF=$(npx cdk diff --all 2>&1) || true
echo "## Infrastructure Changes" > diff.md
echo '```' >> diff.md
echo "$DIFF" >> diff.md
echo '```' >> diff.md
- name: Comment diff on PR
uses: marocchino/sticky-pull-request-comment@v2
with:
path: infra/diff.md
6. CD Pipeline — Staged Deployment
sequenceDiagram
participant Dev as Developer
participant GH as GitHub
participant CI as CI Pipeline
participant STG as Staging Account
participant PROD as Production Account
participant CW as CloudWatch
Dev->>GH: Merge PR to main
GH->>CI: Trigger deploy pipeline
CI->>CI: cdk synth + tests
CI->>STG: cdk deploy --all
STG-->>CI: Stack outputs
CI->>STG: Run integration tests
STG-->>CI: Tests pass
CI->>GH: Request manual approval
Dev->>GH: Approve deployment
GH->>CI: Approval received
CI->>PROD: cdk deploy --all (stack-by-stack)
Note over CI,PROD: Deploy order: Networking → Security → Data → AI/ML → Compute → Edge → Observability
PROD-->>CI: All stacks deployed
CI->>CW: Verify alarms healthy
CI->>GH: Post deployment summary
7. Drift Detection
# Daily drift detection Lambda (triggered by EventBridge)
import boto3
import json
def lambda_handler(event, context):
cfn = boto3.client('cloudformation')
sns = boto3.client('sns')
stacks = [
'MangaAssist-Networking',
'MangaAssist-Data',
'MangaAssist-Compute',
'MangaAssist-Edge',
'MangaAssist-Observability',
]
drifted_stacks = []
for stack_name in stacks:
cfn.detect_stack_drift(StackName=stack_name)
waiter = cfn.get_waiter('stack_drift_detection_complete')
# Note: In production, use Step Functions for async wait
status = cfn.describe_stack_drift_detection_status(
StackDriftDetectionId=detection_id
)
if status['StackDriftStatus'] == 'DRIFTED':
drifted_stacks.append({
'stack': stack_name,
'drifted_resources': status['DriftedStackResourceCount'],
})
if drifted_stacks:
sns.publish(
TopicArn='arn:aws:sns:us-east-1:ACCOUNT:infra-drift-alerts',
Subject=f'DRIFT DETECTED: {len(drifted_stacks)} stacks',
Message=json.dumps(drifted_stacks, indent=2),
)
return {'drifted': len(drifted_stacks), 'details': drifted_stacks}
Critical Decisions
Decision 1: IaC Tool — AWS CDK (TypeScript) vs Terraform vs Raw CloudFormation
| Criteria (Weight) | AWS CDK (TypeScript) | Terraform (HCL) | Raw CloudFormation (YAML) |
|---|---|---|---|
| Abstraction Level (20%) | 9/10 — L2/L3 constructs, loops, conditions | 7/10 — modules, for_each | 4/10 — verbose, repetitive |
| AWS Integration (20%) | 10/10 — synthesizes to CFN natively | 7/10 — AWS provider, some lag | 10/10 — first-party |
| Type Safety (15%) | 10/10 — full TypeScript types | 5/10 — HCL validation only | 3/10 — YAML, no types |
| State Management (15%) | 8/10 — CFN manages state | 6/10 — state file (locking, remote) | 8/10 — CFN manages state |
| Multi-Cloud (5%) | 2/10 — AWS only | 10/10 — any provider | 1/10 — AWS only |
| Testing (10%) | 9/10 — CDK assertions, snapshot tests | 7/10 — terratest, plan validation | 4/10 — cfn-lint only |
| Team Learning Curve (10%) | 8/10 — team knows TypeScript | 5/10 — must learn HCL | 6/10 — verbose but readable |
| Drift Detection (5%) | 7/10 — CFN drift detection | 9/10 — terraform plan detects drift | 7/10 — CFN drift detection |
| Weighted Score | 8.7/10 | 6.8/10 | 5.5/10 |
Decision: AWS CDK (TypeScript)
Rationale: The MangaAssist chatbot is 100% AWS. Multi-cloud support (Terraform's strength) provides zero value here. CDK's TypeScript constructs dramatically reduce code volume — a DynamoDB table with GSIs, auto-scaling, and alarms takes ~30 lines in CDK vs ~150 in CloudFormation YAML. The team already writes TypeScript for the frontend, so CDK adds no new language overhead.
Why not Terraform? State management is a liability for a small team. S3 + DynamoDB state locking works but is another system to maintain. State corruption requires manual surgery. CDK delegates state entirely to CloudFormation — one less thing to break.
Why not raw CloudFormation? The chatbot has 20+ AWS services across 8+ stacks. Raw CloudFormation YAML would be thousands of lines of repetitive configuration. CDK's constructs (e.g., new ecs.FargateService() auto-creates task definition, security groups, IAM roles, CloudWatch log groups) reduce this by 5-10x.
Decision 2: Stack Architecture — Single Stack vs Multi-Stack
| Criteria | Single Stack | Multi-Stack (Current Choice) |
|---|---|---|
| Blast Radius | HIGH — one failure affects everything | LOW — failure isolated to one stack |
| Deploy Speed | Slow — CFN evaluates all resources | Fast — only changed stack deploys |
| Dependency Management | Simple — no cross-stack refs | Complex — stack exports, deploy order |
| Rollback Scope | All or nothing | Per-stack rollback |
| CFN Resource Limit | Risk hitting 500 resource limit | Each stack well under limit |
| Team Parallelism | One person deploys at a time | Different stacks can deploy independently |
Decision: Multi-Stack (8 stacks)
Rationale: The chatbot uses 20+ different AWS services. A single stack would approach CloudFormation's 500-resource limit and create unacceptable blast radius — a typo in a CloudWatch alarm config should not risk rolling back the VPC. Multi-stack deployment adds ~5 minutes to the pipeline (sequential stack deploys) but eliminates the risk of cascading failures.
Decision 3: Environment Isolation — Separate AWS Accounts vs Same Account
| Criteria | Separate Accounts | Same Account + Prefixes |
|---|---|---|
| Security Isolation | Strong — IAM boundary at account level | Weak — shared IAM, risk of cross-env access |
| Cost Visibility | Clear — per-account billing | Murky — requires cost allocation tags |
| Operational Complexity | Higher — cross-account roles, multiple credentials | Lower — single set of credentials |
| Resource Limits | Independent per account | Shared limits (could conflict) |
| Team Size Fit (1-2) | Overhead for small team | Simpler for small team |
| AWS Organizations | Required | Not needed |
Decision: Separate AWS Accounts (via AWS Organizations)
graph LR
subgraph "AWS Organization"
MGMT["Management Account<br/>(Billing, SSO)"]
STG["Staging Account<br/>(MangaAssist-Staging)"]
PROD["Production Account<br/>(MangaAssist-Production)"]
SHARED["Shared Services<br/>(ECR, Artifacts)"]
end
MGMT --> STG
MGMT --> PROD
MGMT --> SHARED
SHARED -->|"Cross-account ECR pull"| STG
SHARED -->|"Cross-account ECR pull"| PROD
Rationale: Even for a 1-2 person team, account-level isolation prevents the most dangerous class of mistakes — a staging deployment accidentally modifying production resources. AWS Organizations + IAM Identity Center (SSO) makes multi-account management nearly as simple as single-account. The overhead is worth the safety guarantee.
Tradeoffs
The Debate: Infrastructure Change Velocity vs Safety
graph TD
subgraph "Product Manager"
PM1["Fast infrastructure changes"]
PM2["New features need new AWS resources quickly"]
PM3["Don't want infra to be the bottleneck"]
end
subgraph "Architect"
AR1["Every infra change reviewed by human"]
AR2["Staging must mirror production exactly"]
AR3["Zero-tolerance for infra drift"]
end
subgraph "DevOps Engineer"
DE1["I'm the only one who can approve"]
DE2["Manual approval creates bottleneck"]
DE3["Automate gates, not approvals"]
end
PM2 ---|"Tension"| AR1
PM3 ---|"Tension"| DE1
AR2 ---|"Cost tension"| PM1
DE2 ---|"Agrees with"| PM3
AR3 ---|"Enables trust in"| DE3
Resolution
| Concern | Solution | Compromise |
|---|---|---|
| PM: Infra changes slow down features | Separate compute stack deploys in < 10 min | Networking/security changes still slow (and should be) |
| Architect: Human review required | cdk diff auto-posted to every PR — review is fast |
Only production gets manual approval; staging is fully automated |
| Architect: Staging must mirror prod | Same CDK code, different config values per environment | Staging runs at 50% capacity (cost saving) — not identical |
| DevOps: Single approver bottleneck | Expand approval to 2 people (DevOps + Tech Lead) | Tech Lead review may slow things slightly but removes bus factor |
| All: Drift must not happen | Daily drift detection Lambda + SNS alerts | Detection, not prevention — manual console access not blocked |
Key Tradeoff: CDK Abstraction Power vs Debugging Difficulty
The double-edged sword of CDK: High-level constructs (L2/L3) auto-generate resources you didn't explicitly define (security groups, IAM policies, log groups). This is powerful but means debugging CloudFormation failures requires understanding the generated template.
CDK Code: ~200 lines TypeScript
Generated CFN: ~2,000 lines YAML
Debug Surface: 10x larger than source code
Mitigation:
1. Always review cdk diff output (not just CDK code) before approving
2. Use CDK snapshot tests to catch unexpected resource changes
3. Keep cdk.out/ artifacts for debugging failed deployments
4. Prefer L2 constructs over L3 (less magic, more control)