CD-01: Application Code Deployment Pipeline

User Story

As a Senior DevOps Engineer on the MangaAssist AI Chatbot team, I want to establish a fully automated CI/CD pipeline for deploying application code to ECS Fargate (baseline) and Lambda (burst overflow), So that every code change is tested, containerized, deployed with zero downtime, and automatically rolled back if health checks fail — enabling the 1–2 person DevOps team to ship multiple times per day with confidence.

Acceptance Criteria

High-Level Design

Pipeline Architecture Overview

flowchart LR
    subgraph "Source"
        A[GitHub Push to main] --> B[Webhook Trigger]
    end

    subgraph "CI Phase (5 min)"
        B --> C[Lint + Static Analysis]
        C --> D[Unit Tests + Coverage]
        D --> E[Integration Tests]
        E --> F[Build Docker Image]
        F --> G[ECR Push + Vulnerability Scan]
    end

    subgraph "CD Phase (8 min)"
        G --> H{ECS or Lambda?}
        H -->|ECS Services| I[Blue/Green Deploy]
        H -->|Lambda Functions| J[Version + Alias Shift]
        I --> K[Health Check + Canary]
        J --> K
        K -->|Pass| L[Full Traffic Shift]
        K -->|Fail| M[Automatic Rollback]
    end

    subgraph "Post-Deploy"
        L --> N[Smoke Tests]
        N --> O[CloudWatch Metrics]
        O --> P[Slack Notification]
    end

    M --> P

    style A fill:#ff9900,color:#000
    style L fill:#1B660F,color:#fff
    style M fill:#DD344C,color:#fff

Deployment Targets

Component	Compute	Deployment Strategy	Traffic Shift
Orchestrator Service	ECS Fargate	Blue/Green via CodeDeploy	10% → 50% → 100% over 10 min
Intent Classifier Proxy	ECS Fargate	Blue/Green via CodeDeploy	Same as orchestrator
RAG Service	ECS Fargate	Blue/Green via CodeDeploy	Same as orchestrator
Guardrails Service	ECS Fargate	Blue/Green via CodeDeploy	Same as orchestrator
Burst Overflow Handler	Lambda	Version + Alias weighted	5% → 100% over 5 min
WebSocket Handler	Lambda@Edge	Version + Alias weighted	5% → 100% over 5 min
API Gateway	REST + WebSocket	Stage deployment	Atomic swap

Low-Level Design

1. Source Stage — Trunk-Based Development

gitGraph
    commit id: "feature-A merged"
    branch feature-B
    commit id: "WIP"
    checkout main
    commit id: "feature-C merged"
    commit id: "hotfix-1"
    checkout feature-B
    commit id: "ready"
    checkout main
    merge feature-B id: "feature-B merged"
    commit id: "deploy-tag-v1.42"

Branch Strategy: Trunk-based development with short-lived feature branches (< 2 days). All merges to main trigger the pipeline. Feature flags decouple deployment from release.

Trigger Configuration (GitHub Actions):

name: deploy-chatbot
on:
  push:
    branches: [main]
    paths:
      - 'src/**'
      - 'Dockerfile'
      - 'requirements.txt'
      - 'package.json'
  workflow_dispatch:
    inputs:
      environment:
        type: choice
        options: [staging, production]

concurrency:
  group: deploy-${{ github.ref }}
  cancel-in-progress: false  # Never cancel in-flight deploys

2. CI Phase — Build and Test

flowchart TD
    A[Checkout Code] --> B[Install Dependencies]
    B --> C{Parallel Jobs}
    C --> D[Python Lint — ruff + mypy]
    C --> E[Unit Tests — pytest]
    C --> F[TypeScript Lint — eslint]
    C --> G[Security Scan — Bandit + Semgrep]

    D --> H{All Pass?}
    E --> H
    F --> H
    G --> H

    H -->|Yes| I[Build Docker Image]
    H -->|No| J[Fail + Notify]

    I --> K[Trivy Vulnerability Scan]
    K -->|No Critical/High| L[Push to ECR]
    K -->|Critical/High Found| J

    L --> M[Tag: sha-abc123 + latest]

Docker Build (multi-stage for minimal image size):

# Build stage
FROM python:3.11-slim AS builder
WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir --target=/deps -r requirements.txt

# Runtime stage
FROM python:3.11-slim
WORKDIR /app
COPY --from=builder /deps /usr/local/lib/python3.11/site-packages
COPY src/ ./src/
EXPOSE 8080
HEALTHCHECK --interval=15s --timeout=5s --retries=3 \
    CMD curl -f http://localhost:8080/health || exit 1
CMD ["uvicorn", "src.main:app", "--host", "0.0.0.0", "--port", "8080"]

ECR Push with Immutable Tags:

- name: Build and push to ECR
  env:
    ECR_REGISTRY: ${{ steps.login-ecr.outputs.registry }}
    IMAGE_TAG: sha-${{ github.sha }}
  run: |
    docker build -t $ECR_REGISTRY/mangaassist-chatbot:$IMAGE_TAG .
    docker push $ECR_REGISTRY/mangaassist-chatbot:$IMAGE_TAG

3. CD Phase — ECS Blue/Green Deployment

sequenceDiagram
    participant GH as GitHub Actions
    participant CD as CodeDeploy
    participant ECS as ECS Fargate
    participant ALB as ALB
    participant CW as CloudWatch

    GH->>CD: Create deployment (new task def)
    CD->>ECS: Launch GREEN task set
    ECS-->>CD: Tasks healthy
    CD->>ALB: Route 10% to GREEN
    CD->>CW: Start canary monitoring (5 min)

    alt Metrics healthy
        CD->>ALB: Route 50% to GREEN
        CD->>CW: Monitor 3 more min
        CD->>ALB: Route 100% to GREEN
        CD->>ECS: Drain BLUE task set
        CD-->>GH: Deployment SUCCESS
    else Metrics degraded
        CD->>ALB: Route 100% back to BLUE
        CD->>ECS: Terminate GREEN tasks
        CD-->>GH: Deployment ROLLED BACK
    end

ECS Task Definition Update:

# Update task definition with new image
def update_task_definition(family: str, new_image: str) -> str:
    ecs = boto3.client('ecs')

    # Get current task definition
    current = ecs.describe_task_definition(taskDefinition=family)
    task_def = current['taskDefinition']

    # Update container image
    for container in task_def['containerDefinitions']:
        if container['name'] == 'chatbot':
            container['image'] = new_image

    # Register new revision (keep all other config identical)
    response = ecs.register_task_definition(
        family=family,
        containerDefinitions=task_def['containerDefinitions'],
        taskRoleArn=task_def['taskRoleArn'],
        executionRoleArn=task_def['executionRoleArn'],
        networkMode=task_def['networkMode'],
        requiresCompatibilities=['FARGATE'],
        cpu=task_def['cpu'],
        memory=task_def['memory'],
    )
    return response['taskDefinition']['taskDefinitionArn']

CodeDeploy AppSpec (appspec.yaml):

version: 0.0
Resources:
  - TargetService:
      Type: AWS::ECS::Service
      Properties:
        TaskDefinition: <TASK_DEFINITION>
        LoadBalancerInfo:
          ContainerName: "chatbot"
          ContainerPort: 8080
        PlatformVersion: "LATEST"
Hooks:
  - BeforeAllowTraffic: "validate-deployment"
  - AfterAllowTraffic: "run-smoke-tests"

4. CD Phase — Lambda Deployment

# Lambda version + alias deployment with weighted traffic
def deploy_lambda(function_name: str, s3_bucket: str, s3_key: str):
    lmb = boto3.client('lambda')

    # Update function code
    lmb.update_function_code(
        FunctionName=function_name,
        S3Bucket=s3_bucket,
        S3Key=s3_key,
    )
    lmb.get_waiter('function_updated_v2').wait(FunctionName=function_name)

    # Publish new version
    version = lmb.publish_version(
        FunctionName=function_name,
        Description=f"Deploy {s3_key}"
    )['Version']

    # Shift 5% traffic to new version for canary
    lmb.update_alias(
        FunctionName=function_name,
        Name='live',
        FunctionVersion=version,
        RoutingConfig={
            'AdditionalVersionWeights': {
                version: 0.05  # 5% canary
            }
        }
    )
    return version

5. Health Check and Rollback Logic

Canary Validation CloudWatch Alarms:

Metric	Threshold	Evaluation Period	Action on Breach
HTTP 5xx rate	> 1%	2 of 3 minutes	Auto-rollback
P95 latency	> 3s	3 of 5 minutes	Auto-rollback
Task health check failures	> 0	1 of 1 minutes	Auto-rollback
Error log rate	> 5x baseline	2 of 3 minutes	Alert + manual review

Rollback Decision Flow:

flowchart TD
    A[Canary Running] --> B{5xx > 1%?}
    B -->|Yes| C[IMMEDIATE ROLLBACK]
    B -->|No| D{P95 > 3s?}
    D -->|Yes| C
    D -->|No| E{Health checks failing?}
    E -->|Yes| C
    E -->|No| F{Error log spike?}
    F -->|Yes| G[Alert Team — Hold Deploy]
    F -->|No| H[Proceed to Next Traffic %]

    C --> I[Route 100% to BLUE]
    C --> J[Post to Slack with Root Cause]
    C --> K[Create Incident Ticket]

6. Post-Deployment Smoke Tests

# Automated smoke test suite run after each deployment
import httpx

SMOKE_TESTS = [
    {
        "name": "health_check",
        "method": "GET",
        "path": "/health",
        "expected_status": 200,
    },
    {
        "name": "chat_basic",
        "method": "POST",
        "path": "/api/v1/chat",
        "body": {"message": "What manga do you recommend?", "session_id": "smoke-test"},
        "expected_status": 200,
        "max_latency_ms": 3000,
    },
    {
        "name": "intent_classification",
        "method": "POST",
        "path": "/api/v1/classify",
        "body": {"message": "I want to return my order"},
        "expected_status": 200,
        "expected_intent": "order_return",
    },
]

async def run_smoke_tests(base_url: str) -> bool:
    async with httpx.AsyncClient(timeout=10.0) as client:
        for test in SMOKE_TESTS:
            response = await client.request(
                test["method"],
                f"{base_url}{test['path']}",
                json=test.get("body"),
            )
            if response.status_code != test["expected_status"]:
                return False
            if "max_latency_ms" in test:
                if response.elapsed.total_seconds() * 1000 > test["max_latency_ms"]:
                    return False
    return True

Critical Decisions

Decision 1: CI/CD Platform — GitHub Actions vs AWS CodePipeline vs GitLab CI

graph LR
    subgraph "GitHub Actions"
        GA1[Native GitHub integration]
        GA2[Rich marketplace]
        GA3[YAML workflows]
        GA4[Community familiarity]
    end

    subgraph "AWS CodePipeline"
        CP1[Deep AWS integration]
        CP2[IAM native auth]
        CP3[CodeDeploy blue/green]
        CP4[No external dependency]
    end

    subgraph "GitLab CI"
        GL1[Built-in registry]
        GL2[Auto DevOps]
        GL3[Self-hosted runners]
        GL4[Compliance features]
    end

Criteria (Weight)	GitHub Actions	AWS CodePipeline	GitLab CI
AWS Integration (25%)	7/10 — via OIDC + actions	10/10 — native	6/10 — via CLI/SDK
Developer Experience (20%)	9/10 — most devs know it	5/10 — clunky console UI	8/10 — good YAML DX
Cost (15%)	7/10 — free tier + $0.008/min	8/10 — $1/pipeline/month	6/10 — runner costs
Ecosystem/Marketplace (15%)	10/10 — largest marketplace	4/10 — limited actions	7/10 — decent templates
Blue/Green ECS (10%)	6/10 — needs CodeDeploy anyway	10/10 — built-in	5/10 — custom scripts
Secrets Management (10%)	7/10 — GitHub Secrets	9/10 — IAM + Secrets Manager	7/10 — CI variables
Audit/Compliance (5%)	7/10 — audit logs	9/10 — CloudTrail	8/10 — compliance dashboard
Weighted Score	7.7/10	7.5/10	6.7/10

Decision: Hybrid — GitHub Actions for CI + AWS CodeDeploy for CD

Rationale: GitHub Actions provides the best developer experience for the CI phase (lint, test, build, scan). AWS CodeDeploy provides native ECS blue/green deployment that no other tool matches. The hybrid approach gives us the best of both worlds: - GitHub Actions: Source trigger, parallel test jobs, Docker build, ECR push - AWS CodeDeploy: Blue/green traffic shifting, automatic rollback, health check integration

Why not pure CodePipeline? The team already uses GitHub. CodePipeline's UI is dated, and the 1-2 person DevOps team values developer experience over all-AWS purity. CodePipeline adds a $1/pipeline/month cost for minimal additional value when GitHub Actions already handles CI.

Why not pure GitHub Actions? ECS blue/green deployment requires CodeDeploy regardless. Trying to replicate blue/green via raw aws ecs update-service loses automatic rollback, traffic shifting granularity, and lifecycle hooks.

Decision 2: Deployment Strategy — Blue/Green vs Canary vs Rolling

Criteria	Blue/Green	Canary	Rolling
Rollback Speed	Instant (swap ALB target)	Fast (stop traffic shift)	Slow (must redeploy)
Cost During Deploy	2x capacity for ~10 min	1x + small canary	1x (gradual replace)
Risk Exposure	None until traffic shift	Small % exposed early	Gradual exposure
Complexity	Medium (CodeDeploy handles)	High (custom metrics)	Low (ECS native)
Validation Window	Pre-traffic health check	Real traffic validation	Real traffic per-task
Zero Downtime	Yes	Yes	Yes (with min healthy %)
Best For	Critical services, fast rollback	ML endpoints, gradual	Non-critical, cost-sensitive

Decision: Blue/Green for ECS services + Canary-like traffic shifting (10% → 50% → 100%)

flowchart LR
    A[New Version Deployed] --> B[0% Traffic — Health Checks Only]
    B -->|All healthy| C[10% Traffic — 5 min canary]
    C -->|Metrics OK| D[50% Traffic — 3 min]
    D -->|Metrics OK| E[100% Traffic — BLUE drained]

    C -->|Metrics Bad| F[Rollback to BLUE]
    D -->|Metrics Bad| F

Rationale: Pure blue/green gives instant rollback capability (critical for a customer-facing chatbot). Adding graduated traffic shifting (10% → 50% → 100%) gives us canary-like validation on real traffic without the complexity of managing separate canary infrastructure. CodeDeploy supports this natively via TimeBasedLinear deployment config.

Why not pure canary? For a 1-2 person team, maintaining separate canary infrastructure (dedicated canary tasks, canary-specific routing rules, custom metrics aggregation) is operational overhead. Blue/green with traffic shifting achieves 80% of the value at 20% of the complexity.

Why not rolling? Rolling updates cannot be instantly rolled back — if task 5 of 10 is bad, you must wait for a new rolling deployment to replace all tasks. For an AI chatbot where a bad deployment could cause hallucinations or incorrect responses, instant rollback is non-negotiable.

Decision 3: Branching Strategy — Trunk-Based vs GitFlow vs GitHub Flow

Criteria	Trunk-Based	GitFlow	GitHub Flow
Deploy Frequency	Multiple per day	Weekly/biweekly releases	Multiple per day
Branch Complexity	Low (main + short-lived)	High (main, develop, release, hotfix)	Low (main + feature)
Feature Flags Needed	Yes (required)	No (version-based)	Optional
CI/CD Complexity	Low (one pipeline)	High (multiple pipelines)	Low (one pipeline)
Team Size Fit (1-2)	Excellent	Poor (too much overhead)	Good
Merge Conflicts	Rare (short branches)	Frequent (long-lived branches)	Moderate
Hotfix Speed	Instant (commit to main)	Slow (branch + merge + cherry-pick)	Fast (PR to main)

Decision: Trunk-Based Development with feature flags

Rationale: With a 1-2 person DevOps team, GitFlow's branch management overhead (maintaining develop, release, hotfix branches) is unjustifiable. Trunk-based development means every merge to main is deployable. Feature flags (via AWS AppConfig — see CD-06) decouple deployment from release, allowing us to ship code that's not yet customer-visible.

Tradeoffs

The Debate: Deployment Speed vs Safety Gates

graph TD
    subgraph "Product Manager"
        PM1["Ship features faster"]
        PM2["Daily deploys minimum"]
        PM3["Customers waiting for fixes"]
    end

    subgraph "Architect"
        AR1["Zero production incidents"]
        AR2["Full regression before deploy"]
        AR3["1-hour canary minimum"]
    end

    subgraph "Team Lead"
        TL1["Team doesn't burn out"]
        TL2["Auto-everything, manual-nothing"]
        TL3["Pipeline < 15 min"]
    end

    PM1 ---|"Tension"| AR2
    PM2 ---|"Tension"| AR3
    AR1 ---|"Tension"| TL3
    TL2 ---|"Enables"| PM2
    TL2 ---|"Enables"| AR1

Resolution: Automated Speed with Safety Nets

Concern	Solution	Compromise
PM wants daily deploys	Trunk-based + feature flags enable multiple daily deploys	Features are deployed but gated — PM must wait for flag activation
Architect wants full regression	Automated test suite runs in < 5 min (not 1-hour manual regression)	Trade comprehensive manual testing for fast automated coverage
Architect wants 1-hour canary	8-minute graduated traffic shift (10% → 50% → 100%)	Shorter canary window but with automated rollback — incidents last minutes not hours
Team Lead wants < 15 min pipeline	CI (5 min) + CD (8 min) + smoke tests (2 min) = 15 min total	Tight budget — test suite must stay fast, no room for slow integration tests in main pipeline
Team Lead wants zero manual steps	Everything automated — approval only for infrastructure changes (CD-02)	Lose the "human in the loop" safety net for app deploys — rely on monitoring instead

Key Tradeoff: Pipeline Speed vs Test Coverage

pie title "15-Minute Pipeline Budget Allocation"
    "Lint + Static Analysis" : 1
    "Unit Tests" : 2
    "Integration Tests" : 2
    "Docker Build + Scan" : 3
    "Blue/Green Deploy" : 5
    "Smoke Tests" : 2

What we sacrifice for speed: - No end-to-end browser tests in the deploy pipeline (moved to nightly runs) - No load/performance tests per deploy (weekly scheduled) - No manual QA gate (replaced by automated smoke tests) - No multi-region deploy validation (single-region MVP per 14-mvp-vs-future.md)

What we gain: - Multiple daily deploys safe for 1-2 person team - Instant automated rollback on any regression - Developer confidence: merge to main = production in 15 minutes - No deployment anxiety or "deploy freezes"

Failure Scenarios and Recovery

Scenario	Detection	Recovery	RTO
Bad code passes tests	Canary 5xx alarm in 2 min	Auto-rollback via CodeDeploy	< 3 min
Docker image has CVE	Trivy scan blocks ECR push	Fix dependency, re-push	Pipeline re-run (15 min)
ECR push timeout	GitHub Actions retry (3 attempts)	Auto-retry, then fail + notify	< 5 min
ECS tasks won't start	Task health check timeout (3 min)	CodeDeploy cancels, keeps BLUE	< 5 min
Lambda cold start spike	P95 latency alarm	Keep provisioned concurrency, rollback if persistent	< 2 min
API Gateway stage deploy fails	CloudFormation rollback	Previous stage preserved	< 5 min

Monitoring This Pipeline

Metric	Source	Alert Threshold
Deploy success rate	CloudWatch custom metric	< 95% over 7 days
Deploy duration P95	GitHub Actions API	> 20 min
Rollback count	CodeDeploy events	> 2/week
Mean time to recovery	CloudWatch composite	> 5 min
ECR image count	ECR lifecycle policy	> 50 untagged images