LOCAL PREVIEW View on GitHub

CD-01: Application Code Deployment Pipeline

User Story

As a Senior DevOps Engineer on the MangaAssist AI Chatbot team, I want to establish a fully automated CI/CD pipeline for deploying application code to ECS Fargate (baseline) and Lambda (burst overflow), So that every code change is tested, containerized, deployed with zero downtime, and automatically rolled back if health checks fail — enabling the 1–2 person DevOps team to ship multiple times per day with confidence.


Acceptance Criteria

  • Every push to main triggers the full pipeline (lint → test → build → deploy)
  • Docker images are built, scanned for vulnerabilities, and pushed to ECR
  • ECS Fargate services are updated via blue/green deployment with automatic rollback
  • Lambda functions are updated with versioning and alias-based traffic shifting
  • API Gateway configuration is deployed atomically alongside compute changes
  • Pipeline completes in < 15 minutes (commit to production traffic)
  • Failed deployments automatically roll back within 2 minutes
  • Deployment metrics (success rate, duration, rollback count) are published to CloudWatch
  • Slack/PagerDuty notifications on deployment success, failure, and rollback
  • Feature flags gate new functionality independent of deployment

High-Level Design

Pipeline Architecture Overview

flowchart LR
    subgraph "Source"
        A[GitHub Push to main] --> B[Webhook Trigger]
    end

    subgraph "CI Phase (5 min)"
        B --> C[Lint + Static Analysis]
        C --> D[Unit Tests + Coverage]
        D --> E[Integration Tests]
        E --> F[Build Docker Image]
        F --> G[ECR Push + Vulnerability Scan]
    end

    subgraph "CD Phase (8 min)"
        G --> H{ECS or Lambda?}
        H -->|ECS Services| I[Blue/Green Deploy]
        H -->|Lambda Functions| J[Version + Alias Shift]
        I --> K[Health Check + Canary]
        J --> K
        K -->|Pass| L[Full Traffic Shift]
        K -->|Fail| M[Automatic Rollback]
    end

    subgraph "Post-Deploy"
        L --> N[Smoke Tests]
        N --> O[CloudWatch Metrics]
        O --> P[Slack Notification]
    end

    M --> P

    style A fill:#ff9900,color:#000
    style L fill:#1B660F,color:#fff
    style M fill:#DD344C,color:#fff

Deployment Targets

Component Compute Deployment Strategy Traffic Shift
Orchestrator Service ECS Fargate Blue/Green via CodeDeploy 10% → 50% → 100% over 10 min
Intent Classifier Proxy ECS Fargate Blue/Green via CodeDeploy Same as orchestrator
RAG Service ECS Fargate Blue/Green via CodeDeploy Same as orchestrator
Guardrails Service ECS Fargate Blue/Green via CodeDeploy Same as orchestrator
Burst Overflow Handler Lambda Version + Alias weighted 5% → 100% over 5 min
WebSocket Handler Lambda@Edge Version + Alias weighted 5% → 100% over 5 min
API Gateway REST + WebSocket Stage deployment Atomic swap

Low-Level Design

1. Source Stage — Trunk-Based Development

gitGraph
    commit id: "feature-A merged"
    branch feature-B
    commit id: "WIP"
    checkout main
    commit id: "feature-C merged"
    commit id: "hotfix-1"
    checkout feature-B
    commit id: "ready"
    checkout main
    merge feature-B id: "feature-B merged"
    commit id: "deploy-tag-v1.42"

Branch Strategy: Trunk-based development with short-lived feature branches (< 2 days). All merges to main trigger the pipeline. Feature flags decouple deployment from release.

Trigger Configuration (GitHub Actions):

name: deploy-chatbot
on:
  push:
    branches: [main]
    paths:
      - 'src/**'
      - 'Dockerfile'
      - 'requirements.txt'
      - 'package.json'
  workflow_dispatch:
    inputs:
      environment:
        type: choice
        options: [staging, production]

concurrency:
  group: deploy-${{ github.ref }}
  cancel-in-progress: false  # Never cancel in-flight deploys

2. CI Phase — Build and Test

flowchart TD
    A[Checkout Code] --> B[Install Dependencies]
    B --> C{Parallel Jobs}
    C --> D[Python Lint — ruff + mypy]
    C --> E[Unit Tests — pytest]
    C --> F[TypeScript Lint — eslint]
    C --> G[Security Scan — Bandit + Semgrep]

    D --> H{All Pass?}
    E --> H
    F --> H
    G --> H

    H -->|Yes| I[Build Docker Image]
    H -->|No| J[Fail + Notify]

    I --> K[Trivy Vulnerability Scan]
    K -->|No Critical/High| L[Push to ECR]
    K -->|Critical/High Found| J

    L --> M[Tag: sha-abc123 + latest]

Docker Build (multi-stage for minimal image size):

# Build stage
FROM python:3.11-slim AS builder
WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir --target=/deps -r requirements.txt

# Runtime stage
FROM python:3.11-slim
WORKDIR /app
COPY --from=builder /deps /usr/local/lib/python3.11/site-packages
COPY src/ ./src/
EXPOSE 8080
HEALTHCHECK --interval=15s --timeout=5s --retries=3 \
    CMD curl -f http://localhost:8080/health || exit 1
CMD ["uvicorn", "src.main:app", "--host", "0.0.0.0", "--port", "8080"]

ECR Push with Immutable Tags:

- name: Build and push to ECR
  env:
    ECR_REGISTRY: ${{ steps.login-ecr.outputs.registry }}
    IMAGE_TAG: sha-${{ github.sha }}
  run: |
    docker build -t $ECR_REGISTRY/mangaassist-chatbot:$IMAGE_TAG .
    docker push $ECR_REGISTRY/mangaassist-chatbot:$IMAGE_TAG

3. CD Phase — ECS Blue/Green Deployment

sequenceDiagram
    participant GH as GitHub Actions
    participant CD as CodeDeploy
    participant ECS as ECS Fargate
    participant ALB as ALB
    participant CW as CloudWatch

    GH->>CD: Create deployment (new task def)
    CD->>ECS: Launch GREEN task set
    ECS-->>CD: Tasks healthy
    CD->>ALB: Route 10% to GREEN
    CD->>CW: Start canary monitoring (5 min)

    alt Metrics healthy
        CD->>ALB: Route 50% to GREEN
        CD->>CW: Monitor 3 more min
        CD->>ALB: Route 100% to GREEN
        CD->>ECS: Drain BLUE task set
        CD-->>GH: Deployment SUCCESS
    else Metrics degraded
        CD->>ALB: Route 100% back to BLUE
        CD->>ECS: Terminate GREEN tasks
        CD-->>GH: Deployment ROLLED BACK
    end

ECS Task Definition Update:

# Update task definition with new image
def update_task_definition(family: str, new_image: str) -> str:
    ecs = boto3.client('ecs')

    # Get current task definition
    current = ecs.describe_task_definition(taskDefinition=family)
    task_def = current['taskDefinition']

    # Update container image
    for container in task_def['containerDefinitions']:
        if container['name'] == 'chatbot':
            container['image'] = new_image

    # Register new revision (keep all other config identical)
    response = ecs.register_task_definition(
        family=family,
        containerDefinitions=task_def['containerDefinitions'],
        taskRoleArn=task_def['taskRoleArn'],
        executionRoleArn=task_def['executionRoleArn'],
        networkMode=task_def['networkMode'],
        requiresCompatibilities=['FARGATE'],
        cpu=task_def['cpu'],
        memory=task_def['memory'],
    )
    return response['taskDefinition']['taskDefinitionArn']

CodeDeploy AppSpec (appspec.yaml):

version: 0.0
Resources:
  - TargetService:
      Type: AWS::ECS::Service
      Properties:
        TaskDefinition: <TASK_DEFINITION>
        LoadBalancerInfo:
          ContainerName: "chatbot"
          ContainerPort: 8080
        PlatformVersion: "LATEST"
Hooks:
  - BeforeAllowTraffic: "validate-deployment"
  - AfterAllowTraffic: "run-smoke-tests"

4. CD Phase — Lambda Deployment

# Lambda version + alias deployment with weighted traffic
def deploy_lambda(function_name: str, s3_bucket: str, s3_key: str):
    lmb = boto3.client('lambda')

    # Update function code
    lmb.update_function_code(
        FunctionName=function_name,
        S3Bucket=s3_bucket,
        S3Key=s3_key,
    )
    lmb.get_waiter('function_updated_v2').wait(FunctionName=function_name)

    # Publish new version
    version = lmb.publish_version(
        FunctionName=function_name,
        Description=f"Deploy {s3_key}"
    )['Version']

    # Shift 5% traffic to new version for canary
    lmb.update_alias(
        FunctionName=function_name,
        Name='live',
        FunctionVersion=version,
        RoutingConfig={
            'AdditionalVersionWeights': {
                version: 0.05  # 5% canary
            }
        }
    )
    return version

5. Health Check and Rollback Logic

Canary Validation CloudWatch Alarms:

Metric Threshold Evaluation Period Action on Breach
HTTP 5xx rate > 1% 2 of 3 minutes Auto-rollback
P95 latency > 3s 3 of 5 minutes Auto-rollback
Task health check failures > 0 1 of 1 minutes Auto-rollback
Error log rate > 5x baseline 2 of 3 minutes Alert + manual review

Rollback Decision Flow:

flowchart TD
    A[Canary Running] --> B{5xx > 1%?}
    B -->|Yes| C[IMMEDIATE ROLLBACK]
    B -->|No| D{P95 > 3s?}
    D -->|Yes| C
    D -->|No| E{Health checks failing?}
    E -->|Yes| C
    E -->|No| F{Error log spike?}
    F -->|Yes| G[Alert Team — Hold Deploy]
    F -->|No| H[Proceed to Next Traffic %]

    C --> I[Route 100% to BLUE]
    C --> J[Post to Slack with Root Cause]
    C --> K[Create Incident Ticket]

6. Post-Deployment Smoke Tests

# Automated smoke test suite run after each deployment
import httpx

SMOKE_TESTS = [
    {
        "name": "health_check",
        "method": "GET",
        "path": "/health",
        "expected_status": 200,
    },
    {
        "name": "chat_basic",
        "method": "POST",
        "path": "/api/v1/chat",
        "body": {"message": "What manga do you recommend?", "session_id": "smoke-test"},
        "expected_status": 200,
        "max_latency_ms": 3000,
    },
    {
        "name": "intent_classification",
        "method": "POST",
        "path": "/api/v1/classify",
        "body": {"message": "I want to return my order"},
        "expected_status": 200,
        "expected_intent": "order_return",
    },
]

async def run_smoke_tests(base_url: str) -> bool:
    async with httpx.AsyncClient(timeout=10.0) as client:
        for test in SMOKE_TESTS:
            response = await client.request(
                test["method"],
                f"{base_url}{test['path']}",
                json=test.get("body"),
            )
            if response.status_code != test["expected_status"]:
                return False
            if "max_latency_ms" in test:
                if response.elapsed.total_seconds() * 1000 > test["max_latency_ms"]:
                    return False
    return True

Critical Decisions

Decision 1: CI/CD Platform — GitHub Actions vs AWS CodePipeline vs GitLab CI

graph LR
    subgraph "GitHub Actions"
        GA1[Native GitHub integration]
        GA2[Rich marketplace]
        GA3[YAML workflows]
        GA4[Community familiarity]
    end

    subgraph "AWS CodePipeline"
        CP1[Deep AWS integration]
        CP2[IAM native auth]
        CP3[CodeDeploy blue/green]
        CP4[No external dependency]
    end

    subgraph "GitLab CI"
        GL1[Built-in registry]
        GL2[Auto DevOps]
        GL3[Self-hosted runners]
        GL4[Compliance features]
    end
Criteria (Weight) GitHub Actions AWS CodePipeline GitLab CI
AWS Integration (25%) 7/10 — via OIDC + actions 10/10 — native 6/10 — via CLI/SDK
Developer Experience (20%) 9/10 — most devs know it 5/10 — clunky console UI 8/10 — good YAML DX
Cost (15%) 7/10 — free tier + $0.008/min 8/10 — $1/pipeline/month 6/10 — runner costs
Ecosystem/Marketplace (15%) 10/10 — largest marketplace 4/10 — limited actions 7/10 — decent templates
Blue/Green ECS (10%) 6/10 — needs CodeDeploy anyway 10/10 — built-in 5/10 — custom scripts
Secrets Management (10%) 7/10 — GitHub Secrets 9/10 — IAM + Secrets Manager 7/10 — CI variables
Audit/Compliance (5%) 7/10 — audit logs 9/10 — CloudTrail 8/10 — compliance dashboard
Weighted Score 7.7/10 7.5/10 6.7/10

Decision: Hybrid — GitHub Actions for CI + AWS CodeDeploy for CD

Rationale: GitHub Actions provides the best developer experience for the CI phase (lint, test, build, scan). AWS CodeDeploy provides native ECS blue/green deployment that no other tool matches. The hybrid approach gives us the best of both worlds: - GitHub Actions: Source trigger, parallel test jobs, Docker build, ECR push - AWS CodeDeploy: Blue/green traffic shifting, automatic rollback, health check integration

Why not pure CodePipeline? The team already uses GitHub. CodePipeline's UI is dated, and the 1-2 person DevOps team values developer experience over all-AWS purity. CodePipeline adds a $1/pipeline/month cost for minimal additional value when GitHub Actions already handles CI.

Why not pure GitHub Actions? ECS blue/green deployment requires CodeDeploy regardless. Trying to replicate blue/green via raw aws ecs update-service loses automatic rollback, traffic shifting granularity, and lifecycle hooks.


Decision 2: Deployment Strategy — Blue/Green vs Canary vs Rolling

Criteria Blue/Green Canary Rolling
Rollback Speed Instant (swap ALB target) Fast (stop traffic shift) Slow (must redeploy)
Cost During Deploy 2x capacity for ~10 min 1x + small canary 1x (gradual replace)
Risk Exposure None until traffic shift Small % exposed early Gradual exposure
Complexity Medium (CodeDeploy handles) High (custom metrics) Low (ECS native)
Validation Window Pre-traffic health check Real traffic validation Real traffic per-task
Zero Downtime Yes Yes Yes (with min healthy %)
Best For Critical services, fast rollback ML endpoints, gradual Non-critical, cost-sensitive

Decision: Blue/Green for ECS services + Canary-like traffic shifting (10% → 50% → 100%)

flowchart LR
    A[New Version Deployed] --> B[0% Traffic — Health Checks Only]
    B -->|All healthy| C[10% Traffic — 5 min canary]
    C -->|Metrics OK| D[50% Traffic — 3 min]
    D -->|Metrics OK| E[100% Traffic — BLUE drained]

    C -->|Metrics Bad| F[Rollback to BLUE]
    D -->|Metrics Bad| F

Rationale: Pure blue/green gives instant rollback capability (critical for a customer-facing chatbot). Adding graduated traffic shifting (10% → 50% → 100%) gives us canary-like validation on real traffic without the complexity of managing separate canary infrastructure. CodeDeploy supports this natively via TimeBasedLinear deployment config.

Why not pure canary? For a 1-2 person team, maintaining separate canary infrastructure (dedicated canary tasks, canary-specific routing rules, custom metrics aggregation) is operational overhead. Blue/green with traffic shifting achieves 80% of the value at 20% of the complexity.

Why not rolling? Rolling updates cannot be instantly rolled back — if task 5 of 10 is bad, you must wait for a new rolling deployment to replace all tasks. For an AI chatbot where a bad deployment could cause hallucinations or incorrect responses, instant rollback is non-negotiable.


Decision 3: Branching Strategy — Trunk-Based vs GitFlow vs GitHub Flow

Criteria Trunk-Based GitFlow GitHub Flow
Deploy Frequency Multiple per day Weekly/biweekly releases Multiple per day
Branch Complexity Low (main + short-lived) High (main, develop, release, hotfix) Low (main + feature)
Feature Flags Needed Yes (required) No (version-based) Optional
CI/CD Complexity Low (one pipeline) High (multiple pipelines) Low (one pipeline)
Team Size Fit (1-2) Excellent Poor (too much overhead) Good
Merge Conflicts Rare (short branches) Frequent (long-lived branches) Moderate
Hotfix Speed Instant (commit to main) Slow (branch + merge + cherry-pick) Fast (PR to main)

Decision: Trunk-Based Development with feature flags

Rationale: With a 1-2 person DevOps team, GitFlow's branch management overhead (maintaining develop, release, hotfix branches) is unjustifiable. Trunk-based development means every merge to main is deployable. Feature flags (via AWS AppConfig — see CD-06) decouple deployment from release, allowing us to ship code that's not yet customer-visible.


Tradeoffs

The Debate: Deployment Speed vs Safety Gates

graph TD
    subgraph "Product Manager"
        PM1["Ship features faster"]
        PM2["Daily deploys minimum"]
        PM3["Customers waiting for fixes"]
    end

    subgraph "Architect"
        AR1["Zero production incidents"]
        AR2["Full regression before deploy"]
        AR3["1-hour canary minimum"]
    end

    subgraph "Team Lead"
        TL1["Team doesn't burn out"]
        TL2["Auto-everything, manual-nothing"]
        TL3["Pipeline < 15 min"]
    end

    PM1 ---|"Tension"| AR2
    PM2 ---|"Tension"| AR3
    AR1 ---|"Tension"| TL3
    TL2 ---|"Enables"| PM2
    TL2 ---|"Enables"| AR1

Resolution: Automated Speed with Safety Nets

Concern Solution Compromise
PM wants daily deploys Trunk-based + feature flags enable multiple daily deploys Features are deployed but gated — PM must wait for flag activation
Architect wants full regression Automated test suite runs in < 5 min (not 1-hour manual regression) Trade comprehensive manual testing for fast automated coverage
Architect wants 1-hour canary 8-minute graduated traffic shift (10% → 50% → 100%) Shorter canary window but with automated rollback — incidents last minutes not hours
Team Lead wants < 15 min pipeline CI (5 min) + CD (8 min) + smoke tests (2 min) = 15 min total Tight budget — test suite must stay fast, no room for slow integration tests in main pipeline
Team Lead wants zero manual steps Everything automated — approval only for infrastructure changes (CD-02) Lose the "human in the loop" safety net for app deploys — rely on monitoring instead

Key Tradeoff: Pipeline Speed vs Test Coverage

pie title "15-Minute Pipeline Budget Allocation"
    "Lint + Static Analysis" : 1
    "Unit Tests" : 2
    "Integration Tests" : 2
    "Docker Build + Scan" : 3
    "Blue/Green Deploy" : 5
    "Smoke Tests" : 2

What we sacrifice for speed: - No end-to-end browser tests in the deploy pipeline (moved to nightly runs) - No load/performance tests per deploy (weekly scheduled) - No manual QA gate (replaced by automated smoke tests) - No multi-region deploy validation (single-region MVP per 14-mvp-vs-future.md)

What we gain: - Multiple daily deploys safe for 1-2 person team - Instant automated rollback on any regression - Developer confidence: merge to main = production in 15 minutes - No deployment anxiety or "deploy freezes"


Failure Scenarios and Recovery

Scenario Detection Recovery RTO
Bad code passes tests Canary 5xx alarm in 2 min Auto-rollback via CodeDeploy < 3 min
Docker image has CVE Trivy scan blocks ECR push Fix dependency, re-push Pipeline re-run (15 min)
ECR push timeout GitHub Actions retry (3 attempts) Auto-retry, then fail + notify < 5 min
ECS tasks won't start Task health check timeout (3 min) CodeDeploy cancels, keeps BLUE < 5 min
Lambda cold start spike P95 latency alarm Keep provisioned concurrency, rollback if persistent < 2 min
API Gateway stage deploy fails CloudFormation rollback Previous stage preserved < 5 min

Monitoring This Pipeline

Metric Source Alert Threshold
Deploy success rate CloudWatch custom metric < 95% over 7 days
Deploy duration P95 GitHub Actions API > 20 min
Rollback count CodeDeploy events > 2/week
Mean time to recovery CloudWatch composite > 5 min
ECR image count ECR lifecycle policy > 50 untagged images