CD-01: Application Code Deployment Pipeline
User Story
As a Senior DevOps Engineer on the MangaAssist AI Chatbot team, I want to establish a fully automated CI/CD pipeline for deploying application code to ECS Fargate (baseline) and Lambda (burst overflow), So that every code change is tested, containerized, deployed with zero downtime, and automatically rolled back if health checks fail — enabling the 1–2 person DevOps team to ship multiple times per day with confidence.
Acceptance Criteria
- Every push to
maintriggers the full pipeline (lint → test → build → deploy) - Docker images are built, scanned for vulnerabilities, and pushed to ECR
- ECS Fargate services are updated via blue/green deployment with automatic rollback
- Lambda functions are updated with versioning and alias-based traffic shifting
- API Gateway configuration is deployed atomically alongside compute changes
- Pipeline completes in < 15 minutes (commit to production traffic)
- Failed deployments automatically roll back within 2 minutes
- Deployment metrics (success rate, duration, rollback count) are published to CloudWatch
- Slack/PagerDuty notifications on deployment success, failure, and rollback
- Feature flags gate new functionality independent of deployment
High-Level Design
Pipeline Architecture Overview
flowchart LR
subgraph "Source"
A[GitHub Push to main] --> B[Webhook Trigger]
end
subgraph "CI Phase (5 min)"
B --> C[Lint + Static Analysis]
C --> D[Unit Tests + Coverage]
D --> E[Integration Tests]
E --> F[Build Docker Image]
F --> G[ECR Push + Vulnerability Scan]
end
subgraph "CD Phase (8 min)"
G --> H{ECS or Lambda?}
H -->|ECS Services| I[Blue/Green Deploy]
H -->|Lambda Functions| J[Version + Alias Shift]
I --> K[Health Check + Canary]
J --> K
K -->|Pass| L[Full Traffic Shift]
K -->|Fail| M[Automatic Rollback]
end
subgraph "Post-Deploy"
L --> N[Smoke Tests]
N --> O[CloudWatch Metrics]
O --> P[Slack Notification]
end
M --> P
style A fill:#ff9900,color:#000
style L fill:#1B660F,color:#fff
style M fill:#DD344C,color:#fff
Deployment Targets
| Component | Compute | Deployment Strategy | Traffic Shift |
|---|---|---|---|
| Orchestrator Service | ECS Fargate | Blue/Green via CodeDeploy | 10% → 50% → 100% over 10 min |
| Intent Classifier Proxy | ECS Fargate | Blue/Green via CodeDeploy | Same as orchestrator |
| RAG Service | ECS Fargate | Blue/Green via CodeDeploy | Same as orchestrator |
| Guardrails Service | ECS Fargate | Blue/Green via CodeDeploy | Same as orchestrator |
| Burst Overflow Handler | Lambda | Version + Alias weighted | 5% → 100% over 5 min |
| WebSocket Handler | Lambda@Edge | Version + Alias weighted | 5% → 100% over 5 min |
| API Gateway | REST + WebSocket | Stage deployment | Atomic swap |
Low-Level Design
1. Source Stage — Trunk-Based Development
gitGraph
commit id: "feature-A merged"
branch feature-B
commit id: "WIP"
checkout main
commit id: "feature-C merged"
commit id: "hotfix-1"
checkout feature-B
commit id: "ready"
checkout main
merge feature-B id: "feature-B merged"
commit id: "deploy-tag-v1.42"
Branch Strategy: Trunk-based development with short-lived feature branches (< 2 days). All merges to main trigger the pipeline. Feature flags decouple deployment from release.
Trigger Configuration (GitHub Actions):
name: deploy-chatbot
on:
push:
branches: [main]
paths:
- 'src/**'
- 'Dockerfile'
- 'requirements.txt'
- 'package.json'
workflow_dispatch:
inputs:
environment:
type: choice
options: [staging, production]
concurrency:
group: deploy-${{ github.ref }}
cancel-in-progress: false # Never cancel in-flight deploys
2. CI Phase — Build and Test
flowchart TD
A[Checkout Code] --> B[Install Dependencies]
B --> C{Parallel Jobs}
C --> D[Python Lint — ruff + mypy]
C --> E[Unit Tests — pytest]
C --> F[TypeScript Lint — eslint]
C --> G[Security Scan — Bandit + Semgrep]
D --> H{All Pass?}
E --> H
F --> H
G --> H
H -->|Yes| I[Build Docker Image]
H -->|No| J[Fail + Notify]
I --> K[Trivy Vulnerability Scan]
K -->|No Critical/High| L[Push to ECR]
K -->|Critical/High Found| J
L --> M[Tag: sha-abc123 + latest]
Docker Build (multi-stage for minimal image size):
# Build stage
FROM python:3.11-slim AS builder
WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir --target=/deps -r requirements.txt
# Runtime stage
FROM python:3.11-slim
WORKDIR /app
COPY --from=builder /deps /usr/local/lib/python3.11/site-packages
COPY src/ ./src/
EXPOSE 8080
HEALTHCHECK --interval=15s --timeout=5s --retries=3 \
CMD curl -f http://localhost:8080/health || exit 1
CMD ["uvicorn", "src.main:app", "--host", "0.0.0.0", "--port", "8080"]
ECR Push with Immutable Tags:
- name: Build and push to ECR
env:
ECR_REGISTRY: ${{ steps.login-ecr.outputs.registry }}
IMAGE_TAG: sha-${{ github.sha }}
run: |
docker build -t $ECR_REGISTRY/mangaassist-chatbot:$IMAGE_TAG .
docker push $ECR_REGISTRY/mangaassist-chatbot:$IMAGE_TAG
3. CD Phase — ECS Blue/Green Deployment
sequenceDiagram
participant GH as GitHub Actions
participant CD as CodeDeploy
participant ECS as ECS Fargate
participant ALB as ALB
participant CW as CloudWatch
GH->>CD: Create deployment (new task def)
CD->>ECS: Launch GREEN task set
ECS-->>CD: Tasks healthy
CD->>ALB: Route 10% to GREEN
CD->>CW: Start canary monitoring (5 min)
alt Metrics healthy
CD->>ALB: Route 50% to GREEN
CD->>CW: Monitor 3 more min
CD->>ALB: Route 100% to GREEN
CD->>ECS: Drain BLUE task set
CD-->>GH: Deployment SUCCESS
else Metrics degraded
CD->>ALB: Route 100% back to BLUE
CD->>ECS: Terminate GREEN tasks
CD-->>GH: Deployment ROLLED BACK
end
ECS Task Definition Update:
# Update task definition with new image
def update_task_definition(family: str, new_image: str) -> str:
ecs = boto3.client('ecs')
# Get current task definition
current = ecs.describe_task_definition(taskDefinition=family)
task_def = current['taskDefinition']
# Update container image
for container in task_def['containerDefinitions']:
if container['name'] == 'chatbot':
container['image'] = new_image
# Register new revision (keep all other config identical)
response = ecs.register_task_definition(
family=family,
containerDefinitions=task_def['containerDefinitions'],
taskRoleArn=task_def['taskRoleArn'],
executionRoleArn=task_def['executionRoleArn'],
networkMode=task_def['networkMode'],
requiresCompatibilities=['FARGATE'],
cpu=task_def['cpu'],
memory=task_def['memory'],
)
return response['taskDefinition']['taskDefinitionArn']
CodeDeploy AppSpec (appspec.yaml):
version: 0.0
Resources:
- TargetService:
Type: AWS::ECS::Service
Properties:
TaskDefinition: <TASK_DEFINITION>
LoadBalancerInfo:
ContainerName: "chatbot"
ContainerPort: 8080
PlatformVersion: "LATEST"
Hooks:
- BeforeAllowTraffic: "validate-deployment"
- AfterAllowTraffic: "run-smoke-tests"
4. CD Phase — Lambda Deployment
# Lambda version + alias deployment with weighted traffic
def deploy_lambda(function_name: str, s3_bucket: str, s3_key: str):
lmb = boto3.client('lambda')
# Update function code
lmb.update_function_code(
FunctionName=function_name,
S3Bucket=s3_bucket,
S3Key=s3_key,
)
lmb.get_waiter('function_updated_v2').wait(FunctionName=function_name)
# Publish new version
version = lmb.publish_version(
FunctionName=function_name,
Description=f"Deploy {s3_key}"
)['Version']
# Shift 5% traffic to new version for canary
lmb.update_alias(
FunctionName=function_name,
Name='live',
FunctionVersion=version,
RoutingConfig={
'AdditionalVersionWeights': {
version: 0.05 # 5% canary
}
}
)
return version
5. Health Check and Rollback Logic
Canary Validation CloudWatch Alarms:
| Metric | Threshold | Evaluation Period | Action on Breach |
|---|---|---|---|
| HTTP 5xx rate | > 1% | 2 of 3 minutes | Auto-rollback |
| P95 latency | > 3s | 3 of 5 minutes | Auto-rollback |
| Task health check failures | > 0 | 1 of 1 minutes | Auto-rollback |
| Error log rate | > 5x baseline | 2 of 3 minutes | Alert + manual review |
Rollback Decision Flow:
flowchart TD
A[Canary Running] --> B{5xx > 1%?}
B -->|Yes| C[IMMEDIATE ROLLBACK]
B -->|No| D{P95 > 3s?}
D -->|Yes| C
D -->|No| E{Health checks failing?}
E -->|Yes| C
E -->|No| F{Error log spike?}
F -->|Yes| G[Alert Team — Hold Deploy]
F -->|No| H[Proceed to Next Traffic %]
C --> I[Route 100% to BLUE]
C --> J[Post to Slack with Root Cause]
C --> K[Create Incident Ticket]
6. Post-Deployment Smoke Tests
# Automated smoke test suite run after each deployment
import httpx
SMOKE_TESTS = [
{
"name": "health_check",
"method": "GET",
"path": "/health",
"expected_status": 200,
},
{
"name": "chat_basic",
"method": "POST",
"path": "/api/v1/chat",
"body": {"message": "What manga do you recommend?", "session_id": "smoke-test"},
"expected_status": 200,
"max_latency_ms": 3000,
},
{
"name": "intent_classification",
"method": "POST",
"path": "/api/v1/classify",
"body": {"message": "I want to return my order"},
"expected_status": 200,
"expected_intent": "order_return",
},
]
async def run_smoke_tests(base_url: str) -> bool:
async with httpx.AsyncClient(timeout=10.0) as client:
for test in SMOKE_TESTS:
response = await client.request(
test["method"],
f"{base_url}{test['path']}",
json=test.get("body"),
)
if response.status_code != test["expected_status"]:
return False
if "max_latency_ms" in test:
if response.elapsed.total_seconds() * 1000 > test["max_latency_ms"]:
return False
return True
Critical Decisions
Decision 1: CI/CD Platform — GitHub Actions vs AWS CodePipeline vs GitLab CI
graph LR
subgraph "GitHub Actions"
GA1[Native GitHub integration]
GA2[Rich marketplace]
GA3[YAML workflows]
GA4[Community familiarity]
end
subgraph "AWS CodePipeline"
CP1[Deep AWS integration]
CP2[IAM native auth]
CP3[CodeDeploy blue/green]
CP4[No external dependency]
end
subgraph "GitLab CI"
GL1[Built-in registry]
GL2[Auto DevOps]
GL3[Self-hosted runners]
GL4[Compliance features]
end
| Criteria (Weight) | GitHub Actions | AWS CodePipeline | GitLab CI |
|---|---|---|---|
| AWS Integration (25%) | 7/10 — via OIDC + actions | 10/10 — native | 6/10 — via CLI/SDK |
| Developer Experience (20%) | 9/10 — most devs know it | 5/10 — clunky console UI | 8/10 — good YAML DX |
| Cost (15%) | 7/10 — free tier + $0.008/min | 8/10 — $1/pipeline/month | 6/10 — runner costs |
| Ecosystem/Marketplace (15%) | 10/10 — largest marketplace | 4/10 — limited actions | 7/10 — decent templates |
| Blue/Green ECS (10%) | 6/10 — needs CodeDeploy anyway | 10/10 — built-in | 5/10 — custom scripts |
| Secrets Management (10%) | 7/10 — GitHub Secrets | 9/10 — IAM + Secrets Manager | 7/10 — CI variables |
| Audit/Compliance (5%) | 7/10 — audit logs | 9/10 — CloudTrail | 8/10 — compliance dashboard |
| Weighted Score | 7.7/10 | 7.5/10 | 6.7/10 |
Decision: Hybrid — GitHub Actions for CI + AWS CodeDeploy for CD
Rationale: GitHub Actions provides the best developer experience for the CI phase (lint, test, build, scan). AWS CodeDeploy provides native ECS blue/green deployment that no other tool matches. The hybrid approach gives us the best of both worlds: - GitHub Actions: Source trigger, parallel test jobs, Docker build, ECR push - AWS CodeDeploy: Blue/green traffic shifting, automatic rollback, health check integration
Why not pure CodePipeline? The team already uses GitHub. CodePipeline's UI is dated, and the 1-2 person DevOps team values developer experience over all-AWS purity. CodePipeline adds a $1/pipeline/month cost for minimal additional value when GitHub Actions already handles CI.
Why not pure GitHub Actions? ECS blue/green deployment requires CodeDeploy regardless. Trying to replicate blue/green via raw aws ecs update-service loses automatic rollback, traffic shifting granularity, and lifecycle hooks.
Decision 2: Deployment Strategy — Blue/Green vs Canary vs Rolling
| Criteria | Blue/Green | Canary | Rolling |
|---|---|---|---|
| Rollback Speed | Instant (swap ALB target) | Fast (stop traffic shift) | Slow (must redeploy) |
| Cost During Deploy | 2x capacity for ~10 min | 1x + small canary | 1x (gradual replace) |
| Risk Exposure | None until traffic shift | Small % exposed early | Gradual exposure |
| Complexity | Medium (CodeDeploy handles) | High (custom metrics) | Low (ECS native) |
| Validation Window | Pre-traffic health check | Real traffic validation | Real traffic per-task |
| Zero Downtime | Yes | Yes | Yes (with min healthy %) |
| Best For | Critical services, fast rollback | ML endpoints, gradual | Non-critical, cost-sensitive |
Decision: Blue/Green for ECS services + Canary-like traffic shifting (10% → 50% → 100%)
flowchart LR
A[New Version Deployed] --> B[0% Traffic — Health Checks Only]
B -->|All healthy| C[10% Traffic — 5 min canary]
C -->|Metrics OK| D[50% Traffic — 3 min]
D -->|Metrics OK| E[100% Traffic — BLUE drained]
C -->|Metrics Bad| F[Rollback to BLUE]
D -->|Metrics Bad| F
Rationale: Pure blue/green gives instant rollback capability (critical for a customer-facing chatbot). Adding graduated traffic shifting (10% → 50% → 100%) gives us canary-like validation on real traffic without the complexity of managing separate canary infrastructure. CodeDeploy supports this natively via TimeBasedLinear deployment config.
Why not pure canary? For a 1-2 person team, maintaining separate canary infrastructure (dedicated canary tasks, canary-specific routing rules, custom metrics aggregation) is operational overhead. Blue/green with traffic shifting achieves 80% of the value at 20% of the complexity.
Why not rolling? Rolling updates cannot be instantly rolled back — if task 5 of 10 is bad, you must wait for a new rolling deployment to replace all tasks. For an AI chatbot where a bad deployment could cause hallucinations or incorrect responses, instant rollback is non-negotiable.
Decision 3: Branching Strategy — Trunk-Based vs GitFlow vs GitHub Flow
| Criteria | Trunk-Based | GitFlow | GitHub Flow |
|---|---|---|---|
| Deploy Frequency | Multiple per day | Weekly/biweekly releases | Multiple per day |
| Branch Complexity | Low (main + short-lived) | High (main, develop, release, hotfix) | Low (main + feature) |
| Feature Flags Needed | Yes (required) | No (version-based) | Optional |
| CI/CD Complexity | Low (one pipeline) | High (multiple pipelines) | Low (one pipeline) |
| Team Size Fit (1-2) | Excellent | Poor (too much overhead) | Good |
| Merge Conflicts | Rare (short branches) | Frequent (long-lived branches) | Moderate |
| Hotfix Speed | Instant (commit to main) | Slow (branch + merge + cherry-pick) | Fast (PR to main) |
Decision: Trunk-Based Development with feature flags
Rationale: With a 1-2 person DevOps team, GitFlow's branch management overhead (maintaining develop, release, hotfix branches) is unjustifiable. Trunk-based development means every merge to main is deployable. Feature flags (via AWS AppConfig — see CD-06) decouple deployment from release, allowing us to ship code that's not yet customer-visible.
Tradeoffs
The Debate: Deployment Speed vs Safety Gates
graph TD
subgraph "Product Manager"
PM1["Ship features faster"]
PM2["Daily deploys minimum"]
PM3["Customers waiting for fixes"]
end
subgraph "Architect"
AR1["Zero production incidents"]
AR2["Full regression before deploy"]
AR3["1-hour canary minimum"]
end
subgraph "Team Lead"
TL1["Team doesn't burn out"]
TL2["Auto-everything, manual-nothing"]
TL3["Pipeline < 15 min"]
end
PM1 ---|"Tension"| AR2
PM2 ---|"Tension"| AR3
AR1 ---|"Tension"| TL3
TL2 ---|"Enables"| PM2
TL2 ---|"Enables"| AR1
Resolution: Automated Speed with Safety Nets
| Concern | Solution | Compromise |
|---|---|---|
| PM wants daily deploys | Trunk-based + feature flags enable multiple daily deploys | Features are deployed but gated — PM must wait for flag activation |
| Architect wants full regression | Automated test suite runs in < 5 min (not 1-hour manual regression) | Trade comprehensive manual testing for fast automated coverage |
| Architect wants 1-hour canary | 8-minute graduated traffic shift (10% → 50% → 100%) | Shorter canary window but with automated rollback — incidents last minutes not hours |
| Team Lead wants < 15 min pipeline | CI (5 min) + CD (8 min) + smoke tests (2 min) = 15 min total | Tight budget — test suite must stay fast, no room for slow integration tests in main pipeline |
| Team Lead wants zero manual steps | Everything automated — approval only for infrastructure changes (CD-02) | Lose the "human in the loop" safety net for app deploys — rely on monitoring instead |
Key Tradeoff: Pipeline Speed vs Test Coverage
pie title "15-Minute Pipeline Budget Allocation"
"Lint + Static Analysis" : 1
"Unit Tests" : 2
"Integration Tests" : 2
"Docker Build + Scan" : 3
"Blue/Green Deploy" : 5
"Smoke Tests" : 2
What we sacrifice for speed: - No end-to-end browser tests in the deploy pipeline (moved to nightly runs) - No load/performance tests per deploy (weekly scheduled) - No manual QA gate (replaced by automated smoke tests) - No multi-region deploy validation (single-region MVP per 14-mvp-vs-future.md)
What we gain: - Multiple daily deploys safe for 1-2 person team - Instant automated rollback on any regression - Developer confidence: merge to main = production in 15 minutes - No deployment anxiety or "deploy freezes"
Failure Scenarios and Recovery
| Scenario | Detection | Recovery | RTO |
|---|---|---|---|
| Bad code passes tests | Canary 5xx alarm in 2 min | Auto-rollback via CodeDeploy | < 3 min |
| Docker image has CVE | Trivy scan blocks ECR push | Fix dependency, re-push | Pipeline re-run (15 min) |
| ECR push timeout | GitHub Actions retry (3 attempts) | Auto-retry, then fail + notify | < 5 min |
| ECS tasks won't start | Task health check timeout (3 min) | CodeDeploy cancels, keeps BLUE | < 5 min |
| Lambda cold start spike | P95 latency alarm | Keep provisioned concurrency, rollback if persistent | < 2 min |
| API Gateway stage deploy fails | CloudFormation rollback | Previous stage preserved | < 5 min |
Monitoring This Pipeline
| Metric | Source | Alert Threshold |
|---|---|---|
| Deploy success rate | CloudWatch custom metric | < 95% over 7 days |
| Deploy duration P95 | GitHub Actions API | > 20 min |
| Rollback count | CodeDeploy events | > 2/week |
| Mean time to recovery | CloudWatch composite | > 5 min |
| ECR image count | ECR lifecycle policy | > 50 untagged images |