LOCAL PREVIEW View on GitHub

US-04: Compute Cost Optimization (ECS Fargate + Lambda)

User Story

As a DevOps engineer, I want to right-size ECS Fargate tasks, leverage Fargate Spot for non-critical workloads, and optimize the Lambda burst worker configuration, So that compute costs decrease by 35-55% while maintaining P99 latency SLAs.

Acceptance Criteria

  • ECS task CPU/memory is right-sized based on production utilization metrics.
  • Fargate Spot is used for at least 30% of Orchestrator tasks during non-peak hours.
  • Auto-scaling policies respond within 2 minutes to traffic spikes.
  • Lambda burst workers use provisioned concurrency only during peak hours.
  • WebSocket handler tasks use ARM64 (Graviton) for 20% cost savings.
  • Total compute costs decrease by 35-55%.

High-Level Design

Cost Problem

The deployment architecture (HLD) specifies: - Orchestrator: 10-100 ECS Fargate tasks (auto-scaling) - WebSocket Handler: Sticky session tasks - Lambda burst workers: up to 1000 concurrent

At steady state with 30 Orchestrator tasks running 24/7: - 30 × 2 vCPU × $0.04048/hr + 30 × 4 GB × $0.004445/hr = $2,945/month - Lambda (burst): ~500K invocations/day × $0.0000002 × 30 = $90/month (invocations) + duration - Baseline compute: ~$3,500-4,500/month

Optimization Architecture

graph TD
    subgraph "Peak Hours (9am-11pm JST)"
        A[ALB] --> B[Orchestrator<br>20-60 tasks<br>On-Demand]
        A --> C[WebSocket Handler<br>5-15 tasks<br>Graviton ARM64]
    end

    subgraph "Off-Peak Hours (11pm-9am JST)"
        D[ALB] --> E[Orchestrator<br>5-15 tasks<br>70% Fargate Spot]
        D --> F[WebSocket Handler<br>2-5 tasks<br>Graviton ARM64]
    end

    subgraph "Burst Overflow"
        B --> G[Lambda Burst Workers<br>Provisioned: peak only]
        E --> H[Lambda Burst Workers<br>On-Demand only]
    end

    style E fill:#2d8,stroke:#333
    style F fill:#2d8,stroke:#333
    style G fill:#fd2,stroke:#333

Savings Breakdown

Technique Savings Monthly Impact
Right-sizing (2 vCPU/4 GB → 1 vCPU/2 GB) ~40% per task ~$1,180
Fargate Spot (off-peak, 70% of tasks) ~70% per Spot task ~$430
Graviton ARM64 (WebSocket handler) ~20% per task ~$120
Lambda provisioned concurrency (peak only) ~60% of Lambda spend ~$54
Auto-scaling tightening ~15% over-provisioning removed ~$220
Total ~$2,004/month (45%)

Low-Level Design

1. Task Right-Sizing

The Orchestrator coordinates work but does not do heavy compute — most CPU time is spent waiting for downstream responses.

graph TD
    A[Current: 2 vCPU / 4 GB<br>$98/month per task] --> B[Profile Production<br>CPU + Memory Utilization]
    B --> C{Avg CPU < 30%?<br>Avg Mem < 40%?}
    C -->|Yes| D[Right-size to<br>1 vCPU / 2 GB<br>$49/month per task]
    C -->|No| E[Keep current sizing]
    D --> F[Run canary for 1 week]
    F --> G{P99 latency<br>within SLA?}
    G -->|Yes| H[Roll out to all tasks]
    G -->|No| I[Try 1 vCPU / 3 GB]

Code Example: ECS Task Definition (Right-Sized)

import boto3


def create_optimized_task_definition() -> dict:
    """Create a right-sized ECS task definition for the Orchestrator."""

    ecs = boto3.client("ecs")

    response = ecs.register_task_definition(
        family="manga-orchestrator",
        networkMode="awsvpc",
        requiresCompatibilities=["FARGATE"],
        runtimePlatform={
            "cpuArchitecture": "X86_64",
            "operatingSystemFamily": "LINUX",
        },
        cpu="1024",       # 1 vCPU (down from 2)
        memory="2048",    # 2 GB (down from 4)
        containerDefinitions=[
            {
                "name": "orchestrator",
                "image": "123456789012.dkr.ecr.ap-northeast-1.amazonaws.com/"
                         "manga-orchestrator:latest",
                "portMappings": [
                    {"containerPort": 8080, "protocol": "tcp"}
                ],
                "environment": [
                    {"name": "JAVA_OPTS", "value": "-Xmx1536m -XX:+UseG1GC"},
                    {"name": "MAX_CONCURRENT_REQUESTS", "value": "50"},
                ],
                "logConfiguration": {
                    "logDriver": "awslogs",
                    "options": {
                        "awslogs-group": "/ecs/manga-orchestrator",
                        "awslogs-region": "ap-northeast-1",
                        "awslogs-stream-prefix": "ecs",
                    },
                },
                "healthCheck": {
                    "command": ["CMD-SHELL", "curl -f http://localhost:8080/health"],
                    "interval": 30,
                    "timeout": 5,
                    "retries": 3,
                },
            }
        ],
    )
    return response["taskDefinition"]


def create_graviton_ws_task_definition() -> dict:
    """WebSocket handler on Graviton ARM64 — 20% cheaper."""

    ecs = boto3.client("ecs")

    response = ecs.register_task_definition(
        family="manga-websocket-handler",
        networkMode="awsvpc",
        requiresCompatibilities=["FARGATE"],
        runtimePlatform={
            "cpuArchitecture": "ARM64",     # Graviton — 20% savings
            "operatingSystemFamily": "LINUX",
        },
        cpu="512",        # 0.5 vCPU is enough for WebSocket relay
        memory="1024",    # 1 GB
        containerDefinitions=[
            {
                "name": "websocket-handler",
                "image": "123456789012.dkr.ecr.ap-northeast-1.amazonaws.com/"
                         "manga-ws-handler:latest-arm64",
                "portMappings": [
                    {"containerPort": 8081, "protocol": "tcp"}
                ],
                "environment": [
                    {"name": "MAX_CONNECTIONS", "value": "2000"},
                    {"name": "HEARTBEAT_INTERVAL_SEC", "value": "30"},
                    {"name": "IDLE_TIMEOUT_SEC", "value": "300"},
                ],
                "logConfiguration": {
                    "logDriver": "awslogs",
                    "options": {
                        "awslogs-group": "/ecs/manga-websocket",
                        "awslogs-region": "ap-northeast-1",
                        "awslogs-stream-prefix": "ecs",
                    },
                },
            }
        ],
    )
    return response["taskDefinition"]

2. Fargate Spot for Off-Peak

graph LR
    subgraph "Capacity Provider Strategy"
        A[Off-Peak<br>11pm-9am] --> B[FARGATE_SPOT: 70%<br>FARGATE: 30%]
        C[Peak<br>9am-11pm] --> D[FARGATE: 100%]
    end

    subgraph "Spot Interruption Handling"
        E[Spot Interruption<br>2-min warning] --> F[Drain Connections]
        F --> G[Finish In-Flight<br>Requests]
        G --> H[Graceful Shutdown]
        H --> I[New Task Starts<br>on On-Demand]
    end

Code Example: Capacity Provider Strategy

import boto3


def configure_capacity_providers(cluster_name: str = "manga-chatbot") -> None:
    """Set up Fargate Spot capacity provider strategy."""

    ecs = boto3.client("ecs")

    # Update cluster with both capacity providers
    ecs.put_cluster_capacity_providers(
        cluster=cluster_name,
        capacityProviders=["FARGATE", "FARGATE_SPOT"],
        defaultCapacityProviderStrategy=[
            {"capacityProvider": "FARGATE", "weight": 100, "base": 5},
        ],
    )


def update_service_for_off_peak(
    cluster_name: str = "manga-chatbot",
    service_name: str = "orchestrator",
) -> None:
    """Switch to Spot-heavy strategy during off-peak hours."""

    ecs = boto3.client("ecs")

    ecs.update_service(
        cluster=cluster_name,
        service=service_name,
        capacityProviderStrategy=[
            # Keep 30% on-demand as a base
            {"capacityProvider": "FARGATE", "weight": 30, "base": 3},
            # 70% on Spot for off-peak savings
            {"capacityProvider": "FARGATE_SPOT", "weight": 70, "base": 0},
        ],
    )


def update_service_for_peak(
    cluster_name: str = "manga-chatbot",
    service_name: str = "orchestrator",
) -> None:
    """Switch to all on-demand during peak hours for reliability."""

    ecs = boto3.client("ecs")

    ecs.update_service(
        cluster=cluster_name,
        service=service_name,
        capacityProviderStrategy=[
            {"capacityProvider": "FARGATE", "weight": 100, "base": 10},
        ],
    )

3. Auto-Scaling Policy Optimization

graph TD
    subgraph "Scaling Triggers"
        A[Target Tracking:<br>CPU Utilization 60%]
        B[Target Tracking:<br>Request Count per Target 200]
        C[Step Scaling:<br>Active Connections]
    end

    A --> D{CPU > 60%?}
    D -->|Yes, 2 min| E[Scale Out +2 tasks]
    D -->|No, CPU < 30%<br>for 10 min| F[Scale In -1 task]

    B --> G{Requests > 200/target?}
    G -->|Yes| E

    subgraph "Schedule-Based"
        H[8:45am JST] --> I[Pre-scale to 20 tasks]
        J[11:00pm JST] --> K[Scale down to 5 tasks]
    end

Code Example: Auto-Scaling Configuration

import boto3


def configure_orchestrator_autoscaling(
    cluster_name: str = "manga-chatbot",
    service_name: str = "orchestrator",
) -> None:
    """Configure auto-scaling for the Orchestrator ECS service."""

    aas = boto3.client("application-autoscaling")
    resource_id = f"service/{cluster_name}/{service_name}"

    # Register scalable target
    aas.register_scalable_target(
        ServiceNamespace="ecs",
        ResourceId=resource_id,
        ScalableDimension="ecs:service:DesiredCount",
        MinCapacity=5,
        MaxCapacity=60,
    )

    # CPU-based target tracking
    aas.put_scaling_policy(
        PolicyName="orchestrator-cpu-scaling",
        ServiceNamespace="ecs",
        ResourceId=resource_id,
        ScalableDimension="ecs:service:DesiredCount",
        PolicyType="TargetTrackingScaling",
        TargetTrackingScalingPolicyConfiguration={
            "TargetValue": 60.0,
            "PredefinedMetricSpecification": {
                "PredefinedMetricType": "ECSServiceAverageCPUUtilization",
            },
            "ScaleInCooldown": 300,
            "ScaleOutCooldown": 120,
        },
    )

    # Request-count-based target tracking
    aas.put_scaling_policy(
        PolicyName="orchestrator-request-scaling",
        ServiceNamespace="ecs",
        ResourceId=resource_id,
        ScalableDimension="ecs:service:DesiredCount",
        PolicyType="TargetTrackingScaling",
        TargetTrackingScalingPolicyConfiguration={
            "TargetValue": 200.0,
            "PredefinedMetricSpecification": {
                "PredefinedMetricType": "ALBRequestCountPerTarget",
            },
            "ScaleInCooldown": 300,
            "ScaleOutCooldown": 60,
        },
    )

    # Scheduled scaling: pre-warm before peak
    aas.put_scheduled_action(
        ServiceNamespace="ecs",
        ScheduledActionName="pre-warm-peak",
        ResourceId=resource_id,
        ScalableDimension="ecs:service:DesiredCount",
        Schedule="cron(45 23 * * ? *)",   # 8:45am JST = 23:45 UTC
        ScalableTargetAction={"MinCapacity": 20, "MaxCapacity": 60},
    )

    # Scheduled scaling: scale down for off-peak
    aas.put_scheduled_action(
        ServiceNamespace="ecs",
        ScheduledActionName="scale-down-off-peak",
        ResourceId=resource_id,
        ScalableDimension="ecs:service:DesiredCount",
        Schedule="cron(0 14 * * ? *)",    # 11pm JST = 14:00 UTC
        ScalableTargetAction={"MinCapacity": 5, "MaxCapacity": 20},
    )

4. Lambda Burst Worker Optimization

graph TD
    A[Lambda Burst Workers] --> B{Time of Day?}
    B -->|Peak 9am-11pm| C[Provisioned Concurrency: 50<br>Eliminates cold starts]
    B -->|Off-Peak| D[On-Demand Only<br>No provisioned concurrency]

    E[Optimization] --> F[Right-size memory<br>Profile and reduce]
    E --> G[ARM64 architecture<br>20% cheaper per ms]
    E --> H[Optimize package size<br>Faster cold starts]

Code Example: Lambda Configuration

import boto3


def configure_burst_worker_lambda() -> dict:
    """Create optimized Lambda function for burst overflow."""

    lambda_client = boto3.client("lambda")

    # Create/update function with ARM64
    response = lambda_client.create_function(
        FunctionName="manga-burst-worker",
        Runtime="python3.12",
        Handler="handler.handle_message",
        Code={"S3Bucket": "manga-deploy", "S3Key": "burst-worker.zip"},
        MemorySize=512,         # Right-sized from profiling (was 1024)
        Timeout=30,
        Architectures=["arm64"],  # Graviton — 20% savings
        Environment={
            "Variables": {
                "REDIS_HOST": "manga-cache.xxxxxx.cache.amazonaws.com",
                "BEDROCK_REGION": "ap-northeast-1",
            }
        },
    )

    return response


def manage_provisioned_concurrency(
    function_name: str = "manga-burst-worker",
    is_peak: bool = True,
) -> None:
    """Toggle provisioned concurrency based on time of day."""

    lambda_client = boto3.client("lambda")

    if is_peak:
        lambda_client.put_provisioned_concurrency_config(
            FunctionName=function_name,
            Qualifier="prod",
            ProvisionedConcurrentExecutions=50,
        )
    else:
        lambda_client.delete_provisioned_concurrency_config(
            FunctionName=function_name,
            Qualifier="prod",
        )

Monitoring and Metrics

Metric Target Alert
ECS avg CPU utilization 50-65% < 25% (over-provisioned) or > 80%
ECS avg memory utilization 50-70% > 85%
Fargate Spot interruption rate < 5% > 10%
Auto-scaling lag (time to scale) < 2 min > 5 min
Lambda cold start rate (peak) < 5% > 15%
P99 request latency < 3s > 5s

Risks and Mitigations

Risk Impact Mitigation
Fargate Spot interruption during off-peak Active requests dropped Graceful drain on SIGTERM; 30% on-demand base; auto-replace within 2 min
Under-provisioned tasks after right-sizing Higher latency, dropped requests Canary deployment; rollback if P99 > SLA for 5 min
Scheduled scaling misaligned with traffic Over or under-provisioned Combine scheduled + reactive scaling; adjust schedule quarterly
Lambda cold starts during peak Increased P99 latency Provisioned concurrency during peak hours; ARM64 reduces cold start time

Deep Dive: Why This Works on a Manga Chatbot Workload

Compute optimization on a chatbot is unusual because the workload mixes two cost shapes that are normally optimized separately: stateful long-lived connections (WebSocket sessions for streaming responses) and stateless burst work (Orchestrator request handling). Treating them as one workload — putting both on the same Fargate fleet at the same vCPU/memory ratio — is what causes the over-provisioning the story addresses. The savings break down into four mechanically-separable wins.

Property 1: Compute right-sizing is a measurement problem, not a design problem. The original 2 vCPU / 4 GB Fargate task was sized at provisioning time without traffic data. Most chatbot Orchestrator workloads are I/O-bound (waiting on Bedrock, OpenSearch, DynamoDB) and CPU-bound only during prompt assembly and JSON serialization. Profiling typically shows 25–40% CPU utilization at peak — meaning a 1 vCPU / 2 GB task with the same connection budget is functionally equivalent. The 40% right-sizing saving in the story is not "doing more with less"; it is "stopping over-provisioning." The architectural assumption is that headroom for traffic spikes comes from auto-scaling (more tasks), not from over-sizing each task — this is the AWS Compute Optimizer doctrine.

Property 2: Spot is safe for stateless workers, fatal for stateful ones. The Orchestrator request handler is stateless (request → response, no in-memory session). It can run on Spot with graceful drain on SIGTERM and a 70/30 mix during off-peak. The WebSocket connection handler is stateful (open connection, streaming response in flight) — Spot interruption would cut user mid-response. This is why the story splits the workload: WebSocket handlers run on on-demand ARM64 (cheaper than x86 on-demand, but never Spot); Orchestrator workers run on the 70% Spot mix off-peak. The 30% on-demand baseline exists so that Spot interruption events (~5% per AWS-published rates for general-purpose families) cannot strand the system without compute — the on-demand floor absorbs the gap during the ~2 minute Spot replacement window.

Property 3: Graviton ARM64 is a Pareto improvement for the WebSocket workload. The published 20–40% price-perf gain from Graviton (Snap, Twitter, Honeycomb case studies) is real if your workload is portable. WebSocket handlers in this story are pure Python with no native dependencies — they are trivially portable to ARM64. Orchestrator workers, by contrast, may pull in native dependencies (numpy for embedding ops, sentencepiece for tokenization) where ARM64 wheels lag. This is why the story recommends ARM64 for WebSocket only and leaves Orchestrator on x86 — testing-effort-aware optimization, not blanket migration. The failure signal would be missing ARM64 wheels for a new dependency; a CI gate that builds the image on both architectures catches it.

Property 4: Lambda burst workers and provisioned concurrency are a peak-only contract. Lambda is correctly used for burst/spillover work (analytics enqueue, post-processing) where average traffic is low but peak is bursty. Provisioned concurrency eliminates cold starts but bills hourly even when idle — turning it on 24/7 negates the cost win. The story's "provisioned concurrency during peak only" + scheduled scaling of the concurrency value is the standard cost-optimal pattern; turning it off at 11pm JST and on at 8:45am JST captures the cost saving without sacrificing peak-hour latency.

Bottom line: the savings stack additively, not multiplicatively, because the four techniques target different cost components: right-sizing reduces the per-task baseline, Spot reduces the per-task hourly rate (off-peak only), Graviton reduces the per-task hourly rate further (WebSocket only), and provisioned-concurrency scheduling reduces Lambda idle waste. Pulling any one out leaves a measurable 10–20% saving on the table.


Real-World Validation

Industry Benchmarks & Case Studies

  • AWS Compute Optimizer service — Published anonymized data shows 30–50% over-provisioning on average across customer Fargate workloads. The story's 40% right-sizing target sits in the middle of this band.
  • AWS Spot Instance Advisor (public dashboard) — General-purpose Fargate Spot interruption rate sits at < 5% median across most regions for typical task families, with upward spikes of 10–20% during regional capacity crunches. The story's "70% Spot off-peak / 100% on-demand peak" mix matches AWS-published Spot best-practice for stateless workloads.
  • Graviton case studies (Snap, Honeycomb, Twitter, Datadog) — All report 20–40% price-performance improvement after migrating to ARM64. Snap's 2022 re:Invent talk on Graviton WebSocket workloads is the closest analogue to this story's WebSocket workload.
  • AWS Lambda Provisioned Concurrency pricing model — $0.0000041667/GB-second for provisioned + standard invoke pricing on top. The break-even vs on-demand cold-starts is at ~50% of capacity in steady use; below that, on-demand cold-starts win on cost.
  • AWS Fargate ARM64 pricing — 20% discount vs x86 Fargate (validated against current AWS Fargate pricing page). ARM64-on-Fargate is GA since 2022 and supports both Linux distributions used in this story's containers.
  • Internal cross-reference: POC-to-Production-War-Story/02-seven-production-catastrophes.md — The "WebSocket meltdown" catastrophe was caused by under-provisioned WebSocket handlers running on Spot — informs the on-demand-only WebSocket policy here.
  • Internal cross-reference: Optimization-Tradeoffs-User-Stories/ — Covers autoscaling-vs-static-provisioning trade-off curves; this story is the "lean with elastic spike absorption" operating point.

Math Validation

  • Fargate on-demand x86: $0.04048/vCPU-hour + $0.004445/GB-hour. Original 2vCPU/4GB task: $0.04048×2 + $0.004445×4 = $0.0987/hr × 730 hr × 30 tasks = ~$2,162/month. Story baseline of "$3.5–4.5K/month for 30 tasks" appears to include Multi-AZ overhead, ALB cost, and burst tasks during scale-out — which would put it at the realistic end of the range.
  • Right-sized 1vCPU/2GB: $0.04048 + $0.008890 = $0.04937/hr — exactly 50% of original; 40% savings claim accounts for not-all-tasks being right-sizable. ✅
  • Fargate Spot discount: ~70% off on-demand for spot tasks. 70% Spot × 70% discount = 49% reduction off-peak — matches story's "Spot 70% off off-peak" ✅.
  • Fargate ARM64 (Graviton): 20% discount vs x86. 1vCPU/2GB ARM64 = $0.03950/hr — about $0.01 cheaper per task-hour. ✅
  • Lambda ARM64 + 512MB: $0.0000133334/GB-second × 0.5 GB = $0.00000667/sec. 100ms typical invocation = $0.000000667 per call. At 1M calls/day = $0.67/day = $20/month — Lambda burst cost is small; the savings here are dominated by the Fargate stack.

Conservative vs Aggressive Savings Bounds

Bound Source Total monthly savings
Conservative Right-sizing only, no Spot, no ARM64 ~30% (~$1,200/month)
Aggressive Full Spot + ARM64 + scheduled scale-down + Lambda burst migration of ~40% peak load ~60% (~$2,500/month)
Story's projected savings 35–55% (~$1,500–$2,200/month) Realistic for a 30-task baseline; depends on Spot interruption rate and ARM64 portability of all dependencies.

Cross-Story Interactions & Conflicts

  • US-08 (Traffic-Based) — Authoritative side: US-08. Conflict mode: auto-scaling (this story) responds to load by adding tasks; degradation (US-08) responds to load by shedding requests. If both fire on the same trigger window, you get oscillation: degradation cuts demand → CPU drops → auto-scaler scales in → next traffic burst hits with no headroom → degradation fires again. Resolution: US-08 emits a degradation_active=true signal; the auto-scaler suspends scale-in (but allows scale-out) for the duration. This is a one-way coupling — degradation can prevent scale-in, scale-in cannot prevent degradation.
  • US-06 (RAG) — Authoritative side: US-06 owns OpenSearch capacity. Conflict mode: when this story's auto-scaler scales Fargate to zero (or near-zero) overnight, OpenSearch Serverless still bills the 4-OCU minimum (~$22/day = ~$691/month). Compute-tier cost goes to zero overnight, but OpenSearch tier doesn't follow. Resolution: do not assume OpenSearch capacity tracks Fargate capacity; US-06's batch indexing window (2am–4am JST) intentionally uses the otherwise-idle OpenSearch capacity overnight, so total OpenSearch utility is preserved.
  • US-01 (LLM Tokens) — Indirect interaction. Conflict mode: if Fargate workers are right-sized too aggressively (1vCPU/2GB), prompt assembly and tokenization for large prompts may become CPU-bound and increase per-request latency. Right-sizing must measure CPU utilization during peak prompt-compression activity (US-01's compressor), not just during typical request handling.
  • US-05 (DynamoDB) — Indirect interaction. The DDB write buffer in US-05 (100 ms flush interval) lives in-process on these Fargate tasks. Spot interruption could drop up to 100 ms of buffered writes; US-05's DLQ pattern absorbs this, but Spot interruption rate (5%) × 30 tasks × buffer drops is the real loss-tolerance budget — measure it.

Rollback & Experimentation

Shadow-Mode Plan

  • Right-sizing: deploy a parallel ECS service at 1 vCPU / 2 GB with 5% traffic shadow-routed; measure CPU/memory utilization and p99 latency for 1 week before rolling out.
  • Spot mix: enable 30% Spot off-peak first, observe interruption rate for 2 weeks, then ramp to 70% if interruption rate stays < 5%.
  • ARM64 WebSocket: deploy ARM64 service at 10% traffic for 1 week; compare connection error rate, message throughput, and per-task cost.

Canary Thresholds

  • Right-sizing: 10% → 25% → 50% → 100% over 2 weeks; abort if p99 latency rises > 15% or CPU sustained > 75%.
  • Spot ramp: 30% → 50% → 70% over 3 weeks; abort if interruption-induced 5xx rate > 0.5%.
  • Provisioned concurrency schedule: deploy with the current 24/7 schedule, then drop to peak-only after observing 1 full traffic cycle.
  • Abort criteria (any one trips): p99 latency rise > 15%, request error rate > 1%, Spot interruption rate > 10%, ARM64-induced runtime errors detected.

Kill Switch

  • Three independent flags: compute_rightsize_enabled (reverts to 2vCPU/4GB tasks), compute_spot_enabled (reverts to 100% on-demand), compute_arm64_enabled (reverts WebSocket handlers to x86). Each flag flips a CDK / Terraform toggle that triggers a rolling redeployment within ~10 minutes.

Quality Regression Criteria (story-specific)

  • P99 request latency: ≤ 3.0 s (matches story metric line 437); breach for > 10 min triggers investigation.
  • Spot interruption-driven 5xx rate: ≤ 0.5% (above this, drop to 50% Spot mix).
  • ARM64 build failures (CI): 0 (any failure blocks deployment).
  • Auto-scaling oscillation rate (scale-out + scale-in within same 5-min window): ≤ 2 events/day.

Multi-Reviewer Validation Findings & Resolutions

The cross-reviewer pass identified the following story-specific findings. README's "Multi-Reviewer Validation & Cross-Cutting Hardening" section covers concerns that span all stories.

S1 (must-fix before production)

Spot interruption + DDB write buffer = silent data loss. US-04 runs 70% Spot off-peak; US-05 runs a 100ms in-process DDB write buffer on those tasks. Spot interruption with default SIGTERM grace can drop up to 100ms of buffered writes. At 5% interruption rate × 30 tasks × 100ms × peak QPS, expect ~0.5–1% of turns lost per off-peak day. Resolution: SIGTERM handler must (a) stop accepting new buffered writes immediately, (b) flush remaining buffer synchronously with a hard 20s timeout, © write any remaining items to a DLQ DDB table on timeout, (d) confirm Fargate's 30s graceful-stop window is honored end-to-end. Validated via quarterly chaos drill (force-terminate 10% of tasks; reconcile DLQ vs writes).

Three independent kill-switch flags create incident confusion. compute_rightsize_enabled + compute_spot_enabled + compute_arm64_enabled — at 3am, an SRE has to think about which combination to flip. Resolution: consolidate to a single compute_optimization_enabled flag at the story level; per-technique flags become internal-only (developer testing). Store the flag in the central feature-flag evaluator per README precedence rules.

S2 (fix before scale-up)

Auto-scaling lag (2 min) contradicts canary abort threshold (5s). During the 2-minute scale-out window after a traffic burst, p99 latency can hit 10–15s on already-running tasks before new capacity warms up. The "abort if p99 > +15%" canary criterion will trip on every traffic spike. Resolution: anchor canary aborts to absolute SLA (p99 > 3.0s sustained for > 5 min) per US-04 line 437, not to relative percentage rise. Use scheduled scale-out 5 minutes before known traffic peaks (8:55am JST) to pre-warm.

ARM64 wheel-availability risk acknowledged but unmitigated. Story flags missing wheels but only adds a CI gate. Resolution: add a pre-implementation dependency audit — run pip download --platform manylinux_2_17_aarch64 -r requirements.txt and confirm zero failures before the ARM64 migration starts. Native dependencies (numpy, sentencepiece, tokenizers) all have aarch64 wheels at current pinned versions; verify per release.

Image scanning baseline must extend to ARM64. ARM64 base images may have different CVE coverage in your scanner. Resolution: CI gate runs scanner on both arm64 and x86_64 images; both must pass before merge. Document the security baseline assumption per architecture.

Baseline cost decomposition opaque. Story claims $3.5–4.5K/month for 30 tasks but math validation derives $2,162 task cost only. The remaining ~$1.5–2K is not broken out. Resolution: explicit baseline decomposition: task vCPU/memory + ALB hour + ALB LCU + NAT Gateway data processing + CloudWatch logs ingestion + ECR storage + KMS calls. NAT Gateway data charges in particular often dominate Fargate egress to Bedrock.

Auto-scaling oscillation with US-08 degradation not bidirectionally captured. US-04 acknowledges the conflict; US-08 must explicitly emit suspend_scale_in=true. Resolution: add an alarm on simultaneous scale-out + degradation events > 2/day; tune US-08 hysteresis if alarm fires.

S3 (acknowledged / future work)

  • Compute Savings Plans (1-year commit on Fargate + Lambda; ~15–30% discount) — FinOps-lead-owned at quarterly review.
  • Multi-region active-active for compute — out of scope.
  • Per-request compute cost attribution to US-07 (request_id-keyed task-second sampling).

Runbook: Spot Interruption Cascade

Symptoms: > 10% of Orchestrator tasks interrupted within 10 minutes; 5xx error spike on requests that were in-flight on terminated tasks.

Triage (in order):

  1. Validate ALB has unregistered terminated tasks (target group health check; should happen within 15s).
  2. Check Spot interruption rate via AWS Spot Instance Advisor — is this regional capacity crunch (broad spike) or workload-specific (localized)?
  3. If sustained > 10% for 15 minutes, flip compute_optimization_enabled=false to fall back to 100% on-demand (rolling deploy ~10 min).
  4. Verify DDB DLQ is catching dropped buffer writes; reconcile via the periodic replay job.
  5. If interruption-driven 5xx > 1% sustained, page; this is a customer-impact incident.

Escalation: if region-wide Spot capacity is exhausted, on-demand fallback may also struggle to acquire capacity. Pre-warmed reserved-capacity (Compute Savings Plans) is the longer-term mitigation; flag at next FinOps review.