US-04: Compute Cost Optimization (ECS Fargate + Lambda)
User Story
As a DevOps engineer, I want to right-size ECS Fargate tasks, leverage Fargate Spot for non-critical workloads, and optimize the Lambda burst worker configuration, So that compute costs decrease by 35-55% while maintaining P99 latency SLAs.
Acceptance Criteria
- ECS task CPU/memory is right-sized based on production utilization metrics.
- Fargate Spot is used for at least 30% of Orchestrator tasks during non-peak hours.
- Auto-scaling policies respond within 2 minutes to traffic spikes.
- Lambda burst workers use provisioned concurrency only during peak hours.
- WebSocket handler tasks use ARM64 (Graviton) for 20% cost savings.
- Total compute costs decrease by 35-55%.
High-Level Design
Cost Problem
The deployment architecture (HLD) specifies: - Orchestrator: 10-100 ECS Fargate tasks (auto-scaling) - WebSocket Handler: Sticky session tasks - Lambda burst workers: up to 1000 concurrent
At steady state with 30 Orchestrator tasks running 24/7: - 30 × 2 vCPU × $0.04048/hr + 30 × 4 GB × $0.004445/hr = $2,945/month - Lambda (burst): ~500K invocations/day × $0.0000002 × 30 = $90/month (invocations) + duration - Baseline compute: ~$3,500-4,500/month
Optimization Architecture
graph TD
subgraph "Peak Hours (9am-11pm JST)"
A[ALB] --> B[Orchestrator<br>20-60 tasks<br>On-Demand]
A --> C[WebSocket Handler<br>5-15 tasks<br>Graviton ARM64]
end
subgraph "Off-Peak Hours (11pm-9am JST)"
D[ALB] --> E[Orchestrator<br>5-15 tasks<br>70% Fargate Spot]
D --> F[WebSocket Handler<br>2-5 tasks<br>Graviton ARM64]
end
subgraph "Burst Overflow"
B --> G[Lambda Burst Workers<br>Provisioned: peak only]
E --> H[Lambda Burst Workers<br>On-Demand only]
end
style E fill:#2d8,stroke:#333
style F fill:#2d8,stroke:#333
style G fill:#fd2,stroke:#333
Savings Breakdown
| Technique | Savings | Monthly Impact |
|---|---|---|
| Right-sizing (2 vCPU/4 GB → 1 vCPU/2 GB) | ~40% per task | ~$1,180 |
| Fargate Spot (off-peak, 70% of tasks) | ~70% per Spot task | ~$430 |
| Graviton ARM64 (WebSocket handler) | ~20% per task | ~$120 |
| Lambda provisioned concurrency (peak only) | ~60% of Lambda spend | ~$54 |
| Auto-scaling tightening | ~15% over-provisioning removed | ~$220 |
| Total | ~$2,004/month (45%) |
Low-Level Design
1. Task Right-Sizing
The Orchestrator coordinates work but does not do heavy compute — most CPU time is spent waiting for downstream responses.
graph TD
A[Current: 2 vCPU / 4 GB<br>$98/month per task] --> B[Profile Production<br>CPU + Memory Utilization]
B --> C{Avg CPU < 30%?<br>Avg Mem < 40%?}
C -->|Yes| D[Right-size to<br>1 vCPU / 2 GB<br>$49/month per task]
C -->|No| E[Keep current sizing]
D --> F[Run canary for 1 week]
F --> G{P99 latency<br>within SLA?}
G -->|Yes| H[Roll out to all tasks]
G -->|No| I[Try 1 vCPU / 3 GB]
Code Example: ECS Task Definition (Right-Sized)
import boto3
def create_optimized_task_definition() -> dict:
"""Create a right-sized ECS task definition for the Orchestrator."""
ecs = boto3.client("ecs")
response = ecs.register_task_definition(
family="manga-orchestrator",
networkMode="awsvpc",
requiresCompatibilities=["FARGATE"],
runtimePlatform={
"cpuArchitecture": "X86_64",
"operatingSystemFamily": "LINUX",
},
cpu="1024", # 1 vCPU (down from 2)
memory="2048", # 2 GB (down from 4)
containerDefinitions=[
{
"name": "orchestrator",
"image": "123456789012.dkr.ecr.ap-northeast-1.amazonaws.com/"
"manga-orchestrator:latest",
"portMappings": [
{"containerPort": 8080, "protocol": "tcp"}
],
"environment": [
{"name": "JAVA_OPTS", "value": "-Xmx1536m -XX:+UseG1GC"},
{"name": "MAX_CONCURRENT_REQUESTS", "value": "50"},
],
"logConfiguration": {
"logDriver": "awslogs",
"options": {
"awslogs-group": "/ecs/manga-orchestrator",
"awslogs-region": "ap-northeast-1",
"awslogs-stream-prefix": "ecs",
},
},
"healthCheck": {
"command": ["CMD-SHELL", "curl -f http://localhost:8080/health"],
"interval": 30,
"timeout": 5,
"retries": 3,
},
}
],
)
return response["taskDefinition"]
def create_graviton_ws_task_definition() -> dict:
"""WebSocket handler on Graviton ARM64 — 20% cheaper."""
ecs = boto3.client("ecs")
response = ecs.register_task_definition(
family="manga-websocket-handler",
networkMode="awsvpc",
requiresCompatibilities=["FARGATE"],
runtimePlatform={
"cpuArchitecture": "ARM64", # Graviton — 20% savings
"operatingSystemFamily": "LINUX",
},
cpu="512", # 0.5 vCPU is enough for WebSocket relay
memory="1024", # 1 GB
containerDefinitions=[
{
"name": "websocket-handler",
"image": "123456789012.dkr.ecr.ap-northeast-1.amazonaws.com/"
"manga-ws-handler:latest-arm64",
"portMappings": [
{"containerPort": 8081, "protocol": "tcp"}
],
"environment": [
{"name": "MAX_CONNECTIONS", "value": "2000"},
{"name": "HEARTBEAT_INTERVAL_SEC", "value": "30"},
{"name": "IDLE_TIMEOUT_SEC", "value": "300"},
],
"logConfiguration": {
"logDriver": "awslogs",
"options": {
"awslogs-group": "/ecs/manga-websocket",
"awslogs-region": "ap-northeast-1",
"awslogs-stream-prefix": "ecs",
},
},
}
],
)
return response["taskDefinition"]
2. Fargate Spot for Off-Peak
graph LR
subgraph "Capacity Provider Strategy"
A[Off-Peak<br>11pm-9am] --> B[FARGATE_SPOT: 70%<br>FARGATE: 30%]
C[Peak<br>9am-11pm] --> D[FARGATE: 100%]
end
subgraph "Spot Interruption Handling"
E[Spot Interruption<br>2-min warning] --> F[Drain Connections]
F --> G[Finish In-Flight<br>Requests]
G --> H[Graceful Shutdown]
H --> I[New Task Starts<br>on On-Demand]
end
Code Example: Capacity Provider Strategy
import boto3
def configure_capacity_providers(cluster_name: str = "manga-chatbot") -> None:
"""Set up Fargate Spot capacity provider strategy."""
ecs = boto3.client("ecs")
# Update cluster with both capacity providers
ecs.put_cluster_capacity_providers(
cluster=cluster_name,
capacityProviders=["FARGATE", "FARGATE_SPOT"],
defaultCapacityProviderStrategy=[
{"capacityProvider": "FARGATE", "weight": 100, "base": 5},
],
)
def update_service_for_off_peak(
cluster_name: str = "manga-chatbot",
service_name: str = "orchestrator",
) -> None:
"""Switch to Spot-heavy strategy during off-peak hours."""
ecs = boto3.client("ecs")
ecs.update_service(
cluster=cluster_name,
service=service_name,
capacityProviderStrategy=[
# Keep 30% on-demand as a base
{"capacityProvider": "FARGATE", "weight": 30, "base": 3},
# 70% on Spot for off-peak savings
{"capacityProvider": "FARGATE_SPOT", "weight": 70, "base": 0},
],
)
def update_service_for_peak(
cluster_name: str = "manga-chatbot",
service_name: str = "orchestrator",
) -> None:
"""Switch to all on-demand during peak hours for reliability."""
ecs = boto3.client("ecs")
ecs.update_service(
cluster=cluster_name,
service=service_name,
capacityProviderStrategy=[
{"capacityProvider": "FARGATE", "weight": 100, "base": 10},
],
)
3. Auto-Scaling Policy Optimization
graph TD
subgraph "Scaling Triggers"
A[Target Tracking:<br>CPU Utilization 60%]
B[Target Tracking:<br>Request Count per Target 200]
C[Step Scaling:<br>Active Connections]
end
A --> D{CPU > 60%?}
D -->|Yes, 2 min| E[Scale Out +2 tasks]
D -->|No, CPU < 30%<br>for 10 min| F[Scale In -1 task]
B --> G{Requests > 200/target?}
G -->|Yes| E
subgraph "Schedule-Based"
H[8:45am JST] --> I[Pre-scale to 20 tasks]
J[11:00pm JST] --> K[Scale down to 5 tasks]
end
Code Example: Auto-Scaling Configuration
import boto3
def configure_orchestrator_autoscaling(
cluster_name: str = "manga-chatbot",
service_name: str = "orchestrator",
) -> None:
"""Configure auto-scaling for the Orchestrator ECS service."""
aas = boto3.client("application-autoscaling")
resource_id = f"service/{cluster_name}/{service_name}"
# Register scalable target
aas.register_scalable_target(
ServiceNamespace="ecs",
ResourceId=resource_id,
ScalableDimension="ecs:service:DesiredCount",
MinCapacity=5,
MaxCapacity=60,
)
# CPU-based target tracking
aas.put_scaling_policy(
PolicyName="orchestrator-cpu-scaling",
ServiceNamespace="ecs",
ResourceId=resource_id,
ScalableDimension="ecs:service:DesiredCount",
PolicyType="TargetTrackingScaling",
TargetTrackingScalingPolicyConfiguration={
"TargetValue": 60.0,
"PredefinedMetricSpecification": {
"PredefinedMetricType": "ECSServiceAverageCPUUtilization",
},
"ScaleInCooldown": 300,
"ScaleOutCooldown": 120,
},
)
# Request-count-based target tracking
aas.put_scaling_policy(
PolicyName="orchestrator-request-scaling",
ServiceNamespace="ecs",
ResourceId=resource_id,
ScalableDimension="ecs:service:DesiredCount",
PolicyType="TargetTrackingScaling",
TargetTrackingScalingPolicyConfiguration={
"TargetValue": 200.0,
"PredefinedMetricSpecification": {
"PredefinedMetricType": "ALBRequestCountPerTarget",
},
"ScaleInCooldown": 300,
"ScaleOutCooldown": 60,
},
)
# Scheduled scaling: pre-warm before peak
aas.put_scheduled_action(
ServiceNamespace="ecs",
ScheduledActionName="pre-warm-peak",
ResourceId=resource_id,
ScalableDimension="ecs:service:DesiredCount",
Schedule="cron(45 23 * * ? *)", # 8:45am JST = 23:45 UTC
ScalableTargetAction={"MinCapacity": 20, "MaxCapacity": 60},
)
# Scheduled scaling: scale down for off-peak
aas.put_scheduled_action(
ServiceNamespace="ecs",
ScheduledActionName="scale-down-off-peak",
ResourceId=resource_id,
ScalableDimension="ecs:service:DesiredCount",
Schedule="cron(0 14 * * ? *)", # 11pm JST = 14:00 UTC
ScalableTargetAction={"MinCapacity": 5, "MaxCapacity": 20},
)
4. Lambda Burst Worker Optimization
graph TD
A[Lambda Burst Workers] --> B{Time of Day?}
B -->|Peak 9am-11pm| C[Provisioned Concurrency: 50<br>Eliminates cold starts]
B -->|Off-Peak| D[On-Demand Only<br>No provisioned concurrency]
E[Optimization] --> F[Right-size memory<br>Profile and reduce]
E --> G[ARM64 architecture<br>20% cheaper per ms]
E --> H[Optimize package size<br>Faster cold starts]
Code Example: Lambda Configuration
import boto3
def configure_burst_worker_lambda() -> dict:
"""Create optimized Lambda function for burst overflow."""
lambda_client = boto3.client("lambda")
# Create/update function with ARM64
response = lambda_client.create_function(
FunctionName="manga-burst-worker",
Runtime="python3.12",
Handler="handler.handle_message",
Code={"S3Bucket": "manga-deploy", "S3Key": "burst-worker.zip"},
MemorySize=512, # Right-sized from profiling (was 1024)
Timeout=30,
Architectures=["arm64"], # Graviton — 20% savings
Environment={
"Variables": {
"REDIS_HOST": "manga-cache.xxxxxx.cache.amazonaws.com",
"BEDROCK_REGION": "ap-northeast-1",
}
},
)
return response
def manage_provisioned_concurrency(
function_name: str = "manga-burst-worker",
is_peak: bool = True,
) -> None:
"""Toggle provisioned concurrency based on time of day."""
lambda_client = boto3.client("lambda")
if is_peak:
lambda_client.put_provisioned_concurrency_config(
FunctionName=function_name,
Qualifier="prod",
ProvisionedConcurrentExecutions=50,
)
else:
lambda_client.delete_provisioned_concurrency_config(
FunctionName=function_name,
Qualifier="prod",
)
Monitoring and Metrics
| Metric | Target | Alert |
|---|---|---|
| ECS avg CPU utilization | 50-65% | < 25% (over-provisioned) or > 80% |
| ECS avg memory utilization | 50-70% | > 85% |
| Fargate Spot interruption rate | < 5% | > 10% |
| Auto-scaling lag (time to scale) | < 2 min | > 5 min |
| Lambda cold start rate (peak) | < 5% | > 15% |
| P99 request latency | < 3s | > 5s |
Risks and Mitigations
| Risk | Impact | Mitigation |
|---|---|---|
| Fargate Spot interruption during off-peak | Active requests dropped | Graceful drain on SIGTERM; 30% on-demand base; auto-replace within 2 min |
| Under-provisioned tasks after right-sizing | Higher latency, dropped requests | Canary deployment; rollback if P99 > SLA for 5 min |
| Scheduled scaling misaligned with traffic | Over or under-provisioned | Combine scheduled + reactive scaling; adjust schedule quarterly |
| Lambda cold starts during peak | Increased P99 latency | Provisioned concurrency during peak hours; ARM64 reduces cold start time |
Deep Dive: Why This Works on a Manga Chatbot Workload
Compute optimization on a chatbot is unusual because the workload mixes two cost shapes that are normally optimized separately: stateful long-lived connections (WebSocket sessions for streaming responses) and stateless burst work (Orchestrator request handling). Treating them as one workload — putting both on the same Fargate fleet at the same vCPU/memory ratio — is what causes the over-provisioning the story addresses. The savings break down into four mechanically-separable wins.
Property 1: Compute right-sizing is a measurement problem, not a design problem. The original 2 vCPU / 4 GB Fargate task was sized at provisioning time without traffic data. Most chatbot Orchestrator workloads are I/O-bound (waiting on Bedrock, OpenSearch, DynamoDB) and CPU-bound only during prompt assembly and JSON serialization. Profiling typically shows 25–40% CPU utilization at peak — meaning a 1 vCPU / 2 GB task with the same connection budget is functionally equivalent. The 40% right-sizing saving in the story is not "doing more with less"; it is "stopping over-provisioning." The architectural assumption is that headroom for traffic spikes comes from auto-scaling (more tasks), not from over-sizing each task — this is the AWS Compute Optimizer doctrine.
Property 2: Spot is safe for stateless workers, fatal for stateful ones. The Orchestrator request handler is stateless (request → response, no in-memory session). It can run on Spot with graceful drain on SIGTERM and a 70/30 mix during off-peak. The WebSocket connection handler is stateful (open connection, streaming response in flight) — Spot interruption would cut user mid-response. This is why the story splits the workload: WebSocket handlers run on on-demand ARM64 (cheaper than x86 on-demand, but never Spot); Orchestrator workers run on the 70% Spot mix off-peak. The 30% on-demand baseline exists so that Spot interruption events (~5% per AWS-published rates for general-purpose families) cannot strand the system without compute — the on-demand floor absorbs the gap during the ~2 minute Spot replacement window.
Property 3: Graviton ARM64 is a Pareto improvement for the WebSocket workload. The published 20–40% price-perf gain from Graviton (Snap, Twitter, Honeycomb case studies) is real if your workload is portable. WebSocket handlers in this story are pure Python with no native dependencies — they are trivially portable to ARM64. Orchestrator workers, by contrast, may pull in native dependencies (numpy for embedding ops, sentencepiece for tokenization) where ARM64 wheels lag. This is why the story recommends ARM64 for WebSocket only and leaves Orchestrator on x86 — testing-effort-aware optimization, not blanket migration. The failure signal would be missing ARM64 wheels for a new dependency; a CI gate that builds the image on both architectures catches it.
Property 4: Lambda burst workers and provisioned concurrency are a peak-only contract. Lambda is correctly used for burst/spillover work (analytics enqueue, post-processing) where average traffic is low but peak is bursty. Provisioned concurrency eliminates cold starts but bills hourly even when idle — turning it on 24/7 negates the cost win. The story's "provisioned concurrency during peak only" + scheduled scaling of the concurrency value is the standard cost-optimal pattern; turning it off at 11pm JST and on at 8:45am JST captures the cost saving without sacrificing peak-hour latency.
Bottom line: the savings stack additively, not multiplicatively, because the four techniques target different cost components: right-sizing reduces the per-task baseline, Spot reduces the per-task hourly rate (off-peak only), Graviton reduces the per-task hourly rate further (WebSocket only), and provisioned-concurrency scheduling reduces Lambda idle waste. Pulling any one out leaves a measurable 10–20% saving on the table.
Real-World Validation
Industry Benchmarks & Case Studies
- AWS Compute Optimizer service — Published anonymized data shows 30–50% over-provisioning on average across customer Fargate workloads. The story's 40% right-sizing target sits in the middle of this band.
- AWS Spot Instance Advisor (public dashboard) — General-purpose Fargate Spot interruption rate sits at < 5% median across most regions for typical task families, with upward spikes of 10–20% during regional capacity crunches. The story's "70% Spot off-peak / 100% on-demand peak" mix matches AWS-published Spot best-practice for stateless workloads.
- Graviton case studies (Snap, Honeycomb, Twitter, Datadog) — All report 20–40% price-performance improvement after migrating to ARM64. Snap's 2022 re:Invent talk on Graviton WebSocket workloads is the closest analogue to this story's WebSocket workload.
- AWS Lambda Provisioned Concurrency pricing model — $0.0000041667/GB-second for provisioned + standard invoke pricing on top. The break-even vs on-demand cold-starts is at ~50% of capacity in steady use; below that, on-demand cold-starts win on cost.
- AWS Fargate ARM64 pricing — 20% discount vs x86 Fargate (validated against current AWS Fargate pricing page). ARM64-on-Fargate is GA since 2022 and supports both Linux distributions used in this story's containers.
- Internal cross-reference:
POC-to-Production-War-Story/02-seven-production-catastrophes.md— The "WebSocket meltdown" catastrophe was caused by under-provisioned WebSocket handlers running on Spot — informs the on-demand-only WebSocket policy here. - Internal cross-reference:
Optimization-Tradeoffs-User-Stories/— Covers autoscaling-vs-static-provisioning trade-off curves; this story is the "lean with elastic spike absorption" operating point.
Math Validation
- Fargate on-demand x86: $0.04048/vCPU-hour + $0.004445/GB-hour. Original 2vCPU/4GB task: $0.04048×2 + $0.004445×4 = $0.0987/hr × 730 hr × 30 tasks = ~$2,162/month. Story baseline of "$3.5–4.5K/month for 30 tasks" appears to include Multi-AZ overhead, ALB cost, and burst tasks during scale-out — which would put it at the realistic end of the range.
- Right-sized 1vCPU/2GB: $0.04048 + $0.008890 = $0.04937/hr — exactly 50% of original; 40% savings claim accounts for not-all-tasks being right-sizable. ✅
- Fargate Spot discount: ~70% off on-demand for spot tasks. 70% Spot × 70% discount = 49% reduction off-peak — matches story's "Spot 70% off off-peak" ✅.
- Fargate ARM64 (Graviton): 20% discount vs x86. 1vCPU/2GB ARM64 = $0.03950/hr — about $0.01 cheaper per task-hour. ✅
- Lambda ARM64 + 512MB: $0.0000133334/GB-second × 0.5 GB = $0.00000667/sec. 100ms typical invocation = $0.000000667 per call. At 1M calls/day = $0.67/day = $20/month — Lambda burst cost is small; the savings here are dominated by the Fargate stack.
Conservative vs Aggressive Savings Bounds
| Bound | Source | Total monthly savings |
|---|---|---|
| Conservative | Right-sizing only, no Spot, no ARM64 | ~30% (~$1,200/month) |
| Aggressive | Full Spot + ARM64 + scheduled scale-down + Lambda burst migration of ~40% peak load | ~60% (~$2,500/month) |
| Story's projected savings | 35–55% (~$1,500–$2,200/month) | Realistic for a 30-task baseline; depends on Spot interruption rate and ARM64 portability of all dependencies. |
Cross-Story Interactions & Conflicts
- US-08 (Traffic-Based) — Authoritative side: US-08. Conflict mode: auto-scaling (this story) responds to load by adding tasks; degradation (US-08) responds to load by shedding requests. If both fire on the same trigger window, you get oscillation: degradation cuts demand → CPU drops → auto-scaler scales in → next traffic burst hits with no headroom → degradation fires again. Resolution: US-08 emits a
degradation_active=truesignal; the auto-scaler suspends scale-in (but allows scale-out) for the duration. This is a one-way coupling — degradation can prevent scale-in, scale-in cannot prevent degradation. - US-06 (RAG) — Authoritative side: US-06 owns OpenSearch capacity. Conflict mode: when this story's auto-scaler scales Fargate to zero (or near-zero) overnight, OpenSearch Serverless still bills the 4-OCU minimum (~$22/day = ~$691/month). Compute-tier cost goes to zero overnight, but OpenSearch tier doesn't follow. Resolution: do not assume OpenSearch capacity tracks Fargate capacity; US-06's batch indexing window (2am–4am JST) intentionally uses the otherwise-idle OpenSearch capacity overnight, so total OpenSearch utility is preserved.
- US-01 (LLM Tokens) — Indirect interaction. Conflict mode: if Fargate workers are right-sized too aggressively (1vCPU/2GB), prompt assembly and tokenization for large prompts may become CPU-bound and increase per-request latency. Right-sizing must measure CPU utilization during peak prompt-compression activity (US-01's compressor), not just during typical request handling.
- US-05 (DynamoDB) — Indirect interaction. The DDB write buffer in US-05 (100 ms flush interval) lives in-process on these Fargate tasks. Spot interruption could drop up to 100 ms of buffered writes; US-05's DLQ pattern absorbs this, but Spot interruption rate (5%) × 30 tasks × buffer drops is the real loss-tolerance budget — measure it.
Rollback & Experimentation
Shadow-Mode Plan
- Right-sizing: deploy a parallel ECS service at 1 vCPU / 2 GB with 5% traffic shadow-routed; measure CPU/memory utilization and p99 latency for 1 week before rolling out.
- Spot mix: enable 30% Spot off-peak first, observe interruption rate for 2 weeks, then ramp to 70% if interruption rate stays < 5%.
- ARM64 WebSocket: deploy ARM64 service at 10% traffic for 1 week; compare connection error rate, message throughput, and per-task cost.
Canary Thresholds
- Right-sizing: 10% → 25% → 50% → 100% over 2 weeks; abort if p99 latency rises > 15% or CPU sustained > 75%.
- Spot ramp: 30% → 50% → 70% over 3 weeks; abort if interruption-induced 5xx rate > 0.5%.
- Provisioned concurrency schedule: deploy with the current 24/7 schedule, then drop to peak-only after observing 1 full traffic cycle.
- Abort criteria (any one trips): p99 latency rise > 15%, request error rate > 1%, Spot interruption rate > 10%, ARM64-induced runtime errors detected.
Kill Switch
- Three independent flags:
compute_rightsize_enabled(reverts to 2vCPU/4GB tasks),compute_spot_enabled(reverts to 100% on-demand),compute_arm64_enabled(reverts WebSocket handlers to x86). Each flag flips a CDK / Terraform toggle that triggers a rolling redeployment within ~10 minutes.
Quality Regression Criteria (story-specific)
- P99 request latency: ≤ 3.0 s (matches story metric line 437); breach for > 10 min triggers investigation.
- Spot interruption-driven 5xx rate: ≤ 0.5% (above this, drop to 50% Spot mix).
- ARM64 build failures (CI): 0 (any failure blocks deployment).
- Auto-scaling oscillation rate (scale-out + scale-in within same 5-min window): ≤ 2 events/day.
Multi-Reviewer Validation Findings & Resolutions
The cross-reviewer pass identified the following story-specific findings. README's "Multi-Reviewer Validation & Cross-Cutting Hardening" section covers concerns that span all stories.
S1 (must-fix before production)
Spot interruption + DDB write buffer = silent data loss. US-04 runs 70% Spot off-peak; US-05 runs a 100ms in-process DDB write buffer on those tasks. Spot interruption with default SIGTERM grace can drop up to 100ms of buffered writes. At 5% interruption rate × 30 tasks × 100ms × peak QPS, expect ~0.5–1% of turns lost per off-peak day. Resolution: SIGTERM handler must (a) stop accepting new buffered writes immediately, (b) flush remaining buffer synchronously with a hard 20s timeout, © write any remaining items to a DLQ DDB table on timeout, (d) confirm Fargate's 30s graceful-stop window is honored end-to-end. Validated via quarterly chaos drill (force-terminate 10% of tasks; reconcile DLQ vs writes).
Three independent kill-switch flags create incident confusion. compute_rightsize_enabled + compute_spot_enabled + compute_arm64_enabled — at 3am, an SRE has to think about which combination to flip. Resolution: consolidate to a single compute_optimization_enabled flag at the story level; per-technique flags become internal-only (developer testing). Store the flag in the central feature-flag evaluator per README precedence rules.
S2 (fix before scale-up)
Auto-scaling lag (2 min) contradicts canary abort threshold (5s). During the 2-minute scale-out window after a traffic burst, p99 latency can hit 10–15s on already-running tasks before new capacity warms up. The "abort if p99 > +15%" canary criterion will trip on every traffic spike. Resolution: anchor canary aborts to absolute SLA (p99 > 3.0s sustained for > 5 min) per US-04 line 437, not to relative percentage rise. Use scheduled scale-out 5 minutes before known traffic peaks (8:55am JST) to pre-warm.
ARM64 wheel-availability risk acknowledged but unmitigated. Story flags missing wheels but only adds a CI gate. Resolution: add a pre-implementation dependency audit — run pip download --platform manylinux_2_17_aarch64 -r requirements.txt and confirm zero failures before the ARM64 migration starts. Native dependencies (numpy, sentencepiece, tokenizers) all have aarch64 wheels at current pinned versions; verify per release.
Image scanning baseline must extend to ARM64. ARM64 base images may have different CVE coverage in your scanner. Resolution: CI gate runs scanner on both arm64 and x86_64 images; both must pass before merge. Document the security baseline assumption per architecture.
Baseline cost decomposition opaque. Story claims $3.5–4.5K/month for 30 tasks but math validation derives $2,162 task cost only. The remaining ~$1.5–2K is not broken out. Resolution: explicit baseline decomposition: task vCPU/memory + ALB hour + ALB LCU + NAT Gateway data processing + CloudWatch logs ingestion + ECR storage + KMS calls. NAT Gateway data charges in particular often dominate Fargate egress to Bedrock.
Auto-scaling oscillation with US-08 degradation not bidirectionally captured. US-04 acknowledges the conflict; US-08 must explicitly emit suspend_scale_in=true. Resolution: add an alarm on simultaneous scale-out + degradation events > 2/day; tune US-08 hysteresis if alarm fires.
S3 (acknowledged / future work)
- Compute Savings Plans (1-year commit on Fargate + Lambda; ~15–30% discount) — FinOps-lead-owned at quarterly review.
- Multi-region active-active for compute — out of scope.
- Per-request compute cost attribution to US-07 (request_id-keyed task-second sampling).
Runbook: Spot Interruption Cascade
Symptoms: > 10% of Orchestrator tasks interrupted within 10 minutes; 5xx error spike on requests that were in-flight on terminated tasks.
Triage (in order):
- Validate ALB has unregistered terminated tasks (target group health check; should happen within 15s).
- Check Spot interruption rate via AWS Spot Instance Advisor — is this regional capacity crunch (broad spike) or workload-specific (localized)?
- If sustained > 10% for 15 minutes, flip
compute_optimization_enabled=falseto fall back to 100% on-demand (rolling deploy ~10 min). - Verify DDB DLQ is catching dropped buffer writes; reconcile via the periodic replay job.
- If interruption-driven 5xx > 1% sustained, page; this is a customer-impact incident.
Escalation: if region-wide Spot capacity is exhausted, on-demand fallback may also struggle to acquire capacity. Pre-warmed reserved-capacity (Compute Savings Plans) is the longer-term mitigation; flag at next FinOps review.