US-08: Autoscaling Strategy — Cost vs Performance Headroom

User Story

As a SRE lead for MangaAssist, I want to define an autoscaling strategy that provides enough capacity headroom for latency SLAs without paying for idle resources, So that peak traffic gets fast responses and off-peak traffic doesn't burn budget on empty containers.

The Debate

graph TD
    subgraph "Performance Team"
        P["Keep 50% headroom at all times.<br/>Autoscaling is reactive — by the<br/>time new tasks spin up (60-90s),<br/>users have already seen timeouts.<br/>Pre-provision for peak."]
    end

    subgraph "Cost Team"
        C["50% headroom means paying for<br/>containers that sit idle 70% of<br/>the time. At $0.04/vCPU-hour,<br/>that's $28K/month in waste.<br/>Scale tighter, accept occasional<br/>latency spikes."]
    end

    subgraph "Inference Team"
        I["If tasks are overloaded,<br/>LLM requests queue up.<br/>Queued requests = stale context<br/>(user may have navigated away).<br/>Overloaded tasks also drop<br/>WebSocket connections."]
    end

    P ---|"Idle cost"| C
    C ---|"Quality under<br/>load"| I
    I ---|"Capacity<br/>planning"| P

    style P fill:#4ecdc4,stroke:#333,color:#000
    style C fill:#f9d71c,stroke:#333,color:#000
    style I fill:#ff6b6b,stroke:#333,color:#000

Acceptance Criteria

p95 latency stays under 2 seconds during traffic spikes up to 2x baseline.
Scale-up responds within 90 seconds of a sustained traffic increase.
Idle resource cost (capacity above actual usage) stays under 25% of compute budget.
Zero dropped requests during scaling events (queue absorbs burst).
Predictive scaling pre-provisions for known traffic patterns (weekday evenings, manga release days).

Traffic Patterns and the Scaling Challenge

MangaAssist Daily Traffic Profile (JST)

graph LR
    subgraph "Traffic Pattern — Typical Weekday"
        direction TB
        T1["6AM: Wake-up ramp<br/>200 RPS"]
        T2["12PM: Lunch spike<br/>800 RPS"]
        T3["3PM: Afternoon lull<br/>500 RPS"]
        T4["7PM: Evening peak<br/>1,500 RPS"]
        T5["10PM: Wind-down<br/>600 RPS"]
        T6["1AM: Off-peak<br/>50 RPS"]
    end

    T1 --> T2 --> T3 --> T4 --> T5 --> T6

    style T4 fill:#eb3b5a,stroke:#333,color:#fff
    style T6 fill:#2d8659,stroke:#333,color:#fff

The Scaling Dilemma at the Evening Spike

graph TD
    subgraph "6:30 PM — Traffic starts climbing"
        A1["Current: 600 RPS<br/>Capacity: 800 RPS<br/>Headroom: 33%"]
    end

    subgraph "7:00 PM — Spike hits"
        A2["Demand: 1,500 RPS<br/>Capacity: 800 RPS<br/>Deficit: 700 RPS ❌"]
    end

    subgraph "7:02 PM — Autoscaler triggers"
        A3["Scaling event fired<br/>New tasks launching...<br/>Time to ready: 60-90s"]
    end

    subgraph "7:03 PM — Gap period"
        A4["700 RPS queued or dropped<br/>Latency spikes to 5-8 seconds<br/>15% of users see errors"]
    end

    A1 --> A2 --> A3 --> A4

    style A2 fill:#eb3b5a,stroke:#333,color:#fff
    style A4 fill:#ff6b6b,stroke:#333,color:#000

This is why reactive autoscaling alone is not enough.

The Three Scaling Strategies

Strategy A: Reactive Only (Cost Team Preferred)

graph TD
    A["Monitor: CPU > 70%<br/>for 2 minutes"] --> B["Scale up:<br/>add 20% capacity"]
    B --> C["Tasks ready<br/>in 60-90 seconds"]
    C --> D["Gap: 2-3 minutes<br/>of degraded performance"]

    style D fill:#eb3b5a,stroke:#333,color:#fff

Metric	Value
Monthly compute cost	$42,000
Idle resource %	10%
p95 latency during spike	4.2 seconds (SLA violation)
Dropped requests during scaling	2-5%

Strategy B: Over-Provisioned (Performance Team Preferred)

graph TD
    A["Always maintain:<br/>50% headroom above<br/>current baseline"] --> B["No scaling gap:<br/>capacity always available"]

    style B fill:#2d8659,stroke:#333,color:#fff

Metric	Value
Monthly compute cost	$78,000
Idle resource %	45%
p95 latency during spike	1.2 seconds (within SLA)
Dropped requests during scaling	0%

Strategy C: Predictive + Reactive Hybrid (Decision)

graph TD
    A["Predictive Scaling<br/>(scheduled)"] --> B["Pre-provision for known<br/>patterns:<br/>• Evening peak at 6:30 PM<br/>• Lunch spike at 11:30 AM<br/>• Manga release days"]

    C["Reactive Scaling<br/>(metric-based)"] --> D["Handle unexpected spikes:<br/>• Viral recommendation<br/>• Flash sale<br/>• External event"]

    B --> E["Combined: Capacity ready<br/>before predictable spikes<br/>+ reactive for surprises"]
    D --> E

    E --> F["Lambda Burst Pool<br/>(overflow)"] 

    style B fill:#2d8659,stroke:#333,color:#fff
    style D fill:#fd9644,stroke:#333,color:#000
    style F fill:#f9d71c,stroke:#333,color:#000

Metric	Value
Monthly compute cost	$56,000
Idle resource %	22%
p95 latency during spike	1.5 seconds (within SLA)
Dropped requests during scaling	0%

Predictive Scaling: How It Works

sequenceDiagram
    participant Scheduler as Predictive Scheduler
    participant ASG as ECS Auto Scaling
    participant Fargate as Fargate Tasks
    participant Lambda as Lambda Burst Pool
    participant CW as CloudWatch

    Note over Scheduler: 6:15 PM JST — 15 min before predicted peak
    Scheduler->>ASG: Scale to peak capacity (1,500 RPS)
    ASG->>Fargate: Launch additional tasks
    Note over Fargate: Tasks ready by 6:25 PM

    Note over CW: 7:00 PM — Actual peak arrives
    CW->>ASG: CPU at 65% (within target)
    Note over Fargate: Handling peak without scaling gap ✅

    Note over CW: 7:30 PM — Unexpected viral spike (+500 RPS)
    CW->>ASG: CPU at 85% — reactive scale-up
    ASG->>Fargate: Launch more tasks (60s warm-up)
    CW->>Lambda: Overflow to Lambda burst pool (instant)
    Note over Lambda: Lambda handles overflow during 60s gap

Predictive Schedule

Time (JST)	Target Capacity	Trigger	Rationale
5:30 AM	Minimum (100 RPS)	Scheduled	Deep off-peak
6:00 AM	Morning base (300 RPS)	Scheduled	Pre-commute browsing
11:30 AM	Lunch peak (900 RPS)	Scheduled	30 min before lunch
1:30 PM	Afternoon (600 RPS)	Scheduled	Post-lunch decline
6:15 PM	Evening peak (1,600 RPS)	Scheduled	45 min before peak
10:00 PM	Wind-down (700 RPS)	Scheduled	Post-peak descent
12:00 AM	Night minimum (100 RPS)	Scheduled	Night mode
Release day	2x normal capacity	Calendar event	Manga volume release days see 2-3x traffic

Lambda Burst Pool: The Safety Valve

graph TD
    A["Request arrives"] --> B{"Fargate tasks<br/>available?"}
    B -->|"Yes"| C["Route to Fargate<br/>(preferred — WebSocket,<br/>stateful)"]
    B -->|"No — at capacity"| D["Route to Lambda<br/>(overflow — HTTPS only,<br/>stateless)"]

    D --> E["Lambda limitations:<br/>• No WebSocket streaming<br/>• Stateless (no session affinity)<br/>• Cold start: 200-500ms<br/>• Concurrency limit: 1,000"]

    style C fill:#2d8659,stroke:#333,color:#fff
    style D fill:#fd9644,stroke:#333,color:#000

Lambda vs Fargate Cost at Overflow Volume

Scenario	Duration	Volume	Fargate Cost	Lambda Cost	Winner
Spike overflow (15 min, 500 RPS)	15 min	450K requests	$0 (already scaled)	$180	Lambda (no pre-provision needed)
Sustained high (2 hours, 500 RPS)	2 hours	3.6M requests	$120	$1,440	Fargate (cheaper sustained)
Flash spike (5 min, 2000 RPS)	5 min	600K requests	$0 (can't scale in time)	$240	Lambda (only option)

Decision: Lambda for bursts under 20 minutes. Beyond that, reactive Fargate scaling should have caught up.

The Provisioned Throughput Decision (Bedrock)

The most expensive scaling decision is Bedrock provisioned throughput:

graph TD
    subgraph "On-Demand Bedrock"
        OD["Pay per token<br/>No capacity guarantee<br/>Latency: 200-800ms TTFT<br/>Risk: throttling at peak"]
    end

    subgraph "Provisioned Throughput"
        PT["Reserved capacity<br/>$25K/month base<br/>Latency: 150-400ms TTFT<br/>Guaranteed throughput"]
    end

    subgraph "Hybrid (Decision)"
        HY["Provisioned for peak hours<br/>(8AM-11PM JST)<br/>On-demand for off-peak<br/>Saves ~$10K/month vs 24/7"]
    end

    style OD fill:#2d8659,stroke:#333,color:#fff
    style PT fill:#eb3b5a,stroke:#333,color:#fff
    style HY fill:#fd9644,stroke:#333,color:#000

Configuration	Monthly Cost	Peak TTFT p95	Off-Peak TTFT p95	Risk
Always on-demand	$0 base + per-token	500-800ms	300-500ms	Throttling at peak
Always provisioned	$25,000 + per-token	200-400ms	200-400ms	Paying for idle off-peak
Hybrid (Decision)	$15,000 + per-token	200-400ms	400-700ms	Slightly higher off-peak latency

Scale-Down Strategy: Equally Important

Scaling down too aggressively wastes the investment in warm tasks. Scaling down too slowly wastes money.

graph TD
    A["Traffic decreasing"] --> B{"CPU < 40%<br/>for 10 minutes?"}
    B -->|"Yes"| C["Scale down by<br/>10% of current capacity"]
    B -->|"No"| D["Hold current capacity"]

    C --> E{"Minimum floor<br/>reached?"}
    E -->|"Yes"| F["Stop scaling down"]
    E -->|"No"| A

    style C fill:#f9d71c,stroke:#333,color:#000
    style F fill:#eb3b5a,stroke:#333,color:#fff

Scale-Down Rules

Rule	Value	Rationale
Cool-down period after scale-up	10 minutes	Prevent thrashing on bursty traffic
CPU threshold for scale-down	< 40% for 10 min	Conservative to avoid premature scale-down
Scale-down step size	10% of current capacity	Gradual; don't kill too many tasks at once
Minimum tasks (floor)	10 tasks	Enough for baseline + unexpected traffic
Scale-down block during known peaks	30 min before predicted peak	Don't scale down into a known spike

2026 Update: Scale on Generative Workload Signals, Not Only CPU

Treat everything above this section as the baseline autoscaling architecture. This update preserves that original scaling design and shows how the current architecture should scale on generative-serving signals instead.

Current LLM-serving systems are moving away from CPU-driven autoscaling toward metrics that reflect actual generative bottlenecks.

For managed Bedrock, use cross-Region or global inference profiles to absorb burst demand and raise throughput on on-demand traffic. They complement but do not replace Provisioned Throughput.
Apply latency-optimized inference selectively on user-facing routes where TTFT matters most. It is usually a route-level optimization, not an all-traffic default.
For self-hosted stacks, autoscale on waiting requests, queue time, TTFT, prompt tokens/sec, generation tokens/sec, and KV cache pressure rather than raw CPU.
Separate prefill-heavy and decode-heavy bottlenecks where possible. Disaggregated prefill is often a better control knob than simply holding more generic compute headroom.
Maintain a graceful degradation plan for spikes: smaller output cap, cheaper model, reduced retrieval depth, or safe fallback before user-visible errors.

Recent references: AWS Bedrock cross-Region inference, AWS Bedrock global cross-Region inference, AWS Bedrock latency-optimized inference, KServe autoscaling with LLM metrics, vLLM metrics, vLLM disaggregated prefilling.

Reversal Triggers

Trigger	Action
p95 latency exceeds 2s during a scaling event	Increase predictive headroom by 20%
Idle resource cost exceeds 30% of compute budget	Reduce predictive scaling buffer; tighten reactive thresholds
Lambda burst pool hits concurrency limit	Increase Lambda concurrency or expand Fargate reactive scaling speed
Predictive schedule misaligns with actual traffic	Retrain prediction model with last 30 days of traffic
Manga release spike significantly exceeds 2x prediction	Add calendar-based alerts for "big release days" with manual capacity override

Impact on Trilemma

Dimension	Reactive Only	Over-Provisioned	Predictive Hybrid (Decision)
Cost	$42K/mo (cheapest)	$78K/mo (most expensive)	$56K/mo (33% savings vs over-provisioned)
Performance (peak p95)	4.2s (SLA violation)	1.2s (best)	1.5s (within SLA)
Availability	95-98% during spikes	99.9%	99.5% (Lambda overflow covers gaps)
QACPI	Low (latency kills it)	Medium (cost drags)	Highest