LOCAL PREVIEW View on GitHub

US-08: Autoscaling Strategy — Cost vs Performance Headroom

User Story

As a SRE lead for MangaAssist, I want to define an autoscaling strategy that provides enough capacity headroom for latency SLAs without paying for idle resources, So that peak traffic gets fast responses and off-peak traffic doesn't burn budget on empty containers.

The Debate

graph TD
    subgraph "Performance Team"
        P["Keep 50% headroom at all times.<br/>Autoscaling is reactive — by the<br/>time new tasks spin up (60-90s),<br/>users have already seen timeouts.<br/>Pre-provision for peak."]
    end

    subgraph "Cost Team"
        C["50% headroom means paying for<br/>containers that sit idle 70% of<br/>the time. At $0.04/vCPU-hour,<br/>that's $28K/month in waste.<br/>Scale tighter, accept occasional<br/>latency spikes."]
    end

    subgraph "Inference Team"
        I["If tasks are overloaded,<br/>LLM requests queue up.<br/>Queued requests = stale context<br/>(user may have navigated away).<br/>Overloaded tasks also drop<br/>WebSocket connections."]
    end

    P ---|"Idle cost"| C
    C ---|"Quality under<br/>load"| I
    I ---|"Capacity<br/>planning"| P

    style P fill:#4ecdc4,stroke:#333,color:#000
    style C fill:#f9d71c,stroke:#333,color:#000
    style I fill:#ff6b6b,stroke:#333,color:#000

Acceptance Criteria

  • p95 latency stays under 2 seconds during traffic spikes up to 2x baseline.
  • Scale-up responds within 90 seconds of a sustained traffic increase.
  • Idle resource cost (capacity above actual usage) stays under 25% of compute budget.
  • Zero dropped requests during scaling events (queue absorbs burst).
  • Predictive scaling pre-provisions for known traffic patterns (weekday evenings, manga release days).

Traffic Patterns and the Scaling Challenge

MangaAssist Daily Traffic Profile (JST)

graph LR
    subgraph "Traffic Pattern — Typical Weekday"
        direction TB
        T1["6AM: Wake-up ramp<br/>200 RPS"]
        T2["12PM: Lunch spike<br/>800 RPS"]
        T3["3PM: Afternoon lull<br/>500 RPS"]
        T4["7PM: Evening peak<br/>1,500 RPS"]
        T5["10PM: Wind-down<br/>600 RPS"]
        T6["1AM: Off-peak<br/>50 RPS"]
    end

    T1 --> T2 --> T3 --> T4 --> T5 --> T6

    style T4 fill:#eb3b5a,stroke:#333,color:#fff
    style T6 fill:#2d8659,stroke:#333,color:#fff

The Scaling Dilemma at the Evening Spike

graph TD
    subgraph "6:30 PM — Traffic starts climbing"
        A1["Current: 600 RPS<br/>Capacity: 800 RPS<br/>Headroom: 33%"]
    end

    subgraph "7:00 PM — Spike hits"
        A2["Demand: 1,500 RPS<br/>Capacity: 800 RPS<br/>Deficit: 700 RPS ❌"]
    end

    subgraph "7:02 PM — Autoscaler triggers"
        A3["Scaling event fired<br/>New tasks launching...<br/>Time to ready: 60-90s"]
    end

    subgraph "7:03 PM — Gap period"
        A4["700 RPS queued or dropped<br/>Latency spikes to 5-8 seconds<br/>15% of users see errors"]
    end

    A1 --> A2 --> A3 --> A4

    style A2 fill:#eb3b5a,stroke:#333,color:#fff
    style A4 fill:#ff6b6b,stroke:#333,color:#000

This is why reactive autoscaling alone is not enough.


The Three Scaling Strategies

Strategy A: Reactive Only (Cost Team Preferred)

graph TD
    A["Monitor: CPU > 70%<br/>for 2 minutes"] --> B["Scale up:<br/>add 20% capacity"]
    B --> C["Tasks ready<br/>in 60-90 seconds"]
    C --> D["Gap: 2-3 minutes<br/>of degraded performance"]

    style D fill:#eb3b5a,stroke:#333,color:#fff
Metric Value
Monthly compute cost $42,000
Idle resource % 10%
p95 latency during spike 4.2 seconds (SLA violation)
Dropped requests during scaling 2-5%

Strategy B: Over-Provisioned (Performance Team Preferred)

graph TD
    A["Always maintain:<br/>50% headroom above<br/>current baseline"] --> B["No scaling gap:<br/>capacity always available"]

    style B fill:#2d8659,stroke:#333,color:#fff
Metric Value
Monthly compute cost $78,000
Idle resource % 45%
p95 latency during spike 1.2 seconds (within SLA)
Dropped requests during scaling 0%

Strategy C: Predictive + Reactive Hybrid (Decision)

graph TD
    A["Predictive Scaling<br/>(scheduled)"] --> B["Pre-provision for known<br/>patterns:<br/>• Evening peak at 6:30 PM<br/>• Lunch spike at 11:30 AM<br/>• Manga release days"]

    C["Reactive Scaling<br/>(metric-based)"] --> D["Handle unexpected spikes:<br/>• Viral recommendation<br/>• Flash sale<br/>• External event"]

    B --> E["Combined: Capacity ready<br/>before predictable spikes<br/>+ reactive for surprises"]
    D --> E

    E --> F["Lambda Burst Pool<br/>(overflow)"] 

    style B fill:#2d8659,stroke:#333,color:#fff
    style D fill:#fd9644,stroke:#333,color:#000
    style F fill:#f9d71c,stroke:#333,color:#000
Metric Value
Monthly compute cost $56,000
Idle resource % 22%
p95 latency during spike 1.5 seconds (within SLA)
Dropped requests during scaling 0%

Predictive Scaling: How It Works

sequenceDiagram
    participant Scheduler as Predictive Scheduler
    participant ASG as ECS Auto Scaling
    participant Fargate as Fargate Tasks
    participant Lambda as Lambda Burst Pool
    participant CW as CloudWatch

    Note over Scheduler: 6:15 PM JST — 15 min before predicted peak
    Scheduler->>ASG: Scale to peak capacity (1,500 RPS)
    ASG->>Fargate: Launch additional tasks
    Note over Fargate: Tasks ready by 6:25 PM

    Note over CW: 7:00 PM — Actual peak arrives
    CW->>ASG: CPU at 65% (within target)
    Note over Fargate: Handling peak without scaling gap ✅

    Note over CW: 7:30 PM — Unexpected viral spike (+500 RPS)
    CW->>ASG: CPU at 85% — reactive scale-up
    ASG->>Fargate: Launch more tasks (60s warm-up)
    CW->>Lambda: Overflow to Lambda burst pool (instant)
    Note over Lambda: Lambda handles overflow during 60s gap

Predictive Schedule

Time (JST) Target Capacity Trigger Rationale
5:30 AM Minimum (100 RPS) Scheduled Deep off-peak
6:00 AM Morning base (300 RPS) Scheduled Pre-commute browsing
11:30 AM Lunch peak (900 RPS) Scheduled 30 min before lunch
1:30 PM Afternoon (600 RPS) Scheduled Post-lunch decline
6:15 PM Evening peak (1,600 RPS) Scheduled 45 min before peak
10:00 PM Wind-down (700 RPS) Scheduled Post-peak descent
12:00 AM Night minimum (100 RPS) Scheduled Night mode
Release day 2x normal capacity Calendar event Manga volume release days see 2-3x traffic

Lambda Burst Pool: The Safety Valve

graph TD
    A["Request arrives"] --> B{"Fargate tasks<br/>available?"}
    B -->|"Yes"| C["Route to Fargate<br/>(preferred — WebSocket,<br/>stateful)"]
    B -->|"No — at capacity"| D["Route to Lambda<br/>(overflow — HTTPS only,<br/>stateless)"]

    D --> E["Lambda limitations:<br/>• No WebSocket streaming<br/>• Stateless (no session affinity)<br/>• Cold start: 200-500ms<br/>• Concurrency limit: 1,000"]

    style C fill:#2d8659,stroke:#333,color:#fff
    style D fill:#fd9644,stroke:#333,color:#000

Lambda vs Fargate Cost at Overflow Volume

Scenario Duration Volume Fargate Cost Lambda Cost Winner
Spike overflow (15 min, 500 RPS) 15 min 450K requests $0 (already scaled) $180 Lambda (no pre-provision needed)
Sustained high (2 hours, 500 RPS) 2 hours 3.6M requests $120 $1,440 Fargate (cheaper sustained)
Flash spike (5 min, 2000 RPS) 5 min 600K requests $0 (can't scale in time) $240 Lambda (only option)

Decision: Lambda for bursts under 20 minutes. Beyond that, reactive Fargate scaling should have caught up.


The Provisioned Throughput Decision (Bedrock)

The most expensive scaling decision is Bedrock provisioned throughput:

graph TD
    subgraph "On-Demand Bedrock"
        OD["Pay per token<br/>No capacity guarantee<br/>Latency: 200-800ms TTFT<br/>Risk: throttling at peak"]
    end

    subgraph "Provisioned Throughput"
        PT["Reserved capacity<br/>$25K/month base<br/>Latency: 150-400ms TTFT<br/>Guaranteed throughput"]
    end

    subgraph "Hybrid (Decision)"
        HY["Provisioned for peak hours<br/>(8AM-11PM JST)<br/>On-demand for off-peak<br/>Saves ~$10K/month vs 24/7"]
    end

    style OD fill:#2d8659,stroke:#333,color:#fff
    style PT fill:#eb3b5a,stroke:#333,color:#fff
    style HY fill:#fd9644,stroke:#333,color:#000
Configuration Monthly Cost Peak TTFT p95 Off-Peak TTFT p95 Risk
Always on-demand $0 base + per-token 500-800ms 300-500ms Throttling at peak
Always provisioned $25,000 + per-token 200-400ms 200-400ms Paying for idle off-peak
Hybrid (Decision) $15,000 + per-token 200-400ms 400-700ms Slightly higher off-peak latency

Scale-Down Strategy: Equally Important

Scaling down too aggressively wastes the investment in warm tasks. Scaling down too slowly wastes money.

graph TD
    A["Traffic decreasing"] --> B{"CPU < 40%<br/>for 10 minutes?"}
    B -->|"Yes"| C["Scale down by<br/>10% of current capacity"]
    B -->|"No"| D["Hold current capacity"]

    C --> E{"Minimum floor<br/>reached?"}
    E -->|"Yes"| F["Stop scaling down"]
    E -->|"No"| A

    style C fill:#f9d71c,stroke:#333,color:#000
    style F fill:#eb3b5a,stroke:#333,color:#fff

Scale-Down Rules

Rule Value Rationale
Cool-down period after scale-up 10 minutes Prevent thrashing on bursty traffic
CPU threshold for scale-down < 40% for 10 min Conservative to avoid premature scale-down
Scale-down step size 10% of current capacity Gradual; don't kill too many tasks at once
Minimum tasks (floor) 10 tasks Enough for baseline + unexpected traffic
Scale-down block during known peaks 30 min before predicted peak Don't scale down into a known spike

2026 Update: Scale on Generative Workload Signals, Not Only CPU

Treat everything above this section as the baseline autoscaling architecture. This update preserves that original scaling design and shows how the current architecture should scale on generative-serving signals instead.

Current LLM-serving systems are moving away from CPU-driven autoscaling toward metrics that reflect actual generative bottlenecks.

  • For managed Bedrock, use cross-Region or global inference profiles to absorb burst demand and raise throughput on on-demand traffic. They complement but do not replace Provisioned Throughput.
  • Apply latency-optimized inference selectively on user-facing routes where TTFT matters most. It is usually a route-level optimization, not an all-traffic default.
  • For self-hosted stacks, autoscale on waiting requests, queue time, TTFT, prompt tokens/sec, generation tokens/sec, and KV cache pressure rather than raw CPU.
  • Separate prefill-heavy and decode-heavy bottlenecks where possible. Disaggregated prefill is often a better control knob than simply holding more generic compute headroom.
  • Maintain a graceful degradation plan for spikes: smaller output cap, cheaper model, reduced retrieval depth, or safe fallback before user-visible errors.

Recent references: AWS Bedrock cross-Region inference, AWS Bedrock global cross-Region inference, AWS Bedrock latency-optimized inference, KServe autoscaling with LLM metrics, vLLM metrics, vLLM disaggregated prefilling.

Reversal Triggers

Trigger Action
p95 latency exceeds 2s during a scaling event Increase predictive headroom by 20%
Idle resource cost exceeds 30% of compute budget Reduce predictive scaling buffer; tighten reactive thresholds
Lambda burst pool hits concurrency limit Increase Lambda concurrency or expand Fargate reactive scaling speed
Predictive schedule misaligns with actual traffic Retrain prediction model with last 30 days of traffic
Manga release spike significantly exceeds 2x prediction Add calendar-based alerts for "big release days" with manual capacity override

Impact on Trilemma

Dimension Reactive Only Over-Provisioned Predictive Hybrid (Decision)
Cost $42K/mo (cheapest) $78K/mo (most expensive) $56K/mo (33% savings vs over-provisioned)
Performance (peak p95) 4.2s (SLA violation) 1.2s (best) 1.5s (within SLA)
Availability 95-98% during spikes 99.9% 99.5% (Lambda overflow covers gaps)
QACPI Low (latency kills it) Medium (cost drags) Highest