User Story
As a SRE lead for MangaAssist,
I want to define an autoscaling strategy that provides enough capacity headroom for latency SLAs without paying for idle resources,
So that peak traffic gets fast responses and off-peak traffic doesn't burn budget on empty containers.
The Debate
graph TD
subgraph "Performance Team"
P["Keep 50% headroom at all times.<br/>Autoscaling is reactive — by the<br/>time new tasks spin up (60-90s),<br/>users have already seen timeouts.<br/>Pre-provision for peak."]
end
subgraph "Cost Team"
C["50% headroom means paying for<br/>containers that sit idle 70% of<br/>the time. At $0.04/vCPU-hour,<br/>that's $28K/month in waste.<br/>Scale tighter, accept occasional<br/>latency spikes."]
end
subgraph "Inference Team"
I["If tasks are overloaded,<br/>LLM requests queue up.<br/>Queued requests = stale context<br/>(user may have navigated away).<br/>Overloaded tasks also drop<br/>WebSocket connections."]
end
P ---|"Idle cost"| C
C ---|"Quality under<br/>load"| I
I ---|"Capacity<br/>planning"| P
style P fill:#4ecdc4,stroke:#333,color:#000
style C fill:#f9d71c,stroke:#333,color:#000
style I fill:#ff6b6b,stroke:#333,color:#000
Acceptance Criteria
Traffic Patterns and the Scaling Challenge
MangaAssist Daily Traffic Profile (JST)
graph LR
subgraph "Traffic Pattern — Typical Weekday"
direction TB
T1["6AM: Wake-up ramp<br/>200 RPS"]
T2["12PM: Lunch spike<br/>800 RPS"]
T3["3PM: Afternoon lull<br/>500 RPS"]
T4["7PM: Evening peak<br/>1,500 RPS"]
T5["10PM: Wind-down<br/>600 RPS"]
T6["1AM: Off-peak<br/>50 RPS"]
end
T1 --> T2 --> T3 --> T4 --> T5 --> T6
style T4 fill:#eb3b5a,stroke:#333,color:#fff
style T6 fill:#2d8659,stroke:#333,color:#fff
The Scaling Dilemma at the Evening Spike
graph TD
subgraph "6:30 PM — Traffic starts climbing"
A1["Current: 600 RPS<br/>Capacity: 800 RPS<br/>Headroom: 33%"]
end
subgraph "7:00 PM — Spike hits"
A2["Demand: 1,500 RPS<br/>Capacity: 800 RPS<br/>Deficit: 700 RPS ❌"]
end
subgraph "7:02 PM — Autoscaler triggers"
A3["Scaling event fired<br/>New tasks launching...<br/>Time to ready: 60-90s"]
end
subgraph "7:03 PM — Gap period"
A4["700 RPS queued or dropped<br/>Latency spikes to 5-8 seconds<br/>15% of users see errors"]
end
A1 --> A2 --> A3 --> A4
style A2 fill:#eb3b5a,stroke:#333,color:#fff
style A4 fill:#ff6b6b,stroke:#333,color:#000
This is why reactive autoscaling alone is not enough.
The Three Scaling Strategies
Strategy A: Reactive Only (Cost Team Preferred)
graph TD
A["Monitor: CPU > 70%<br/>for 2 minutes"] --> B["Scale up:<br/>add 20% capacity"]
B --> C["Tasks ready<br/>in 60-90 seconds"]
C --> D["Gap: 2-3 minutes<br/>of degraded performance"]
style D fill:#eb3b5a,stroke:#333,color:#fff
| Metric |
Value |
| Monthly compute cost |
$42,000 |
| Idle resource % |
10% |
| p95 latency during spike |
4.2 seconds (SLA violation) |
| Dropped requests during scaling |
2-5% |
graph TD
A["Always maintain:<br/>50% headroom above<br/>current baseline"] --> B["No scaling gap:<br/>capacity always available"]
style B fill:#2d8659,stroke:#333,color:#fff
| Metric |
Value |
| Monthly compute cost |
$78,000 |
| Idle resource % |
45% |
| p95 latency during spike |
1.2 seconds (within SLA) |
| Dropped requests during scaling |
0% |
Strategy C: Predictive + Reactive Hybrid (Decision)
graph TD
A["Predictive Scaling<br/>(scheduled)"] --> B["Pre-provision for known<br/>patterns:<br/>• Evening peak at 6:30 PM<br/>• Lunch spike at 11:30 AM<br/>• Manga release days"]
C["Reactive Scaling<br/>(metric-based)"] --> D["Handle unexpected spikes:<br/>• Viral recommendation<br/>• Flash sale<br/>• External event"]
B --> E["Combined: Capacity ready<br/>before predictable spikes<br/>+ reactive for surprises"]
D --> E
E --> F["Lambda Burst Pool<br/>(overflow)"]
style B fill:#2d8659,stroke:#333,color:#fff
style D fill:#fd9644,stroke:#333,color:#000
style F fill:#f9d71c,stroke:#333,color:#000
| Metric |
Value |
| Monthly compute cost |
$56,000 |
| Idle resource % |
22% |
| p95 latency during spike |
1.5 seconds (within SLA) |
| Dropped requests during scaling |
0% |
Predictive Scaling: How It Works
sequenceDiagram
participant Scheduler as Predictive Scheduler
participant ASG as ECS Auto Scaling
participant Fargate as Fargate Tasks
participant Lambda as Lambda Burst Pool
participant CW as CloudWatch
Note over Scheduler: 6:15 PM JST — 15 min before predicted peak
Scheduler->>ASG: Scale to peak capacity (1,500 RPS)
ASG->>Fargate: Launch additional tasks
Note over Fargate: Tasks ready by 6:25 PM
Note over CW: 7:00 PM — Actual peak arrives
CW->>ASG: CPU at 65% (within target)
Note over Fargate: Handling peak without scaling gap ✅
Note over CW: 7:30 PM — Unexpected viral spike (+500 RPS)
CW->>ASG: CPU at 85% — reactive scale-up
ASG->>Fargate: Launch more tasks (60s warm-up)
CW->>Lambda: Overflow to Lambda burst pool (instant)
Note over Lambda: Lambda handles overflow during 60s gap
Predictive Schedule
| Time (JST) |
Target Capacity |
Trigger |
Rationale |
| 5:30 AM |
Minimum (100 RPS) |
Scheduled |
Deep off-peak |
| 6:00 AM |
Morning base (300 RPS) |
Scheduled |
Pre-commute browsing |
| 11:30 AM |
Lunch peak (900 RPS) |
Scheduled |
30 min before lunch |
| 1:30 PM |
Afternoon (600 RPS) |
Scheduled |
Post-lunch decline |
| 6:15 PM |
Evening peak (1,600 RPS) |
Scheduled |
45 min before peak |
| 10:00 PM |
Wind-down (700 RPS) |
Scheduled |
Post-peak descent |
| 12:00 AM |
Night minimum (100 RPS) |
Scheduled |
Night mode |
| Release day |
2x normal capacity |
Calendar event |
Manga volume release days see 2-3x traffic |
Lambda Burst Pool: The Safety Valve
graph TD
A["Request arrives"] --> B{"Fargate tasks<br/>available?"}
B -->|"Yes"| C["Route to Fargate<br/>(preferred — WebSocket,<br/>stateful)"]
B -->|"No — at capacity"| D["Route to Lambda<br/>(overflow — HTTPS only,<br/>stateless)"]
D --> E["Lambda limitations:<br/>• No WebSocket streaming<br/>• Stateless (no session affinity)<br/>• Cold start: 200-500ms<br/>• Concurrency limit: 1,000"]
style C fill:#2d8659,stroke:#333,color:#fff
style D fill:#fd9644,stroke:#333,color:#000
Lambda vs Fargate Cost at Overflow Volume
| Scenario |
Duration |
Volume |
Fargate Cost |
Lambda Cost |
Winner |
| Spike overflow (15 min, 500 RPS) |
15 min |
450K requests |
$0 (already scaled) |
$180 |
Lambda (no pre-provision needed) |
| Sustained high (2 hours, 500 RPS) |
2 hours |
3.6M requests |
$120 |
$1,440 |
Fargate (cheaper sustained) |
| Flash spike (5 min, 2000 RPS) |
5 min |
600K requests |
$0 (can't scale in time) |
$240 |
Lambda (only option) |
Decision: Lambda for bursts under 20 minutes. Beyond that, reactive Fargate scaling should have caught up.
The Provisioned Throughput Decision (Bedrock)
The most expensive scaling decision is Bedrock provisioned throughput:
graph TD
subgraph "On-Demand Bedrock"
OD["Pay per token<br/>No capacity guarantee<br/>Latency: 200-800ms TTFT<br/>Risk: throttling at peak"]
end
subgraph "Provisioned Throughput"
PT["Reserved capacity<br/>$25K/month base<br/>Latency: 150-400ms TTFT<br/>Guaranteed throughput"]
end
subgraph "Hybrid (Decision)"
HY["Provisioned for peak hours<br/>(8AM-11PM JST)<br/>On-demand for off-peak<br/>Saves ~$10K/month vs 24/7"]
end
style OD fill:#2d8659,stroke:#333,color:#fff
style PT fill:#eb3b5a,stroke:#333,color:#fff
style HY fill:#fd9644,stroke:#333,color:#000
| Configuration |
Monthly Cost |
Peak TTFT p95 |
Off-Peak TTFT p95 |
Risk |
| Always on-demand |
$0 base + per-token |
500-800ms |
300-500ms |
Throttling at peak |
| Always provisioned |
$25,000 + per-token |
200-400ms |
200-400ms |
Paying for idle off-peak |
| Hybrid (Decision) |
$15,000 + per-token |
200-400ms |
400-700ms |
Slightly higher off-peak latency |
Scale-Down Strategy: Equally Important
Scaling down too aggressively wastes the investment in warm tasks. Scaling down too slowly wastes money.
graph TD
A["Traffic decreasing"] --> B{"CPU < 40%<br/>for 10 minutes?"}
B -->|"Yes"| C["Scale down by<br/>10% of current capacity"]
B -->|"No"| D["Hold current capacity"]
C --> E{"Minimum floor<br/>reached?"}
E -->|"Yes"| F["Stop scaling down"]
E -->|"No"| A
style C fill:#f9d71c,stroke:#333,color:#000
style F fill:#eb3b5a,stroke:#333,color:#fff
Scale-Down Rules
| Rule |
Value |
Rationale |
| Cool-down period after scale-up |
10 minutes |
Prevent thrashing on bursty traffic |
| CPU threshold for scale-down |
< 40% for 10 min |
Conservative to avoid premature scale-down |
| Scale-down step size |
10% of current capacity |
Gradual; don't kill too many tasks at once |
| Minimum tasks (floor) |
10 tasks |
Enough for baseline + unexpected traffic |
| Scale-down block during known peaks |
30 min before predicted peak |
Don't scale down into a known spike |
2026 Update: Scale on Generative Workload Signals, Not Only CPU
Treat everything above this section as the baseline autoscaling architecture. This update preserves that original scaling design and shows how the current architecture should scale on generative-serving signals instead.
Current LLM-serving systems are moving away from CPU-driven autoscaling toward metrics that reflect actual generative bottlenecks.
- For managed Bedrock, use cross-Region or global inference profiles to absorb burst demand and raise throughput on on-demand traffic. They complement but do not replace Provisioned Throughput.
- Apply latency-optimized inference selectively on user-facing routes where TTFT matters most. It is usually a route-level optimization, not an all-traffic default.
- For self-hosted stacks, autoscale on waiting requests, queue time, TTFT, prompt tokens/sec, generation tokens/sec, and KV cache pressure rather than raw CPU.
- Separate prefill-heavy and decode-heavy bottlenecks where possible. Disaggregated prefill is often a better control knob than simply holding more generic compute headroom.
- Maintain a graceful degradation plan for spikes: smaller output cap, cheaper model, reduced retrieval depth, or safe fallback before user-visible errors.
Recent references: AWS Bedrock cross-Region inference, AWS Bedrock global cross-Region inference, AWS Bedrock latency-optimized inference, KServe autoscaling with LLM metrics, vLLM metrics, vLLM disaggregated prefilling.
Reversal Triggers
| Trigger |
Action |
| p95 latency exceeds 2s during a scaling event |
Increase predictive headroom by 20% |
| Idle resource cost exceeds 30% of compute budget |
Reduce predictive scaling buffer; tighten reactive thresholds |
| Lambda burst pool hits concurrency limit |
Increase Lambda concurrency or expand Fargate reactive scaling speed |
| Predictive schedule misaligns with actual traffic |
Retrain prediction model with last 30 days of traffic |
| Manga release spike significantly exceeds 2x prediction |
Add calendar-based alerts for "big release days" with manual capacity override |
Impact on Trilemma
| Dimension |
Reactive Only |
Over-Provisioned |
Predictive Hybrid (Decision) |
| Cost |
$42K/mo (cheapest) |
$78K/mo (most expensive) |
$56K/mo (33% savings vs over-provisioned) |
| Performance (peak p95) |
4.2s (SLA violation) |
1.2s (best) |
1.5s (within SLA) |
| Availability |
95-98% during spikes |
99.9% |
99.5% (Lambda overflow covers gaps) |
| QACPI |
Low (latency kills it) |
Medium (cost drags) |
Highest |