02: Auto-Scaling and Utilization for GenAI Workloads
MangaAssist is a JP manga store chatbot running on AWS. It uses Bedrock Claude 3 (Sonnet for complex queries, Haiku for simple ones), OpenSearch Serverless for vector retrieval, DynamoDB for session and catalog data, ECS Fargate for orchestration, API Gateway WebSocket for real-time delivery, and ElastiCache Redis for caching. The system handles 1M messages/day with a target of under 3 seconds end-to-end response time.
Skill Mapping
| Field | Value |
|---|---|
| Domain | 4 -- Operational Efficiency Optimization |
| Task | 4.1 -- Optimize foundation model cost and performance |
| Skill | 4.1.3 -- Implement strategies for high-performance FM systems including batching, capacity planning, utilization monitoring, auto-scaling, and provisioned throughput optimization |
| Focus | Deep-dive into auto-scaling patterns, traffic analysis, Bedrock-aware scaling signals, and dynamic provisioned throughput adjustment |
1. Why FM Workloads Break Traditional Auto-Scaling
1.1 Challenges Unique to Foundation Model Workloads
Auto-scaling rules designed for stateless web services (scale on CPU > 70%, scale on request count > 1000/min) fail for FM workloads because of three properties:
| Property | Web Service Behavior | FM Workload Behavior | Impact on Scaling |
|---|---|---|---|
| Variable token counts | Request sizes are roughly uniform (a few KB) | A greeting is 50 tokens; a multi-turn recommendation is 2,000 tokens. Same request rate, 40x compute variance. | CPU-based scaling under-provisions for heavy queries and over-provisions for light ones |
| Variable latency | Response times are predictable (10-100ms) | Haiku responds in 200ms, Sonnet in 1-4s, long context in 8s+. Identical request rate, wildly different in-flight concurrency. | Request-count scaling misses the concurrency explosion during Sonnet-heavy periods |
| Burstiness | Traffic ramps gradually | Manga releases trigger step-function surges: 5x in under 60 seconds | Cooldown periods designed for gradual ramps cause under-scaling during bursts |
1.2 The Compound Problem
These three properties interact. During a manga release event:
- Traffic volume spikes 5x (burstiness).
- Users ask complex questions about the new release -- "Compare this volume to the previous arc" -- routing to Sonnet instead of Haiku (variable latency).
- Recommendation queries with full context windows consume 2,000+ tokens each (variable token counts).
The result: effective compute demand spikes 10-15x even though raw request count only went up 5x. A scaling policy that only watches request count scales to 5x capacity when it needs 15x.
graph LR
subgraph "What Request-Count Scaling Sees"
A[5x request spike] --> B[Scale to 5x capacity]
end
subgraph "What Actually Happens"
C[5x request spike] --> D[Heavier intent mix<br>+2x Sonnet ratio]
D --> E[Longer token counts<br>+3x avg tokens]
E --> F[Effective demand: 15x]
end
B --> G[Result: Severe<br>under-provisioning]
F --> G
2. MangaAssist Traffic Patterns
2.1 Daily Curve -- JP Evening Peak
MangaAssist's traffic follows a consistent daily pattern driven by Japanese consumer habits. All times below are in JST (UTC+9).
| Time Window (JST) | Traffic Level | Dominant Intents | Scaling Posture |
|---|---|---|---|
| 00:00 - 02:00 | Moderate declining | Late-night browsing, recommendations | Scale-in beginning |
| 02:00 - 06:00 | Trough (~15% of peak) | Automated health checks, occasional insomnia browsing | Minimum capacity (5 tasks) |
| 06:00 - 09:00 | Gradual ramp | Commute browsing, order status checks | Scheduled pre-scale to 10 tasks |
| 09:00 - 12:00 | Moderate (~40% of peak) | Product search, FAQ | Target tracking active |
| 12:00 - 13:00 | Lunch spike (~60% of peak) | Browse recommendations, wishlist queries | Step scaling responds |
| 13:00 - 17:00 | Moderate steady (~45% of peak) | Mixed intents | Steady state |
| 17:00 - 19:00 | Rapid ramp to peak | After-work browsing, recommendations | Scheduled pre-scale to 30 tasks |
| 19:00 - 23:00 | Peak (~100%) | Recommendations, multi-turn conversations, purchases | Full provisioned throughput |
| 23:00 - 00:00 | Peak declining | Late purchases, order confirmations | Begin scheduled scale-in |
2.2 Weekly Pattern -- Release Days and Weekends
| Day | Modifier | Reason |
|---|---|---|
| Monday | +30% above baseline | Weekly Shonen Jump digital release (midnight JST) |
| Tuesday - Thursday | Baseline | Normal traffic |
| Friday | +15% above baseline | Weekend anticipation, paycheck spending |
| Saturday | +40% above baseline | Full-day browsing, sustained peak window |
| Sunday | +25% above baseline | Morning browsing, evening trails off (pre-Monday release) |
2.3 Seasonal Events -- Manga Sales and Promotions
gantt
title MangaAssist Annual Capacity Calendar
dateFormat YYYY-MM-DD
axisFormat %b
section Sustained High
New Year Sale (5x) :crit, 2026-01-01, 7d
Golden Week (3x) :crit, 2026-04-29, 7d
Amazon Prime Day (4x) :crit, 2026-07-15, 3d
Summer Comiket (3x) :crit, 2026-08-10, 4d
Black Friday (5x) :crit, 2026-11-27, 4d
Cyber Monday (4x) :crit, 2026-11-30, 2d
Year-End Sale (4x) :crit, 2026-12-20, 12d
section Burst Events
Jump Festa (3x) :active, 2026-12-19, 2d
Manga Award Season (2x) :active, 2026-03-01, 14d
Anime Season Premieres (2x) :active, 2026-01-10, 7d
Anime Season Premieres (2x) :active, 2026-04-10, 7d
Anime Season Premieres (2x) :active, 2026-07-10, 7d
Anime Season Premieres (2x) :active, 2026-10-10, 7d
3. ECS Fargate Scaling Strategies
3.1 Why CPU/Memory Alone Is Insufficient
The MangaAssist orchestrator running on ECS Fargate is I/O bound, not CPU bound. During peak traffic:
| Metric | Value During Peak | Why It Misleads |
|---|---|---|
| CPU utilization | 25-35% | Most time spent await-ing Bedrock responses and Redis lookups |
| Memory utilization | 40-50% | Stable; conversation context is small per request |
| Network I/O | High | Reflects Bedrock calls, but not exposed as a scaling metric natively |
| Bedrock queue depth | 50-200 | The real bottleneck indicator -- how many requests are waiting for Bedrock |
| Active Bedrock invocations | 30-80 per task | Each task can hold many concurrent async invocations |
If scaling is CPU-based at 70% threshold, the system never scales because CPU stays at 30%. Meanwhile, Bedrock queue depth climbs to 200 and users experience 8-second latencies.
3.2 Custom Metric-Based Scaling
MangaAssist publishes three custom CloudWatch metrics from each ECS task and scales on them:
graph TD
subgraph "ECS Task Metrics Published Every 10s"
M1[bedrock_queue_depth<br>In-flight Bedrock calls waiting]
M2[active_bedrock_invocations<br>Concurrent calls to Bedrock]
M3[request_latency_p99<br>99th percentile response time]
end
subgraph "CloudWatch Aggregation"
M1 --> A1[Aggregate: Maximum across tasks]
M2 --> A2[Aggregate: Sum across tasks]
M3 --> A3[Aggregate: Maximum across tasks]
end
subgraph "Scaling Policies"
A1 --> P1[Step Scaling Policy<br>Queue Depth]
A2 --> P2[Target Tracking Policy<br>Invocations per Task = 40]
A3 --> P3[Step Scaling Policy<br>Latency Breach]
end
subgraph "ECS Auto Scaling"
P1 --> S[Desired Task Count]
P2 --> S
P3 --> S
S --> T[ECS Fargate Tasks<br>min=5, max=100]
end
Policy interaction: When multiple policies disagree, ECS Application Auto Scaling uses the maximum desired count across all policies. This ensures that the most aggressive policy wins during a crisis.
3.3 Scaling Configuration Details
| Policy | Metric | Scale-Out Trigger | Scale-Out Action | Scale-In Trigger | Scale-In Action | Cooldown (Out/In) |
|---|---|---|---|---|---|---|
| Queue Depth Step | bedrock_queue_depth max |
> 50 for 2 min | +2 tasks | < 20 for 10 min | -1 task | 60s / 300s |
| Queue Depth Emergency | bedrock_queue_depth max |
> 200 for 30s | +10 tasks | N/A | N/A | 60s / N/A |
| Invocations Target Tracking | active_bedrock_invocations sum / task count |
> 40 per task | Auto-calculated | < 40 per task | Auto-calculated | 120s / 300s |
| Latency Step | request_latency_p99 max |
> 4,000ms for 2 min | +3 tasks | < 2,000ms for 15 min | -1 task | 90s / 300s |
| Scheduled: Evening Pre-Scale | Time-based | 17:00 JST daily | min_capacity = 30 | 23:30 JST daily | min_capacity = 10 | N/A |
| Scheduled: Overnight Trough | Time-based | 02:00 JST daily | min_capacity = 5 | 06:00 JST daily | min_capacity = 10 | N/A |
4. Bedrock-Aware Scaling
4.1 Key Bedrock Metrics for Scaling Decisions
The AWS/Bedrock CloudWatch namespace provides metrics that MangaAssist's scaling system consumes:
| Bedrock Metric | Dimension | How MangaAssist Uses It |
|---|---|---|
Invocations |
ModelId | Trend detection: is traffic shifting from Haiku to Sonnet? If so, each request is slower and more expensive -- pre-emptively scale. |
InvocationLatency |
ModelId | If P99 latency for Sonnet crosses 5s, the queue depth will rise within 30s. Proactive scale-out before queue depth alarms fire. |
InvocationThrottles |
ModelId | Any throttle is an emergency. Immediate scale action: increase ECS tasks (more concurrent callers may not help, but see below), and trigger provisioned throughput increase. |
InputTokenCount / OutputTokenCount |
ModelId | Compute tokens-per-second to determine if provisioned throughput needs adjustment. |
4.2 Throttle Response Strategy
When Bedrock throttles MangaAssist, adding more ECS tasks does not help because the bottleneck is Bedrock capacity, not orchestrator capacity. The correct response is layered:
flowchart TD
A[Bedrock InvocationThrottles > 0] --> B{Provisioned<br>throughput active?}
B -->|Yes| C[Increase provisioned<br>model units via API]
B -->|No| D[Enable provisioned<br>throughput immediately]
A --> E[Activate request<br>coalescing aggressively]
A --> F[Shift eligible queries<br>from Sonnet to Haiku]
A --> G[Extend Redis cache TTL<br>from 30s to 120s]
A --> H[Enable degraded mode:<br>return cached/template<br>responses for FAQ intent]
C --> I[Monitor: Throttles<br>should drop within 5 min]
D --> I
E --> I
F --> I
G --> I
H --> I
I -->|Throttles persist| J[Escalate: Page on-call,<br>consider cross-region failover]
I -->|Throttles resolved| K[Gradually revert<br>to normal settings]
4.3 Proactive Scaling Using Latency Prediction
Rather than waiting for throttles (which cause user-visible errors), MangaAssist predicts approaching capacity limits using a simple linear model:
Predicted TPS in 5 minutes = current_TPS + (TPS_slope * 300)
Where TPS_slope is the rate of change in tokens-per-second over the last 5 minutes. If the predicted TPS exceeds 85% of provisioned capacity, the system pre-emptively:
- Adds ECS tasks (60-second lead time for Fargate task launch).
- Requests provisioned throughput increase (may take minutes to activate).
- Warms the Redis cache with anticipated popular queries.
5. Provisioned Throughput Dynamic Adjustment
5.1 Time-of-Day Scheduling
MangaAssist dynamically adjusts Bedrock provisioned throughput on a schedule aligned with the daily traffic curve:
| Time Window (JST) | Sonnet Model Units | Haiku Model Units | Est. Combined TPS Capacity | Monthly Cost (Window) |
|---|---|---|---|---|
| 02:00 - 06:00 | 0 (on-demand only) | 0 (on-demand only) | On-demand limits | $0 provisioned |
| 06:00 - 12:00 | 5 | 3 | ~11,000 tokens/s | $7,200 |
| 12:00 - 17:00 | 8 | 3 | ~14,000 tokens/s | $9,900 |
| 17:00 - 02:00 | 15 | 5 | ~25,000 tokens/s | $18,000 |
| Event override | 25 | 8 | ~41,000 tokens/s | Variable |
5.2 Event-Driven Adjustment
When the MangaAssist event calendar signals an upcoming high-traffic event, the system increases provisioned throughput ahead of time:
sequenceDiagram
participant CAL as Event Calendar
participant SCHED as Throughput Scheduler
participant BPT as Bedrock Provisioned<br>Throughput API
participant CW as CloudWatch
participant OPS as On-Call Engineer
CAL->>SCHED: Monday manga release in 3 hours
SCHED->>BPT: Increase to event capacity (25 Sonnet + 8 Haiku)
BPT-->>SCHED: Provisioning in progress (ETA 15 min)
SCHED->>CW: Publish "ProvisioningState: SCALING" metric
Note over BPT: 15 minutes later...
BPT-->>SCHED: Provisioning complete
SCHED->>CW: Publish "ProvisioningState: ACTIVE" metric
Note over CAL: Manga release begins
CW-->>SCHED: TPS climbing: 15K -> 25K -> 32K
alt TPS exceeds provisioned capacity
SCHED->>BPT: Emergency increase request
SCHED->>OPS: Alert: approaching provisioned ceiling
else TPS within capacity
SCHED->>CW: All nominal
end
Note over CAL: 4 hours later, event subsides
SCHED->>BPT: Scale back to evening baseline (15 Sonnet + 5 Haiku)
6. Python Implementation -- GenAIAutoScaler
"""
GenAI-aware auto-scaler for MangaAssist.
Combines ECS task scaling with Bedrock provisioned throughput adjustment.
"""
import time
import json
import logging
from dataclasses import dataclass
from datetime import datetime, timezone, timedelta
from enum import Enum
from typing import Optional
import boto3
logger = logging.getLogger("mangaassist.autoscaler")
JST = timezone(timedelta(hours=9))
class ScalingState(Enum):
STABLE = "stable"
SCALING_OUT = "scaling_out"
SCALING_IN = "scaling_in"
EMERGENCY = "emergency"
EVENT_OVERRIDE = "event_override"
@dataclass
class TrafficSnapshot:
"""Point-in-time traffic measurements."""
timestamp: float
tokens_per_second: float
bedrock_queue_depth: int
active_invocations: int
throttle_count: int
p99_latency_ms: float
sonnet_ratio: float # Fraction of invocations going to Sonnet vs Haiku
ecs_task_count: int
provisioned_sonnet_units: int
provisioned_haiku_units: int
@dataclass
class ScalingDecision:
"""Output of the scaling evaluation."""
state: ScalingState
ecs_desired_count: Optional[int]
sonnet_units: Optional[int]
haiku_units: Optional[int]
reason: str
actions: list[str]
class GenAIAutoScaler:
"""
Auto-scaler for MangaAssist that understands FM workload characteristics.
Key differences from a generic auto-scaler:
1. Scales on tokens-per-second and queue depth, not CPU.
2. Adjusts Bedrock provisioned throughput alongside ECS tasks.
3. Uses traffic slope prediction to scale proactively.
4. Integrates with an event calendar for pre-provisioning.
5. Implements throttle-response strategies beyond just adding capacity.
"""
# Capacity constants
SONNET_TPS_PER_UNIT = 1_000
HAIKU_TPS_PER_UNIT = 2_000
TPS_PER_ECS_TASK = 500
ECS_MIN_TASKS = 5
ECS_MAX_TASKS = 100
# Thresholds
TPS_UTILIZATION_SCALE_OUT = 0.80 # Scale out at 80% utilization
TPS_UTILIZATION_SCALE_IN = 0.40 # Scale in at 40% utilization
TPS_UTILIZATION_EMERGENCY = 0.95 # Emergency at 95%
QUEUE_DEPTH_SCALE_OUT = 50
QUEUE_DEPTH_EMERGENCY = 200
LATENCY_SCALE_OUT_MS = 4_000
THROTTLE_EMERGENCY = 1 # Any throttle is an emergency
# Prediction
PREDICTION_WINDOW_SECONDS = 300 # Look ahead 5 minutes
def __init__(
self,
ecs_cluster: str,
ecs_service: str,
region: str = "ap-northeast-1",
event_calendar: Optional[dict] = None,
):
self.ecs_cluster = ecs_cluster
self.ecs_service = ecs_service
self.region = region
self.event_calendar = event_calendar or {}
self._ecs_client = boto3.client("ecs", region_name=region)
self._bedrock_client = boto3.client("bedrock", region_name=region)
self._cloudwatch_client = boto3.client("cloudwatch", region_name=region)
self._autoscaling_client = boto3.client("application-autoscaling", region_name=region)
# Rolling window of snapshots for slope calculation
self._snapshot_history: list[TrafficSnapshot] = []
self._max_history = 60 # Keep last 60 snapshots (10 minutes at 10s intervals)
self._last_scale_action_time = 0.0
self._cooldown_seconds = 60.0
def evaluate(self, snapshot: TrafficSnapshot) -> ScalingDecision:
"""
Evaluate current traffic and return a scaling decision.
This is the main entry point, called every 10 seconds.
"""
self._snapshot_history.append(snapshot)
if len(self._snapshot_history) > self._max_history:
self._snapshot_history = self._snapshot_history[-self._max_history:]
# Check for event override first
event_decision = self._check_event_override(snapshot)
if event_decision is not None:
return event_decision
# Check for emergency conditions
if snapshot.throttle_count >= self.THROTTLE_EMERGENCY:
return self._handle_throttle_emergency(snapshot)
if snapshot.bedrock_queue_depth >= self.QUEUE_DEPTH_EMERGENCY:
return self._handle_queue_emergency(snapshot)
# Calculate provisioned capacity
provisioned_tps = (
snapshot.provisioned_sonnet_units * self.SONNET_TPS_PER_UNIT
+ snapshot.provisioned_haiku_units * self.HAIKU_TPS_PER_UNIT
)
utilization = snapshot.tokens_per_second / provisioned_tps if provisioned_tps > 0 else 1.0
# Predictive check: will we breach in 5 minutes?
predicted_tps = self._predict_tps()
predicted_utilization = predicted_tps / provisioned_tps if provisioned_tps > 0 else 1.0
# Scale-out evaluation
if (
utilization > self.TPS_UTILIZATION_SCALE_OUT
or predicted_utilization > self.TPS_UTILIZATION_EMERGENCY
or snapshot.bedrock_queue_depth > self.QUEUE_DEPTH_SCALE_OUT
or snapshot.p99_latency_ms > self.LATENCY_SCALE_OUT_MS
):
return self._compute_scale_out(snapshot, utilization, predicted_tps)
# Scale-in evaluation
if (
utilization < self.TPS_UTILIZATION_SCALE_IN
and snapshot.bedrock_queue_depth < 10
and snapshot.p99_latency_ms < 2_000
and self._in_cooldown() is False
):
return self._compute_scale_in(snapshot, utilization)
# Stable -- no action needed
return ScalingDecision(
state=ScalingState.STABLE,
ecs_desired_count=None,
sonnet_units=None,
haiku_units=None,
reason=f"Stable: utilization={utilization:.1%}, queue={snapshot.bedrock_queue_depth}, p99={snapshot.p99_latency_ms:.0f}ms",
actions=[],
)
def _predict_tps(self) -> float:
"""
Predict tokens-per-second in 5 minutes using linear regression
on the recent snapshot history.
"""
if len(self._snapshot_history) < 6:
# Not enough data; return current value
return self._snapshot_history[-1].tokens_per_second if self._snapshot_history else 0
recent = self._snapshot_history[-30:] # Last 5 minutes of 10s snapshots
if len(recent) < 2:
return recent[-1].tokens_per_second
# Simple linear regression: TPS = a + b * t
t_values = [s.timestamp - recent[0].timestamp for s in recent]
tps_values = [s.tokens_per_second for s in recent]
n = len(t_values)
sum_t = sum(t_values)
sum_tps = sum(tps_values)
sum_t_tps = sum(t * tps for t, tps in zip(t_values, tps_values))
sum_t_sq = sum(t * t for t in t_values)
denominator = n * sum_t_sq - sum_t * sum_t
if denominator == 0:
return tps_values[-1]
slope = (n * sum_t_tps - sum_t * sum_tps) / denominator
intercept = (sum_tps - slope * sum_t) / n
# Predict 5 minutes into the future
future_t = t_values[-1] + self.PREDICTION_WINDOW_SECONDS
predicted = intercept + slope * future_t
return max(predicted, 0) # TPS can't be negative
def _handle_throttle_emergency(self, snapshot: TrafficSnapshot) -> ScalingDecision:
"""Respond to Bedrock throttling with multi-layered mitigation."""
additional_sonnet = max(2, snapshot.throttle_count) # At least 2 extra units
additional_haiku = max(1, snapshot.throttle_count // 2)
return ScalingDecision(
state=ScalingState.EMERGENCY,
ecs_desired_count=min(snapshot.ecs_task_count + 10, self.ECS_MAX_TASKS),
sonnet_units=snapshot.provisioned_sonnet_units + additional_sonnet,
haiku_units=snapshot.provisioned_haiku_units + additional_haiku,
reason=f"EMERGENCY: {snapshot.throttle_count} throttles detected",
actions=[
f"Increase provisioned Sonnet by {additional_sonnet} units",
f"Increase provisioned Haiku by {additional_haiku} units",
"Activate aggressive request coalescing",
"Shift eligible queries from Sonnet to Haiku",
"Extend Redis cache TTL to 120s",
"Page on-call engineer",
],
)
def _handle_queue_emergency(self, snapshot: TrafficSnapshot) -> ScalingDecision:
"""Respond to extreme queue depth."""
return ScalingDecision(
state=ScalingState.EMERGENCY,
ecs_desired_count=min(snapshot.ecs_task_count + 10, self.ECS_MAX_TASKS),
sonnet_units=snapshot.provisioned_sonnet_units + 3,
haiku_units=snapshot.provisioned_haiku_units + 2,
reason=f"EMERGENCY: Queue depth {snapshot.bedrock_queue_depth} exceeds {self.QUEUE_DEPTH_EMERGENCY}",
actions=[
"Add 10 ECS tasks immediately",
"Increase provisioned throughput",
"Enable degraded mode for FAQ queries (template responses)",
"Alert on-call engineer",
],
)
def _compute_scale_out(
self, snapshot: TrafficSnapshot, utilization: float, predicted_tps: float
) -> ScalingDecision:
"""Calculate how much to scale out."""
# ECS task scaling: target TPS_PER_ECS_TASK per task
desired_tasks = max(
snapshot.ecs_task_count + 2,
int(predicted_tps / self.TPS_PER_ECS_TASK) + 1,
)
desired_tasks = min(desired_tasks, self.ECS_MAX_TASKS)
# Provisioned throughput: add units if utilization is high
sonnet_units = snapshot.provisioned_sonnet_units
haiku_units = snapshot.provisioned_haiku_units
if utilization > self.TPS_UTILIZATION_SCALE_OUT:
# Add units proportional to the gap
needed_tps = predicted_tps * 1.3 # 30% headroom
current_capacity = (
sonnet_units * self.SONNET_TPS_PER_UNIT
+ haiku_units * self.HAIKU_TPS_PER_UNIT
)
gap = needed_tps - current_capacity
if gap > 0:
# Split additional capacity between Sonnet and Haiku
# based on current Sonnet ratio
sonnet_gap = int(gap * snapshot.sonnet_ratio / self.SONNET_TPS_PER_UNIT) + 1
haiku_gap = int(gap * (1 - snapshot.sonnet_ratio) / self.HAIKU_TPS_PER_UNIT) + 1
sonnet_units += sonnet_gap
haiku_units += haiku_gap
actions = []
if desired_tasks > snapshot.ecs_task_count:
actions.append(
f"Scale ECS from {snapshot.ecs_task_count} to {desired_tasks} tasks"
)
if sonnet_units > snapshot.provisioned_sonnet_units:
actions.append(
f"Increase Sonnet provisioned from {snapshot.provisioned_sonnet_units} to {sonnet_units} units"
)
if haiku_units > snapshot.provisioned_haiku_units:
actions.append(
f"Increase Haiku provisioned from {snapshot.provisioned_haiku_units} to {haiku_units} units"
)
return ScalingDecision(
state=ScalingState.SCALING_OUT,
ecs_desired_count=desired_tasks,
sonnet_units=sonnet_units,
haiku_units=haiku_units,
reason=f"Scale-out: utilization={utilization:.1%}, predicted_tps={predicted_tps:.0f}, queue={snapshot.bedrock_queue_depth}",
actions=actions,
)
def _compute_scale_in(self, snapshot: TrafficSnapshot, utilization: float) -> ScalingDecision:
"""Calculate how much to scale in (conservative)."""
# Never scale in by more than 10% at a time
reduction = max(1, int(snapshot.ecs_task_count * 0.10))
desired_tasks = max(snapshot.ecs_task_count - reduction, self.ECS_MIN_TASKS)
# Reduce provisioned throughput only if utilization is very low
sonnet_units = snapshot.provisioned_sonnet_units
haiku_units = snapshot.provisioned_haiku_units
if utilization < 0.25:
sonnet_units = max(3, sonnet_units - 1)
haiku_units = max(2, haiku_units - 1)
actions = []
if desired_tasks < snapshot.ecs_task_count:
actions.append(
f"Scale ECS from {snapshot.ecs_task_count} to {desired_tasks} tasks"
)
if sonnet_units < snapshot.provisioned_sonnet_units:
actions.append(
f"Reduce Sonnet provisioned from {snapshot.provisioned_sonnet_units} to {sonnet_units} units"
)
return ScalingDecision(
state=ScalingState.SCALING_IN,
ecs_desired_count=desired_tasks,
sonnet_units=sonnet_units,
haiku_units=haiku_units,
reason=f"Scale-in: utilization={utilization:.1%}, queue={snapshot.bedrock_queue_depth}",
actions=actions,
)
def _check_event_override(self, snapshot: TrafficSnapshot) -> Optional[ScalingDecision]:
"""Check if an event calendar entry requires override scaling."""
now_jst = datetime.now(JST)
for event_name, event_config in self.event_calendar.items():
start = event_config["start"]
end = event_config["end"]
if start <= now_jst <= end:
return ScalingDecision(
state=ScalingState.EVENT_OVERRIDE,
ecs_desired_count=min(event_config["ecs_tasks"], self.ECS_MAX_TASKS),
sonnet_units=event_config["sonnet_units"],
haiku_units=event_config["haiku_units"],
reason=f"Event override: {event_name} ({start} to {end})",
actions=[
f"Event '{event_name}' active -- using override capacity",
f"ECS: {event_config['ecs_tasks']} tasks",
f"Sonnet: {event_config['sonnet_units']} units",
f"Haiku: {event_config['haiku_units']} units",
],
)
return None
def _in_cooldown(self) -> bool:
"""Check if we are in a cooldown period after a recent scale action."""
return (time.time() - self._last_scale_action_time) < self._cooldown_seconds
async def apply_decision(self, decision: ScalingDecision):
"""Apply a scaling decision to AWS resources."""
if decision.state == ScalingState.STABLE:
return
logger.info(
"Applying scaling decision: state=%s reason=%s actions=%s",
decision.state.value, decision.reason, decision.actions,
)
# Apply ECS scaling
if decision.ecs_desired_count is not None:
self._ecs_client.update_service(
cluster=self.ecs_cluster,
service=self.ecs_service,
desiredCount=decision.ecs_desired_count,
)
logger.info("ECS desired count set to %d", decision.ecs_desired_count)
# Apply provisioned throughput changes
if decision.sonnet_units is not None:
# Bedrock provisioned throughput API call would go here.
# The exact API depends on the provisioned throughput model
# (e.g., CreateProvisionedModelThroughput or UpdateProvisionedModelThroughput).
logger.info(
"Provisioned throughput update: Sonnet=%d units, Haiku=%d units",
decision.sonnet_units,
decision.haiku_units or 0,
)
# Publish scaling event to CloudWatch
self._cloudwatch_client.put_metric_data(
Namespace="MangaAssist",
MetricData=[
{
"MetricName": "ScalingAction",
"Value": 1,
"Unit": "Count",
"Dimensions": [
{"Name": "State", "Value": decision.state.value},
{"Name": "Service", "Value": "GenAIAutoScaler"},
],
},
],
)
self._last_scale_action_time = time.time()
7. Manga Release Scale-Up Sequence
The following diagram shows the complete scale-up sequence during a Monday manga release event, from the calendar trigger through capacity stabilization:
sequenceDiagram
participant CAL as Event Calendar<br>(Lambda CronJob)
participant SCALER as GenAIAutoScaler
participant ECS as ECS Fargate<br>Service
participant BPT as Bedrock Provisioned<br>Throughput
participant REDIS as ElastiCache Redis
participant CW as CloudWatch
participant OPS as On-Call Eng
Note over CAL: Sunday 21:00 JST<br>3 hours before release
CAL->>SCALER: Event: "weekly_shonen_jump"<br>starts in 3h
SCALER->>BPT: UpdateProvisionedThroughput<br>Sonnet: 15 → 25 units<br>Haiku: 5 → 8 units
BPT-->>SCALER: Status: PROVISIONING
SCALER->>ECS: UpdateService<br>desiredCount: 30 → 50
ECS-->>SCALER: Tasks launching...
SCALER->>REDIS: Pre-warm popular queries<br>(top 100 manga titles)
SCALER->>CW: Publish ScalingState=EVENT_OVERRIDE
Note over ECS: 5 minutes later<br>50 tasks running
Note over BPT: 15 minutes later<br>Provisioned throughput active
SCALER->>CW: Publish ProvisionedState=ACTIVE
SCALER->>OPS: INFO: Event capacity ready
Note over CAL: Monday 00:00 JST<br>Release goes live
CW-->>SCALER: TPS: 7K → 15K → 22K → 30K
Note over SCALER: TPS within provisioned<br>capacity (33K). Stable.
CW-->>SCALER: TPS: 30K → 33K → 36K
Note over SCALER: TPS exceeding provisioned!
SCALER->>BPT: Emergency: Sonnet 25 → 28 units
SCALER->>ECS: UpdateService: 50 → 65 tasks
SCALER->>REDIS: Extend cache TTL: 30s → 90s
SCALER->>OPS: WARN: Approaching capacity ceiling
Note over CW: 30 minutes later<br>Surge subsides
CW-->>SCALER: TPS: 36K → 28K → 20K
SCALER->>SCALER: Predicted TPS in 5m: 16K<br>Safe to scale in
Note over CAL: Monday 04:00 JST<br>Event window ends
CAL->>SCALER: Event "weekly_shonen_jump" ended
SCALER->>BPT: Reduce to evening baseline<br>Sonnet: 28 → 15 units
SCALER->>ECS: UpdateService: 65 → 30 tasks
SCALER->>REDIS: Restore cache TTL: 90s → 30s
SCALER->>CW: Publish ScalingState=STABLE
8. Monitoring Dashboard Layout
MangaAssist uses a CloudWatch dashboard with the following panels to give the on-call engineer full visibility into auto-scaling behavior:
| Panel | Metrics Shown | Time Range | Purpose |
|---|---|---|---|
| Tokens per Second | MangaAssist/TokensPerSecond (line), provisioned capacity (horizontal line) |
6 hours | See how close live traffic is to provisioned ceiling |
| Bedrock Queue Depth | MangaAssist/BedrockQueueDepth max per task |
1 hour | Spot queue buildup before it hits latency |
| P99 Latency | MangaAssist/RequestLatencyP99 |
1 hour | Confirm user-facing impact of any scaling event |
| ECS Task Count | AWS/ECS DesiredCount, RunningCount |
6 hours | Verify scaling actions are applied and tasks are healthy |
| Bedrock Throttles | AWS/Bedrock InvocationThrottles sum |
1 hour | Any non-zero value is an immediate concern |
| Provisioned Throughput Units | Custom metric tracking model units | 24 hours | See time-of-day provisioning changes |
| Scaling Decisions | MangaAssist/ScalingAction by state |
6 hours | Audit trail of every scale-out/in/emergency decision |
| Cost Rate | Computed: (tokens * price_per_token) per hour | 24 hours | Real-time cost tracking against budget |
Key Takeaways
- CPU utilization is the wrong signal for FM workloads -- MangaAssist's orchestrator sits at 30% CPU during peak traffic because it is I/O bound waiting on Bedrock. Queue depth and tokens-per-second are the correct scaling metrics.
- Prediction prevents throttling -- By the time Bedrock starts throttling, users are already affected. A 5-minute linear prediction of TPS allows pre-emptive scaling that keeps the system ahead of demand.
- Throttle response is multi-layered -- Adding ECS tasks when Bedrock is throttled does not help. The correct response combines provisioned throughput increases, cache extension, model downgrade (Sonnet to Haiku), and degraded-mode template responses.
- Scale-in must be conservative -- Asymmetric cooldowns (60s for scale-out, 300s for scale-in) prevent oscillation during the evening ramp where traffic can dip briefly between surges.
- Event calendars are not optional -- Manga releases happen on predictable schedules. Pre-provisioning 3 hours ahead eliminates the cold-start penalty of provisioned throughput activation.