LOCAL PREVIEW View on GitHub

02: Auto-Scaling and Utilization for GenAI Workloads

MangaAssist is a JP manga store chatbot running on AWS. It uses Bedrock Claude 3 (Sonnet for complex queries, Haiku for simple ones), OpenSearch Serverless for vector retrieval, DynamoDB for session and catalog data, ECS Fargate for orchestration, API Gateway WebSocket for real-time delivery, and ElastiCache Redis for caching. The system handles 1M messages/day with a target of under 3 seconds end-to-end response time.


Skill Mapping

Field Value
Domain 4 -- Operational Efficiency Optimization
Task 4.1 -- Optimize foundation model cost and performance
Skill 4.1.3 -- Implement strategies for high-performance FM systems including batching, capacity planning, utilization monitoring, auto-scaling, and provisioned throughput optimization
Focus Deep-dive into auto-scaling patterns, traffic analysis, Bedrock-aware scaling signals, and dynamic provisioned throughput adjustment

1. Why FM Workloads Break Traditional Auto-Scaling

1.1 Challenges Unique to Foundation Model Workloads

Auto-scaling rules designed for stateless web services (scale on CPU > 70%, scale on request count > 1000/min) fail for FM workloads because of three properties:

Property Web Service Behavior FM Workload Behavior Impact on Scaling
Variable token counts Request sizes are roughly uniform (a few KB) A greeting is 50 tokens; a multi-turn recommendation is 2,000 tokens. Same request rate, 40x compute variance. CPU-based scaling under-provisions for heavy queries and over-provisions for light ones
Variable latency Response times are predictable (10-100ms) Haiku responds in 200ms, Sonnet in 1-4s, long context in 8s+. Identical request rate, wildly different in-flight concurrency. Request-count scaling misses the concurrency explosion during Sonnet-heavy periods
Burstiness Traffic ramps gradually Manga releases trigger step-function surges: 5x in under 60 seconds Cooldown periods designed for gradual ramps cause under-scaling during bursts

1.2 The Compound Problem

These three properties interact. During a manga release event:

  1. Traffic volume spikes 5x (burstiness).
  2. Users ask complex questions about the new release -- "Compare this volume to the previous arc" -- routing to Sonnet instead of Haiku (variable latency).
  3. Recommendation queries with full context windows consume 2,000+ tokens each (variable token counts).

The result: effective compute demand spikes 10-15x even though raw request count only went up 5x. A scaling policy that only watches request count scales to 5x capacity when it needs 15x.

graph LR
    subgraph "What Request-Count Scaling Sees"
        A[5x request spike] --> B[Scale to 5x capacity]
    end

    subgraph "What Actually Happens"
        C[5x request spike] --> D[Heavier intent mix<br>+2x Sonnet ratio]
        D --> E[Longer token counts<br>+3x avg tokens]
        E --> F[Effective demand: 15x]
    end

    B --> G[Result: Severe<br>under-provisioning]
    F --> G

2. MangaAssist Traffic Patterns

2.1 Daily Curve -- JP Evening Peak

MangaAssist's traffic follows a consistent daily pattern driven by Japanese consumer habits. All times below are in JST (UTC+9).

Time Window (JST) Traffic Level Dominant Intents Scaling Posture
00:00 - 02:00 Moderate declining Late-night browsing, recommendations Scale-in beginning
02:00 - 06:00 Trough (~15% of peak) Automated health checks, occasional insomnia browsing Minimum capacity (5 tasks)
06:00 - 09:00 Gradual ramp Commute browsing, order status checks Scheduled pre-scale to 10 tasks
09:00 - 12:00 Moderate (~40% of peak) Product search, FAQ Target tracking active
12:00 - 13:00 Lunch spike (~60% of peak) Browse recommendations, wishlist queries Step scaling responds
13:00 - 17:00 Moderate steady (~45% of peak) Mixed intents Steady state
17:00 - 19:00 Rapid ramp to peak After-work browsing, recommendations Scheduled pre-scale to 30 tasks
19:00 - 23:00 Peak (~100%) Recommendations, multi-turn conversations, purchases Full provisioned throughput
23:00 - 00:00 Peak declining Late purchases, order confirmations Begin scheduled scale-in

2.2 Weekly Pattern -- Release Days and Weekends

Day Modifier Reason
Monday +30% above baseline Weekly Shonen Jump digital release (midnight JST)
Tuesday - Thursday Baseline Normal traffic
Friday +15% above baseline Weekend anticipation, paycheck spending
Saturday +40% above baseline Full-day browsing, sustained peak window
Sunday +25% above baseline Morning browsing, evening trails off (pre-Monday release)

2.3 Seasonal Events -- Manga Sales and Promotions

gantt
    title MangaAssist Annual Capacity Calendar
    dateFormat  YYYY-MM-DD
    axisFormat  %b

    section Sustained High
    New Year Sale (5x)          :crit, 2026-01-01, 7d
    Golden Week (3x)            :crit, 2026-04-29, 7d
    Amazon Prime Day (4x)       :crit, 2026-07-15, 3d
    Summer Comiket (3x)         :crit, 2026-08-10, 4d
    Black Friday (5x)           :crit, 2026-11-27, 4d
    Cyber Monday (4x)           :crit, 2026-11-30, 2d
    Year-End Sale (4x)          :crit, 2026-12-20, 12d

    section Burst Events
    Jump Festa (3x)             :active, 2026-12-19, 2d
    Manga Award Season (2x)     :active, 2026-03-01, 14d
    Anime Season Premieres (2x) :active, 2026-01-10, 7d
    Anime Season Premieres (2x) :active, 2026-04-10, 7d
    Anime Season Premieres (2x) :active, 2026-07-10, 7d
    Anime Season Premieres (2x) :active, 2026-10-10, 7d

3. ECS Fargate Scaling Strategies

3.1 Why CPU/Memory Alone Is Insufficient

The MangaAssist orchestrator running on ECS Fargate is I/O bound, not CPU bound. During peak traffic:

Metric Value During Peak Why It Misleads
CPU utilization 25-35% Most time spent await-ing Bedrock responses and Redis lookups
Memory utilization 40-50% Stable; conversation context is small per request
Network I/O High Reflects Bedrock calls, but not exposed as a scaling metric natively
Bedrock queue depth 50-200 The real bottleneck indicator -- how many requests are waiting for Bedrock
Active Bedrock invocations 30-80 per task Each task can hold many concurrent async invocations

If scaling is CPU-based at 70% threshold, the system never scales because CPU stays at 30%. Meanwhile, Bedrock queue depth climbs to 200 and users experience 8-second latencies.

3.2 Custom Metric-Based Scaling

MangaAssist publishes three custom CloudWatch metrics from each ECS task and scales on them:

graph TD
    subgraph "ECS Task Metrics Published Every 10s"
        M1[bedrock_queue_depth<br>In-flight Bedrock calls waiting]
        M2[active_bedrock_invocations<br>Concurrent calls to Bedrock]
        M3[request_latency_p99<br>99th percentile response time]
    end

    subgraph "CloudWatch Aggregation"
        M1 --> A1[Aggregate: Maximum across tasks]
        M2 --> A2[Aggregate: Sum across tasks]
        M3 --> A3[Aggregate: Maximum across tasks]
    end

    subgraph "Scaling Policies"
        A1 --> P1[Step Scaling Policy<br>Queue Depth]
        A2 --> P2[Target Tracking Policy<br>Invocations per Task = 40]
        A3 --> P3[Step Scaling Policy<br>Latency Breach]
    end

    subgraph "ECS Auto Scaling"
        P1 --> S[Desired Task Count]
        P2 --> S
        P3 --> S
        S --> T[ECS Fargate Tasks<br>min=5, max=100]
    end

Policy interaction: When multiple policies disagree, ECS Application Auto Scaling uses the maximum desired count across all policies. This ensures that the most aggressive policy wins during a crisis.

3.3 Scaling Configuration Details

Policy Metric Scale-Out Trigger Scale-Out Action Scale-In Trigger Scale-In Action Cooldown (Out/In)
Queue Depth Step bedrock_queue_depth max > 50 for 2 min +2 tasks < 20 for 10 min -1 task 60s / 300s
Queue Depth Emergency bedrock_queue_depth max > 200 for 30s +10 tasks N/A N/A 60s / N/A
Invocations Target Tracking active_bedrock_invocations sum / task count > 40 per task Auto-calculated < 40 per task Auto-calculated 120s / 300s
Latency Step request_latency_p99 max > 4,000ms for 2 min +3 tasks < 2,000ms for 15 min -1 task 90s / 300s
Scheduled: Evening Pre-Scale Time-based 17:00 JST daily min_capacity = 30 23:30 JST daily min_capacity = 10 N/A
Scheduled: Overnight Trough Time-based 02:00 JST daily min_capacity = 5 06:00 JST daily min_capacity = 10 N/A

4. Bedrock-Aware Scaling

4.1 Key Bedrock Metrics for Scaling Decisions

The AWS/Bedrock CloudWatch namespace provides metrics that MangaAssist's scaling system consumes:

Bedrock Metric Dimension How MangaAssist Uses It
Invocations ModelId Trend detection: is traffic shifting from Haiku to Sonnet? If so, each request is slower and more expensive -- pre-emptively scale.
InvocationLatency ModelId If P99 latency for Sonnet crosses 5s, the queue depth will rise within 30s. Proactive scale-out before queue depth alarms fire.
InvocationThrottles ModelId Any throttle is an emergency. Immediate scale action: increase ECS tasks (more concurrent callers may not help, but see below), and trigger provisioned throughput increase.
InputTokenCount / OutputTokenCount ModelId Compute tokens-per-second to determine if provisioned throughput needs adjustment.

4.2 Throttle Response Strategy

When Bedrock throttles MangaAssist, adding more ECS tasks does not help because the bottleneck is Bedrock capacity, not orchestrator capacity. The correct response is layered:

flowchart TD
    A[Bedrock InvocationThrottles > 0] --> B{Provisioned<br>throughput active?}

    B -->|Yes| C[Increase provisioned<br>model units via API]
    B -->|No| D[Enable provisioned<br>throughput immediately]

    A --> E[Activate request<br>coalescing aggressively]
    A --> F[Shift eligible queries<br>from Sonnet to Haiku]
    A --> G[Extend Redis cache TTL<br>from 30s to 120s]
    A --> H[Enable degraded mode:<br>return cached/template<br>responses for FAQ intent]

    C --> I[Monitor: Throttles<br>should drop within 5 min]
    D --> I
    E --> I
    F --> I
    G --> I
    H --> I

    I -->|Throttles persist| J[Escalate: Page on-call,<br>consider cross-region failover]
    I -->|Throttles resolved| K[Gradually revert<br>to normal settings]

4.3 Proactive Scaling Using Latency Prediction

Rather than waiting for throttles (which cause user-visible errors), MangaAssist predicts approaching capacity limits using a simple linear model:

Predicted TPS in 5 minutes = current_TPS + (TPS_slope * 300)

Where TPS_slope is the rate of change in tokens-per-second over the last 5 minutes. If the predicted TPS exceeds 85% of provisioned capacity, the system pre-emptively:

  1. Adds ECS tasks (60-second lead time for Fargate task launch).
  2. Requests provisioned throughput increase (may take minutes to activate).
  3. Warms the Redis cache with anticipated popular queries.

5. Provisioned Throughput Dynamic Adjustment

5.1 Time-of-Day Scheduling

MangaAssist dynamically adjusts Bedrock provisioned throughput on a schedule aligned with the daily traffic curve:

Time Window (JST) Sonnet Model Units Haiku Model Units Est. Combined TPS Capacity Monthly Cost (Window)
02:00 - 06:00 0 (on-demand only) 0 (on-demand only) On-demand limits $0 provisioned
06:00 - 12:00 5 3 ~11,000 tokens/s $7,200
12:00 - 17:00 8 3 ~14,000 tokens/s $9,900
17:00 - 02:00 15 5 ~25,000 tokens/s $18,000
Event override 25 8 ~41,000 tokens/s Variable

5.2 Event-Driven Adjustment

When the MangaAssist event calendar signals an upcoming high-traffic event, the system increases provisioned throughput ahead of time:

sequenceDiagram
    participant CAL as Event Calendar
    participant SCHED as Throughput Scheduler
    participant BPT as Bedrock Provisioned<br>Throughput API
    participant CW as CloudWatch
    participant OPS as On-Call Engineer

    CAL->>SCHED: Monday manga release in 3 hours
    SCHED->>BPT: Increase to event capacity (25 Sonnet + 8 Haiku)
    BPT-->>SCHED: Provisioning in progress (ETA 15 min)
    SCHED->>CW: Publish "ProvisioningState: SCALING" metric

    Note over BPT: 15 minutes later...
    BPT-->>SCHED: Provisioning complete
    SCHED->>CW: Publish "ProvisioningState: ACTIVE" metric

    Note over CAL: Manga release begins
    CW-->>SCHED: TPS climbing: 15K -> 25K -> 32K

    alt TPS exceeds provisioned capacity
        SCHED->>BPT: Emergency increase request
        SCHED->>OPS: Alert: approaching provisioned ceiling
    else TPS within capacity
        SCHED->>CW: All nominal
    end

    Note over CAL: 4 hours later, event subsides
    SCHED->>BPT: Scale back to evening baseline (15 Sonnet + 5 Haiku)

6. Python Implementation -- GenAIAutoScaler

"""
GenAI-aware auto-scaler for MangaAssist.
Combines ECS task scaling with Bedrock provisioned throughput adjustment.
"""

import time
import json
import logging
from dataclasses import dataclass
from datetime import datetime, timezone, timedelta
from enum import Enum
from typing import Optional

import boto3

logger = logging.getLogger("mangaassist.autoscaler")

JST = timezone(timedelta(hours=9))


class ScalingState(Enum):
    STABLE = "stable"
    SCALING_OUT = "scaling_out"
    SCALING_IN = "scaling_in"
    EMERGENCY = "emergency"
    EVENT_OVERRIDE = "event_override"


@dataclass
class TrafficSnapshot:
    """Point-in-time traffic measurements."""
    timestamp: float
    tokens_per_second: float
    bedrock_queue_depth: int
    active_invocations: int
    throttle_count: int
    p99_latency_ms: float
    sonnet_ratio: float  # Fraction of invocations going to Sonnet vs Haiku
    ecs_task_count: int
    provisioned_sonnet_units: int
    provisioned_haiku_units: int


@dataclass
class ScalingDecision:
    """Output of the scaling evaluation."""
    state: ScalingState
    ecs_desired_count: Optional[int]
    sonnet_units: Optional[int]
    haiku_units: Optional[int]
    reason: str
    actions: list[str]


class GenAIAutoScaler:
    """
    Auto-scaler for MangaAssist that understands FM workload characteristics.

    Key differences from a generic auto-scaler:
    1. Scales on tokens-per-second and queue depth, not CPU.
    2. Adjusts Bedrock provisioned throughput alongside ECS tasks.
    3. Uses traffic slope prediction to scale proactively.
    4. Integrates with an event calendar for pre-provisioning.
    5. Implements throttle-response strategies beyond just adding capacity.
    """

    # Capacity constants
    SONNET_TPS_PER_UNIT = 1_000
    HAIKU_TPS_PER_UNIT = 2_000
    TPS_PER_ECS_TASK = 500
    ECS_MIN_TASKS = 5
    ECS_MAX_TASKS = 100

    # Thresholds
    TPS_UTILIZATION_SCALE_OUT = 0.80  # Scale out at 80% utilization
    TPS_UTILIZATION_SCALE_IN = 0.40   # Scale in at 40% utilization
    TPS_UTILIZATION_EMERGENCY = 0.95  # Emergency at 95%
    QUEUE_DEPTH_SCALE_OUT = 50
    QUEUE_DEPTH_EMERGENCY = 200
    LATENCY_SCALE_OUT_MS = 4_000
    THROTTLE_EMERGENCY = 1  # Any throttle is an emergency

    # Prediction
    PREDICTION_WINDOW_SECONDS = 300  # Look ahead 5 minutes

    def __init__(
        self,
        ecs_cluster: str,
        ecs_service: str,
        region: str = "ap-northeast-1",
        event_calendar: Optional[dict] = None,
    ):
        self.ecs_cluster = ecs_cluster
        self.ecs_service = ecs_service
        self.region = region
        self.event_calendar = event_calendar or {}

        self._ecs_client = boto3.client("ecs", region_name=region)
        self._bedrock_client = boto3.client("bedrock", region_name=region)
        self._cloudwatch_client = boto3.client("cloudwatch", region_name=region)
        self._autoscaling_client = boto3.client("application-autoscaling", region_name=region)

        # Rolling window of snapshots for slope calculation
        self._snapshot_history: list[TrafficSnapshot] = []
        self._max_history = 60  # Keep last 60 snapshots (10 minutes at 10s intervals)

        self._last_scale_action_time = 0.0
        self._cooldown_seconds = 60.0

    def evaluate(self, snapshot: TrafficSnapshot) -> ScalingDecision:
        """
        Evaluate current traffic and return a scaling decision.
        This is the main entry point, called every 10 seconds.
        """
        self._snapshot_history.append(snapshot)
        if len(self._snapshot_history) > self._max_history:
            self._snapshot_history = self._snapshot_history[-self._max_history:]

        # Check for event override first
        event_decision = self._check_event_override(snapshot)
        if event_decision is not None:
            return event_decision

        # Check for emergency conditions
        if snapshot.throttle_count >= self.THROTTLE_EMERGENCY:
            return self._handle_throttle_emergency(snapshot)

        if snapshot.bedrock_queue_depth >= self.QUEUE_DEPTH_EMERGENCY:
            return self._handle_queue_emergency(snapshot)

        # Calculate provisioned capacity
        provisioned_tps = (
            snapshot.provisioned_sonnet_units * self.SONNET_TPS_PER_UNIT
            + snapshot.provisioned_haiku_units * self.HAIKU_TPS_PER_UNIT
        )
        utilization = snapshot.tokens_per_second / provisioned_tps if provisioned_tps > 0 else 1.0

        # Predictive check: will we breach in 5 minutes?
        predicted_tps = self._predict_tps()
        predicted_utilization = predicted_tps / provisioned_tps if provisioned_tps > 0 else 1.0

        # Scale-out evaluation
        if (
            utilization > self.TPS_UTILIZATION_SCALE_OUT
            or predicted_utilization > self.TPS_UTILIZATION_EMERGENCY
            or snapshot.bedrock_queue_depth > self.QUEUE_DEPTH_SCALE_OUT
            or snapshot.p99_latency_ms > self.LATENCY_SCALE_OUT_MS
        ):
            return self._compute_scale_out(snapshot, utilization, predicted_tps)

        # Scale-in evaluation
        if (
            utilization < self.TPS_UTILIZATION_SCALE_IN
            and snapshot.bedrock_queue_depth < 10
            and snapshot.p99_latency_ms < 2_000
            and self._in_cooldown() is False
        ):
            return self._compute_scale_in(snapshot, utilization)

        # Stable -- no action needed
        return ScalingDecision(
            state=ScalingState.STABLE,
            ecs_desired_count=None,
            sonnet_units=None,
            haiku_units=None,
            reason=f"Stable: utilization={utilization:.1%}, queue={snapshot.bedrock_queue_depth}, p99={snapshot.p99_latency_ms:.0f}ms",
            actions=[],
        )

    def _predict_tps(self) -> float:
        """
        Predict tokens-per-second in 5 minutes using linear regression
        on the recent snapshot history.
        """
        if len(self._snapshot_history) < 6:
            # Not enough data; return current value
            return self._snapshot_history[-1].tokens_per_second if self._snapshot_history else 0

        recent = self._snapshot_history[-30:]  # Last 5 minutes of 10s snapshots
        if len(recent) < 2:
            return recent[-1].tokens_per_second

        # Simple linear regression: TPS = a + b * t
        t_values = [s.timestamp - recent[0].timestamp for s in recent]
        tps_values = [s.tokens_per_second for s in recent]

        n = len(t_values)
        sum_t = sum(t_values)
        sum_tps = sum(tps_values)
        sum_t_tps = sum(t * tps for t, tps in zip(t_values, tps_values))
        sum_t_sq = sum(t * t for t in t_values)

        denominator = n * sum_t_sq - sum_t * sum_t
        if denominator == 0:
            return tps_values[-1]

        slope = (n * sum_t_tps - sum_t * sum_tps) / denominator
        intercept = (sum_tps - slope * sum_t) / n

        # Predict 5 minutes into the future
        future_t = t_values[-1] + self.PREDICTION_WINDOW_SECONDS
        predicted = intercept + slope * future_t

        return max(predicted, 0)  # TPS can't be negative

    def _handle_throttle_emergency(self, snapshot: TrafficSnapshot) -> ScalingDecision:
        """Respond to Bedrock throttling with multi-layered mitigation."""
        additional_sonnet = max(2, snapshot.throttle_count)  # At least 2 extra units
        additional_haiku = max(1, snapshot.throttle_count // 2)

        return ScalingDecision(
            state=ScalingState.EMERGENCY,
            ecs_desired_count=min(snapshot.ecs_task_count + 10, self.ECS_MAX_TASKS),
            sonnet_units=snapshot.provisioned_sonnet_units + additional_sonnet,
            haiku_units=snapshot.provisioned_haiku_units + additional_haiku,
            reason=f"EMERGENCY: {snapshot.throttle_count} throttles detected",
            actions=[
                f"Increase provisioned Sonnet by {additional_sonnet} units",
                f"Increase provisioned Haiku by {additional_haiku} units",
                "Activate aggressive request coalescing",
                "Shift eligible queries from Sonnet to Haiku",
                "Extend Redis cache TTL to 120s",
                "Page on-call engineer",
            ],
        )

    def _handle_queue_emergency(self, snapshot: TrafficSnapshot) -> ScalingDecision:
        """Respond to extreme queue depth."""
        return ScalingDecision(
            state=ScalingState.EMERGENCY,
            ecs_desired_count=min(snapshot.ecs_task_count + 10, self.ECS_MAX_TASKS),
            sonnet_units=snapshot.provisioned_sonnet_units + 3,
            haiku_units=snapshot.provisioned_haiku_units + 2,
            reason=f"EMERGENCY: Queue depth {snapshot.bedrock_queue_depth} exceeds {self.QUEUE_DEPTH_EMERGENCY}",
            actions=[
                "Add 10 ECS tasks immediately",
                "Increase provisioned throughput",
                "Enable degraded mode for FAQ queries (template responses)",
                "Alert on-call engineer",
            ],
        )

    def _compute_scale_out(
        self, snapshot: TrafficSnapshot, utilization: float, predicted_tps: float
    ) -> ScalingDecision:
        """Calculate how much to scale out."""
        # ECS task scaling: target TPS_PER_ECS_TASK per task
        desired_tasks = max(
            snapshot.ecs_task_count + 2,
            int(predicted_tps / self.TPS_PER_ECS_TASK) + 1,
        )
        desired_tasks = min(desired_tasks, self.ECS_MAX_TASKS)

        # Provisioned throughput: add units if utilization is high
        sonnet_units = snapshot.provisioned_sonnet_units
        haiku_units = snapshot.provisioned_haiku_units

        if utilization > self.TPS_UTILIZATION_SCALE_OUT:
            # Add units proportional to the gap
            needed_tps = predicted_tps * 1.3  # 30% headroom
            current_capacity = (
                sonnet_units * self.SONNET_TPS_PER_UNIT
                + haiku_units * self.HAIKU_TPS_PER_UNIT
            )
            gap = needed_tps - current_capacity
            if gap > 0:
                # Split additional capacity between Sonnet and Haiku
                # based on current Sonnet ratio
                sonnet_gap = int(gap * snapshot.sonnet_ratio / self.SONNET_TPS_PER_UNIT) + 1
                haiku_gap = int(gap * (1 - snapshot.sonnet_ratio) / self.HAIKU_TPS_PER_UNIT) + 1
                sonnet_units += sonnet_gap
                haiku_units += haiku_gap

        actions = []
        if desired_tasks > snapshot.ecs_task_count:
            actions.append(
                f"Scale ECS from {snapshot.ecs_task_count} to {desired_tasks} tasks"
            )
        if sonnet_units > snapshot.provisioned_sonnet_units:
            actions.append(
                f"Increase Sonnet provisioned from {snapshot.provisioned_sonnet_units} to {sonnet_units} units"
            )
        if haiku_units > snapshot.provisioned_haiku_units:
            actions.append(
                f"Increase Haiku provisioned from {snapshot.provisioned_haiku_units} to {haiku_units} units"
            )

        return ScalingDecision(
            state=ScalingState.SCALING_OUT,
            ecs_desired_count=desired_tasks,
            sonnet_units=sonnet_units,
            haiku_units=haiku_units,
            reason=f"Scale-out: utilization={utilization:.1%}, predicted_tps={predicted_tps:.0f}, queue={snapshot.bedrock_queue_depth}",
            actions=actions,
        )

    def _compute_scale_in(self, snapshot: TrafficSnapshot, utilization: float) -> ScalingDecision:
        """Calculate how much to scale in (conservative)."""
        # Never scale in by more than 10% at a time
        reduction = max(1, int(snapshot.ecs_task_count * 0.10))
        desired_tasks = max(snapshot.ecs_task_count - reduction, self.ECS_MIN_TASKS)

        # Reduce provisioned throughput only if utilization is very low
        sonnet_units = snapshot.provisioned_sonnet_units
        haiku_units = snapshot.provisioned_haiku_units

        if utilization < 0.25:
            sonnet_units = max(3, sonnet_units - 1)
            haiku_units = max(2, haiku_units - 1)

        actions = []
        if desired_tasks < snapshot.ecs_task_count:
            actions.append(
                f"Scale ECS from {snapshot.ecs_task_count} to {desired_tasks} tasks"
            )
        if sonnet_units < snapshot.provisioned_sonnet_units:
            actions.append(
                f"Reduce Sonnet provisioned from {snapshot.provisioned_sonnet_units} to {sonnet_units} units"
            )

        return ScalingDecision(
            state=ScalingState.SCALING_IN,
            ecs_desired_count=desired_tasks,
            sonnet_units=sonnet_units,
            haiku_units=haiku_units,
            reason=f"Scale-in: utilization={utilization:.1%}, queue={snapshot.bedrock_queue_depth}",
            actions=actions,
        )

    def _check_event_override(self, snapshot: TrafficSnapshot) -> Optional[ScalingDecision]:
        """Check if an event calendar entry requires override scaling."""
        now_jst = datetime.now(JST)
        for event_name, event_config in self.event_calendar.items():
            start = event_config["start"]
            end = event_config["end"]
            if start <= now_jst <= end:
                return ScalingDecision(
                    state=ScalingState.EVENT_OVERRIDE,
                    ecs_desired_count=min(event_config["ecs_tasks"], self.ECS_MAX_TASKS),
                    sonnet_units=event_config["sonnet_units"],
                    haiku_units=event_config["haiku_units"],
                    reason=f"Event override: {event_name} ({start} to {end})",
                    actions=[
                        f"Event '{event_name}' active -- using override capacity",
                        f"ECS: {event_config['ecs_tasks']} tasks",
                        f"Sonnet: {event_config['sonnet_units']} units",
                        f"Haiku: {event_config['haiku_units']} units",
                    ],
                )
        return None

    def _in_cooldown(self) -> bool:
        """Check if we are in a cooldown period after a recent scale action."""
        return (time.time() - self._last_scale_action_time) < self._cooldown_seconds

    async def apply_decision(self, decision: ScalingDecision):
        """Apply a scaling decision to AWS resources."""
        if decision.state == ScalingState.STABLE:
            return

        logger.info(
            "Applying scaling decision: state=%s reason=%s actions=%s",
            decision.state.value, decision.reason, decision.actions,
        )

        # Apply ECS scaling
        if decision.ecs_desired_count is not None:
            self._ecs_client.update_service(
                cluster=self.ecs_cluster,
                service=self.ecs_service,
                desiredCount=decision.ecs_desired_count,
            )
            logger.info("ECS desired count set to %d", decision.ecs_desired_count)

        # Apply provisioned throughput changes
        if decision.sonnet_units is not None:
            # Bedrock provisioned throughput API call would go here.
            # The exact API depends on the provisioned throughput model
            # (e.g., CreateProvisionedModelThroughput or UpdateProvisionedModelThroughput).
            logger.info(
                "Provisioned throughput update: Sonnet=%d units, Haiku=%d units",
                decision.sonnet_units,
                decision.haiku_units or 0,
            )

        # Publish scaling event to CloudWatch
        self._cloudwatch_client.put_metric_data(
            Namespace="MangaAssist",
            MetricData=[
                {
                    "MetricName": "ScalingAction",
                    "Value": 1,
                    "Unit": "Count",
                    "Dimensions": [
                        {"Name": "State", "Value": decision.state.value},
                        {"Name": "Service", "Value": "GenAIAutoScaler"},
                    ],
                },
            ],
        )

        self._last_scale_action_time = time.time()

7. Manga Release Scale-Up Sequence

The following diagram shows the complete scale-up sequence during a Monday manga release event, from the calendar trigger through capacity stabilization:

sequenceDiagram
    participant CAL as Event Calendar<br>(Lambda CronJob)
    participant SCALER as GenAIAutoScaler
    participant ECS as ECS Fargate<br>Service
    participant BPT as Bedrock Provisioned<br>Throughput
    participant REDIS as ElastiCache Redis
    participant CW as CloudWatch
    participant OPS as On-Call Eng

    Note over CAL: Sunday 21:00 JST<br>3 hours before release

    CAL->>SCALER: Event: "weekly_shonen_jump"<br>starts in 3h
    SCALER->>BPT: UpdateProvisionedThroughput<br>Sonnet: 15 → 25 units<br>Haiku: 5 → 8 units
    BPT-->>SCALER: Status: PROVISIONING
    SCALER->>ECS: UpdateService<br>desiredCount: 30 → 50
    ECS-->>SCALER: Tasks launching...
    SCALER->>REDIS: Pre-warm popular queries<br>(top 100 manga titles)
    SCALER->>CW: Publish ScalingState=EVENT_OVERRIDE

    Note over ECS: 5 minutes later<br>50 tasks running

    Note over BPT: 15 minutes later<br>Provisioned throughput active

    SCALER->>CW: Publish ProvisionedState=ACTIVE
    SCALER->>OPS: INFO: Event capacity ready

    Note over CAL: Monday 00:00 JST<br>Release goes live

    CW-->>SCALER: TPS: 7K → 15K → 22K → 30K
    Note over SCALER: TPS within provisioned<br>capacity (33K). Stable.

    CW-->>SCALER: TPS: 30K → 33K → 36K
    Note over SCALER: TPS exceeding provisioned!

    SCALER->>BPT: Emergency: Sonnet 25 → 28 units
    SCALER->>ECS: UpdateService: 50 → 65 tasks
    SCALER->>REDIS: Extend cache TTL: 30s → 90s
    SCALER->>OPS: WARN: Approaching capacity ceiling

    Note over CW: 30 minutes later<br>Surge subsides

    CW-->>SCALER: TPS: 36K → 28K → 20K
    SCALER->>SCALER: Predicted TPS in 5m: 16K<br>Safe to scale in

    Note over CAL: Monday 04:00 JST<br>Event window ends

    CAL->>SCALER: Event "weekly_shonen_jump" ended
    SCALER->>BPT: Reduce to evening baseline<br>Sonnet: 28 → 15 units
    SCALER->>ECS: UpdateService: 65 → 30 tasks
    SCALER->>REDIS: Restore cache TTL: 90s → 30s
    SCALER->>CW: Publish ScalingState=STABLE

8. Monitoring Dashboard Layout

MangaAssist uses a CloudWatch dashboard with the following panels to give the on-call engineer full visibility into auto-scaling behavior:

Panel Metrics Shown Time Range Purpose
Tokens per Second MangaAssist/TokensPerSecond (line), provisioned capacity (horizontal line) 6 hours See how close live traffic is to provisioned ceiling
Bedrock Queue Depth MangaAssist/BedrockQueueDepth max per task 1 hour Spot queue buildup before it hits latency
P99 Latency MangaAssist/RequestLatencyP99 1 hour Confirm user-facing impact of any scaling event
ECS Task Count AWS/ECS DesiredCount, RunningCount 6 hours Verify scaling actions are applied and tasks are healthy
Bedrock Throttles AWS/Bedrock InvocationThrottles sum 1 hour Any non-zero value is an immediate concern
Provisioned Throughput Units Custom metric tracking model units 24 hours See time-of-day provisioning changes
Scaling Decisions MangaAssist/ScalingAction by state 6 hours Audit trail of every scale-out/in/emergency decision
Cost Rate Computed: (tokens * price_per_token) per hour 24 hours Real-time cost tracking against budget

Key Takeaways

  1. CPU utilization is the wrong signal for FM workloads -- MangaAssist's orchestrator sits at 30% CPU during peak traffic because it is I/O bound waiting on Bedrock. Queue depth and tokens-per-second are the correct scaling metrics.
  2. Prediction prevents throttling -- By the time Bedrock starts throttling, users are already affected. A 5-minute linear prediction of TPS allows pre-emptive scaling that keeps the system ahead of demand.
  3. Throttle response is multi-layered -- Adding ECS tasks when Bedrock is throttled does not help. The correct response combines provisioned throughput increases, cache extension, model downgrade (Sonnet to Haiku), and degraded-mode template responses.
  4. Scale-in must be conservative -- Asymmetric cooldowns (60s for scale-out, 300s for scale-in) prevent oscillation during the evening ramp where traffic can dip briefly between surges.
  5. Event calendars are not optional -- Manga releases happen on predictable schedules. Pre-provisioning 3 hours ahead eliminates the cold-start penalty of provisioned throughput activation.