01: High-Performance FM Architecture and Throughput Management

MangaAssist is a JP manga store chatbot running on AWS. It uses Bedrock Claude 3 (Sonnet for complex queries, Haiku for simple ones), OpenSearch Serverless for vector retrieval, DynamoDB for session and catalog data, ECS Fargate for orchestration, API Gateway WebSocket for real-time delivery, and ElastiCache Redis for caching. The system handles 1M messages/day with a target of under 3 seconds end-to-end response time.

Skill Mapping

Field	Value
Domain	4 -- Operational Efficiency Optimization
Task	4.1 -- Optimize foundation model cost and performance
Skill	4.1.3 -- Implement strategies for high-performance FM systems including batching, capacity planning, utilization monitoring, auto-scaling, and provisioned throughput optimization
MangaAssist Focus	Throughput management for 1M daily messages with sub-3s latency, burst handling for manga releases and seasonal events, cost-efficient provisioning

High-Performance FM Dimensions

mindmap
  root((High-Performance
    FM Systems))
    Batching
      Micro-Batching by Intent
      Request Coalescing
      Batch Window Tuning
      Priority Queue Separation
    Capacity Planning
      Token Throughput Targets
      Peak Hour Analysis
      Event Provisioning
      Growth Forecasting
    Utilization Monitoring
      Invocation Metrics
      Tokens-per-Second Tracking
      Throttle Rate Monitoring
      Waste Detection
    Auto-Scaling
      ECS Fargate Scaling
      Queue-Depth Triggers
      Step Scaling Policies
      Target Tracking
    Provisioned Throughput
      Model Units Sizing
      On-Demand vs Provisioned
      Breakeven Analysis
      Dynamic Adjustment

1. Batching Strategies

1.1 Micro-Batching for Similar Intents

MangaAssist classifies every incoming message into one of several intents: product_search, order_status, faq, recommendation, greeting, etc. Messages with the same intent often share prompt templates, system instructions, and retrieval patterns. Micro-batching groups these messages together so the orchestrator can:

share a single retrieval call across multiple queries searching the same index partition,
pack multiple user queries into a single Bedrock invocation using a batched prompt layout (where the model processes N user questions in one call),
reduce per-request overhead (TLS handshake, connection setup, retry envelope).

The trade-off is added latency: each message waits in the batch window before processing begins. For MangaAssist, the window must stay under 200ms to preserve the 3-second budget.

sequenceDiagram
    participant U1 as User A (product_search)
    participant U2 as User B (product_search)
    participant U3 as User C (order_status)
    participant Q as Intent Queue
    participant MB as Micro-Batcher
    participant B as Bedrock Claude

    U1->>Q: "Find shonen manga under 800 yen"
    U2->>Q: "Any new isekai releases?"
    U3->>Q: "Where is my order 12345?"

    Q->>MB: Batch [U1, U2] (same intent: product_search)
    Q->>MB: Single [U3] (order_status, different template)

    MB->>B: Batched invocation for product_search (2 queries)
    MB->>B: Single invocation for order_status

    B-->>MB: Batched response [resp_A, resp_B]
    B-->>MB: Single response [resp_C]

    MB-->>U1: resp_A
    MB-->>U2: resp_B
    MB-->>U3: resp_C

1.2 Request Coalescing for Identical Queries

During peak hours or manga release events, many users ask near-identical questions within seconds: "Is volume 42 of One Piece available?", "One Piece 42 in stock?", "Do you have One Piece vol 42?". After intent classification and query normalization, these resolve to the same semantic query.

Request coalescing detects duplicate normalized queries arriving within a coalescing window (typically 500ms-2s) and fans out a single Bedrock response to all waiting callers. This eliminates redundant invocations, reduces cost, and keeps the system under its throttle ceiling.

Key design decisions: - Normalization depth: MangaAssist normalizes to (intent, product_id, query_hash) where query_hash is a semantic hash (embedding-based) rather than an exact string match. - Staleness risk: Coalesced responses must not be stale. For order-status queries, coalescing is disabled because each user's order is unique. - Cache interaction: Coalesced results are written to Redis with a short TTL (30s) so subsequent identical queries within the TTL skip Bedrock entirely.

2. Capacity Planning

2.1 Token Throughput Requirements

MangaAssist must plan for tokens, not just requests. A simple greeting consumes ~50 tokens total (input + output), while a detailed recommendation with product cards can consume ~2,000 tokens. The blended average across all intents is approximately 600 tokens per request.

Metric	Calculation	Value
Daily messages	Given	1,000,000
Avg tokens per message	Blended across intents	600
Daily tokens	1M x 600	600,000,000
Avg tokens/second (uniform)	600M / 86,400	~6,944 tokens/s
Peak multiplier (JP evening)	3x average	~20,833 tokens/s
Burst multiplier (manga release)	5x average	~34,722 tokens/s
Target provisioned capacity	Peak + 30% headroom	~27,083 tokens/s
Burst capacity (on-demand spillover)	Burst - provisioned	~7,639 tokens/s on-demand

2.2 Peak Hour Analysis -- JP Timezone

MangaAssist traffic follows a strong daily pattern tied to Japanese consumer behavior. The store's primary users are in JST (UTC+9). Traffic analysis from production logs reveals:

Tokens/sec
35,000 |                                          * (manga release spike)
30,000 |
25,000 |                                    ***
20,000 |                                  **   **
15,000 |                                **       **
10,000 |                    ********** *           ***
 7,000 |--- avg ------*****---------------------------------****-------
 5,000 |           ***                                          ***
 3,000 |        ***                                                ***
 1,000 | *******                                                      **
        |_____|_____|_____|_____|_____|_____|_____|_____|_____|_____|____
        00    03    06    09    12    15    18    21    00    03    06  JST
              trough        lunch     peak evening       overnight

Key observations: - Trough: 02:00-06:00 JST -- under 2,000 tokens/s. Over-provisioning here is pure waste. - Lunch bump: 11:00-13:00 JST -- brief 10,000 tokens/s spike, subsides quickly. - Peak evening: 18:00-23:00 JST -- sustained 15,000-25,000 tokens/s. This is where provisioned throughput earns its keep. - Manga release events: Overlaid on the evening peak, release events can push to 35,000 tokens/s for 30-60 minutes.

2.3 Event Provisioning -- Black Friday and Manga Releases

Scheduled events require pre-provisioning because Bedrock provisioned throughput changes take time to activate. MangaAssist maintains an event calendar:

Event Type	Lead Time for Provisioning	Capacity Multiplier	Duration
Weekly Shonen Jump release (Monday)	2 hours before midnight JST Sunday	2x baseline peak	4 hours
Major manga volume release	6 hours before release time	3x baseline peak	6 hours
Black Friday / Cyber Monday	24 hours before event start	5x baseline peak	72 hours
Amazon Prime Day	24 hours before event start	4x baseline peak	48 hours
New Year sale	12 hours before Dec 31 JST	3x baseline peak	96 hours

3. Utilization Monitoring

3.1 Bedrock Invocation Metrics

MangaAssist tracks the following CloudWatch metrics under the AWS/Bedrock namespace:

Metric	What It Tells Us	Alarm Threshold
`Invocations`	Total API calls per period	Trend analysis, no hard alarm
`InvocationLatency`	P50/P90/P99 latency per invocation	P99 > 5s triggers investigation
`InvocationClientErrors` (4xx)	Validation failures, malformed requests	> 1% of invocations
`InvocationServerErrors` (5xx)	Bedrock-side failures	> 0.1% of invocations
`InvocationThrottles`	Requests rejected due to throughput limits	> 0 triggers immediate alert
`InputTokenCount`	Tokens consumed in prompts	Cost tracking, prompt bloat detection
`OutputTokenCount`	Tokens generated in responses	Cost tracking, verbosity detection

3.2 Tokens-per-Second Tracking

Tokens-per-second (TPS) is the primary capacity metric. MangaAssist publishes a custom CloudWatch metric MangaAssist/TokensPerSecond computed as:

TPS = (InputTokenCount + OutputTokenCount) / period_seconds

This metric drives: - Provisioned throughput sizing: If sustained TPS exceeds 80% of provisioned capacity for 10 minutes, scale-up is triggered. - Cost allocation: TPS per intent allows finance to allocate Bedrock costs to product teams (recommendations team vs. FAQ team). - Anomaly detection: CloudWatch Anomaly Detection on TPS catches unexpected surges (bot attacks, prompt injection attempts causing token inflation).

3.3 Waste Detection -- Over-Provisioned Capacity

Provisioned throughput that sits idle is wasted money. MangaAssist defines waste as:

Waste Ratio = 1 - (actual_TPS / provisioned_TPS)

Waste Ratio	Interpretation	Action
< 20%	Healthy headroom	No action
20-40%	Mild over-provisioning	Review at next capacity planning cycle
40-60%	Significant waste	Schedule scale-down within 1 hour
> 60%	Critical waste	Immediate scale-down (overnight trough likely)

4. Auto-Scaling Configurations

4.1 ECS Fargate Auto-Scaling Tied to Bedrock Queue Depth

ECS Fargate runs the MangaAssist orchestrator service. Scaling this service based on CPU alone is insufficient because the orchestrator spends most of its time waiting on Bedrock (I/O bound, not CPU bound). Instead, MangaAssist scales on a custom metric: Bedrock request queue depth -- the number of in-flight Bedrock invocations waiting for a response.

graph TD
    subgraph "Auto-Scaling Loop"
        A[CloudWatch Custom Metric<br>bedrock_queue_depth] --> B{Queue Depth<br>Evaluation}
        B -->|depth > 50 for 2 min| C[Step Scaling: +2 tasks]
        B -->|depth > 100 for 1 min| D[Step Scaling: +5 tasks]
        B -->|depth > 200 for 30s| E[Step Scaling: +10 tasks<br>Emergency]
        B -->|depth < 20 for 10 min| F[Scale-in: -1 task]
        B -->|depth < 5 for 15 min| G[Scale-in: -2 tasks]
    end

    subgraph "Metric Publisher"
        H[ECS Orchestrator] --> I[Publish queue depth<br>every 10 seconds]
        I --> A
    end

    subgraph "Bedrock"
        H --> J[Bedrock Invocations]
        J --> K[InvocationThrottles metric]
        K --> A
    end

4.2 Step Scaling for Gradual Ramp

Step scaling prevents oscillation during gradual traffic increases (e.g., the evening ramp from 17:00-19:00 JST). Each step adds capacity proportional to the severity of the breach:

Queue Depth	Breach Duration	Action	Cooldown
50	2 minutes	+2 ECS tasks	120s
100	1 minute	+5 ECS tasks	90s
200	30 seconds	+10 ECS tasks	60s
< 20	10 minutes	-1 ECS task	300s
< 5	15 minutes	-2 ECS tasks	300s

Scale-in cooldowns are deliberately longer than scale-out cooldowns to avoid premature shrinkage during brief traffic dips within a peak window.

4.3 Target Tracking for Tokens-per-Second

In addition to step scaling, MangaAssist uses a target tracking policy on the custom MangaAssist/TokensPerSecond metric. The target is set to maintain each ECS task at approximately 500 tokens/s throughput:

Target TPS per task = 500
Desired task count  = ceil(current_total_TPS / 500)

This provides a smooth, predictive scaling signal that complements the reactive step scaling on queue depth.

5. Provisioned Throughput Optimization

5.1 Bedrock Provisioned Throughput -- Model Units

AWS Bedrock offers provisioned throughput purchased in model units. Each model unit provides a guaranteed tokens-per-second capacity for a specific model. For MangaAssist:

Model	Use Case	On-Demand Rate	Provisioned Unit Capacity	Units Needed (Peak)
Claude 3 Sonnet	Complex queries (recommendations, multi-turn)	Variable, subject to throttling	~1,000 tokens/s per unit	15 units
Claude 3 Haiku	Simple queries (greetings, FAQs, order status)	Variable, subject to throttling	~2,000 tokens/s per unit	5 units

5.2 On-Demand vs Provisioned -- Breakeven Analysis

The decision between on-demand and provisioned depends on sustained utilization. Provisioned throughput has a fixed hourly cost regardless of actual usage; on-demand charges per token.

Traffic Level (tokens/s)	Monthly On-Demand Cost (est.)	Monthly Provisioned Cost (est.)	Savings with Provisioned	Recommendation
2,000 (overnight trough)	$8,600	$18,000 (5 units)	-$9,400 (loss)	On-demand
7,000 (daily average)	$30,100	$18,000 (5 units)	+$12,100	Provisioned
15,000 (evening peak)	$64,500	$36,000 (10 units)	+$28,500	Provisioned
25,000 (peak evening)	$107,500	$54,000 (15 units)	+$53,500	Provisioned
35,000 (manga release burst)	$150,500	$54,000 + on-demand overflow	+$61,500	Hybrid

MangaAssist strategy: Provision for the sustained evening peak (15 units Sonnet + 5 units Haiku) and let on-demand handle bursts above that. During overnight trough, scale provisioned throughput down to minimum (3 units Sonnet + 2 units Haiku).

6. Throughput Management Architecture

graph TB
    subgraph "Client Layer"
        WS[WebSocket Clients<br>1M messages/day]
    end

    subgraph "Ingress"
        AG[API Gateway WebSocket]
    end

    subgraph "Orchestration Layer (ECS Fargate)"
        IC[Intent Classifier]
        MB[Micro-Batcher]
        RC[Request Coalescer]
        PQ[Priority Queue<br>High / Normal / Low]
        RL[Rate Limiter]
    end

    subgraph "Cache Layer"
        REDIS[ElastiCache Redis<br>Coalesced Response Cache<br>Embedding Cache]
    end

    subgraph "FM Layer"
        BPT[Bedrock Provisioned Throughput<br>Sonnet: 15 units<br>Haiku: 5 units]
        BOD[Bedrock On-Demand<br>Overflow / Burst]
    end

    subgraph "Monitoring"
        CW[CloudWatch Metrics<br>TPS, Queue Depth, Throttles]
        AS[Auto-Scaling Policies<br>Step + Target Tracking]
        ALARM[Alarms + SNS]
    end

    WS --> AG --> IC
    IC --> MB
    MB --> RC
    RC -->|cache hit| REDIS
    RC -->|cache miss| PQ
    PQ --> RL
    RL -->|within provisioned capacity| BPT
    RL -->|overflow| BOD
    BPT --> REDIS
    BOD --> REDIS

    CW --> AS
    AS -->|scale ECS| IC
    AS -->|adjust provisioned units| BPT
    CW --> ALARM

7. Python Implementation

7.1 ThroughputManager with Batching and Rate Limiting

import asyncio
import time
import hashlib
import json
from dataclasses import dataclass, field
from typing import Optional
from collections import defaultdict

import boto3


@dataclass
class InferenceRequest:
    """A single MangaAssist inference request."""
    request_id: str
    user_id: str
    intent: str
    prompt: str
    model_id: str
    priority: str = "normal"  # high, normal, low
    created_at: float = field(default_factory=time.time)
    normalized_query_hash: Optional[str] = None


@dataclass
class InferenceResponse:
    """Response from Bedrock invocation."""
    request_id: str
    output_text: str
    input_tokens: int
    output_tokens: int
    latency_ms: float
    from_cache: bool = False
    from_coalesce: bool = False


class ThroughputManager:
    """
    Manages Bedrock throughput for MangaAssist with:
    - Micro-batching by intent
    - Request coalescing for identical queries
    - Priority-based rate limiting
    - Provisioned vs on-demand routing
    """

    def __init__(
        self,
        provisioned_tps: int = 20_000,
        batch_window_ms: int = 150,
        coalesce_window_ms: int = 1_000,
        max_batch_size: int = 8,
        max_concurrent_invocations: int = 100,
        redis_client=None,
    ):
        self.provisioned_tps = provisioned_tps
        self.batch_window_ms = batch_window_ms
        self.coalesce_window_ms = coalesce_window_ms
        self.max_batch_size = max_batch_size
        self.redis_client = redis_client

        # Semaphore to limit concurrent Bedrock invocations
        self._invocation_semaphore = asyncio.Semaphore(max_concurrent_invocations)

        # Intent-based batch queues: intent -> list of (request, future)
        self._batch_queues: dict[str, list] = defaultdict(list)
        self._batch_locks: dict[str, asyncio.Lock] = defaultdict(asyncio.Lock)

        # Coalescing: query_hash -> (future, timestamp)
        self._coalesce_map: dict[str, asyncio.Future] = {}
        self._coalesce_lock = asyncio.Lock()

        # Metrics tracking
        self._tokens_this_second = 0
        self._tokens_lock = asyncio.Lock()
        self._last_token_reset = time.time()

        # Rate limiting buckets
        self._rate_limits = {
            "high": asyncio.Semaphore(50),
            "normal": asyncio.Semaphore(40),
            "low": asyncio.Semaphore(10),
        }

        # Bedrock client
        self._bedrock_runtime = boto3.client("bedrock-runtime", region_name="ap-northeast-1")

    async def process_request(self, request: InferenceRequest) -> InferenceResponse:
        """
        Main entry point. Routes through coalescing, batching, and rate limiting.
        """
        # Step 1: Check coalescing (skip for order-status, unique per user)
        if request.intent not in ("order_status", "account_info"):
            coalesced = await self._try_coalesce(request)
            if coalesced is not None:
                return coalesced

        # Step 2: Check Redis cache
        cached = await self._check_cache(request)
        if cached is not None:
            return cached

        # Step 3: Enqueue for micro-batching
        response = await self._enqueue_for_batch(request)
        return response

    def _compute_query_hash(self, request: InferenceRequest) -> str:
        """Compute a normalized hash for coalescing identical queries."""
        normalized = json.dumps({
            "intent": request.intent,
            "prompt_hash": hashlib.sha256(
                request.prompt.strip().lower().encode()
            ).hexdigest()[:16],
            "model_id": request.model_id,
        }, sort_keys=True)
        return hashlib.sha256(normalized.encode()).hexdigest()[:32]

    async def _try_coalesce(self, request: InferenceRequest) -> Optional[InferenceResponse]:
        """
        If an identical query is already in-flight, wait for its result
        instead of making a duplicate Bedrock call.
        """
        query_hash = self._compute_query_hash(request)
        request.normalized_query_hash = query_hash

        async with self._coalesce_lock:
            if query_hash in self._coalesce_map:
                existing_future, created_at = self._coalesce_map[query_hash]
                age_ms = (time.time() - created_at) * 1000
                if age_ms < self.coalesce_window_ms:
                    # Wait for the in-flight request to complete
                    result = await existing_future
                    return InferenceResponse(
                        request_id=request.request_id,
                        output_text=result.output_text,
                        input_tokens=result.input_tokens,
                        output_tokens=result.output_tokens,
                        latency_ms=result.latency_ms,
                        from_coalesce=True,
                    )

            # No in-flight duplicate; register this request for coalescing
            future = asyncio.get_event_loop().create_future()
            self._coalesce_map[query_hash] = (future, time.time())

        return None  # Proceed to batching/invocation

    async def _check_cache(self, request: InferenceRequest) -> Optional[InferenceResponse]:
        """Check Redis for a cached response."""
        if self.redis_client is None:
            return None

        cache_key = f"manga:resp:{request.normalized_query_hash or self._compute_query_hash(request)}"
        cached_data = self.redis_client.get(cache_key)
        if cached_data:
            data = json.loads(cached_data)
            return InferenceResponse(
                request_id=request.request_id,
                output_text=data["output_text"],
                input_tokens=data["input_tokens"],
                output_tokens=data["output_tokens"],
                latency_ms=0.0,
                from_cache=True,
            )
        return None

    async def _enqueue_for_batch(self, request: InferenceRequest) -> InferenceResponse:
        """
        Add request to the intent-specific batch queue.
        If the batch fills up or the window expires, flush the batch.
        """
        future = asyncio.get_event_loop().create_future()

        async with self._batch_locks[request.intent]:
            self._batch_queues[request.intent].append((request, future))

            if len(self._batch_queues[request.intent]) >= self.max_batch_size:
                batch = self._batch_queues[request.intent][:self.max_batch_size]
                self._batch_queues[request.intent] = self._batch_queues[request.intent][self.max_batch_size:]
                asyncio.create_task(self._flush_batch(request.intent, batch))
            elif len(self._batch_queues[request.intent]) == 1:
                # First item in batch -- start the window timer
                asyncio.create_task(self._batch_timer(request.intent))

        return await future

    async def _batch_timer(self, intent: str):
        """Wait for the batch window, then flush whatever is queued."""
        await asyncio.sleep(self.batch_window_ms / 1000.0)
        async with self._batch_locks[intent]:
            if self._batch_queues[intent]:
                batch = self._batch_queues[intent][:]
                self._batch_queues[intent] = []
                asyncio.create_task(self._flush_batch(intent, batch))

    async def _flush_batch(self, intent: str, batch: list):
        """
        Invoke Bedrock for a batch of requests.
        Uses rate limiting and routes to provisioned or on-demand.
        """
        for request, future in batch:
            priority_sem = self._rate_limits.get(request.priority, self._rate_limits["normal"])
            await priority_sem.acquire()
            try:
                async with self._invocation_semaphore:
                    response = await self._invoke_bedrock(request)
                    future.set_result(response)

                    # Update coalescing map
                    if request.normalized_query_hash and request.normalized_query_hash in self._coalesce_map:
                        coalesce_future, _ = self._coalesce_map.pop(request.normalized_query_hash)
                        if not coalesce_future.done():
                            coalesce_future.set_result(response)

                    # Cache the result
                    await self._cache_response(request, response)

                    # Track tokens
                    await self._track_tokens(response.input_tokens + response.output_tokens)
            except Exception as e:
                if not future.done():
                    future.set_exception(e)
            finally:
                priority_sem.release()

    async def _invoke_bedrock(self, request: InferenceRequest) -> InferenceResponse:
        """Call Bedrock and return the response."""
        start = time.time()

        body = json.dumps({
            "anthropic_version": "bedrock-2023-05-31",
            "max_tokens": 1024,
            "messages": [{"role": "user", "content": request.prompt}],
        })

        # Route: provisioned throughput if within capacity, else on-demand
        current_tps = await self._get_current_tps()
        model_id = request.model_id
        if current_tps > self.provisioned_tps * 0.9:
            # Near provisioned capacity -- use on-demand endpoint as overflow
            model_id = f"{request.model_id}"  # on-demand uses same model ID

        response = self._bedrock_runtime.invoke_model(
            modelId=model_id,
            contentType="application/json",
            accept="application/json",
            body=body,
        )

        response_body = json.loads(response["body"].read())
        latency_ms = (time.time() - start) * 1000

        return InferenceResponse(
            request_id=request.request_id,
            output_text=response_body["content"][0]["text"],
            input_tokens=response_body["usage"]["input_tokens"],
            output_tokens=response_body["usage"]["output_tokens"],
            latency_ms=latency_ms,
        )

    async def _track_tokens(self, token_count: int):
        """Track tokens-per-second for monitoring and scaling decisions."""
        async with self._tokens_lock:
            now = time.time()
            if now - self._last_token_reset >= 1.0:
                # Publish the previous second's count to CloudWatch
                await self._publish_tps_metric(self._tokens_this_second)
                self._tokens_this_second = 0
                self._last_token_reset = now
            self._tokens_this_second += token_count

    async def _get_current_tps(self) -> int:
        """Return the current tokens-per-second rate."""
        async with self._tokens_lock:
            return self._tokens_this_second

    async def _publish_tps_metric(self, tps: int):
        """Publish tokens-per-second to CloudWatch."""
        cloudwatch = boto3.client("cloudwatch", region_name="ap-northeast-1")
        cloudwatch.put_metric_data(
            Namespace="MangaAssist",
            MetricData=[{
                "MetricName": "TokensPerSecond",
                "Value": tps,
                "Unit": "Count/Second",
                "Dimensions": [
                    {"Name": "Service", "Value": "ThroughputManager"},
                    {"Name": "Environment", "Value": "production"},
                ],
            }],
        )

    async def _cache_response(self, request: InferenceRequest, response: InferenceResponse):
        """Cache the response in Redis with a short TTL."""
        if self.redis_client is None:
            return
        if request.intent in ("order_status", "account_info"):
            return  # Do not cache user-specific responses

        cache_key = f"manga:resp:{request.normalized_query_hash}"
        self.redis_client.setex(
            cache_key,
            30,  # 30-second TTL
            json.dumps({
                "output_text": response.output_text,
                "input_tokens": response.input_tokens,
                "output_tokens": response.output_tokens,
            }),
        )

    def get_metrics_snapshot(self) -> dict:
        """Return a snapshot of current throughput metrics."""
        return {
            "current_tps": self._tokens_this_second,
            "provisioned_tps": self.provisioned_tps,
            "utilization_pct": round(
                (self._tokens_this_second / self.provisioned_tps) * 100, 1
            ) if self.provisioned_tps > 0 else 0,
            "batch_queue_depths": {
                intent: len(queue)
                for intent, queue in self._batch_queues.items()
            },
            "coalesce_map_size": len(self._coalesce_map),
        }

7.2 Auto-Scaling Policy Configuration (CDK-Style)

"""
CDK-style auto-scaling configuration for MangaAssist ECS Fargate service
with Bedrock-aware custom metrics.
"""

from aws_cdk import (
    Stack,
    Duration,
    aws_ecs as ecs,
    aws_applicationautoscaling as appscaling,
    aws_cloudwatch as cloudwatch,
)
from constructs import Construct


class MangaAssistAutoScalingStack(Stack):
    """
    Configures auto-scaling for the MangaAssist orchestrator service
    running on ECS Fargate, using Bedrock queue depth and tokens-per-second
    as scaling signals instead of CPU/memory alone.
    """

    def __init__(self, scope: Construct, id: str, ecs_service: ecs.FargateService, **kwargs):
        super().__init__(scope, id, **kwargs)

        # --- Scalable Target ---
        scaling = ecs_service.auto_scale_task_count(
            min_capacity=5,    # Minimum tasks even at overnight trough
            max_capacity=100,  # Hard ceiling for cost safety
        )

        # --- Policy 1: Target Tracking on Tokens-per-Second ---
        # Each ECS task should handle roughly 500 tokens/s.
        # If total TPS rises, add tasks proportionally.
        tps_metric = cloudwatch.Metric(
            namespace="MangaAssist",
            metric_name="TokensPerSecond",
            dimensions_map={
                "Service": "ThroughputManager",
                "Environment": "production",
            },
            statistic="Average",
            period=Duration.seconds(60),
        )

        scaling.scale_on_metric(
            "ScaleOnTPS",
            metric=tps_metric,
            scaling_steps=[
                appscaling.ScalingInterval(change=-2, lower=0, upper=2_000),
                appscaling.ScalingInterval(change=0, lower=2_000, upper=5_000),
                appscaling.ScalingInterval(change=2, lower=5_000, upper=10_000),
                appscaling.ScalingInterval(change=5, lower=10_000, upper=20_000),
                appscaling.ScalingInterval(change=10, lower=20_000),
            ],
            adjustment_type=appscaling.AdjustmentType.CHANGE_IN_CAPACITY,
            cooldown=Duration.seconds(120),
        )

        # --- Policy 2: Step Scaling on Bedrock Queue Depth ---
        queue_depth_metric = cloudwatch.Metric(
            namespace="MangaAssist",
            metric_name="BedrockQueueDepth",
            dimensions_map={
                "Service": "Orchestrator",
                "Environment": "production",
            },
            statistic="Maximum",
            period=Duration.seconds(30),
        )

        # Scale-out steps (aggressive)
        scaling.scale_on_metric(
            "ScaleOutOnQueueDepth",
            metric=queue_depth_metric,
            scaling_steps=[
                appscaling.ScalingInterval(change=2, lower=50, upper=100),
                appscaling.ScalingInterval(change=5, lower=100, upper=200),
                appscaling.ScalingInterval(change=10, lower=200),
            ],
            adjustment_type=appscaling.AdjustmentType.CHANGE_IN_CAPACITY,
            cooldown=Duration.seconds(60),
        )

        # Scale-in steps (conservative)
        scaling.scale_on_metric(
            "ScaleInOnQueueDepth",
            metric=queue_depth_metric,
            scaling_steps=[
                appscaling.ScalingInterval(change=-1, upper=20),
                appscaling.ScalingInterval(change=-2, upper=5),
            ],
            adjustment_type=appscaling.AdjustmentType.CHANGE_IN_CAPACITY,
            cooldown=Duration.seconds(300),
        )

        # --- Policy 3: Scheduled Scaling for Known Peaks ---
        # Evening peak: 17:00-23:59 JST (08:00-14:59 UTC)
        scaling.scale_on_schedule(
            "EveningPeakScaleUp",
            schedule=appscaling.Schedule.cron(hour="8", minute="0"),
            min_capacity=30,
        )
        scaling.scale_on_schedule(
            "EveningPeakScaleDown",
            schedule=appscaling.Schedule.cron(hour="15", minute="0"),
            min_capacity=10,
        )

        # Overnight trough: 02:00-06:00 JST (17:00-21:00 UTC)
        scaling.scale_on_schedule(
            "OvernightTrough",
            schedule=appscaling.Schedule.cron(hour="17", minute="0"),
            min_capacity=5,
        )

        # Monday manga release: Scale up Sunday 22:00 JST (13:00 UTC)
        scaling.scale_on_schedule(
            "MondayMangaReleasePrepare",
            schedule=appscaling.Schedule.cron(
                week_day="SUN", hour="13", minute="0"
            ),
            min_capacity=50,
        )
        scaling.scale_on_schedule(
            "MondayMangaReleaseEnd",
            schedule=appscaling.Schedule.cron(
                week_day="MON", hour="6", minute="0"  # 15:00 JST
            ),
            min_capacity=10,
        )

        # --- Policy 4: Throttle-Based Emergency Scaling ---
        throttle_alarm = cloudwatch.Alarm(
            self,
            "BedrockThrottleAlarm",
            metric=cloudwatch.Metric(
                namespace="AWS/Bedrock",
                metric_name="InvocationThrottles",
                statistic="Sum",
                period=Duration.seconds(60),
            ),
            threshold=5,
            evaluation_periods=1,
            comparison_operator=cloudwatch.ComparisonOperator.GREATER_THAN_OR_EQUAL_TO_THRESHOLD,
            alarm_description="Bedrock throttling detected -- emergency scale-out",
        )

        scaling.scale_on_metric(
            "EmergencyScaleOnThrottle",
            metric=cloudwatch.Metric(
                namespace="AWS/Bedrock",
                metric_name="InvocationThrottles",
                statistic="Sum",
                period=Duration.seconds(60),
            ),
            scaling_steps=[
                appscaling.ScalingInterval(change=5, lower=5, upper=20),
                appscaling.ScalingInterval(change=15, lower=20),
            ],
            adjustment_type=appscaling.AdjustmentType.CHANGE_IN_CAPACITY,
            cooldown=Duration.seconds(60),
        )

8. Cost Comparison: On-Demand vs Provisioned Throughput

The following table uses estimated pricing to illustrate the breakeven analysis. Actual costs depend on the specific Bedrock model and region.

Scenario	Avg TPS	Monthly On-Demand	Provisioned Units	Monthly Provisioned	Net Savings	Strategy
Overnight (02:00-06:00 JST)	1,500	$6,450	3 Sonnet + 2 Haiku	$14,400	-$7,950	On-demand only
Morning ramp (06:00-12:00 JST)	5,000	$21,500	5 Sonnet + 3 Haiku	$18,000	+$3,500	Provisioned
Afternoon steady (12:00-17:00 JST)	8,000	$34,400	8 Sonnet + 3 Haiku	$25,200	+$9,200	Provisioned
Evening peak (17:00-23:00 JST)	20,000	$86,000	15 Sonnet + 5 Haiku	$54,000	+$32,000	Provisioned
Manga release burst (30 min)	35,000	$150,500	15 Sonnet + 5 Haiku + overflow	$54,000 + $12,000 overflow	+$84,500	Hybrid
Monthly blended	~7,000	$158,750	Dynamic schedule	$111,600	+$47,150 (30% savings)	Time-of-day provisioning

Key takeaway: MangaAssist saves approximately 30% on Bedrock costs by using time-of-day provisioned throughput scheduling instead of pure on-demand. The overnight period is left on-demand, the evening peak is fully provisioned, and manga release bursts use a hybrid approach.

Key Takeaways

Batching reduces cost but adds latency -- MangaAssist keeps the micro-batch window under 150ms and coalescing window under 1s to stay within the 3-second budget.
Capacity planning is token-centric, not request-centric -- A recommendation query consumes 40x more tokens than a greeting; planning on request counts alone leads to under-provisioning.
Auto-scaling must use Bedrock-aware metrics -- CPU utilization on ECS tasks is misleading for I/O-bound LLM workloads; queue depth and TPS are the correct signals.
Provisioned throughput saves money only at sustained utilization -- The breakeven point for MangaAssist is around 5,000 tokens/s; below that, on-demand is cheaper.
Event provisioning requires a calendar -- Manga releases and sales events must be pre-provisioned hours in advance because throughput changes are not instantaneous.