LOCAL PREVIEW View on GitHub

01: High-Performance FM Architecture and Throughput Management

MangaAssist is a JP manga store chatbot running on AWS. It uses Bedrock Claude 3 (Sonnet for complex queries, Haiku for simple ones), OpenSearch Serverless for vector retrieval, DynamoDB for session and catalog data, ECS Fargate for orchestration, API Gateway WebSocket for real-time delivery, and ElastiCache Redis for caching. The system handles 1M messages/day with a target of under 3 seconds end-to-end response time.


Skill Mapping

Field Value
Domain 4 -- Operational Efficiency Optimization
Task 4.1 -- Optimize foundation model cost and performance
Skill 4.1.3 -- Implement strategies for high-performance FM systems including batching, capacity planning, utilization monitoring, auto-scaling, and provisioned throughput optimization
MangaAssist Focus Throughput management for 1M daily messages with sub-3s latency, burst handling for manga releases and seasonal events, cost-efficient provisioning

High-Performance FM Dimensions

mindmap
  root((High-Performance
    FM Systems))
    Batching
      Micro-Batching by Intent
      Request Coalescing
      Batch Window Tuning
      Priority Queue Separation
    Capacity Planning
      Token Throughput Targets
      Peak Hour Analysis
      Event Provisioning
      Growth Forecasting
    Utilization Monitoring
      Invocation Metrics
      Tokens-per-Second Tracking
      Throttle Rate Monitoring
      Waste Detection
    Auto-Scaling
      ECS Fargate Scaling
      Queue-Depth Triggers
      Step Scaling Policies
      Target Tracking
    Provisioned Throughput
      Model Units Sizing
      On-Demand vs Provisioned
      Breakeven Analysis
      Dynamic Adjustment

1. Batching Strategies

1.1 Micro-Batching for Similar Intents

MangaAssist classifies every incoming message into one of several intents: product_search, order_status, faq, recommendation, greeting, etc. Messages with the same intent often share prompt templates, system instructions, and retrieval patterns. Micro-batching groups these messages together so the orchestrator can:

  • share a single retrieval call across multiple queries searching the same index partition,
  • pack multiple user queries into a single Bedrock invocation using a batched prompt layout (where the model processes N user questions in one call),
  • reduce per-request overhead (TLS handshake, connection setup, retry envelope).

The trade-off is added latency: each message waits in the batch window before processing begins. For MangaAssist, the window must stay under 200ms to preserve the 3-second budget.

sequenceDiagram
    participant U1 as User A (product_search)
    participant U2 as User B (product_search)
    participant U3 as User C (order_status)
    participant Q as Intent Queue
    participant MB as Micro-Batcher
    participant B as Bedrock Claude

    U1->>Q: "Find shonen manga under 800 yen"
    U2->>Q: "Any new isekai releases?"
    U3->>Q: "Where is my order 12345?"

    Q->>MB: Batch [U1, U2] (same intent: product_search)
    Q->>MB: Single [U3] (order_status, different template)

    MB->>B: Batched invocation for product_search (2 queries)
    MB->>B: Single invocation for order_status

    B-->>MB: Batched response [resp_A, resp_B]
    B-->>MB: Single response [resp_C]

    MB-->>U1: resp_A
    MB-->>U2: resp_B
    MB-->>U3: resp_C

1.2 Request Coalescing for Identical Queries

During peak hours or manga release events, many users ask near-identical questions within seconds: "Is volume 42 of One Piece available?", "One Piece 42 in stock?", "Do you have One Piece vol 42?". After intent classification and query normalization, these resolve to the same semantic query.

Request coalescing detects duplicate normalized queries arriving within a coalescing window (typically 500ms-2s) and fans out a single Bedrock response to all waiting callers. This eliminates redundant invocations, reduces cost, and keeps the system under its throttle ceiling.

Key design decisions: - Normalization depth: MangaAssist normalizes to (intent, product_id, query_hash) where query_hash is a semantic hash (embedding-based) rather than an exact string match. - Staleness risk: Coalesced responses must not be stale. For order-status queries, coalescing is disabled because each user's order is unique. - Cache interaction: Coalesced results are written to Redis with a short TTL (30s) so subsequent identical queries within the TTL skip Bedrock entirely.


2. Capacity Planning

2.1 Token Throughput Requirements

MangaAssist must plan for tokens, not just requests. A simple greeting consumes ~50 tokens total (input + output), while a detailed recommendation with product cards can consume ~2,000 tokens. The blended average across all intents is approximately 600 tokens per request.

Metric Calculation Value
Daily messages Given 1,000,000
Avg tokens per message Blended across intents 600
Daily tokens 1M x 600 600,000,000
Avg tokens/second (uniform) 600M / 86,400 ~6,944 tokens/s
Peak multiplier (JP evening) 3x average ~20,833 tokens/s
Burst multiplier (manga release) 5x average ~34,722 tokens/s
Target provisioned capacity Peak + 30% headroom ~27,083 tokens/s
Burst capacity (on-demand spillover) Burst - provisioned ~7,639 tokens/s on-demand

2.2 Peak Hour Analysis -- JP Timezone

MangaAssist traffic follows a strong daily pattern tied to Japanese consumer behavior. The store's primary users are in JST (UTC+9). Traffic analysis from production logs reveals:

Tokens/sec
35,000 |                                          * (manga release spike)
30,000 |
25,000 |                                    ***
20,000 |                                  **   **
15,000 |                                **       **
10,000 |                    ********** *           ***
 7,000 |--- avg ------*****---------------------------------****-------
 5,000 |           ***                                          ***
 3,000 |        ***                                                ***
 1,000 | *******                                                      **
        |_____|_____|_____|_____|_____|_____|_____|_____|_____|_____|____
        00    03    06    09    12    15    18    21    00    03    06  JST
              trough        lunch     peak evening       overnight

Key observations: - Trough: 02:00-06:00 JST -- under 2,000 tokens/s. Over-provisioning here is pure waste. - Lunch bump: 11:00-13:00 JST -- brief 10,000 tokens/s spike, subsides quickly. - Peak evening: 18:00-23:00 JST -- sustained 15,000-25,000 tokens/s. This is where provisioned throughput earns its keep. - Manga release events: Overlaid on the evening peak, release events can push to 35,000 tokens/s for 30-60 minutes.

2.3 Event Provisioning -- Black Friday and Manga Releases

Scheduled events require pre-provisioning because Bedrock provisioned throughput changes take time to activate. MangaAssist maintains an event calendar:

Event Type Lead Time for Provisioning Capacity Multiplier Duration
Weekly Shonen Jump release (Monday) 2 hours before midnight JST Sunday 2x baseline peak 4 hours
Major manga volume release 6 hours before release time 3x baseline peak 6 hours
Black Friday / Cyber Monday 24 hours before event start 5x baseline peak 72 hours
Amazon Prime Day 24 hours before event start 4x baseline peak 48 hours
New Year sale 12 hours before Dec 31 JST 3x baseline peak 96 hours

3. Utilization Monitoring

3.1 Bedrock Invocation Metrics

MangaAssist tracks the following CloudWatch metrics under the AWS/Bedrock namespace:

Metric What It Tells Us Alarm Threshold
Invocations Total API calls per period Trend analysis, no hard alarm
InvocationLatency P50/P90/P99 latency per invocation P99 > 5s triggers investigation
InvocationClientErrors (4xx) Validation failures, malformed requests > 1% of invocations
InvocationServerErrors (5xx) Bedrock-side failures > 0.1% of invocations
InvocationThrottles Requests rejected due to throughput limits > 0 triggers immediate alert
InputTokenCount Tokens consumed in prompts Cost tracking, prompt bloat detection
OutputTokenCount Tokens generated in responses Cost tracking, verbosity detection

3.2 Tokens-per-Second Tracking

Tokens-per-second (TPS) is the primary capacity metric. MangaAssist publishes a custom CloudWatch metric MangaAssist/TokensPerSecond computed as:

TPS = (InputTokenCount + OutputTokenCount) / period_seconds

This metric drives: - Provisioned throughput sizing: If sustained TPS exceeds 80% of provisioned capacity for 10 minutes, scale-up is triggered. - Cost allocation: TPS per intent allows finance to allocate Bedrock costs to product teams (recommendations team vs. FAQ team). - Anomaly detection: CloudWatch Anomaly Detection on TPS catches unexpected surges (bot attacks, prompt injection attempts causing token inflation).

3.3 Waste Detection -- Over-Provisioned Capacity

Provisioned throughput that sits idle is wasted money. MangaAssist defines waste as:

Waste Ratio = 1 - (actual_TPS / provisioned_TPS)
Waste Ratio Interpretation Action
< 20% Healthy headroom No action
20-40% Mild over-provisioning Review at next capacity planning cycle
40-60% Significant waste Schedule scale-down within 1 hour
> 60% Critical waste Immediate scale-down (overnight trough likely)

4. Auto-Scaling Configurations

4.1 ECS Fargate Auto-Scaling Tied to Bedrock Queue Depth

ECS Fargate runs the MangaAssist orchestrator service. Scaling this service based on CPU alone is insufficient because the orchestrator spends most of its time waiting on Bedrock (I/O bound, not CPU bound). Instead, MangaAssist scales on a custom metric: Bedrock request queue depth -- the number of in-flight Bedrock invocations waiting for a response.

graph TD
    subgraph "Auto-Scaling Loop"
        A[CloudWatch Custom Metric<br>bedrock_queue_depth] --> B{Queue Depth<br>Evaluation}
        B -->|depth > 50 for 2 min| C[Step Scaling: +2 tasks]
        B -->|depth > 100 for 1 min| D[Step Scaling: +5 tasks]
        B -->|depth > 200 for 30s| E[Step Scaling: +10 tasks<br>Emergency]
        B -->|depth < 20 for 10 min| F[Scale-in: -1 task]
        B -->|depth < 5 for 15 min| G[Scale-in: -2 tasks]
    end

    subgraph "Metric Publisher"
        H[ECS Orchestrator] --> I[Publish queue depth<br>every 10 seconds]
        I --> A
    end

    subgraph "Bedrock"
        H --> J[Bedrock Invocations]
        J --> K[InvocationThrottles metric]
        K --> A
    end

4.2 Step Scaling for Gradual Ramp

Step scaling prevents oscillation during gradual traffic increases (e.g., the evening ramp from 17:00-19:00 JST). Each step adds capacity proportional to the severity of the breach:

Queue Depth Breach Duration Action Cooldown
50 2 minutes +2 ECS tasks 120s
100 1 minute +5 ECS tasks 90s
200 30 seconds +10 ECS tasks 60s
< 20 10 minutes -1 ECS task 300s
< 5 15 minutes -2 ECS tasks 300s

Scale-in cooldowns are deliberately longer than scale-out cooldowns to avoid premature shrinkage during brief traffic dips within a peak window.

4.3 Target Tracking for Tokens-per-Second

In addition to step scaling, MangaAssist uses a target tracking policy on the custom MangaAssist/TokensPerSecond metric. The target is set to maintain each ECS task at approximately 500 tokens/s throughput:

Target TPS per task = 500
Desired task count  = ceil(current_total_TPS / 500)

This provides a smooth, predictive scaling signal that complements the reactive step scaling on queue depth.


5. Provisioned Throughput Optimization

5.1 Bedrock Provisioned Throughput -- Model Units

AWS Bedrock offers provisioned throughput purchased in model units. Each model unit provides a guaranteed tokens-per-second capacity for a specific model. For MangaAssist:

Model Use Case On-Demand Rate Provisioned Unit Capacity Units Needed (Peak)
Claude 3 Sonnet Complex queries (recommendations, multi-turn) Variable, subject to throttling ~1,000 tokens/s per unit 15 units
Claude 3 Haiku Simple queries (greetings, FAQs, order status) Variable, subject to throttling ~2,000 tokens/s per unit 5 units

5.2 On-Demand vs Provisioned -- Breakeven Analysis

The decision between on-demand and provisioned depends on sustained utilization. Provisioned throughput has a fixed hourly cost regardless of actual usage; on-demand charges per token.

Traffic Level (tokens/s) Monthly On-Demand Cost (est.) Monthly Provisioned Cost (est.) Savings with Provisioned Recommendation
2,000 (overnight trough) $8,600 $18,000 (5 units) -$9,400 (loss) On-demand
7,000 (daily average) $30,100 $18,000 (5 units) +$12,100 Provisioned
15,000 (evening peak) $64,500 $36,000 (10 units) +$28,500 Provisioned
25,000 (peak evening) $107,500 $54,000 (15 units) +$53,500 Provisioned
35,000 (manga release burst) $150,500 $54,000 + on-demand overflow +$61,500 Hybrid

MangaAssist strategy: Provision for the sustained evening peak (15 units Sonnet + 5 units Haiku) and let on-demand handle bursts above that. During overnight trough, scale provisioned throughput down to minimum (3 units Sonnet + 2 units Haiku).


6. Throughput Management Architecture

graph TB
    subgraph "Client Layer"
        WS[WebSocket Clients<br>1M messages/day]
    end

    subgraph "Ingress"
        AG[API Gateway WebSocket]
    end

    subgraph "Orchestration Layer (ECS Fargate)"
        IC[Intent Classifier]
        MB[Micro-Batcher]
        RC[Request Coalescer]
        PQ[Priority Queue<br>High / Normal / Low]
        RL[Rate Limiter]
    end

    subgraph "Cache Layer"
        REDIS[ElastiCache Redis<br>Coalesced Response Cache<br>Embedding Cache]
    end

    subgraph "FM Layer"
        BPT[Bedrock Provisioned Throughput<br>Sonnet: 15 units<br>Haiku: 5 units]
        BOD[Bedrock On-Demand<br>Overflow / Burst]
    end

    subgraph "Monitoring"
        CW[CloudWatch Metrics<br>TPS, Queue Depth, Throttles]
        AS[Auto-Scaling Policies<br>Step + Target Tracking]
        ALARM[Alarms + SNS]
    end

    WS --> AG --> IC
    IC --> MB
    MB --> RC
    RC -->|cache hit| REDIS
    RC -->|cache miss| PQ
    PQ --> RL
    RL -->|within provisioned capacity| BPT
    RL -->|overflow| BOD
    BPT --> REDIS
    BOD --> REDIS

    CW --> AS
    AS -->|scale ECS| IC
    AS -->|adjust provisioned units| BPT
    CW --> ALARM

7. Python Implementation

7.1 ThroughputManager with Batching and Rate Limiting

import asyncio
import time
import hashlib
import json
from dataclasses import dataclass, field
from typing import Optional
from collections import defaultdict

import boto3


@dataclass
class InferenceRequest:
    """A single MangaAssist inference request."""
    request_id: str
    user_id: str
    intent: str
    prompt: str
    model_id: str
    priority: str = "normal"  # high, normal, low
    created_at: float = field(default_factory=time.time)
    normalized_query_hash: Optional[str] = None


@dataclass
class InferenceResponse:
    """Response from Bedrock invocation."""
    request_id: str
    output_text: str
    input_tokens: int
    output_tokens: int
    latency_ms: float
    from_cache: bool = False
    from_coalesce: bool = False


class ThroughputManager:
    """
    Manages Bedrock throughput for MangaAssist with:
    - Micro-batching by intent
    - Request coalescing for identical queries
    - Priority-based rate limiting
    - Provisioned vs on-demand routing
    """

    def __init__(
        self,
        provisioned_tps: int = 20_000,
        batch_window_ms: int = 150,
        coalesce_window_ms: int = 1_000,
        max_batch_size: int = 8,
        max_concurrent_invocations: int = 100,
        redis_client=None,
    ):
        self.provisioned_tps = provisioned_tps
        self.batch_window_ms = batch_window_ms
        self.coalesce_window_ms = coalesce_window_ms
        self.max_batch_size = max_batch_size
        self.redis_client = redis_client

        # Semaphore to limit concurrent Bedrock invocations
        self._invocation_semaphore = asyncio.Semaphore(max_concurrent_invocations)

        # Intent-based batch queues: intent -> list of (request, future)
        self._batch_queues: dict[str, list] = defaultdict(list)
        self._batch_locks: dict[str, asyncio.Lock] = defaultdict(asyncio.Lock)

        # Coalescing: query_hash -> (future, timestamp)
        self._coalesce_map: dict[str, asyncio.Future] = {}
        self._coalesce_lock = asyncio.Lock()

        # Metrics tracking
        self._tokens_this_second = 0
        self._tokens_lock = asyncio.Lock()
        self._last_token_reset = time.time()

        # Rate limiting buckets
        self._rate_limits = {
            "high": asyncio.Semaphore(50),
            "normal": asyncio.Semaphore(40),
            "low": asyncio.Semaphore(10),
        }

        # Bedrock client
        self._bedrock_runtime = boto3.client("bedrock-runtime", region_name="ap-northeast-1")

    async def process_request(self, request: InferenceRequest) -> InferenceResponse:
        """
        Main entry point. Routes through coalescing, batching, and rate limiting.
        """
        # Step 1: Check coalescing (skip for order-status, unique per user)
        if request.intent not in ("order_status", "account_info"):
            coalesced = await self._try_coalesce(request)
            if coalesced is not None:
                return coalesced

        # Step 2: Check Redis cache
        cached = await self._check_cache(request)
        if cached is not None:
            return cached

        # Step 3: Enqueue for micro-batching
        response = await self._enqueue_for_batch(request)
        return response

    def _compute_query_hash(self, request: InferenceRequest) -> str:
        """Compute a normalized hash for coalescing identical queries."""
        normalized = json.dumps({
            "intent": request.intent,
            "prompt_hash": hashlib.sha256(
                request.prompt.strip().lower().encode()
            ).hexdigest()[:16],
            "model_id": request.model_id,
        }, sort_keys=True)
        return hashlib.sha256(normalized.encode()).hexdigest()[:32]

    async def _try_coalesce(self, request: InferenceRequest) -> Optional[InferenceResponse]:
        """
        If an identical query is already in-flight, wait for its result
        instead of making a duplicate Bedrock call.
        """
        query_hash = self._compute_query_hash(request)
        request.normalized_query_hash = query_hash

        async with self._coalesce_lock:
            if query_hash in self._coalesce_map:
                existing_future, created_at = self._coalesce_map[query_hash]
                age_ms = (time.time() - created_at) * 1000
                if age_ms < self.coalesce_window_ms:
                    # Wait for the in-flight request to complete
                    result = await existing_future
                    return InferenceResponse(
                        request_id=request.request_id,
                        output_text=result.output_text,
                        input_tokens=result.input_tokens,
                        output_tokens=result.output_tokens,
                        latency_ms=result.latency_ms,
                        from_coalesce=True,
                    )

            # No in-flight duplicate; register this request for coalescing
            future = asyncio.get_event_loop().create_future()
            self._coalesce_map[query_hash] = (future, time.time())

        return None  # Proceed to batching/invocation

    async def _check_cache(self, request: InferenceRequest) -> Optional[InferenceResponse]:
        """Check Redis for a cached response."""
        if self.redis_client is None:
            return None

        cache_key = f"manga:resp:{request.normalized_query_hash or self._compute_query_hash(request)}"
        cached_data = self.redis_client.get(cache_key)
        if cached_data:
            data = json.loads(cached_data)
            return InferenceResponse(
                request_id=request.request_id,
                output_text=data["output_text"],
                input_tokens=data["input_tokens"],
                output_tokens=data["output_tokens"],
                latency_ms=0.0,
                from_cache=True,
            )
        return None

    async def _enqueue_for_batch(self, request: InferenceRequest) -> InferenceResponse:
        """
        Add request to the intent-specific batch queue.
        If the batch fills up or the window expires, flush the batch.
        """
        future = asyncio.get_event_loop().create_future()

        async with self._batch_locks[request.intent]:
            self._batch_queues[request.intent].append((request, future))

            if len(self._batch_queues[request.intent]) >= self.max_batch_size:
                batch = self._batch_queues[request.intent][:self.max_batch_size]
                self._batch_queues[request.intent] = self._batch_queues[request.intent][self.max_batch_size:]
                asyncio.create_task(self._flush_batch(request.intent, batch))
            elif len(self._batch_queues[request.intent]) == 1:
                # First item in batch -- start the window timer
                asyncio.create_task(self._batch_timer(request.intent))

        return await future

    async def _batch_timer(self, intent: str):
        """Wait for the batch window, then flush whatever is queued."""
        await asyncio.sleep(self.batch_window_ms / 1000.0)
        async with self._batch_locks[intent]:
            if self._batch_queues[intent]:
                batch = self._batch_queues[intent][:]
                self._batch_queues[intent] = []
                asyncio.create_task(self._flush_batch(intent, batch))

    async def _flush_batch(self, intent: str, batch: list):
        """
        Invoke Bedrock for a batch of requests.
        Uses rate limiting and routes to provisioned or on-demand.
        """
        for request, future in batch:
            priority_sem = self._rate_limits.get(request.priority, self._rate_limits["normal"])
            await priority_sem.acquire()
            try:
                async with self._invocation_semaphore:
                    response = await self._invoke_bedrock(request)
                    future.set_result(response)

                    # Update coalescing map
                    if request.normalized_query_hash and request.normalized_query_hash in self._coalesce_map:
                        coalesce_future, _ = self._coalesce_map.pop(request.normalized_query_hash)
                        if not coalesce_future.done():
                            coalesce_future.set_result(response)

                    # Cache the result
                    await self._cache_response(request, response)

                    # Track tokens
                    await self._track_tokens(response.input_tokens + response.output_tokens)
            except Exception as e:
                if not future.done():
                    future.set_exception(e)
            finally:
                priority_sem.release()

    async def _invoke_bedrock(self, request: InferenceRequest) -> InferenceResponse:
        """Call Bedrock and return the response."""
        start = time.time()

        body = json.dumps({
            "anthropic_version": "bedrock-2023-05-31",
            "max_tokens": 1024,
            "messages": [{"role": "user", "content": request.prompt}],
        })

        # Route: provisioned throughput if within capacity, else on-demand
        current_tps = await self._get_current_tps()
        model_id = request.model_id
        if current_tps > self.provisioned_tps * 0.9:
            # Near provisioned capacity -- use on-demand endpoint as overflow
            model_id = f"{request.model_id}"  # on-demand uses same model ID

        response = self._bedrock_runtime.invoke_model(
            modelId=model_id,
            contentType="application/json",
            accept="application/json",
            body=body,
        )

        response_body = json.loads(response["body"].read())
        latency_ms = (time.time() - start) * 1000

        return InferenceResponse(
            request_id=request.request_id,
            output_text=response_body["content"][0]["text"],
            input_tokens=response_body["usage"]["input_tokens"],
            output_tokens=response_body["usage"]["output_tokens"],
            latency_ms=latency_ms,
        )

    async def _track_tokens(self, token_count: int):
        """Track tokens-per-second for monitoring and scaling decisions."""
        async with self._tokens_lock:
            now = time.time()
            if now - self._last_token_reset >= 1.0:
                # Publish the previous second's count to CloudWatch
                await self._publish_tps_metric(self._tokens_this_second)
                self._tokens_this_second = 0
                self._last_token_reset = now
            self._tokens_this_second += token_count

    async def _get_current_tps(self) -> int:
        """Return the current tokens-per-second rate."""
        async with self._tokens_lock:
            return self._tokens_this_second

    async def _publish_tps_metric(self, tps: int):
        """Publish tokens-per-second to CloudWatch."""
        cloudwatch = boto3.client("cloudwatch", region_name="ap-northeast-1")
        cloudwatch.put_metric_data(
            Namespace="MangaAssist",
            MetricData=[{
                "MetricName": "TokensPerSecond",
                "Value": tps,
                "Unit": "Count/Second",
                "Dimensions": [
                    {"Name": "Service", "Value": "ThroughputManager"},
                    {"Name": "Environment", "Value": "production"},
                ],
            }],
        )

    async def _cache_response(self, request: InferenceRequest, response: InferenceResponse):
        """Cache the response in Redis with a short TTL."""
        if self.redis_client is None:
            return
        if request.intent in ("order_status", "account_info"):
            return  # Do not cache user-specific responses

        cache_key = f"manga:resp:{request.normalized_query_hash}"
        self.redis_client.setex(
            cache_key,
            30,  # 30-second TTL
            json.dumps({
                "output_text": response.output_text,
                "input_tokens": response.input_tokens,
                "output_tokens": response.output_tokens,
            }),
        )

    def get_metrics_snapshot(self) -> dict:
        """Return a snapshot of current throughput metrics."""
        return {
            "current_tps": self._tokens_this_second,
            "provisioned_tps": self.provisioned_tps,
            "utilization_pct": round(
                (self._tokens_this_second / self.provisioned_tps) * 100, 1
            ) if self.provisioned_tps > 0 else 0,
            "batch_queue_depths": {
                intent: len(queue)
                for intent, queue in self._batch_queues.items()
            },
            "coalesce_map_size": len(self._coalesce_map),
        }

7.2 Auto-Scaling Policy Configuration (CDK-Style)

"""
CDK-style auto-scaling configuration for MangaAssist ECS Fargate service
with Bedrock-aware custom metrics.
"""

from aws_cdk import (
    Stack,
    Duration,
    aws_ecs as ecs,
    aws_applicationautoscaling as appscaling,
    aws_cloudwatch as cloudwatch,
)
from constructs import Construct


class MangaAssistAutoScalingStack(Stack):
    """
    Configures auto-scaling for the MangaAssist orchestrator service
    running on ECS Fargate, using Bedrock queue depth and tokens-per-second
    as scaling signals instead of CPU/memory alone.
    """

    def __init__(self, scope: Construct, id: str, ecs_service: ecs.FargateService, **kwargs):
        super().__init__(scope, id, **kwargs)

        # --- Scalable Target ---
        scaling = ecs_service.auto_scale_task_count(
            min_capacity=5,    # Minimum tasks even at overnight trough
            max_capacity=100,  # Hard ceiling for cost safety
        )

        # --- Policy 1: Target Tracking on Tokens-per-Second ---
        # Each ECS task should handle roughly 500 tokens/s.
        # If total TPS rises, add tasks proportionally.
        tps_metric = cloudwatch.Metric(
            namespace="MangaAssist",
            metric_name="TokensPerSecond",
            dimensions_map={
                "Service": "ThroughputManager",
                "Environment": "production",
            },
            statistic="Average",
            period=Duration.seconds(60),
        )

        scaling.scale_on_metric(
            "ScaleOnTPS",
            metric=tps_metric,
            scaling_steps=[
                appscaling.ScalingInterval(change=-2, lower=0, upper=2_000),
                appscaling.ScalingInterval(change=0, lower=2_000, upper=5_000),
                appscaling.ScalingInterval(change=2, lower=5_000, upper=10_000),
                appscaling.ScalingInterval(change=5, lower=10_000, upper=20_000),
                appscaling.ScalingInterval(change=10, lower=20_000),
            ],
            adjustment_type=appscaling.AdjustmentType.CHANGE_IN_CAPACITY,
            cooldown=Duration.seconds(120),
        )

        # --- Policy 2: Step Scaling on Bedrock Queue Depth ---
        queue_depth_metric = cloudwatch.Metric(
            namespace="MangaAssist",
            metric_name="BedrockQueueDepth",
            dimensions_map={
                "Service": "Orchestrator",
                "Environment": "production",
            },
            statistic="Maximum",
            period=Duration.seconds(30),
        )

        # Scale-out steps (aggressive)
        scaling.scale_on_metric(
            "ScaleOutOnQueueDepth",
            metric=queue_depth_metric,
            scaling_steps=[
                appscaling.ScalingInterval(change=2, lower=50, upper=100),
                appscaling.ScalingInterval(change=5, lower=100, upper=200),
                appscaling.ScalingInterval(change=10, lower=200),
            ],
            adjustment_type=appscaling.AdjustmentType.CHANGE_IN_CAPACITY,
            cooldown=Duration.seconds(60),
        )

        # Scale-in steps (conservative)
        scaling.scale_on_metric(
            "ScaleInOnQueueDepth",
            metric=queue_depth_metric,
            scaling_steps=[
                appscaling.ScalingInterval(change=-1, upper=20),
                appscaling.ScalingInterval(change=-2, upper=5),
            ],
            adjustment_type=appscaling.AdjustmentType.CHANGE_IN_CAPACITY,
            cooldown=Duration.seconds(300),
        )

        # --- Policy 3: Scheduled Scaling for Known Peaks ---
        # Evening peak: 17:00-23:59 JST (08:00-14:59 UTC)
        scaling.scale_on_schedule(
            "EveningPeakScaleUp",
            schedule=appscaling.Schedule.cron(hour="8", minute="0"),
            min_capacity=30,
        )
        scaling.scale_on_schedule(
            "EveningPeakScaleDown",
            schedule=appscaling.Schedule.cron(hour="15", minute="0"),
            min_capacity=10,
        )

        # Overnight trough: 02:00-06:00 JST (17:00-21:00 UTC)
        scaling.scale_on_schedule(
            "OvernightTrough",
            schedule=appscaling.Schedule.cron(hour="17", minute="0"),
            min_capacity=5,
        )

        # Monday manga release: Scale up Sunday 22:00 JST (13:00 UTC)
        scaling.scale_on_schedule(
            "MondayMangaReleasePrepare",
            schedule=appscaling.Schedule.cron(
                week_day="SUN", hour="13", minute="0"
            ),
            min_capacity=50,
        )
        scaling.scale_on_schedule(
            "MondayMangaReleaseEnd",
            schedule=appscaling.Schedule.cron(
                week_day="MON", hour="6", minute="0"  # 15:00 JST
            ),
            min_capacity=10,
        )

        # --- Policy 4: Throttle-Based Emergency Scaling ---
        throttle_alarm = cloudwatch.Alarm(
            self,
            "BedrockThrottleAlarm",
            metric=cloudwatch.Metric(
                namespace="AWS/Bedrock",
                metric_name="InvocationThrottles",
                statistic="Sum",
                period=Duration.seconds(60),
            ),
            threshold=5,
            evaluation_periods=1,
            comparison_operator=cloudwatch.ComparisonOperator.GREATER_THAN_OR_EQUAL_TO_THRESHOLD,
            alarm_description="Bedrock throttling detected -- emergency scale-out",
        )

        scaling.scale_on_metric(
            "EmergencyScaleOnThrottle",
            metric=cloudwatch.Metric(
                namespace="AWS/Bedrock",
                metric_name="InvocationThrottles",
                statistic="Sum",
                period=Duration.seconds(60),
            ),
            scaling_steps=[
                appscaling.ScalingInterval(change=5, lower=5, upper=20),
                appscaling.ScalingInterval(change=15, lower=20),
            ],
            adjustment_type=appscaling.AdjustmentType.CHANGE_IN_CAPACITY,
            cooldown=Duration.seconds(60),
        )

8. Cost Comparison: On-Demand vs Provisioned Throughput

The following table uses estimated pricing to illustrate the breakeven analysis. Actual costs depend on the specific Bedrock model and region.

Scenario Avg TPS Monthly On-Demand Provisioned Units Monthly Provisioned Net Savings Strategy
Overnight (02:00-06:00 JST) 1,500 $6,450 3 Sonnet + 2 Haiku $14,400 -$7,950 On-demand only
Morning ramp (06:00-12:00 JST) 5,000 $21,500 5 Sonnet + 3 Haiku $18,000 +$3,500 Provisioned
Afternoon steady (12:00-17:00 JST) 8,000 $34,400 8 Sonnet + 3 Haiku $25,200 +$9,200 Provisioned
Evening peak (17:00-23:00 JST) 20,000 $86,000 15 Sonnet + 5 Haiku $54,000 +$32,000 Provisioned
Manga release burst (30 min) 35,000 $150,500 15 Sonnet + 5 Haiku + overflow $54,000 + $12,000 overflow +$84,500 Hybrid
Monthly blended ~7,000 $158,750 Dynamic schedule $111,600 +$47,150 (30% savings) Time-of-day provisioning

Key takeaway: MangaAssist saves approximately 30% on Bedrock costs by using time-of-day provisioned throughput scheduling instead of pure on-demand. The overnight period is left on-demand, the evening peak is fully provisioned, and manga release bursts use a hybrid approach.


Key Takeaways

  1. Batching reduces cost but adds latency -- MangaAssist keeps the micro-batch window under 150ms and coalescing window under 1s to stay within the 3-second budget.
  2. Capacity planning is token-centric, not request-centric -- A recommendation query consumes 40x more tokens than a greeting; planning on request counts alone leads to under-provisioning.
  3. Auto-scaling must use Bedrock-aware metrics -- CPU utilization on ECS tasks is misleading for I/O-bound LLM workloads; queue depth and TPS are the correct signals.
  4. Provisioned throughput saves money only at sustained utilization -- The breakeven point for MangaAssist is around 5,000 tokens/s; below that, on-demand is cheaper.
  5. Event provisioning requires a calendar -- Manga releases and sales events must be pre-provisioned hours in advance because throughput changes are not instantaneous.