01: High-Performance FM Architecture and Throughput Management
MangaAssist is a JP manga store chatbot running on AWS. It uses Bedrock Claude 3 (Sonnet for complex queries, Haiku for simple ones), OpenSearch Serverless for vector retrieval, DynamoDB for session and catalog data, ECS Fargate for orchestration, API Gateway WebSocket for real-time delivery, and ElastiCache Redis for caching. The system handles 1M messages/day with a target of under 3 seconds end-to-end response time.
Skill Mapping
| Field | Value |
|---|---|
| Domain | 4 -- Operational Efficiency Optimization |
| Task | 4.1 -- Optimize foundation model cost and performance |
| Skill | 4.1.3 -- Implement strategies for high-performance FM systems including batching, capacity planning, utilization monitoring, auto-scaling, and provisioned throughput optimization |
| MangaAssist Focus | Throughput management for 1M daily messages with sub-3s latency, burst handling for manga releases and seasonal events, cost-efficient provisioning |
High-Performance FM Dimensions
mindmap
root((High-Performance
FM Systems))
Batching
Micro-Batching by Intent
Request Coalescing
Batch Window Tuning
Priority Queue Separation
Capacity Planning
Token Throughput Targets
Peak Hour Analysis
Event Provisioning
Growth Forecasting
Utilization Monitoring
Invocation Metrics
Tokens-per-Second Tracking
Throttle Rate Monitoring
Waste Detection
Auto-Scaling
ECS Fargate Scaling
Queue-Depth Triggers
Step Scaling Policies
Target Tracking
Provisioned Throughput
Model Units Sizing
On-Demand vs Provisioned
Breakeven Analysis
Dynamic Adjustment
1. Batching Strategies
1.1 Micro-Batching for Similar Intents
MangaAssist classifies every incoming message into one of several intents: product_search, order_status, faq, recommendation, greeting, etc. Messages with the same intent often share prompt templates, system instructions, and retrieval patterns. Micro-batching groups these messages together so the orchestrator can:
- share a single retrieval call across multiple queries searching the same index partition,
- pack multiple user queries into a single Bedrock invocation using a batched prompt layout (where the model processes N user questions in one call),
- reduce per-request overhead (TLS handshake, connection setup, retry envelope).
The trade-off is added latency: each message waits in the batch window before processing begins. For MangaAssist, the window must stay under 200ms to preserve the 3-second budget.
sequenceDiagram
participant U1 as User A (product_search)
participant U2 as User B (product_search)
participant U3 as User C (order_status)
participant Q as Intent Queue
participant MB as Micro-Batcher
participant B as Bedrock Claude
U1->>Q: "Find shonen manga under 800 yen"
U2->>Q: "Any new isekai releases?"
U3->>Q: "Where is my order 12345?"
Q->>MB: Batch [U1, U2] (same intent: product_search)
Q->>MB: Single [U3] (order_status, different template)
MB->>B: Batched invocation for product_search (2 queries)
MB->>B: Single invocation for order_status
B-->>MB: Batched response [resp_A, resp_B]
B-->>MB: Single response [resp_C]
MB-->>U1: resp_A
MB-->>U2: resp_B
MB-->>U3: resp_C
1.2 Request Coalescing for Identical Queries
During peak hours or manga release events, many users ask near-identical questions within seconds: "Is volume 42 of One Piece available?", "One Piece 42 in stock?", "Do you have One Piece vol 42?". After intent classification and query normalization, these resolve to the same semantic query.
Request coalescing detects duplicate normalized queries arriving within a coalescing window (typically 500ms-2s) and fans out a single Bedrock response to all waiting callers. This eliminates redundant invocations, reduces cost, and keeps the system under its throttle ceiling.
Key design decisions:
- Normalization depth: MangaAssist normalizes to (intent, product_id, query_hash) where query_hash is a semantic hash (embedding-based) rather than an exact string match.
- Staleness risk: Coalesced responses must not be stale. For order-status queries, coalescing is disabled because each user's order is unique.
- Cache interaction: Coalesced results are written to Redis with a short TTL (30s) so subsequent identical queries within the TTL skip Bedrock entirely.
2. Capacity Planning
2.1 Token Throughput Requirements
MangaAssist must plan for tokens, not just requests. A simple greeting consumes ~50 tokens total (input + output), while a detailed recommendation with product cards can consume ~2,000 tokens. The blended average across all intents is approximately 600 tokens per request.
| Metric | Calculation | Value |
|---|---|---|
| Daily messages | Given | 1,000,000 |
| Avg tokens per message | Blended across intents | 600 |
| Daily tokens | 1M x 600 | 600,000,000 |
| Avg tokens/second (uniform) | 600M / 86,400 | ~6,944 tokens/s |
| Peak multiplier (JP evening) | 3x average | ~20,833 tokens/s |
| Burst multiplier (manga release) | 5x average | ~34,722 tokens/s |
| Target provisioned capacity | Peak + 30% headroom | ~27,083 tokens/s |
| Burst capacity (on-demand spillover) | Burst - provisioned | ~7,639 tokens/s on-demand |
2.2 Peak Hour Analysis -- JP Timezone
MangaAssist traffic follows a strong daily pattern tied to Japanese consumer behavior. The store's primary users are in JST (UTC+9). Traffic analysis from production logs reveals:
Tokens/sec
35,000 | * (manga release spike)
30,000 |
25,000 | ***
20,000 | ** **
15,000 | ** **
10,000 | ********** * ***
7,000 |--- avg ------*****---------------------------------****-------
5,000 | *** ***
3,000 | *** ***
1,000 | ******* **
|_____|_____|_____|_____|_____|_____|_____|_____|_____|_____|____
00 03 06 09 12 15 18 21 00 03 06 JST
trough lunch peak evening overnight
Key observations: - Trough: 02:00-06:00 JST -- under 2,000 tokens/s. Over-provisioning here is pure waste. - Lunch bump: 11:00-13:00 JST -- brief 10,000 tokens/s spike, subsides quickly. - Peak evening: 18:00-23:00 JST -- sustained 15,000-25,000 tokens/s. This is where provisioned throughput earns its keep. - Manga release events: Overlaid on the evening peak, release events can push to 35,000 tokens/s for 30-60 minutes.
2.3 Event Provisioning -- Black Friday and Manga Releases
Scheduled events require pre-provisioning because Bedrock provisioned throughput changes take time to activate. MangaAssist maintains an event calendar:
| Event Type | Lead Time for Provisioning | Capacity Multiplier | Duration |
|---|---|---|---|
| Weekly Shonen Jump release (Monday) | 2 hours before midnight JST Sunday | 2x baseline peak | 4 hours |
| Major manga volume release | 6 hours before release time | 3x baseline peak | 6 hours |
| Black Friday / Cyber Monday | 24 hours before event start | 5x baseline peak | 72 hours |
| Amazon Prime Day | 24 hours before event start | 4x baseline peak | 48 hours |
| New Year sale | 12 hours before Dec 31 JST | 3x baseline peak | 96 hours |
3. Utilization Monitoring
3.1 Bedrock Invocation Metrics
MangaAssist tracks the following CloudWatch metrics under the AWS/Bedrock namespace:
| Metric | What It Tells Us | Alarm Threshold |
|---|---|---|
Invocations |
Total API calls per period | Trend analysis, no hard alarm |
InvocationLatency |
P50/P90/P99 latency per invocation | P99 > 5s triggers investigation |
InvocationClientErrors (4xx) |
Validation failures, malformed requests | > 1% of invocations |
InvocationServerErrors (5xx) |
Bedrock-side failures | > 0.1% of invocations |
InvocationThrottles |
Requests rejected due to throughput limits | > 0 triggers immediate alert |
InputTokenCount |
Tokens consumed in prompts | Cost tracking, prompt bloat detection |
OutputTokenCount |
Tokens generated in responses | Cost tracking, verbosity detection |
3.2 Tokens-per-Second Tracking
Tokens-per-second (TPS) is the primary capacity metric. MangaAssist publishes a custom CloudWatch metric MangaAssist/TokensPerSecond computed as:
TPS = (InputTokenCount + OutputTokenCount) / period_seconds
This metric drives: - Provisioned throughput sizing: If sustained TPS exceeds 80% of provisioned capacity for 10 minutes, scale-up is triggered. - Cost allocation: TPS per intent allows finance to allocate Bedrock costs to product teams (recommendations team vs. FAQ team). - Anomaly detection: CloudWatch Anomaly Detection on TPS catches unexpected surges (bot attacks, prompt injection attempts causing token inflation).
3.3 Waste Detection -- Over-Provisioned Capacity
Provisioned throughput that sits idle is wasted money. MangaAssist defines waste as:
Waste Ratio = 1 - (actual_TPS / provisioned_TPS)
| Waste Ratio | Interpretation | Action |
|---|---|---|
| < 20% | Healthy headroom | No action |
| 20-40% | Mild over-provisioning | Review at next capacity planning cycle |
| 40-60% | Significant waste | Schedule scale-down within 1 hour |
| > 60% | Critical waste | Immediate scale-down (overnight trough likely) |
4. Auto-Scaling Configurations
4.1 ECS Fargate Auto-Scaling Tied to Bedrock Queue Depth
ECS Fargate runs the MangaAssist orchestrator service. Scaling this service based on CPU alone is insufficient because the orchestrator spends most of its time waiting on Bedrock (I/O bound, not CPU bound). Instead, MangaAssist scales on a custom metric: Bedrock request queue depth -- the number of in-flight Bedrock invocations waiting for a response.
graph TD
subgraph "Auto-Scaling Loop"
A[CloudWatch Custom Metric<br>bedrock_queue_depth] --> B{Queue Depth<br>Evaluation}
B -->|depth > 50 for 2 min| C[Step Scaling: +2 tasks]
B -->|depth > 100 for 1 min| D[Step Scaling: +5 tasks]
B -->|depth > 200 for 30s| E[Step Scaling: +10 tasks<br>Emergency]
B -->|depth < 20 for 10 min| F[Scale-in: -1 task]
B -->|depth < 5 for 15 min| G[Scale-in: -2 tasks]
end
subgraph "Metric Publisher"
H[ECS Orchestrator] --> I[Publish queue depth<br>every 10 seconds]
I --> A
end
subgraph "Bedrock"
H --> J[Bedrock Invocations]
J --> K[InvocationThrottles metric]
K --> A
end
4.2 Step Scaling for Gradual Ramp
Step scaling prevents oscillation during gradual traffic increases (e.g., the evening ramp from 17:00-19:00 JST). Each step adds capacity proportional to the severity of the breach:
| Queue Depth | Breach Duration | Action | Cooldown |
|---|---|---|---|
| 50 | 2 minutes | +2 ECS tasks | 120s |
| 100 | 1 minute | +5 ECS tasks | 90s |
| 200 | 30 seconds | +10 ECS tasks | 60s |
| < 20 | 10 minutes | -1 ECS task | 300s |
| < 5 | 15 minutes | -2 ECS tasks | 300s |
Scale-in cooldowns are deliberately longer than scale-out cooldowns to avoid premature shrinkage during brief traffic dips within a peak window.
4.3 Target Tracking for Tokens-per-Second
In addition to step scaling, MangaAssist uses a target tracking policy on the custom MangaAssist/TokensPerSecond metric. The target is set to maintain each ECS task at approximately 500 tokens/s throughput:
Target TPS per task = 500
Desired task count = ceil(current_total_TPS / 500)
This provides a smooth, predictive scaling signal that complements the reactive step scaling on queue depth.
5. Provisioned Throughput Optimization
5.1 Bedrock Provisioned Throughput -- Model Units
AWS Bedrock offers provisioned throughput purchased in model units. Each model unit provides a guaranteed tokens-per-second capacity for a specific model. For MangaAssist:
| Model | Use Case | On-Demand Rate | Provisioned Unit Capacity | Units Needed (Peak) |
|---|---|---|---|---|
| Claude 3 Sonnet | Complex queries (recommendations, multi-turn) | Variable, subject to throttling | ~1,000 tokens/s per unit | 15 units |
| Claude 3 Haiku | Simple queries (greetings, FAQs, order status) | Variable, subject to throttling | ~2,000 tokens/s per unit | 5 units |
5.2 On-Demand vs Provisioned -- Breakeven Analysis
The decision between on-demand and provisioned depends on sustained utilization. Provisioned throughput has a fixed hourly cost regardless of actual usage; on-demand charges per token.
| Traffic Level (tokens/s) | Monthly On-Demand Cost (est.) | Monthly Provisioned Cost (est.) | Savings with Provisioned | Recommendation |
|---|---|---|---|---|
| 2,000 (overnight trough) | $8,600 | $18,000 (5 units) | -$9,400 (loss) | On-demand |
| 7,000 (daily average) | $30,100 | $18,000 (5 units) | +$12,100 | Provisioned |
| 15,000 (evening peak) | $64,500 | $36,000 (10 units) | +$28,500 | Provisioned |
| 25,000 (peak evening) | $107,500 | $54,000 (15 units) | +$53,500 | Provisioned |
| 35,000 (manga release burst) | $150,500 | $54,000 + on-demand overflow | +$61,500 | Hybrid |
MangaAssist strategy: Provision for the sustained evening peak (15 units Sonnet + 5 units Haiku) and let on-demand handle bursts above that. During overnight trough, scale provisioned throughput down to minimum (3 units Sonnet + 2 units Haiku).
6. Throughput Management Architecture
graph TB
subgraph "Client Layer"
WS[WebSocket Clients<br>1M messages/day]
end
subgraph "Ingress"
AG[API Gateway WebSocket]
end
subgraph "Orchestration Layer (ECS Fargate)"
IC[Intent Classifier]
MB[Micro-Batcher]
RC[Request Coalescer]
PQ[Priority Queue<br>High / Normal / Low]
RL[Rate Limiter]
end
subgraph "Cache Layer"
REDIS[ElastiCache Redis<br>Coalesced Response Cache<br>Embedding Cache]
end
subgraph "FM Layer"
BPT[Bedrock Provisioned Throughput<br>Sonnet: 15 units<br>Haiku: 5 units]
BOD[Bedrock On-Demand<br>Overflow / Burst]
end
subgraph "Monitoring"
CW[CloudWatch Metrics<br>TPS, Queue Depth, Throttles]
AS[Auto-Scaling Policies<br>Step + Target Tracking]
ALARM[Alarms + SNS]
end
WS --> AG --> IC
IC --> MB
MB --> RC
RC -->|cache hit| REDIS
RC -->|cache miss| PQ
PQ --> RL
RL -->|within provisioned capacity| BPT
RL -->|overflow| BOD
BPT --> REDIS
BOD --> REDIS
CW --> AS
AS -->|scale ECS| IC
AS -->|adjust provisioned units| BPT
CW --> ALARM
7. Python Implementation
7.1 ThroughputManager with Batching and Rate Limiting
import asyncio
import time
import hashlib
import json
from dataclasses import dataclass, field
from typing import Optional
from collections import defaultdict
import boto3
@dataclass
class InferenceRequest:
"""A single MangaAssist inference request."""
request_id: str
user_id: str
intent: str
prompt: str
model_id: str
priority: str = "normal" # high, normal, low
created_at: float = field(default_factory=time.time)
normalized_query_hash: Optional[str] = None
@dataclass
class InferenceResponse:
"""Response from Bedrock invocation."""
request_id: str
output_text: str
input_tokens: int
output_tokens: int
latency_ms: float
from_cache: bool = False
from_coalesce: bool = False
class ThroughputManager:
"""
Manages Bedrock throughput for MangaAssist with:
- Micro-batching by intent
- Request coalescing for identical queries
- Priority-based rate limiting
- Provisioned vs on-demand routing
"""
def __init__(
self,
provisioned_tps: int = 20_000,
batch_window_ms: int = 150,
coalesce_window_ms: int = 1_000,
max_batch_size: int = 8,
max_concurrent_invocations: int = 100,
redis_client=None,
):
self.provisioned_tps = provisioned_tps
self.batch_window_ms = batch_window_ms
self.coalesce_window_ms = coalesce_window_ms
self.max_batch_size = max_batch_size
self.redis_client = redis_client
# Semaphore to limit concurrent Bedrock invocations
self._invocation_semaphore = asyncio.Semaphore(max_concurrent_invocations)
# Intent-based batch queues: intent -> list of (request, future)
self._batch_queues: dict[str, list] = defaultdict(list)
self._batch_locks: dict[str, asyncio.Lock] = defaultdict(asyncio.Lock)
# Coalescing: query_hash -> (future, timestamp)
self._coalesce_map: dict[str, asyncio.Future] = {}
self._coalesce_lock = asyncio.Lock()
# Metrics tracking
self._tokens_this_second = 0
self._tokens_lock = asyncio.Lock()
self._last_token_reset = time.time()
# Rate limiting buckets
self._rate_limits = {
"high": asyncio.Semaphore(50),
"normal": asyncio.Semaphore(40),
"low": asyncio.Semaphore(10),
}
# Bedrock client
self._bedrock_runtime = boto3.client("bedrock-runtime", region_name="ap-northeast-1")
async def process_request(self, request: InferenceRequest) -> InferenceResponse:
"""
Main entry point. Routes through coalescing, batching, and rate limiting.
"""
# Step 1: Check coalescing (skip for order-status, unique per user)
if request.intent not in ("order_status", "account_info"):
coalesced = await self._try_coalesce(request)
if coalesced is not None:
return coalesced
# Step 2: Check Redis cache
cached = await self._check_cache(request)
if cached is not None:
return cached
# Step 3: Enqueue for micro-batching
response = await self._enqueue_for_batch(request)
return response
def _compute_query_hash(self, request: InferenceRequest) -> str:
"""Compute a normalized hash for coalescing identical queries."""
normalized = json.dumps({
"intent": request.intent,
"prompt_hash": hashlib.sha256(
request.prompt.strip().lower().encode()
).hexdigest()[:16],
"model_id": request.model_id,
}, sort_keys=True)
return hashlib.sha256(normalized.encode()).hexdigest()[:32]
async def _try_coalesce(self, request: InferenceRequest) -> Optional[InferenceResponse]:
"""
If an identical query is already in-flight, wait for its result
instead of making a duplicate Bedrock call.
"""
query_hash = self._compute_query_hash(request)
request.normalized_query_hash = query_hash
async with self._coalesce_lock:
if query_hash in self._coalesce_map:
existing_future, created_at = self._coalesce_map[query_hash]
age_ms = (time.time() - created_at) * 1000
if age_ms < self.coalesce_window_ms:
# Wait for the in-flight request to complete
result = await existing_future
return InferenceResponse(
request_id=request.request_id,
output_text=result.output_text,
input_tokens=result.input_tokens,
output_tokens=result.output_tokens,
latency_ms=result.latency_ms,
from_coalesce=True,
)
# No in-flight duplicate; register this request for coalescing
future = asyncio.get_event_loop().create_future()
self._coalesce_map[query_hash] = (future, time.time())
return None # Proceed to batching/invocation
async def _check_cache(self, request: InferenceRequest) -> Optional[InferenceResponse]:
"""Check Redis for a cached response."""
if self.redis_client is None:
return None
cache_key = f"manga:resp:{request.normalized_query_hash or self._compute_query_hash(request)}"
cached_data = self.redis_client.get(cache_key)
if cached_data:
data = json.loads(cached_data)
return InferenceResponse(
request_id=request.request_id,
output_text=data["output_text"],
input_tokens=data["input_tokens"],
output_tokens=data["output_tokens"],
latency_ms=0.0,
from_cache=True,
)
return None
async def _enqueue_for_batch(self, request: InferenceRequest) -> InferenceResponse:
"""
Add request to the intent-specific batch queue.
If the batch fills up or the window expires, flush the batch.
"""
future = asyncio.get_event_loop().create_future()
async with self._batch_locks[request.intent]:
self._batch_queues[request.intent].append((request, future))
if len(self._batch_queues[request.intent]) >= self.max_batch_size:
batch = self._batch_queues[request.intent][:self.max_batch_size]
self._batch_queues[request.intent] = self._batch_queues[request.intent][self.max_batch_size:]
asyncio.create_task(self._flush_batch(request.intent, batch))
elif len(self._batch_queues[request.intent]) == 1:
# First item in batch -- start the window timer
asyncio.create_task(self._batch_timer(request.intent))
return await future
async def _batch_timer(self, intent: str):
"""Wait for the batch window, then flush whatever is queued."""
await asyncio.sleep(self.batch_window_ms / 1000.0)
async with self._batch_locks[intent]:
if self._batch_queues[intent]:
batch = self._batch_queues[intent][:]
self._batch_queues[intent] = []
asyncio.create_task(self._flush_batch(intent, batch))
async def _flush_batch(self, intent: str, batch: list):
"""
Invoke Bedrock for a batch of requests.
Uses rate limiting and routes to provisioned or on-demand.
"""
for request, future in batch:
priority_sem = self._rate_limits.get(request.priority, self._rate_limits["normal"])
await priority_sem.acquire()
try:
async with self._invocation_semaphore:
response = await self._invoke_bedrock(request)
future.set_result(response)
# Update coalescing map
if request.normalized_query_hash and request.normalized_query_hash in self._coalesce_map:
coalesce_future, _ = self._coalesce_map.pop(request.normalized_query_hash)
if not coalesce_future.done():
coalesce_future.set_result(response)
# Cache the result
await self._cache_response(request, response)
# Track tokens
await self._track_tokens(response.input_tokens + response.output_tokens)
except Exception as e:
if not future.done():
future.set_exception(e)
finally:
priority_sem.release()
async def _invoke_bedrock(self, request: InferenceRequest) -> InferenceResponse:
"""Call Bedrock and return the response."""
start = time.time()
body = json.dumps({
"anthropic_version": "bedrock-2023-05-31",
"max_tokens": 1024,
"messages": [{"role": "user", "content": request.prompt}],
})
# Route: provisioned throughput if within capacity, else on-demand
current_tps = await self._get_current_tps()
model_id = request.model_id
if current_tps > self.provisioned_tps * 0.9:
# Near provisioned capacity -- use on-demand endpoint as overflow
model_id = f"{request.model_id}" # on-demand uses same model ID
response = self._bedrock_runtime.invoke_model(
modelId=model_id,
contentType="application/json",
accept="application/json",
body=body,
)
response_body = json.loads(response["body"].read())
latency_ms = (time.time() - start) * 1000
return InferenceResponse(
request_id=request.request_id,
output_text=response_body["content"][0]["text"],
input_tokens=response_body["usage"]["input_tokens"],
output_tokens=response_body["usage"]["output_tokens"],
latency_ms=latency_ms,
)
async def _track_tokens(self, token_count: int):
"""Track tokens-per-second for monitoring and scaling decisions."""
async with self._tokens_lock:
now = time.time()
if now - self._last_token_reset >= 1.0:
# Publish the previous second's count to CloudWatch
await self._publish_tps_metric(self._tokens_this_second)
self._tokens_this_second = 0
self._last_token_reset = now
self._tokens_this_second += token_count
async def _get_current_tps(self) -> int:
"""Return the current tokens-per-second rate."""
async with self._tokens_lock:
return self._tokens_this_second
async def _publish_tps_metric(self, tps: int):
"""Publish tokens-per-second to CloudWatch."""
cloudwatch = boto3.client("cloudwatch", region_name="ap-northeast-1")
cloudwatch.put_metric_data(
Namespace="MangaAssist",
MetricData=[{
"MetricName": "TokensPerSecond",
"Value": tps,
"Unit": "Count/Second",
"Dimensions": [
{"Name": "Service", "Value": "ThroughputManager"},
{"Name": "Environment", "Value": "production"},
],
}],
)
async def _cache_response(self, request: InferenceRequest, response: InferenceResponse):
"""Cache the response in Redis with a short TTL."""
if self.redis_client is None:
return
if request.intent in ("order_status", "account_info"):
return # Do not cache user-specific responses
cache_key = f"manga:resp:{request.normalized_query_hash}"
self.redis_client.setex(
cache_key,
30, # 30-second TTL
json.dumps({
"output_text": response.output_text,
"input_tokens": response.input_tokens,
"output_tokens": response.output_tokens,
}),
)
def get_metrics_snapshot(self) -> dict:
"""Return a snapshot of current throughput metrics."""
return {
"current_tps": self._tokens_this_second,
"provisioned_tps": self.provisioned_tps,
"utilization_pct": round(
(self._tokens_this_second / self.provisioned_tps) * 100, 1
) if self.provisioned_tps > 0 else 0,
"batch_queue_depths": {
intent: len(queue)
for intent, queue in self._batch_queues.items()
},
"coalesce_map_size": len(self._coalesce_map),
}
7.2 Auto-Scaling Policy Configuration (CDK-Style)
"""
CDK-style auto-scaling configuration for MangaAssist ECS Fargate service
with Bedrock-aware custom metrics.
"""
from aws_cdk import (
Stack,
Duration,
aws_ecs as ecs,
aws_applicationautoscaling as appscaling,
aws_cloudwatch as cloudwatch,
)
from constructs import Construct
class MangaAssistAutoScalingStack(Stack):
"""
Configures auto-scaling for the MangaAssist orchestrator service
running on ECS Fargate, using Bedrock queue depth and tokens-per-second
as scaling signals instead of CPU/memory alone.
"""
def __init__(self, scope: Construct, id: str, ecs_service: ecs.FargateService, **kwargs):
super().__init__(scope, id, **kwargs)
# --- Scalable Target ---
scaling = ecs_service.auto_scale_task_count(
min_capacity=5, # Minimum tasks even at overnight trough
max_capacity=100, # Hard ceiling for cost safety
)
# --- Policy 1: Target Tracking on Tokens-per-Second ---
# Each ECS task should handle roughly 500 tokens/s.
# If total TPS rises, add tasks proportionally.
tps_metric = cloudwatch.Metric(
namespace="MangaAssist",
metric_name="TokensPerSecond",
dimensions_map={
"Service": "ThroughputManager",
"Environment": "production",
},
statistic="Average",
period=Duration.seconds(60),
)
scaling.scale_on_metric(
"ScaleOnTPS",
metric=tps_metric,
scaling_steps=[
appscaling.ScalingInterval(change=-2, lower=0, upper=2_000),
appscaling.ScalingInterval(change=0, lower=2_000, upper=5_000),
appscaling.ScalingInterval(change=2, lower=5_000, upper=10_000),
appscaling.ScalingInterval(change=5, lower=10_000, upper=20_000),
appscaling.ScalingInterval(change=10, lower=20_000),
],
adjustment_type=appscaling.AdjustmentType.CHANGE_IN_CAPACITY,
cooldown=Duration.seconds(120),
)
# --- Policy 2: Step Scaling on Bedrock Queue Depth ---
queue_depth_metric = cloudwatch.Metric(
namespace="MangaAssist",
metric_name="BedrockQueueDepth",
dimensions_map={
"Service": "Orchestrator",
"Environment": "production",
},
statistic="Maximum",
period=Duration.seconds(30),
)
# Scale-out steps (aggressive)
scaling.scale_on_metric(
"ScaleOutOnQueueDepth",
metric=queue_depth_metric,
scaling_steps=[
appscaling.ScalingInterval(change=2, lower=50, upper=100),
appscaling.ScalingInterval(change=5, lower=100, upper=200),
appscaling.ScalingInterval(change=10, lower=200),
],
adjustment_type=appscaling.AdjustmentType.CHANGE_IN_CAPACITY,
cooldown=Duration.seconds(60),
)
# Scale-in steps (conservative)
scaling.scale_on_metric(
"ScaleInOnQueueDepth",
metric=queue_depth_metric,
scaling_steps=[
appscaling.ScalingInterval(change=-1, upper=20),
appscaling.ScalingInterval(change=-2, upper=5),
],
adjustment_type=appscaling.AdjustmentType.CHANGE_IN_CAPACITY,
cooldown=Duration.seconds(300),
)
# --- Policy 3: Scheduled Scaling for Known Peaks ---
# Evening peak: 17:00-23:59 JST (08:00-14:59 UTC)
scaling.scale_on_schedule(
"EveningPeakScaleUp",
schedule=appscaling.Schedule.cron(hour="8", minute="0"),
min_capacity=30,
)
scaling.scale_on_schedule(
"EveningPeakScaleDown",
schedule=appscaling.Schedule.cron(hour="15", minute="0"),
min_capacity=10,
)
# Overnight trough: 02:00-06:00 JST (17:00-21:00 UTC)
scaling.scale_on_schedule(
"OvernightTrough",
schedule=appscaling.Schedule.cron(hour="17", minute="0"),
min_capacity=5,
)
# Monday manga release: Scale up Sunday 22:00 JST (13:00 UTC)
scaling.scale_on_schedule(
"MondayMangaReleasePrepare",
schedule=appscaling.Schedule.cron(
week_day="SUN", hour="13", minute="0"
),
min_capacity=50,
)
scaling.scale_on_schedule(
"MondayMangaReleaseEnd",
schedule=appscaling.Schedule.cron(
week_day="MON", hour="6", minute="0" # 15:00 JST
),
min_capacity=10,
)
# --- Policy 4: Throttle-Based Emergency Scaling ---
throttle_alarm = cloudwatch.Alarm(
self,
"BedrockThrottleAlarm",
metric=cloudwatch.Metric(
namespace="AWS/Bedrock",
metric_name="InvocationThrottles",
statistic="Sum",
period=Duration.seconds(60),
),
threshold=5,
evaluation_periods=1,
comparison_operator=cloudwatch.ComparisonOperator.GREATER_THAN_OR_EQUAL_TO_THRESHOLD,
alarm_description="Bedrock throttling detected -- emergency scale-out",
)
scaling.scale_on_metric(
"EmergencyScaleOnThrottle",
metric=cloudwatch.Metric(
namespace="AWS/Bedrock",
metric_name="InvocationThrottles",
statistic="Sum",
period=Duration.seconds(60),
),
scaling_steps=[
appscaling.ScalingInterval(change=5, lower=5, upper=20),
appscaling.ScalingInterval(change=15, lower=20),
],
adjustment_type=appscaling.AdjustmentType.CHANGE_IN_CAPACITY,
cooldown=Duration.seconds(60),
)
8. Cost Comparison: On-Demand vs Provisioned Throughput
The following table uses estimated pricing to illustrate the breakeven analysis. Actual costs depend on the specific Bedrock model and region.
| Scenario | Avg TPS | Monthly On-Demand | Provisioned Units | Monthly Provisioned | Net Savings | Strategy |
|---|---|---|---|---|---|---|
| Overnight (02:00-06:00 JST) | 1,500 | $6,450 | 3 Sonnet + 2 Haiku | $14,400 | -$7,950 | On-demand only |
| Morning ramp (06:00-12:00 JST) | 5,000 | $21,500 | 5 Sonnet + 3 Haiku | $18,000 | +$3,500 | Provisioned |
| Afternoon steady (12:00-17:00 JST) | 8,000 | $34,400 | 8 Sonnet + 3 Haiku | $25,200 | +$9,200 | Provisioned |
| Evening peak (17:00-23:00 JST) | 20,000 | $86,000 | 15 Sonnet + 5 Haiku | $54,000 | +$32,000 | Provisioned |
| Manga release burst (30 min) | 35,000 | $150,500 | 15 Sonnet + 5 Haiku + overflow | $54,000 + $12,000 overflow | +$84,500 | Hybrid |
| Monthly blended | ~7,000 | $158,750 | Dynamic schedule | $111,600 | +$47,150 (30% savings) | Time-of-day provisioning |
Key takeaway: MangaAssist saves approximately 30% on Bedrock costs by using time-of-day provisioned throughput scheduling instead of pure on-demand. The overnight period is left on-demand, the evening peak is fully provisioned, and manga release bursts use a hybrid approach.
Key Takeaways
- Batching reduces cost but adds latency -- MangaAssist keeps the micro-batch window under 150ms and coalescing window under 1s to stay within the 3-second budget.
- Capacity planning is token-centric, not request-centric -- A recommendation query consumes 40x more tokens than a greeting; planning on request counts alone leads to under-provisioning.
- Auto-scaling must use Bedrock-aware metrics -- CPU utilization on ECS tasks is misleading for I/O-bound LLM workloads; queue depth and TPS are the correct signals.
- Provisioned throughput saves money only at sustained utilization -- The breakeven point for MangaAssist is around 5,000 tokens/s; below that, on-demand is cheaper.
- Event provisioning requires a calendar -- Manga releases and sales events must be pre-provisioned hours in advance because throughput changes are not instantaneous.