AI-Assisted Development Patterns for MangaAssist
MangaAssist context: JP Manga store chatbot on AWS — Bedrock Claude 3 (Sonnet at $3/$15 per 1M tokens input/output, Haiku at $0.25/$1.25), OpenSearch Serverless (vector store), DynamoDB (sessions/products), ECS Fargate (orchestrator), API Gateway WebSocket, ElastiCache Redis. Target: useful answer in under 3 seconds, 1M messages/day scale.
Skill Mapping
| Attribute |
Detail |
| Domain |
2 — Implementation & Integration of GenAI Applications |
| Task |
2.5 — Application Integration Patterns |
| Skill |
2.5.4 — Developer Productivity |
| Focus |
Test fixtures for FM responses, mock Bedrock client, benchmark suite |
| MangaAssist Scope |
Deterministic testing of non-deterministic AI, performance benchmarks, development velocity |
Mind Map
mindmap
root((AI-Assisted<br/>Development Patterns))
Test Fixtures
FM Response Fixtures
Golden Response Sets
Edge Case Responses
Japanese Content Fixtures
Error Response Fixtures
Fixture Management
Version-Controlled Fixtures
Auto-Generated from Prod
Parameterized Templates
Snapshot Testing
Mock Bedrock Client
Request Interception
Model ID Routing
Token Counting
Latency Simulation
Response Generation
Deterministic Outputs
Configurable Failures
Streaming Simulation
Cost Tracking
Validation
Schema Validation
Token Limit Enforcement
Content Safety Checks
Benchmark Suite
Latency Benchmarks
Cold Start Timing
Warm Path Timing
End-to-End Chain
Throughput Benchmarks
Concurrent Requests
Batch Processing
Rate Limiting
Cost Benchmarks
Token Usage per Query
Model Routing Savings
Cache Hit Impact
Regression Detection
Statistical Comparison
Trend Analysis
Alert Thresholds
Test Fixture Architecture
graph TB
subgraph FixtureStore["Fixture Store"]
GR[Golden Responses<br/>Known-Good Outputs]
EC[Edge Cases<br/>Unicode / Empty / Overflow]
ER[Error Responses<br/>Throttle / Timeout / 500]
JP[Japanese Fixtures<br/>Manga-Specific Content]
end
subgraph FixtureLoader["Fixture Loader"]
FL[File Loader<br/>JSON / YAML]
PL[Parameterized Loader<br/>Template + Variables]
SL[Snapshot Loader<br/>Recorded Prod Responses]
end
subgraph TestTargets["Test Targets"]
UT[Unit Tests<br/>Single Function]
IT[Integration Tests<br/>Service Chain]
CT[Contract Tests<br/>API Schema]
PT[Property Tests<br/>Invariants]
end
GR --> FL
EC --> FL
ER --> FL
JP --> PL
FL --> UT
PL --> IT
SL --> CT
FL --> PT
style GR fill:#51cf66,color:#fff
style EC fill:#ffd43b,color:#333
style ER fill:#ff6b6b,color:#fff
FM Response Fixtures
"""
Test fixtures for Foundation Model responses in MangaAssist.
Provides deterministic, version-controlled response data for testing.
"""
import json
import hashlib
from dataclasses import dataclass, field
from pathlib import Path
from typing import Any, Optional
@dataclass
class FMResponseFixture:
"""A captured or constructed FM response for testing."""
fixture_id: str
model_id: str
prompt_hash: str
response_text: str
usage: dict[str, int]
latency_ms: float
stop_reason: str = "end_turn"
metadata: dict[str, Any] = field(default_factory=dict)
def to_bedrock_format(self) -> dict[str, Any]:
"""Convert to the format returned by Bedrock invoke_model."""
return {
"content": [{"type": "text", "text": self.response_text}],
"usage": self.usage,
"stop_reason": self.stop_reason,
"model": self.model_id,
}
class FixtureLibrary:
"""Manages a library of FM response fixtures for MangaAssist testing."""
def __init__(self, fixture_dir: str = "tests/fixtures/fm_responses"):
self.fixture_dir = Path(fixture_dir)
self.fixture_dir.mkdir(parents=True, exist_ok=True)
self._cache: dict[str, FMResponseFixture] = {}
def _prompt_hash(self, prompt: str) -> str:
"""Generate a stable hash for a prompt string."""
return hashlib.sha256(prompt.encode("utf-8")).hexdigest()[:16]
# --- Golden response fixtures for common MangaAssist queries ---
MANGA_RECOMMENDATION = FMResponseFixture(
fixture_id="manga_recommendation_001",
model_id="anthropic.claude-3-sonnet-20240229-v1:0",
prompt_hash="a1b2c3d4e5f6g7h8",
response_text=(
"Based on your interest in action manga, I recommend these titles:\n\n"
"1. **鬼滅の刃 (Demon Slayer)** by 吾峠呼世晴 — A tale of Tanjiro's quest "
"to save his sister. 23 volumes, completed.\n\n"
"2. **呪術廻戦 (Jujutsu Kaisen)** by 芥見下々 — Modern sorcery battles "
"with compelling characters. Currently at volume 25.\n\n"
"3. **チェンソーマン (Chainsaw Man)** by 藤本タツキ — Dark, fast-paced action "
"with unique art style. Part 2 ongoing.\n\n"
"All three are available in our store. Would you like pricing or "
"availability information for any of these?"
),
usage={"input_tokens": 245, "output_tokens": 187},
latency_ms=1250.0,
)
PRODUCT_LOOKUP = FMResponseFixture(
fixture_id="product_lookup_001",
model_id="anthropic.claude-3-haiku-20240307-v1:0",
prompt_hash="b2c3d4e5f6g7h8i9",
response_text=(
"Here are the details for 進撃の巨人 (Attack on Titan) Volume 1:\n\n"
"- **Author**: 諫山創 (Isayama Hajime)\n"
"- **Publisher**: 講談社 (Kodansha)\n"
"- **Price**: ¥528 (tax included)\n"
"- **ISBN**: 978-4-06-384234-6\n"
"- **Status**: In stock\n"
"- **Format**: Tankōbon (単行本)\n\n"
"Would you like to add this to your cart?"
),
usage={"input_tokens": 120, "output_tokens": 98},
latency_ms=450.0,
)
ORDER_STATUS = FMResponseFixture(
fixture_id="order_status_001",
model_id="anthropic.claude-3-haiku-20240307-v1:0",
prompt_hash="c3d4e5f6g7h8i9j0",
response_text=(
"Your order #MNG-2024-78432 is currently being processed:\n\n"
"- **Status**: Shipped\n"
"- **Items**: 3 manga volumes\n"
"- **Tracking**: JP-1234567890\n"
"- **Estimated delivery**: 2-3 business days\n\n"
"You can track your package at the Japan Post website."
),
usage={"input_tokens": 95, "output_tokens": 72},
latency_ms=380.0,
)
JAPANESE_GREETING = FMResponseFixture(
fixture_id="japanese_greeting_001",
model_id="anthropic.claude-3-haiku-20240307-v1:0",
prompt_hash="d4e5f6g7h8i9j0k1",
response_text=(
"こんにちは!マンガアシストへようこそ。"
"どのようなマンガをお探しですか?"
"ジャンル、作者名、または作品名でお探しいただけます。"
),
usage={"input_tokens": 50, "output_tokens": 45},
latency_ms=280.0,
)
# --- Error response fixtures ---
THROTTLED_RESPONSE = FMResponseFixture(
fixture_id="error_throttle_001",
model_id="anthropic.claude-3-sonnet-20240229-v1:0",
prompt_hash="error_throttle",
response_text="",
usage={"input_tokens": 0, "output_tokens": 0},
latency_ms=0.0,
stop_reason="error",
metadata={"error_code": "ThrottlingException", "retryable": True},
)
TIMEOUT_RESPONSE = FMResponseFixture(
fixture_id="error_timeout_001",
model_id="anthropic.claude-3-sonnet-20240229-v1:0",
prompt_hash="error_timeout",
response_text="",
usage={"input_tokens": 245, "output_tokens": 0},
latency_ms=30000.0,
stop_reason="error",
metadata={"error_code": "ModelTimeoutException", "retryable": True},
)
CONTENT_FILTER_RESPONSE = FMResponseFixture(
fixture_id="error_content_filter_001",
model_id="anthropic.claude-3-sonnet-20240229-v1:0",
prompt_hash="error_content_filter",
response_text="I cannot help with that request.",
usage={"input_tokens": 120, "output_tokens": 8},
latency_ms=200.0,
stop_reason="content_filtered",
metadata={"filter_type": "safety", "category": "harmful_content"},
)
def save_fixture(self, fixture: FMResponseFixture) -> Path:
"""Persist a fixture to disk for version control."""
filepath = self.fixture_dir / f"{fixture.fixture_id}.json"
data = {
"fixture_id": fixture.fixture_id,
"model_id": fixture.model_id,
"prompt_hash": fixture.prompt_hash,
"response_text": fixture.response_text,
"usage": fixture.usage,
"latency_ms": fixture.latency_ms,
"stop_reason": fixture.stop_reason,
"metadata": fixture.metadata,
}
with open(filepath, "w", encoding="utf-8") as f:
json.dump(data, f, ensure_ascii=False, indent=2)
return filepath
def load_fixture(self, fixture_id: str) -> FMResponseFixture | None:
"""Load a fixture from disk."""
if fixture_id in self._cache:
return self._cache[fixture_id]
filepath = self.fixture_dir / f"{fixture_id}.json"
if not filepath.exists():
return None
with open(filepath, encoding="utf-8") as f:
data = json.load(f)
fixture = FMResponseFixture(**data)
self._cache[fixture_id] = fixture
return fixture
def list_fixtures(self) -> list[str]:
"""List all available fixture IDs."""
return [
p.stem for p in self.fixture_dir.glob("*.json")
]
Mock Bedrock Client
graph TB
subgraph MockClient["Mock Bedrock Client"]
RI[Request Interceptor]
RR[Response Router]
LS[Latency Simulator]
TC[Token Counter]
CT[Cost Tracker]
end
subgraph ResponseStrategies["Response Strategies"]
FX[Fixture-Based<br/>Return saved responses]
DT[Deterministic<br/>Hash-based selection]
RN[Random<br/>Controlled randomness]
ER[Error Injection<br/>Fault simulation]
end
subgraph Validation["Request Validation"]
SV[Schema Validator<br/>Anthropic message format]
TL[Token Limit Check<br/>Max tokens enforcement]
ML[Model Validator<br/>Supported model IDs]
end
RI --> SV
RI --> TL
RI --> ML
SV -->|Valid| RR
RR --> FX
RR --> DT
RR --> RN
RR --> ER
FX --> LS
DT --> LS
LS --> TC
TC --> CT
style MockClient fill:#232f3e,color:#ff9900
Mock Bedrock Client Implementation
"""
Mock Bedrock client for deterministic testing of MangaAssist.
Simulates invoke_model, invoke_model_with_response_stream, and error conditions.
"""
import asyncio
import hashlib
import json
import logging
import random
import time
from collections import defaultdict
from dataclasses import dataclass, field
from typing import Any, AsyncIterator, Optional
logger = logging.getLogger(__name__)
@dataclass
class MockModelConfig:
"""Configuration for a mocked Bedrock model."""
model_id: str
avg_latency_ms: float
latency_jitter_ms: float = 50.0
tokens_per_second: float = 80.0
max_tokens: int = 4096
error_rate: float = 0.0
throttle_rate: float = 0.0
MOCK_MODELS = {
"anthropic.claude-3-sonnet-20240229-v1:0": MockModelConfig(
model_id="anthropic.claude-3-sonnet-20240229-v1:0",
avg_latency_ms=1200.0,
latency_jitter_ms=300.0,
tokens_per_second=60.0,
max_tokens=4096,
),
"anthropic.claude-3-haiku-20240307-v1:0": MockModelConfig(
model_id="anthropic.claude-3-haiku-20240307-v1:0",
avg_latency_ms=350.0,
latency_jitter_ms=100.0,
tokens_per_second=120.0,
max_tokens=4096,
),
}
class MockBedrockClient:
"""
A mock AWS Bedrock runtime client for testing MangaAssist.
Provides deterministic responses, latency simulation, error injection,
and usage tracking — all without calling the real Bedrock service.
"""
def __init__(
self,
fixtures: dict[str, Any] | None = None,
default_model: str = "anthropic.claude-3-haiku-20240307-v1:0",
seed: int = 42,
):
self.fixtures = fixtures or {}
self.default_model = default_model
self.rng = random.Random(seed)
self.call_log: list[dict[str, Any]] = []
self.token_usage: defaultdict[str, dict[str, int]] = defaultdict(
lambda: {"input": 0, "output": 0}
)
self.total_cost_usd: float = 0.0
self._forced_errors: list[dict[str, Any]] = []
self._response_overrides: dict[str, str] = {}
def force_error(
self,
error_code: str,
error_message: str = "Simulated error",
count: int = 1,
) -> None:
"""Queue a forced error for the next N invocations."""
for _ in range(count):
self._forced_errors.append({
"code": error_code,
"message": error_message,
})
def set_response_override(self, prompt_hash: str, response: str) -> None:
"""Set a specific response for a given prompt hash."""
self._response_overrides[prompt_hash] = response
def _hash_prompt(self, messages: list[dict]) -> str:
"""Create a deterministic hash of the conversation messages."""
content = json.dumps(messages, sort_keys=True)
return hashlib.sha256(content.encode()).hexdigest()[:16]
def _estimate_tokens(self, text: str) -> int:
"""Rough token estimation: ~4 characters per token for mixed JP/EN."""
return max(1, len(text) // 4)
def _compute_cost(
self, model_id: str, input_tokens: int, output_tokens: int
) -> float:
"""Compute cost in USD based on MangaAssist model pricing."""
pricing = {
"anthropic.claude-3-sonnet-20240229-v1:0": (3.0, 15.0),
"anthropic.claude-3-haiku-20240307-v1:0": (0.25, 1.25),
}
input_rate, output_rate = pricing.get(model_id, (3.0, 15.0))
return (
(input_tokens / 1_000_000) * input_rate
+ (output_tokens / 1_000_000) * output_rate
)
def _generate_response(
self, model_id: str, messages: list[dict], max_tokens: int
) -> str:
"""Generate a deterministic mock response."""
prompt_hash = self._hash_prompt(messages)
# Check for overrides
if prompt_hash in self._response_overrides:
return self._response_overrides[prompt_hash]
# Check for fixture matches
if prompt_hash in self.fixtures:
return self.fixtures[prompt_hash]
# Default: echo-based deterministic response
last_message = messages[-1].get("content", "") if messages else ""
if isinstance(last_message, list):
last_message = " ".join(
block.get("text", "") for block in last_message
if isinstance(block, dict)
)
return (
f"[Mock {model_id.split('.')[-1].split('-')[0]}] "
f"Response to: {last_message[:100]}..."
)
async def invoke(
self,
model_id: str | None = None,
prompt: str = "",
max_tokens: int = 1024,
messages: list[dict] | None = None,
system: str | None = None,
temperature: float = 0.7,
) -> dict[str, Any]:
"""
Mock invoke_model that simulates Bedrock behavior.
Returns a dict with 'text', 'usage', 'latency_ms', and 'cost_usd'.
"""
model_id = model_id or self.default_model
config = MOCK_MODELS.get(model_id, MOCK_MODELS[self.default_model])
# Check for forced errors
if self._forced_errors:
error = self._forced_errors.pop(0)
raise Exception(
f"[{error['code']}] {error['message']}"
)
# Check random error injection
if self.rng.random() < config.error_rate:
raise Exception("[ModelError] Random error injection triggered")
if self.rng.random() < config.throttle_rate:
raise Exception("[ThrottlingException] Rate limit exceeded")
# Build messages if only prompt provided
if messages is None:
messages = [{"role": "user", "content": prompt}]
# Generate response
response_text = self._generate_response(model_id, messages, max_tokens)
# Compute tokens
input_text = json.dumps(messages)
if system:
input_text += system
input_tokens = self._estimate_tokens(input_text)
output_tokens = self._estimate_tokens(response_text)
# Simulate latency
latency = config.avg_latency_ms + self.rng.gauss(0, config.latency_jitter_ms)
latency = max(50.0, latency)
await asyncio.sleep(latency / 1000.0)
# Track usage
self.token_usage[model_id]["input"] += input_tokens
self.token_usage[model_id]["output"] += output_tokens
cost = self._compute_cost(model_id, input_tokens, output_tokens)
self.total_cost_usd += cost
# Log the call
call_record = {
"model_id": model_id,
"input_tokens": input_tokens,
"output_tokens": output_tokens,
"latency_ms": round(latency, 2),
"cost_usd": round(cost, 8),
"timestamp": time.time(),
}
self.call_log.append(call_record)
return {
"text": response_text,
"usage": {
"input_tokens": input_tokens,
"output_tokens": output_tokens,
},
"latency_ms": round(latency, 2),
"cost_usd": round(cost, 8),
"model_id": model_id,
"stop_reason": "end_turn",
}
async def invoke_stream(
self,
model_id: str | None = None,
prompt: str = "",
max_tokens: int = 1024,
messages: list[dict] | None = None,
) -> AsyncIterator[dict[str, Any]]:
"""Mock streaming response that yields chunks."""
model_id = model_id or self.default_model
config = MOCK_MODELS.get(model_id, MOCK_MODELS[self.default_model])
if messages is None:
messages = [{"role": "user", "content": prompt}]
response_text = self._generate_response(model_id, messages, max_tokens)
# Simulate streaming: yield word-by-word
words = response_text.split()
chunk_size = max(1, len(words) // 10)
for i in range(0, len(words), chunk_size):
chunk = " ".join(words[i : i + chunk_size])
await asyncio.sleep(50 / 1000.0) # 50ms between chunks
yield {
"type": "content_block_delta",
"delta": {"type": "text_delta", "text": chunk + " "},
}
yield {
"type": "message_stop",
"amazon-bedrock-invocationMetrics": {
"inputTokenCount": self._estimate_tokens(json.dumps(messages)),
"outputTokenCount": self._estimate_tokens(response_text),
},
}
def get_usage_report(self) -> dict[str, Any]:
"""Get a summary of all mock invocations."""
return {
"total_calls": len(self.call_log),
"total_cost_usd": round(self.total_cost_usd, 6),
"token_usage": dict(self.token_usage),
"avg_latency_ms": (
round(
sum(c["latency_ms"] for c in self.call_log) / len(self.call_log), 2
)
if self.call_log
else 0
),
"calls_by_model": {
model: sum(1 for c in self.call_log if c["model_id"] == model)
for model in set(c["model_id"] for c in self.call_log)
},
}
def reset(self) -> None:
"""Reset all tracking state."""
self.call_log.clear()
self.token_usage.clear()
self.total_cost_usd = 0.0
self._forced_errors.clear()
self._response_overrides.clear()
Benchmark Suite
graph TB
subgraph BenchmarkTypes["Benchmark Categories"]
LB[Latency Benchmarks<br/>Response timing]
TB[Throughput Benchmarks<br/>Concurrent load]
CB[Cost Benchmarks<br/>Token economics]
RB[Regression Benchmarks<br/>Historical comparison]
end
subgraph Execution["Benchmark Execution"]
WU[Warm-Up Phase<br/>10 requests]
ME[Measurement Phase<br/>100+ requests]
CD[Cool-Down Phase<br/>Flush metrics]
end
subgraph Analysis["Statistical Analysis"]
DS[Descriptive Stats<br/>Mean / Median / StdDev]
PC[Percentiles<br/>P50 / P95 / P99]
CI[Confidence Intervals<br/>95% CI]
TR[Trend Detection<br/>Regression alerts]
end
subgraph Output["Benchmark Output"]
JR[JSON Report]
HR[HTML Dashboard]
GA[GitHub Actions Annotation]
SL[Slack Notification]
end
LB --> WU
TB --> WU
CB --> WU
RB --> WU
WU --> ME
ME --> CD
CD --> DS
CD --> PC
DS --> CI
PC --> TR
CI --> JR
TR --> GA
JR --> HR
GA --> SL
style ME fill:#339af0,color:#fff
style TR fill:#ff6b6b,color:#fff
Benchmark Suite Implementation
"""
Benchmark suite for MangaAssist chatbot performance testing.
Measures latency, throughput, cost, and detects regressions.
"""
import asyncio
import json
import logging
import statistics
import time
from dataclasses import dataclass, field
from enum import Enum
from pathlib import Path
from typing import Any, Callable, Coroutine, Optional
logger = logging.getLogger(__name__)
class BenchmarkStatus(Enum):
PASS = "pass"
REGRESSION = "regression"
IMPROVEMENT = "improvement"
BASELINE = "baseline"
@dataclass
class BenchmarkResult:
"""Result of a single benchmark run."""
name: str
samples: int
mean_ms: float
median_ms: float
stddev_ms: float
p50_ms: float
p95_ms: float
p99_ms: float
min_ms: float
max_ms: float
throughput_rps: float = 0.0
total_tokens: int = 0
total_cost_usd: float = 0.0
status: BenchmarkStatus = BenchmarkStatus.BASELINE
regression_pct: float = 0.0
@dataclass
class BenchmarkConfig:
"""Configuration for a benchmark run."""
name: str
warmup_iterations: int = 10
measurement_iterations: int = 100
concurrent_workers: int = 1
timeout_seconds: float = 30.0
model_id: str = "anthropic.claude-3-haiku-20240307-v1:0"
regression_threshold_pct: float = 10.0
class LatencyBenchmark:
"""Measures response latency for MangaAssist operations."""
def __init__(self, bedrock_client: Any, config: BenchmarkConfig):
self.client = bedrock_client
self.config = config
self.latencies: list[float] = []
async def _single_request(self, prompt: str) -> dict[str, Any]:
"""Execute a single benchmarked request."""
start = time.monotonic()
result = await self.client.invoke(
model_id=self.config.model_id,
prompt=prompt,
max_tokens=512,
)
elapsed_ms = (time.monotonic() - start) * 1000
return {"latency_ms": elapsed_ms, **result}
async def run(self, prompts: list[str]) -> BenchmarkResult:
"""Run the latency benchmark with warmup and measurement phases."""
# Warm-up phase
logger.info("Warmup: %d iterations", self.config.warmup_iterations)
for i in range(self.config.warmup_iterations):
prompt = prompts[i % len(prompts)]
await self._single_request(prompt)
# Measurement phase
logger.info("Measuring: %d iterations", self.config.measurement_iterations)
self.latencies.clear()
total_tokens = 0
total_cost = 0.0
for i in range(self.config.measurement_iterations):
prompt = prompts[i % len(prompts)]
result = await self._single_request(prompt)
self.latencies.append(result["latency_ms"])
usage = result.get("usage", {})
total_tokens += usage.get("input_tokens", 0) + usage.get("output_tokens", 0)
total_cost += result.get("cost_usd", 0)
sorted_latencies = sorted(self.latencies)
total_time_s = sum(self.latencies) / 1000.0
return BenchmarkResult(
name=self.config.name,
samples=len(self.latencies),
mean_ms=round(statistics.mean(self.latencies), 2),
median_ms=round(statistics.median(self.latencies), 2),
stddev_ms=round(statistics.stdev(self.latencies), 2) if len(self.latencies) > 1 else 0,
p50_ms=round(sorted_latencies[len(sorted_latencies) // 2], 2),
p95_ms=round(sorted_latencies[int(len(sorted_latencies) * 0.95)], 2),
p99_ms=round(sorted_latencies[int(len(sorted_latencies) * 0.99)], 2),
min_ms=round(min(self.latencies), 2),
max_ms=round(max(self.latencies), 2),
throughput_rps=round(len(self.latencies) / total_time_s, 2) if total_time_s > 0 else 0,
total_tokens=total_tokens,
total_cost_usd=round(total_cost, 6),
)
class ThroughputBenchmark:
"""Measures concurrent request throughput for MangaAssist."""
def __init__(self, bedrock_client: Any, config: BenchmarkConfig):
self.client = bedrock_client
self.config = config
async def _worker(
self, worker_id: int, prompts: list[str], results: list[float]
) -> None:
"""A single concurrent worker executing requests."""
iterations = self.config.measurement_iterations // self.config.concurrent_workers
for i in range(iterations):
prompt = prompts[(worker_id * iterations + i) % len(prompts)]
start = time.monotonic()
await self.client.invoke(
model_id=self.config.model_id,
prompt=prompt,
max_tokens=512,
)
results.append((time.monotonic() - start) * 1000)
async def run(self, prompts: list[str]) -> BenchmarkResult:
"""Run throughput benchmark with concurrent workers."""
all_latencies: list[float] = []
wall_start = time.monotonic()
tasks = [
self._worker(i, prompts, all_latencies)
for i in range(self.config.concurrent_workers)
]
await asyncio.gather(*tasks)
wall_time_s = time.monotonic() - wall_start
sorted_lat = sorted(all_latencies)
return BenchmarkResult(
name=f"{self.config.name}_concurrent_{self.config.concurrent_workers}",
samples=len(all_latencies),
mean_ms=round(statistics.mean(all_latencies), 2),
median_ms=round(statistics.median(all_latencies), 2),
stddev_ms=round(statistics.stdev(all_latencies), 2) if len(all_latencies) > 1 else 0,
p50_ms=round(sorted_lat[len(sorted_lat) // 2], 2),
p95_ms=round(sorted_lat[int(len(sorted_lat) * 0.95)], 2),
p99_ms=round(sorted_lat[int(len(sorted_lat) * 0.99)], 2),
min_ms=round(min(all_latencies), 2),
max_ms=round(max(all_latencies), 2),
throughput_rps=round(len(all_latencies) / wall_time_s, 2),
)
class RegressionDetector:
"""Detects performance regressions by comparing against historical baselines."""
def __init__(
self,
baseline_path: str = "benchmarks/baselines",
threshold_pct: float = 10.0,
):
self.baseline_path = Path(baseline_path)
self.baseline_path.mkdir(parents=True, exist_ok=True)
self.threshold_pct = threshold_pct
def save_baseline(self, result: BenchmarkResult) -> None:
"""Save a benchmark result as the new baseline."""
filepath = self.baseline_path / f"{result.name}.json"
data = {
"name": result.name,
"p95_ms": result.p95_ms,
"mean_ms": result.mean_ms,
"throughput_rps": result.throughput_rps,
"timestamp": time.time(),
}
with open(filepath, "w") as f:
json.dump(data, f, indent=2)
def load_baseline(self, name: str) -> dict[str, Any] | None:
"""Load a historical baseline for comparison."""
filepath = self.baseline_path / f"{name}.json"
if not filepath.exists():
return None
with open(filepath) as f:
return json.load(f)
def compare(self, result: BenchmarkResult) -> BenchmarkResult:
"""Compare a result against its baseline and set status."""
baseline = self.load_baseline(result.name)
if not baseline:
result.status = BenchmarkStatus.BASELINE
return result
baseline_p95 = baseline["p95_ms"]
change_pct = ((result.p95_ms - baseline_p95) / baseline_p95) * 100
result.regression_pct = round(change_pct, 2)
if change_pct > self.threshold_pct:
result.status = BenchmarkStatus.REGRESSION
logger.warning(
"REGRESSION: %s P95 increased by %.1f%% (%s -> %s ms)",
result.name, change_pct, baseline_p95, result.p95_ms,
)
elif change_pct < -self.threshold_pct:
result.status = BenchmarkStatus.IMPROVEMENT
logger.info(
"IMPROVEMENT: %s P95 decreased by %.1f%%",
result.name, abs(change_pct),
)
else:
result.status = BenchmarkStatus.PASS
return result
class BenchmarkSuite:
"""Orchestrates the full benchmark suite for MangaAssist."""
# Standard prompts for benchmarking MangaAssist
STANDARD_PROMPTS = [
"おすすめのアクションマンガを教えてください",
"進撃の巨人の1巻はありますか?",
"注文番号MNG-2024-78432の配送状況を教えてください",
"What are the top selling manga this week?",
"鬼滅の刃の全巻セットの価格は?",
"Do you have any manga by 手塚治虫?",
"新刊コミックの入荷予定を教えてください",
"Can you recommend manga similar to One Piece?",
"漫画のギフトラッピングはできますか?",
"What genres of manga do you carry?",
]
def __init__(self, bedrock_client: Any):
self.client = bedrock_client
self.detector = RegressionDetector()
self.results: list[BenchmarkResult] = []
async def run_all(self) -> dict[str, Any]:
"""Run the complete benchmark suite."""
# Latency benchmarks
for model_name, model_id in [
("sonnet", "anthropic.claude-3-sonnet-20240229-v1:0"),
("haiku", "anthropic.claude-3-haiku-20240307-v1:0"),
]:
config = BenchmarkConfig(
name=f"latency_{model_name}",
warmup_iterations=5,
measurement_iterations=50,
model_id=model_id,
)
bench = LatencyBenchmark(self.client, config)
result = await bench.run(self.STANDARD_PROMPTS)
result = self.detector.compare(result)
self.results.append(result)
# Throughput benchmark
config = BenchmarkConfig(
name="throughput_haiku",
warmup_iterations=5,
measurement_iterations=100,
concurrent_workers=10,
model_id="anthropic.claude-3-haiku-20240307-v1:0",
)
bench = ThroughputBenchmark(self.client, config)
result = await bench.run(self.STANDARD_PROMPTS)
result = self.detector.compare(result)
self.results.append(result)
# Build summary
regressions = [r for r in self.results if r.status == BenchmarkStatus.REGRESSION]
return {
"total_benchmarks": len(self.results),
"regressions": len(regressions),
"results": [
{
"name": r.name,
"status": r.status.value,
"p95_ms": r.p95_ms,
"mean_ms": r.mean_ms,
"throughput_rps": r.throughput_rps,
"regression_pct": r.regression_pct,
"total_cost_usd": r.total_cost_usd,
}
for r in self.results
],
"gate_passed": len(regressions) == 0,
}
Cost Modeling for MangaAssist
graph LR
subgraph QueryTypes["Query Types"]
SQ[Simple Query<br/>Greeting / FAQ]
PL[Product Lookup<br/>ISBN / Title]
RC[Recommendation<br/>Personalized]
OS[Order Status<br/>Tracking]
CX[Complex Query<br/>Multi-Turn]
end
subgraph ModelRouting["Model Routing"]
HK[Haiku<br/>$0.25 / $1.25]
SN[Sonnet<br/>$3.00 / $15.00]
end
subgraph CostImpact["Daily Cost @ 1M msgs"]
HC[Haiku Simple<br/>~$150/day]
SC[Sonnet Complex<br/>~$4,500/day]
MC[Mixed Routing<br/>~$900/day]
end
SQ --> HK
PL --> HK
OS --> HK
RC --> SN
CX --> SN
HK --> HC
SN --> SC
HK --> MC
SN --> MC
style HK fill:#51cf66,color:#fff
style SN fill:#339af0,color:#fff
style MC fill:#ffd43b,color:#333
Cost Analysis Utility
"""
Cost analysis and model routing optimizer for MangaAssist.
Estimates daily costs at 1M messages/day based on query mix.
"""
from dataclasses import dataclass
from typing import Any
@dataclass
class QueryProfile:
"""Profile for a type of user query in MangaAssist."""
name: str
share_pct: float # Percentage of total traffic
avg_input_tokens: int
avg_output_tokens: int
preferred_model: str # "haiku" or "sonnet"
cache_hit_rate: float # 0.0 to 1.0
MANGA_QUERY_MIX = [
QueryProfile("greeting_faq", 20.0, 80, 60, "haiku", 0.90),
QueryProfile("product_lookup", 30.0, 150, 120, "haiku", 0.40),
QueryProfile("recommendation", 20.0, 300, 250, "sonnet", 0.10),
QueryProfile("order_status", 15.0, 100, 80, "haiku", 0.05),
QueryProfile("complex_multi_turn", 15.0, 500, 400, "sonnet", 0.05),
]
MODEL_COSTS_PER_M = {
"haiku": {"input": 0.25, "output": 1.25},
"sonnet": {"input": 3.00, "output": 15.00},
}
def estimate_daily_cost(
daily_messages: int = 1_000_000,
query_mix: list[QueryProfile] | None = None,
) -> dict[str, Any]:
"""Estimate daily cost breakdown for MangaAssist at scale."""
query_mix = query_mix or MANGA_QUERY_MIX
breakdown = []
total_cost = 0.0
total_cached = 0
for profile in query_mix:
daily_queries = int(daily_messages * profile.share_pct / 100)
cached = int(daily_queries * profile.cache_hit_rate)
uncached = daily_queries - cached
total_cached += cached
pricing = MODEL_COSTS_PER_M[profile.preferred_model]
input_cost = (uncached * profile.avg_input_tokens / 1_000_000) * pricing["input"]
output_cost = (uncached * profile.avg_output_tokens / 1_000_000) * pricing["output"]
query_cost = input_cost + output_cost
total_cost += query_cost
breakdown.append({
"query_type": profile.name,
"model": profile.preferred_model,
"daily_queries": daily_queries,
"cached": cached,
"uncached": uncached,
"daily_cost_usd": round(query_cost, 2),
"cost_per_query_usd": round(query_cost / max(uncached, 1), 6),
})
return {
"daily_messages": daily_messages,
"total_daily_cost_usd": round(total_cost, 2),
"monthly_cost_usd": round(total_cost * 30, 2),
"cache_savings_pct": round(total_cached / daily_messages * 100, 1),
"avg_cost_per_message_usd": round(total_cost / daily_messages, 6),
"breakdown": breakdown,
}
Key Takeaways
| # |
Takeaway |
MangaAssist Application |
| 1 |
Test fixtures make non-deterministic FM testing deterministic. Golden responses, error scenarios, and Japanese content fixtures give repeatable test runs. |
Every MangaAssist prompt category (recommendation, product lookup, order status) has a golden fixture with known token counts and quality scores. |
| 2 |
Mock Bedrock clients must simulate latency, token counting, and cost — not just return canned text. |
The mock client tracks cumulative cost using real Sonnet/Haiku pricing so developers see cost impact during testing. |
| 3 |
Streaming mocks are essential for testing the WebSocket delivery path. Word-by-word yielding simulates real Bedrock streaming. |
API Gateway WebSocket tests use the mock stream iterator to verify chunk assembly and partial-response rendering. |
| 4 |
Benchmark suites need warm-up phases to avoid cold-start noise in measurements. |
10 warm-up requests before 100 measured requests ensures ECS Fargate container pools are active. |
| 5 |
Regression detection uses statistical comparison (P95 change > 10%) rather than simple threshold checks. |
A 12% P95 increase after a prompt template change triggers a warning annotation on the GitHub PR. |
| 6 |
Cost modeling at 1M messages/day reveals that model routing is the largest lever — Haiku for simple queries saves 80%+ vs. all-Sonnet. |
Routing greetings and FAQ to Haiku ($0.25/$1.25) instead of Sonnet ($3/$15) saves an estimated $3,600/day. |
| 7 |
Cache hit rates vary dramatically by query type — FAQs cache at 90% while recommendations cache at 10%. |
ElastiCache Redis hit rates are tracked per query category, and the benchmark suite models their cost impact. |