03 — Cost–Performance & Token Efficiency — Answers
Easy
A1. Per-Conversation LLM Cost Calculation
Claude 3.5 Sonnet pricing: - Input: $3.00 per million tokens - Output: $15.00 per million tokens
Claude 3 Haiku pricing: - Input: $0.25 per million tokens - Output: $1.25 per million tokens
Recommendation intent — Sonnet:
$$\text{Input cost} = 800 \times \frac{\$3.00}{1{,}000{,}000} = \$0.0024$$
$$\text{Output cost} = 400 \times \frac{\$15.00}{1{,}000{,}000} = \$0.0060$$
$$\text{Total per conversation} = \$0.0024 + \$0.0060 = \$0.0084$$
Recommendation intent — Haiku:
$$\text{Input cost} = 800 \times \frac{\$0.25}{1{,}000{,}000} = \$0.0002$$
$$\text{Output cost} = 400 \times \frac{\$1.25}{1{,}000{,}000} = \$0.0005$$
$$\text{Total per conversation} = \$0.0002 + \$0.0005 = \$0.0007$$
Cost ratio:
$$\frac{\text{Sonnet}}{\text{Haiku}} = \frac{\$0.0084}{\$0.0007} = 12\times$$
Daily cost at 40K conversations/day: - Sonnet: 40,000 × $0.0084 = $336/day ($10,080/month) - Haiku: 40,000 × $0.0007 = $28/day ($840/month) - Daily savings from switching to Haiku: $308/day ($9,240/month)
However, this savings is only worthwhile if Haiku meets the quality threshold for recommendation. Based on evaluation data, Haiku scores below the 0.88 BERTScore threshold for this intent, so the savings are not realizable without quality compromise.
A2. Token Budgets for MangaAssist
Definition: A token budget is a hard or soft limit on the number of input and output tokens consumed per LLM invocation for a specific intent. It controls costs, prevents runaway token consumption, and ensures predictable latency.
FAQ intent token budget:
| Component | Token Budget | Rationale |
|---|---|---|
| System prompt | 200 tokens | Fixed — FAQ instructions and persona |
| Conversation history | 300 tokens (max 2 prior turns) | FAQ rarely needs deep history |
| User query | 100 tokens | FAQ questions are typically concise |
| RAG context | 400 tokens (2 chunks max) | FAQ answers are usually in a single knowledge base article |
| Total input budget | 1,000 tokens | |
| Output budget (max_tokens) | 250 tokens | FAQ answers should be concise and direct |
Enforcement on ECS Fargate:
# In the orchestrator service running on ECS Fargate
INTENT_TOKEN_BUDGETS = {
"faq": {"max_input": 1000, "max_output": 250},
"recommendation": {"max_input": 2500, "max_output": 500},
"chitchat": {"max_input": 600, "max_output": 150},
# ... other intents
}
async def invoke_llm(intent: str, prompt: str, context: str) -> str:
budget = INTENT_TOKEN_BUDGETS[intent]
# Token counting using tiktoken (approximate for Claude)
input_tokens = count_tokens(prompt + context)
if input_tokens > budget["max_input"]:
# Strategy: truncate conversation history first, then RAG context
prompt = truncate_to_budget(prompt, context, budget["max_input"])
response = await bedrock.invoke_model(
modelId=get_model_for_intent(intent),
body={
"prompt": prompt,
"max_tokens": budget["max_output"], # Hard cap enforced by Bedrock
"stop_sequences": ["\n\nHuman:"]
}
)
return response
The max_tokens parameter in the Bedrock API call is the hard enforcement. Input truncation is application-level, applied before the API call.
A3. Effective Cost with Caching
Given:
- faq: 50K conversations/day, 500 input / 200 output tokens, Haiku, 45% cache hit rate
- order_tracking: 30K conversations/day, 600 input / 300 output tokens, Haiku, 30% cache hit rate
FAQ cost calculation:
Per-conversation Haiku cost:
$$\text{Input} = 500 \times \frac{\$0.25}{1M} = \$0.000125$$
$$\text{Output} = 200 \times \frac{\$1.25}{1M} = \$0.000250$$
$$\text{Total} = \$0.000375$$
With 45% cache hit rate (cache hits cost ~$0 for LLM — only Redis lookup cost): - LLM invocations: 50,000 × 0.55 = 27,500 - Cache hits: 50,000 × 0.45 = 22,500 (near-zero LLM cost)
$$\text{FAQ daily cost} = 27{,}500 \times \$0.000375 = \$10.31$$
Without caching: 50,000 × $0.000375 = $18.75 → 45% savings from caching
Order tracking cost calculation:
Per-conversation Haiku cost:
$$\text{Input} = 600 \times \frac{\$0.25}{1M} = \$0.000150$$
$$\text{Output} = 300 \times \frac{\$1.25}{1M} = \$0.000375$$
$$\text{Total} = \$0.000525$$
With 30% cache hit rate: - LLM invocations: 30,000 × 0.70 = 21,000
$$\text{Order tracking daily cost} = 21{,}000 \times \$0.000525 = \$11.03$$
Without caching: 30,000 × $0.000525 = $15.75 → 30% savings
Combined daily cost: $10.31 + $11.03 = $21.34/day Without caching: $18.75 + $15.75 = $34.50/day Total savings from caching: $13.16/day ($394.80/month)
Note: These calculations exclude ElastiCache Redis infrastructure cost (~$150/month for a cache.r6g.large instance), which is amortized across all cached intents.
Medium
A4. Complete Cost-Per-Conversation Breakdown
Assumptions: - Sonnet: $3.00/$15.00 per M input/output tokens - Haiku: $0.25/$1.25 per M input/output tokens - Average tokens per conversation: 800 input, 350 output (varies by intent — detailed below)
| Intent | Model | Daily Vol. | Avg In Tok | Avg Out Tok | Cost/Conv | Daily Cost | Monthly Cost |
|---|---|---|---|---|---|---|---|
| recommendation | Sonnet | 67K | 1,200 | 450 | $0.01035 | $693 | $20,790 |
| product_question | Sonnet | 50K | 900 | 400 | $0.00870 | $435 | $13,050 |
| faq | Haiku | 100K | 500 | 200 | $0.00038 | $38 | $1,125 |
| order_tracking | Haiku | 83K | 600 | 300 | $0.00053 | $44 | $1,313 |
| return_request | Haiku | 27K | 700 | 350 | $0.00061 | $16 | $494 |
| promotion | Haiku | 40K | 400 | 200 | $0.00035 | $14 | $420 |
| checkout_help | Sonnet | 33K | 800 | 300 | $0.00690 | $228 | $6,831 |
| chitchat | Haiku | 17K | 300 | 150 | $0.00026 | $4 | $133 |
| escalation | Haiku | 7K | 600 | 400 | $0.00065 | $5 | $137 |
| product_discovery | Sonnet | 60K | 1,100 | 400 | $0.00930 | $558 | $16,740 |
| TOTAL | 484K | $2,035 | $61,033 |
Top 3 most expensive intents:
-
recommendation— $20,790/month (34% of total spend) - Optimization: Implement semantic caching — similar recommendation queries share responses. At 25% cache hit rate → saves ~$5,200/month. - Optimization: Context compression — reduce RAG context from 5 chunks to top 2 using a re-ranker → reduces input tokens from 1,200 to 700 → saves ~$3,900/month. -
product_discovery— $16,740/month (27%) - Optimization: Complexity-based routing — route simple discovery queries ("popular manga") to Haiku, complex ones to Sonnet. At 40% simple ratio → saves ~$6,030/month. - Optimization: Progressive disclosure — start with Haiku for initial broad results, escalate to Sonnet only if user refines search. -
product_question— $13,050/month (21%) - Optimization: Fine-tune Haiku on product_question data → if quality meets threshold, migrate fully → saves ~$12,000/month. - Optimization: Structured data extraction — for questions about specific product attributes (price, page count, dimensions), use DynamoDB direct lookup instead of LLM → eliminates ~30% of LLM calls.
Projected savings from all optimizations: ~$27,000/month (44% reduction)
A5. RAG Context Cost Comparison for Product Discovery
Scenario details: - 5 retrieved chunks × 300 tokens each = 1,500 tokens of RAG context - Base input tokens (system prompt + user query): 800 - Output tokens: 400
Option A — All 5 chunks to Sonnet:
$$\text{Input} = (800 + 1{,}500) \times \frac{\$3.00}{1M} = 2{,}300 \times \$0.000003 = \$0.00690$$
$$\text{Output} = 400 \times \frac{\$15.00}{1M} = \$0.00600$$
$$\text{Total} = \$0.01290 \text{ per conversation}$$
Option B — All 5 chunks to Haiku:
$$\text{Input} = 2{,}300 \times \frac{\$0.25}{1M} = \$0.000575$$
$$\text{Output} = 400 \times \frac{\$1.25}{1M} = \$0.000500$$
$$\text{Total} = \$0.001075 \text{ per conversation}$$
Option C — Re-ranker selects top 2 chunks, send to Sonnet:
Re-ranker cost: $0.0001 per inference Input tokens with 2 chunks: 800 + (2 × 300) = 1,400
$$\text{Re-ranker} = \$0.0001$$
$$\text{Input} = 1{,}400 \times \frac{\$3.00}{1M} = \$0.00420$$
$$\text{Output} = 400 \times \frac{\$15.00}{1M} = \$0.00600$$
$$\text{Total} = \$0.0001 + \$0.00420 + \$0.00600 = \$0.01030 \text{ per conversation}$$
Comparison at 60K conversations/day:
| Option | Cost/Conv | Daily Cost | Monthly Cost | Quality |
|---|---|---|---|---|
| A: 5 chunks → Sonnet | $0.01290 | $774 | $23,220 | Highest — most context |
| B: 5 chunks → Haiku | $0.00108 | $65 | $1,935 | Lowest — Haiku may miss nuance |
| C: 2 chunks → Sonnet + re-ranker | $0.01030 | $618 | $18,540 | High — top 2 chunks are usually sufficient |
Recommendation: Option C saves $4,680/month vs Option A with minimal quality loss. The re-ranker filters irrelevant chunks (like Option B's included One Piece result when the user asked about dark fantasy), actually improving faithfulness by reducing noise in the context window.
A6. Intent-Based Token Budget Table
| Intent | Max Input Tokens | Max Output Tokens | Model | Target $/Conv | Rationale |
|---|---|---|---|---|---|
| recommendation | 2,500 | 500 | Sonnet | $0.0150 | Large RAG context (5 products), detailed multi-product responses |
| product_question | 2,000 | 400 | Sonnet | $0.0120 | Product context + comparison needs |
| product_discovery | 2,200 | 450 | Sonnet | $0.0135 | Iterative discovery needs broad context |
| faq | 1,000 | 250 | Haiku | $0.0006 | Concise answers, limited context needed |
| order_tracking | 1,200 | 300 | Haiku | $0.0007 | Order data is structured, responses are template-like |
| return_request | 1,200 | 350 | Haiku | $0.0007 | Return policy is finite, but edge cases need detail |
| promotion | 800 | 200 | Haiku | $0.0005 | Promotions are pre-defined, short responses |
| checkout_help | 1,500 | 350 | Sonnet | $0.0098 | Step-by-step guidance needs clarity |
| chitchat | 600 | 150 | Haiku | $0.0003 | Casual, short exchanges |
| escalation | 1,500 | 500 | Haiku | $0.0010 | Needs higher output to generate detailed agent summary |
Why escalation gets a higher output budget than order_tracking:
-
order_trackingoutput is a structured status update: "Your order #12345 shipped on March 15 and is expected to arrive by March 20. Tracking number: XYZ." This rarely exceeds 100 tokens. -
escalationoutput is a conversation summary for the human agent. It must include: 1. Summary of the customer's issue 2. What solutions were attempted by the chatbot 3. Customer's emotional state / sentiment 4. Relevant order/product details 5. Recommended actions for the agent
This summary can reach 300–500 tokens to be useful. Generating a truncated summary wastes the human agent's time.
Hard
A7. Pareto-Optimal Configuration Search
Search space definition:
For each of MangaAssist's 10 intents, the configurable parameters are:
| Parameter | Options | Values |
|---|---|---|
| Model | 3 | Sonnet, Haiku, Fine-tuned Haiku |
| max_tokens (output) | 4 | 150, 250, 400, 500 |
| temperature | 3 | 0.1, 0.3, 0.7 |
| RAG context chunks | 3 | 0, 2, 5 (if applicable) |
Per intent: 3 × 4 × 3 × 3 = 108 configurations Across 10 intents: 108^10 ≈ 10^20 combinations — brute force is impossible.
Optimization approach — Bayesian Optimization with Multi-Objective Acquisition:
- Objective functions: - $f_1(\mathbf{x})$ = Total monthly LLM cost (minimize) - $f_2(\mathbf{x})$ = Weighted average quality score (maximize)
Where $\mathbf{x}$ is the configuration vector across all 10 intents.
-
Constraints: - Per-intent quality ≥ minimum threshold (hard constraint) - P99 latency ≤ 5,000ms per intent (hard constraint)
-
Algorithm — NSGA-II (Non-dominated Sorting Genetic Algorithm): - Population size: 100 configurations - Evaluate each configuration using offline evaluation (Bedrock evaluation jobs on golden datasets) - Evolve over 50 generations with crossover and mutation on intent-level parameters - The Pareto front emerges naturally — configurations not dominated by any other
-
Efficient evaluation: - Don't run production traffic. Use the golden evaluation datasets (200 samples per intent). - Cost is calculated analytically from token counts and model pricing. - Quality is measured via BERTScore + judge model scoring. - Latency is estimated from historical distributions per model.
-
Output: A Pareto front with ~15–20 non-dominated configurations:
Config A: $35K/month, quality=0.92 (all Sonnet, high tokens)
Config B: $28K/month, quality=0.91 (Sonnet for top 3, Haiku for rest)
Config C: $18K/month, quality=0.87 (Sonnet for recommendation only)
Config D: $12K/month, quality=0.83 (all Haiku)
The business chooses from the Pareto front based on budget constraints and quality appetite.
- A configuration is Pareto-dominated if there exists another configuration that is both cheaper AND higher quality. For example, if Config X costs $25K with quality 0.88, and Config Y costs $22K with quality 0.90, then X is dominated by Y and should never be chosen.
A8. Cost-Aware Intent Sub-Routing for Product Question
Complexity classifier design:
Features for routing product_question to Sonnet vs Haiku:
| Feature | Type | Example |
|---|---|---|
| Query token count | Numeric | Short queries (< 20 tokens) → likely simple |
| Number of entities mentioned | Numeric | Multi-product comparison → complex |
| Presence of comparison words | Binary | "compare", "difference between", "versus" → complex |
| Coreference depth | Numeric | "What about the second one?" → needs context → complex |
| Product attribute type | Categorical | Factual (page count, price) → simple; Subjective (quality, art style) → complex |
| Conversation turn number | Numeric | Turn 1 → usually simple; Turn 4+ → often complex |
Training pipeline (SageMaker):
-
Data: Label 10K historical
product_questionconversations as "simple" or "complex" based on: - Did Haiku produce an acceptable response? (run both models, compare) - Simple = Haiku BERTScore ≥ 0.85 against reference; Complex = Haiku < 0.85 -
Model: XGBoost classifier on SageMaker (fast inference, < 10ms latency)
-
Deploy: SageMaker real-time endpoint, invoked by the ECS Fargate orchestrator before model selection
Expected cost savings:
| Metric | Current (All Sonnet) | With Sub-Routing |
|---|---|---|
| Daily volume | 50K | 50K |
| Sonnet calls | 50K (100%) | 35K (70% complex) |
| Haiku calls | 0 | 15K (30% simple) |
| Daily LLM cost | $435 | $305 + $8 = $313 |
| Monthly savings | — | $3,660/month |
Additional SageMaker cost for classifier: ~$100/month (ml.c5.large endpoint) → net savings: $3,560/month
Quality guardrail: Run a shadow evaluation weekly — invoke both Sonnet and Haiku for all queries, compare the complexity classifier's routing decision against the actual quality gap. If the classifier's "simple" routing is wrong > 5% of the time, retrain.
A9. Cost Attribution Dashboard Design
Data pipeline architecture:
Bedrock API Response
│ (includes token counts in response metadata)
▼
┌─────────────────────┐
│ ECS Fargate │ Extract: input_tokens, output_tokens,
│ (Orchestrator) │ model_id, intent, session_id,
│ │ cache_hit, timestamp
└──────────┬──────────┘
│ Structured JSON log
▼
┌─────────────────────┐
│ CloudWatch Logs │ Log group: /mangaassist/llm-usage
│ │ Retention: 90 days
└──────────┬──────────┘
│ Subscription filter
▼
┌─────────────────────┐
│ Kinesis Data │ Buffer: 60s / 5MB batches
│ Firehose │
└──────────┬──────────┘
│
┌─────┴──────┐
▼ ▼
┌─────────┐ ┌─────────────┐
│ S3 │ │ DynamoDB │
│ (Raw) │ │ (Hourly │
│ Parquet │ │ Aggregates)│
└────┬────┘ └──────┬──────┘
│ │
▼ ▼
┌─────────────────────────────┐
│ QuickSight Dashboard │
│ │
│ ┌─────────────────────────┐ │
│ │ Cost by Intent (Pie) │ │
│ │ Cost by Model (Stacked) │ │
│ │ Input vs Output Tokens │ │
│ │ Cache Savings (Line) │ │
│ │ Hourly Trend (Heatmap) │ │
│ └─────────────────────────┘ │
└─────────────────────────────┘
Per-request log entry (emitted by ECS Fargate):
{
"timestamp": "2025-03-15T14:23:01Z",
"request_id": "req-abc123",
"session_id": "sess-xyz789",
"intent": "recommendation",
"model_id": "anthropic.claude-3-5-sonnet-20241022-v2:0",
"input_tokens": 1247,
"output_tokens": 412,
"cache_hit": false,
"latency_ms": 2340,
"input_cost_usd": 0.003741,
"output_cost_usd": 0.006180,
"total_cost_usd": 0.009921,
"region": "us-east-1",
"hour_of_day": 14
}
DynamoDB hourly aggregation schema:
PK: INTENT#recommendation
SK: 2025-03-15T14:00:00Z
Attributes:
model: "sonnet"
total_conversations: 2834
total_input_tokens: 3,541,298
total_output_tokens: 1,168,508
total_cost_usd: 28.16
cache_hit_count: 412
cache_miss_count: 2422
avg_cost_per_conversation: 0.00993
This hourly granularity keeps DynamoDB costs low (~$15/month) while enabling dashboard queries down to hourly resolution. For sub-hourly analysis, query the raw Parquet files in S3 via Athena.
Very Hard
A10. Real-Time Cost Anomaly Detection and Dynamic Routing
Architecture:
┌──────────────────────────────────────────────────────────┐
│ Real-Time Cost Control System │
├──────────────────────────────────────────────────────────┤
│ │
│ ┌─────────────┐ ┌──────────────┐ ┌────────────┐ │
│ │ ECS Fargate │───►│ CloudWatch │───►│ Anomaly │ │
│ │ (Per-req │ │ Metrics │ │ Detection │ │
│ │ cost emit) │ │ │ │ (ML-based) │ │
│ └──────┬──────┘ └──────────────┘ └─────┬──────┘ │
│ │ │ │
│ ▼ ▼ │
│ ┌─────────────┐ ┌──────────────┐ ┌────────────┐ │
│ │ ElastiCache │◄───│ Cost Budget │◄───│ CloudWatch │ │
│ │ Redis │ │ Lambda │ │ Alarm │ │
│ │ (Running │ │ (Threshold │ │ │ │
│ │ daily cost)│ │ checks) │ │ │ │
│ └─────────────┘ └──────────────┘ └────────────┘ │
└──────────────────────────────────────────────────────────┘
Anomaly definitions:
| Anomaly Type | Detection | Threshold |
|---|---|---|
| Single conversation spike | Per-conversation cost > 10× intent average | recommendation > $0.10/conversation |
| Intent cost velocity spike | Rolling 15-min cost > 3× historical same-hour average | CloudWatch Anomaly Detection (ML band) |
| Daily budget approaching | Cumulative spend reaches % thresholds | 70% / 85% / 95% / 100% of $5,000 |
| Token count explosion | Input tokens > 10,000 on a single request | Direct check in orchestrator |
Real-time budget tracking (Redis):
# Executed on every LLM invocation
async def track_and_check_budget(intent: str, cost: float) -> str:
pipe = redis.pipeline()
# Increment daily counter
today = datetime.utcnow().strftime("%Y-%m-%d")
pipe.incrbyfloat(f"cost:daily:{today}", cost)
pipe.incrbyfloat(f"cost:daily:{today}:intent:{intent}", cost)
results = await pipe.execute()
daily_total = float(results[0])
DAILY_CAP = 5000.0
if daily_total >= DAILY_CAP * 0.95:
# CRITICAL: Only serve high-value intents on Sonnet
return "emergency_mode"
elif daily_total >= DAILY_CAP * 0.85:
# WARNING: Downgrade low-priority intents
return "cost_saving_mode"
elif daily_total >= DAILY_CAP * 0.70:
# CAUTION: Enable aggressive caching
return "cache_aggressive_mode"
else:
return "normal_mode"
Automated downgrade behavior:
| Mode | Sonnet Intents | Haiku Intents | Caching |
|---|---|---|---|
| Normal | recommendation, product_question, product_discovery, checkout_help | faq, order_tracking, return_request, promotion, chitchat, escalation | Standard |
| Cache Aggressive | Same | Same | Extend TTLs 2×, fuzzy match cache keys |
| Cost Saving | recommendation, checkout_help only | All others downgraded to Haiku | Aggressive + serve stale cache |
| Emergency | recommendation only | Everything else → Haiku | Max aggressive |
Preventing customer experience degradation:
- Transparent degradation: When downgraded from Sonnet to Haiku, the orchestrator adds extra few-shot examples to the Haiku prompt to compensate for quality loss
- Quality floor: Even in emergency mode, if Haiku produces a low-confidence response (below quality threshold), escalate to a human agent rather than serve a bad answer
- Budget replenishment: At midnight UTC, reset the daily budget counter. During the first hour of the new day, gradually restore normal routing (don't blast all accumulated demand through Sonnet)
A11. Cost-Aware Prompt Engineering Framework
Token efficiency metric:
$$\text{Token Efficiency} = \frac{\text{Quality Score}}{\text{Total Tokens Consumed}} \times 1000$$
This gives "quality points per 1,000 tokens." Higher is better — either by improving quality without token increase, or reducing tokens without quality loss.
Systematic prompt optimization process:
Phase 1 — Baseline measurement: For each intent, measure the current prompt's: - Total input tokens (including system prompt) - Average output tokens - Quality score (BERTScore + judge model average) - Token efficiency ratio
Phase 2 — Optimization techniques:
| Technique | Description | Expected Token Reduction | Quality Risk |
|---|---|---|---|
| System prompt compression | Rewrite verbose instructions into concise directives | 15–30% of system prompt tokens | Low — if meaning preserved |
| Example pruning | Reduce few-shot examples from 5 to 2 (most representative) | 40–60% of example tokens | Medium — edge case handling may degrade |
| Output format constraints | Add "respond in ≤ 3 sentences" or structured JSON schema | 20–40% of output tokens | Low–Medium — may lose nuance |
| Context window pruning | Remove conversation turns older than 3 turns | 30–50% of history tokens | Medium — multi-turn coherence may suffer |
| Instruction consolidation | Merge redundant or overlapping instructions | 10–20% of system prompt | Low |
Phase 3 — A/B test evaluation:
For each prompt variant:
# Evaluation for prompt optimization A/B test
test_config = {
"test_id": "prompt-opt-rec-v2",
"intent": "recommendation",
"control_prompt": current_production_prompt,
"treatment_prompt": optimized_prompt,
"metrics": {
"primary": "token_efficiency",
"guardrails": ["bertscore >= 0.88", "manga_domain_accuracy >= 4.0"],
"cost": "cost_per_conversation"
},
"min_duration_days": 7,
"traffic_split": 0.10 # 10% of recommendation traffic
}
Phase 4 — Edge case regression detection:
Prompt compression can introduce subtle failures. To catch these:
-
Adversarial evaluation dataset: 50 deliberately tricky queries that test edge cases: - Manga with similar names (Naruto vs Boruto: Naruto Next Generations) - Genres that overlap (dark fantasy vs horror) - Out-of-catalog requests - Multi-language queries
-
A/B test duration extension: Even if the main metrics show success, hold the test for 14 days to accumulate enough edge case exposure.
-
Automated regression scanner: Compare the optimized prompt's responses against the original prompt for all 200 golden dataset samples. Flag any response where quality drops > 10% — review these manually before promoting.
A12. Cost-Performance Simulation Framework
Architecture — SageMaker Pipeline:
┌─────────────────────────────────────────────────────────────────┐
│ MangaAssist Cost Simulation Pipeline │
├─────────────────────────────────────────────────────────────────┤
│ │
│ Step 1: Data Ingestion (SageMaker Processing) │
│ ├── Historical token distributions per intent (from S3) │
│ ├── Current model assignments and pricing │
│ ├── Cache hit rates per intent (from CloudWatch) │
│ ├── Traffic volume by hour/day (from DynamoDB aggregates) │
│ └── Quality scores per model-intent pair (from eval results) │
│ │
│ Step 2: Scenario Generation (SageMaker Processing) │
│ ├── Parse input scenario parameters │
│ ├── Generate Monte Carlo simulation inputs │
│ └── Create scenario matrix │
│ │
│ Step 3: Monte Carlo Simulation (SageMaker Training job) │
│ ├── 10,000 simulation runs per scenario │
│ ├── Sample from token distribution (not just mean) │
│ ├── Model traffic growth with Poisson variance │
│ ├── Apply cache hit rate changes when model changes │
│ └── Include infrastructure costs (OpenSearch, ECS, DDB, Redis) │
│ │
│ Step 4: Analysis & Output (SageMaker Processing) │
│ ├── Aggregate by scenario: mean, P5, P50, P95 │
│ ├── Generate confidence intervals │
│ └── Write results to S3 + DynamoDB │
│ │
│ Step 5: Visualization (Lambda → QuickSight) │
│ └── Update dashboard with new projections │
└─────────────────────────────────────────────────────────────────┘
Input parameters format:
{
"scenario_name": "Q2 2025 Projection",
"base_date": "2025-03-01",
"projection_months": 6,
"assumptions": {
"traffic_growth": {
"type": "linear",
"monthly_rate": 0.10
},
"price_changes": [
{
"model": "claude-3-5-sonnet",
"effective_date": "2025-05-01",
"new_input_price_per_m": 2.50,
"new_output_price_per_m": 12.00
}
],
"new_models": [
{
"model": "claude-4-haiku",
"available_date": "2025-06-01",
"input_price_per_m": 0.40,
"output_price_per_m": 2.00,
"estimated_quality_vs_sonnet": 0.92
}
],
"architecture_changes": [
{
"type": "prompt_cache",
"effective_date": "2025-04-01",
"cache_eligible_intents": ["faq", "promotion", "chitchat"],
"expected_token_reduction": 0.30
}
],
"model_reassignments": [
{
"intent": "product_question",
"from_model": "sonnet",
"to_model": "haiku",
"effective_date": "2025-04-15"
}
]
}
}
Monte Carlo simulation core:
def simulate_monthly_cost(
intent_configs: dict,
traffic_model: TrafficModel,
token_distributions: dict[str, Distribution],
cache_model: CacheModel,
n_simulations: int = 10000
) -> SimulationResult:
monthly_costs = []
for _ in range(n_simulations):
total_cost = 0.0
for intent, config in intent_configs.items():
# Sample daily traffic from Poisson distribution
daily_traffic = np.random.poisson(traffic_model.mean_daily[intent])
monthly_traffic = daily_traffic * 30
# Sample token counts from fitted distributions (not just means!)
input_tokens = token_distributions[intent].input.rvs(monthly_traffic)
output_tokens = token_distributions[intent].output.rvs(monthly_traffic)
# Apply cache — cached requests have zero LLM cost
cache_rate = cache_model.get_rate(intent, config.model)
llm_requests = int(monthly_traffic * (1 - cache_rate))
# Calculate LLM cost
input_cost = np.sum(input_tokens[:llm_requests]) * config.model.input_price / 1e6
output_cost = np.sum(output_tokens[:llm_requests]) * config.model.output_price / 1e6
total_cost += input_cost + output_cost
# Add infrastructure costs (relatively stable, add noise ±10%)
infra_cost = sample_infrastructure_costs()
total_cost += infra_cost
monthly_costs.append(total_cost)
return SimulationResult(
mean=np.mean(monthly_costs),
p5=np.percentile(monthly_costs, 5),
p50=np.percentile(monthly_costs, 50),
p95=np.percentile(monthly_costs, 95),
confidence_interval_95=(
np.percentile(monthly_costs, 2.5),
np.percentile(monthly_costs, 97.5)
)
)
Output format:
┌───────────────────────────────────────────────────────────────┐
│ Scenario: Q2 2025 Projection (10% monthly growth) │
├───────────┬──────────┬──────────┬──────────┬────────────────┤
│ Month │ P5 │ P50 │ P95 │ 95% CI │
├───────────┼──────────┼──────────┼──────────┼────────────────┤
│ Apr 2025 │ $52,100 │ $58,400 │ $66,200 │ [$50,800-67,900] │
│ May 2025 │ $54,300 │ $61,800 │ $71,500 │ [$52,900-73,200] │
│ Jun 2025 │ $49,100 │ $55,200 │ $63,800 │ [$47,600-65,400] │ ← price drop
│ Jul 2025 │ $52,800 │ $59,900 │ $69,100 │ [$51,200-71,000] │
│ Aug 2025 │ $56,400 │ $64,100 │ $74,200 │ [$54,600-76,100] │
│ Sep 2025 │ $48,200 │ $54,800 │ $63,500 │ [$46,700-65,200] │ ← Claude 4 Haiku
└───────────┴──────────┴──────────┴──────────┴────────────────┘
Key insights:
- Bedrock price reduction in May saves ~$5K/month
- Claude 4 Haiku migration in Sep saves ~$9K/month
- Monthly budget cap of $400K is achievable by Aug with all optimizations