03 — Cost–Performance & Token Efficiency — Answers

Easy

A1. Per-Conversation LLM Cost Calculation

Claude 3.5 Sonnet pricing: - Input: $3.00 per million tokens - Output: $15.00 per million tokens

Claude 3 Haiku pricing: - Input: $0.25 per million tokens - Output: $1.25 per million tokens

Recommendation intent — Sonnet:

$$\text{Input cost} = 800 \times \frac{\$3.00}{1{,}000{,}000} = \$0.0024$$

$$\text{Output cost} = 400 \times \frac{\$15.00}{1{,}000{,}000} = \$0.0060$$

$$\text{Total per conversation} = \$0.0024 + \$0.0060 = \$0.0084$$

Recommendation intent — Haiku:

$$\text{Input cost} = 800 \times \frac{\$0.25}{1{,}000{,}000} = \$0.0002$$

$$\text{Output cost} = 400 \times \frac{\$1.25}{1{,}000{,}000} = \$0.0005$$

$$\text{Total per conversation} = \$0.0002 + \$0.0005 = \$0.0007$$

Cost ratio:

$$\frac{\text{Sonnet}}{\text{Haiku}} = \frac{\$0.0084}{\$0.0007} = 12\times$$

Daily cost at 40K conversations/day: - Sonnet: 40,000 × $0.0084 = $336/day ($10,080/month) - Haiku: 40,000 × $0.0007 = $28/day ($840/month) - Daily savings from switching to Haiku: $308/day ($9,240/month)

However, this savings is only worthwhile if Haiku meets the quality threshold for recommendation. Based on evaluation data, Haiku scores below the 0.88 BERTScore threshold for this intent, so the savings are not realizable without quality compromise.

A2. Token Budgets for MangaAssist

Definition: A token budget is a hard or soft limit on the number of input and output tokens consumed per LLM invocation for a specific intent. It controls costs, prevents runaway token consumption, and ensures predictable latency.

FAQ intent token budget:

Component	Token Budget	Rationale
System prompt	200 tokens	Fixed — FAQ instructions and persona
Conversation history	300 tokens (max 2 prior turns)	FAQ rarely needs deep history
User query	100 tokens	FAQ questions are typically concise
RAG context	400 tokens (2 chunks max)	FAQ answers are usually in a single knowledge base article
Total input budget	1,000 tokens
Output budget (max_tokens)	250 tokens	FAQ answers should be concise and direct

Enforcement on ECS Fargate:

# In the orchestrator service running on ECS Fargate
INTENT_TOKEN_BUDGETS = {
    "faq": {"max_input": 1000, "max_output": 250},
    "recommendation": {"max_input": 2500, "max_output": 500},
    "chitchat": {"max_input": 600, "max_output": 150},
    # ... other intents
}

async def invoke_llm(intent: str, prompt: str, context: str) -> str:
    budget = INTENT_TOKEN_BUDGETS[intent]

    # Token counting using tiktoken (approximate for Claude)
    input_tokens = count_tokens(prompt + context)

    if input_tokens > budget["max_input"]:
        # Strategy: truncate conversation history first, then RAG context
        prompt = truncate_to_budget(prompt, context, budget["max_input"])

    response = await bedrock.invoke_model(
        modelId=get_model_for_intent(intent),
        body={
            "prompt": prompt,
            "max_tokens": budget["max_output"],  # Hard cap enforced by Bedrock
            "stop_sequences": ["\n\nHuman:"]
        }
    )
    return response

The max_tokens parameter in the Bedrock API call is the hard enforcement. Input truncation is application-level, applied before the API call.

A3. Effective Cost with Caching

Given: - faq: 50K conversations/day, 500 input / 200 output tokens, Haiku, 45% cache hit rate - order_tracking: 30K conversations/day, 600 input / 300 output tokens, Haiku, 30% cache hit rate

FAQ cost calculation:

Per-conversation Haiku cost:

$$\text{Input} = 500 \times \frac{\$0.25}{1M} = \$0.000125$$

$$\text{Output} = 200 \times \frac{\$1.25}{1M} = \$0.000250$$

$$\text{Total} = \$0.000375$$

With 45% cache hit rate (cache hits cost ~$0 for LLM — only Redis lookup cost): - LLM invocations: 50,000 × 0.55 = 27,500 - Cache hits: 50,000 × 0.45 = 22,500 (near-zero LLM cost)

$$\text{FAQ daily cost} = 27{,}500 \times \$0.000375 = \$10.31$$

Without caching: 50,000 × $0.000375 = $18.75 → 45% savings from caching

Order tracking cost calculation:

Per-conversation Haiku cost:

$$\text{Input} = 600 \times \frac{\$0.25}{1M} = \$0.000150$$

$$\text{Output} = 300 \times \frac{\$1.25}{1M} = \$0.000375$$

$$\text{Total} = \$0.000525$$

With 30% cache hit rate: - LLM invocations: 30,000 × 0.70 = 21,000

$$\text{Order tracking daily cost} = 21{,}000 \times \$0.000525 = \$11.03$$

Without caching: 30,000 × $0.000525 = $15.75 → 30% savings

Combined daily cost: $10.31 + $11.03 = $21.34/day Without caching: $18.75 + $15.75 = $34.50/day Total savings from caching: $13.16/day ($394.80/month)

Note: These calculations exclude ElastiCache Redis infrastructure cost (~$150/month for a cache.r6g.large instance), which is amortized across all cached intents.

Medium

A4. Complete Cost-Per-Conversation Breakdown

Assumptions: - Sonnet: $3.00/$15.00 per M input/output tokens - Haiku: $0.25/$1.25 per M input/output tokens - Average tokens per conversation: 800 input, 350 output (varies by intent — detailed below)

Intent	Model	Daily Vol.	Avg In Tok	Avg Out Tok	Cost/Conv	Daily Cost	Monthly Cost
recommendation	Sonnet	67K	1,200	450	$0.01035	$693	$20,790
product_question	Sonnet	50K	900	400	$0.00870	$435	$13,050
faq	Haiku	100K	500	200	$0.00038	$38	$1,125
order_tracking	Haiku	83K	600	300	$0.00053	$44	$1,313
return_request	Haiku	27K	700	350	$0.00061	$16	$494
promotion	Haiku	40K	400	200	$0.00035	$14	$420
checkout_help	Sonnet	33K	800	300	$0.00690	$228	$6,831
chitchat	Haiku	17K	300	150	$0.00026	$4	$133
escalation	Haiku	7K	600	400	$0.00065	$5	$137
product_discovery	Sonnet	60K	1,100	400	$0.00930	$558	$16,740
TOTAL		484K				$2,035	$61,033

Top 3 most expensive intents:

recommendation — $20,790/month (34% of total spend) - Optimization: Implement semantic caching — similar recommendation queries share responses. At 25% cache hit rate → saves ~$5,200/month. - Optimization: Context compression — reduce RAG context from 5 chunks to top 2 using a re-ranker → reduces input tokens from 1,200 to 700 → saves ~$3,900/month.
product_discovery — $16,740/month (27%) - Optimization: Complexity-based routing — route simple discovery queries ("popular manga") to Haiku, complex ones to Sonnet. At 40% simple ratio → saves ~$6,030/month. - Optimization: Progressive disclosure — start with Haiku for initial broad results, escalate to Sonnet only if user refines search.
product_question — $13,050/month (21%) - Optimization: Fine-tune Haiku on product_question data → if quality meets threshold, migrate fully → saves ~$12,000/month. - Optimization: Structured data extraction — for questions about specific product attributes (price, page count, dimensions), use DynamoDB direct lookup instead of LLM → eliminates ~30% of LLM calls.

Projected savings from all optimizations: ~$27,000/month (44% reduction)

A5. RAG Context Cost Comparison for Product Discovery

Scenario details: - 5 retrieved chunks × 300 tokens each = 1,500 tokens of RAG context - Base input tokens (system prompt + user query): 800 - Output tokens: 400

Option A — All 5 chunks to Sonnet:

$$\text{Input} = (800 + 1{,}500) \times \frac{\$3.00}{1M} = 2{,}300 \times \$0.000003 = \$0.00690$$

$$\text{Output} = 400 \times \frac{\$15.00}{1M} = \$0.00600$$

$$\text{Total} = \$0.01290 \text{ per conversation}$$

Option B — All 5 chunks to Haiku:

$$\text{Input} = 2{,}300 \times \frac{\$0.25}{1M} = \$0.000575$$

$$\text{Output} = 400 \times \frac{\$1.25}{1M} = \$0.000500$$

$$\text{Total} = \$0.001075 \text{ per conversation}$$

Option C — Re-ranker selects top 2 chunks, send to Sonnet:

Re-ranker cost: $0.0001 per inference Input tokens with 2 chunks: 800 + (2 × 300) = 1,400

$$\text{Re-ranker} = \$0.0001$$

$$\text{Input} = 1{,}400 \times \frac{\$3.00}{1M} = \$0.00420$$

$$\text{Output} = 400 \times \frac{\$15.00}{1M} = \$0.00600$$

$$\text{Total} = \$0.0001 + \$0.00420 + \$0.00600 = \$0.01030 \text{ per conversation}$$

Comparison at 60K conversations/day:

Option	Cost/Conv	Daily Cost	Monthly Cost	Quality
A: 5 chunks → Sonnet	$0.01290	$774	$23,220	Highest — most context
B: 5 chunks → Haiku	$0.00108	$65	$1,935	Lowest — Haiku may miss nuance
C: 2 chunks → Sonnet + re-ranker	$0.01030	$618	$18,540	High — top 2 chunks are usually sufficient

Recommendation: Option C saves $4,680/month vs Option A with minimal quality loss. The re-ranker filters irrelevant chunks (like Option B's included One Piece result when the user asked about dark fantasy), actually improving faithfulness by reducing noise in the context window.

A6. Intent-Based Token Budget Table

Intent	Max Input Tokens	Max Output Tokens	Model	Target $/Conv	Rationale
recommendation	2,500	500	Sonnet	$0.0150	Large RAG context (5 products), detailed multi-product responses
product_question	2,000	400	Sonnet	$0.0120	Product context + comparison needs
product_discovery	2,200	450	Sonnet	$0.0135	Iterative discovery needs broad context
faq	1,000	250	Haiku	$0.0006	Concise answers, limited context needed
order_tracking	1,200	300	Haiku	$0.0007	Order data is structured, responses are template-like
return_request	1,200	350	Haiku	$0.0007	Return policy is finite, but edge cases need detail
promotion	800	200	Haiku	$0.0005	Promotions are pre-defined, short responses
checkout_help	1,500	350	Sonnet	$0.0098	Step-by-step guidance needs clarity
chitchat	600	150	Haiku	$0.0003	Casual, short exchanges
escalation	1,500	500	Haiku	$0.0010	Needs higher output to generate detailed agent summary

Why escalation gets a higher output budget than order_tracking:

order_tracking output is a structured status update: "Your order #12345 shipped on March 15 and is expected to arrive by March 20. Tracking number: XYZ." This rarely exceeds 100 tokens.
escalation output is a conversation summary for the human agent. It must include: 1. Summary of the customer's issue 2. What solutions were attempted by the chatbot 3. Customer's emotional state / sentiment 4. Relevant order/product details 5. Recommended actions for the agent

This summary can reach 300–500 tokens to be useful. Generating a truncated summary wastes the human agent's time.

Hard

A7. Pareto-Optimal Configuration Search

Search space definition:

For each of MangaAssist's 10 intents, the configurable parameters are:

Parameter	Options	Values
Model	3	Sonnet, Haiku, Fine-tuned Haiku
max_tokens (output)	4	150, 250, 400, 500
temperature	3	0.1, 0.3, 0.7
RAG context chunks	3	0, 2, 5 (if applicable)

Per intent: 3 × 4 × 3 × 3 = 108 configurations Across 10 intents: 108^10 ≈ 10^20 combinations — brute force is impossible.

Optimization approach — Bayesian Optimization with Multi-Objective Acquisition:

Objective functions: - $f_1(\mathbf{x})$ = Total monthly LLM cost (minimize) - $f_2(\mathbf{x})$ = Weighted average quality score (maximize)

Where $\mathbf{x}$ is the configuration vector across all 10 intents.

Constraints: - Per-intent quality ≥ minimum threshold (hard constraint) - P99 latency ≤ 5,000ms per intent (hard constraint)
Algorithm — NSGA-II (Non-dominated Sorting Genetic Algorithm): - Population size: 100 configurations - Evaluate each configuration using offline evaluation (Bedrock evaluation jobs on golden datasets) - Evolve over 50 generations with crossover and mutation on intent-level parameters - The Pareto front emerges naturally — configurations not dominated by any other
Efficient evaluation: - Don't run production traffic. Use the golden evaluation datasets (200 samples per intent). - Cost is calculated analytically from token counts and model pricing. - Quality is measured via BERTScore + judge model scoring. - Latency is estimated from historical distributions per model.
Output: A Pareto front with ~15–20 non-dominated configurations:

Config A: $35K/month, quality=0.92  (all Sonnet, high tokens)
Config B: $28K/month, quality=0.91  (Sonnet for top 3, Haiku for rest)
Config C: $18K/month, quality=0.87  (Sonnet for recommendation only)
Config D: $12K/month, quality=0.83  (all Haiku)

The business chooses from the Pareto front based on budget constraints and quality appetite.

A configuration is Pareto-dominated if there exists another configuration that is both cheaper AND higher quality. For example, if Config X costs $25K with quality 0.88, and Config Y costs $22K with quality 0.90, then X is dominated by Y and should never be chosen.

A8. Cost-Aware Intent Sub-Routing for Product Question

Complexity classifier design:

Features for routing product_question to Sonnet vs Haiku:

Feature	Type	Example
Query token count	Numeric	Short queries (< 20 tokens) → likely simple
Number of entities mentioned	Numeric	Multi-product comparison → complex
Presence of comparison words	Binary	"compare", "difference between", "versus" → complex
Coreference depth	Numeric	"What about the second one?" → needs context → complex
Product attribute type	Categorical	Factual (page count, price) → simple; Subjective (quality, art style) → complex
Conversation turn number	Numeric	Turn 1 → usually simple; Turn 4+ → often complex

Training pipeline (SageMaker):

Data: Label 10K historical product_question conversations as "simple" or "complex" based on: - Did Haiku produce an acceptable response? (run both models, compare) - Simple = Haiku BERTScore ≥ 0.85 against reference; Complex = Haiku < 0.85
Model: XGBoost classifier on SageMaker (fast inference, < 10ms latency)
Deploy: SageMaker real-time endpoint, invoked by the ECS Fargate orchestrator before model selection

Expected cost savings:

Metric	Current (All Sonnet)	With Sub-Routing
Daily volume	50K	50K
Sonnet calls	50K (100%)	35K (70% complex)
Haiku calls	0	15K (30% simple)
Daily LLM cost	$435	$305 + $8 = $313
Monthly savings	—	$3,660/month

Additional SageMaker cost for classifier: ~$100/month (ml.c5.large endpoint) → net savings: $3,560/month

Quality guardrail: Run a shadow evaluation weekly — invoke both Sonnet and Haiku for all queries, compare the complexity classifier's routing decision against the actual quality gap. If the classifier's "simple" routing is wrong > 5% of the time, retrain.

A9. Cost Attribution Dashboard Design

Data pipeline architecture:

Bedrock API Response
    │ (includes token counts in response metadata)
    ▼
┌─────────────────────┐
│ ECS Fargate         │  Extract: input_tokens, output_tokens,
│ (Orchestrator)      │  model_id, intent, session_id,
│                     │  cache_hit, timestamp
└──────────┬──────────┘
           │ Structured JSON log
           ▼
┌─────────────────────┐
│ CloudWatch Logs     │  Log group: /mangaassist/llm-usage
│                     │  Retention: 90 days
└──────────┬──────────┘
           │ Subscription filter
           ▼
┌─────────────────────┐
│ Kinesis Data        │  Buffer: 60s / 5MB batches
│ Firehose            │
└──────────┬──────────┘
           │
     ┌─────┴──────┐
     ▼            ▼
┌─────────┐  ┌─────────────┐
│ S3      │  │ DynamoDB    │
│ (Raw)   │  │ (Hourly     │
│ Parquet │  │  Aggregates)│
└────┬────┘  └──────┬──────┘
     │               │
     ▼               ▼
┌─────────────────────────────┐
│ QuickSight Dashboard        │
│                             │
│ ┌─────────────────────────┐ │
│ │ Cost by Intent (Pie)    │ │
│ │ Cost by Model (Stacked) │ │
│ │ Input vs Output Tokens  │ │
│ │ Cache Savings (Line)    │ │
│ │ Hourly Trend (Heatmap)  │ │
│ └─────────────────────────┘ │
└─────────────────────────────┘

Per-request log entry (emitted by ECS Fargate):

{
  "timestamp": "2025-03-15T14:23:01Z",
  "request_id": "req-abc123",
  "session_id": "sess-xyz789",
  "intent": "recommendation",
  "model_id": "anthropic.claude-3-5-sonnet-20241022-v2:0",
  "input_tokens": 1247,
  "output_tokens": 412,
  "cache_hit": false,
  "latency_ms": 2340,
  "input_cost_usd": 0.003741,
  "output_cost_usd": 0.006180,
  "total_cost_usd": 0.009921,
  "region": "us-east-1",
  "hour_of_day": 14
}

DynamoDB hourly aggregation schema:

PK: INTENT#recommendation
SK: 2025-03-15T14:00:00Z
Attributes:
  model: "sonnet"
  total_conversations: 2834
  total_input_tokens: 3,541,298
  total_output_tokens: 1,168,508
  total_cost_usd: 28.16
  cache_hit_count: 412
  cache_miss_count: 2422
  avg_cost_per_conversation: 0.00993

This hourly granularity keeps DynamoDB costs low (~$15/month) while enabling dashboard queries down to hourly resolution. For sub-hourly analysis, query the raw Parquet files in S3 via Athena.

Very Hard

A10. Real-Time Cost Anomaly Detection and Dynamic Routing

Architecture:

┌──────────────────────────────────────────────────────────┐
│                Real-Time Cost Control System              │
├──────────────────────────────────────────────────────────┤
│                                                          │
│  ┌─────────────┐    ┌──────────────┐    ┌────────────┐  │
│  │ ECS Fargate │───►│ CloudWatch   │───►│ Anomaly    │  │
│  │ (Per-req    │    │ Metrics      │    │ Detection  │  │
│  │  cost emit) │    │              │    │ (ML-based) │  │
│  └──────┬──────┘    └──────────────┘    └─────┬──────┘  │
│         │                                      │         │
│         ▼                                      ▼         │
│  ┌─────────────┐    ┌──────────────┐    ┌────────────┐  │
│  │ ElastiCache │◄───│ Cost Budget  │◄───│ CloudWatch │  │
│  │ Redis       │    │ Lambda       │    │ Alarm      │  │
│  │ (Running    │    │ (Threshold   │    │            │  │
│  │  daily cost)│    │  checks)     │    │            │  │
│  └─────────────┘    └──────────────┘    └────────────┘  │
└──────────────────────────────────────────────────────────┘

Anomaly definitions:

Anomaly Type	Detection	Threshold
Single conversation spike	Per-conversation cost > 10× intent average	`recommendation` > $0.10/conversation
Intent cost velocity spike	Rolling 15-min cost > 3× historical same-hour average	CloudWatch Anomaly Detection (ML band)
Daily budget approaching	Cumulative spend reaches % thresholds	70% / 85% / 95% / 100% of $5,000
Token count explosion	Input tokens > 10,000 on a single request	Direct check in orchestrator

Real-time budget tracking (Redis):

# Executed on every LLM invocation
async def track_and_check_budget(intent: str, cost: float) -> str:
    pipe = redis.pipeline()

    # Increment daily counter
    today = datetime.utcnow().strftime("%Y-%m-%d")
    pipe.incrbyfloat(f"cost:daily:{today}", cost)
    pipe.incrbyfloat(f"cost:daily:{today}:intent:{intent}", cost)

    results = await pipe.execute()
    daily_total = float(results[0])

    DAILY_CAP = 5000.0

    if daily_total >= DAILY_CAP * 0.95:
        # CRITICAL: Only serve high-value intents on Sonnet
        return "emergency_mode"
    elif daily_total >= DAILY_CAP * 0.85:
        # WARNING: Downgrade low-priority intents
        return "cost_saving_mode"
    elif daily_total >= DAILY_CAP * 0.70:
        # CAUTION: Enable aggressive caching
        return "cache_aggressive_mode"
    else:
        return "normal_mode"

Automated downgrade behavior:

Mode	Sonnet Intents	Haiku Intents	Caching
Normal	recommendation, product_question, product_discovery, checkout_help	faq, order_tracking, return_request, promotion, chitchat, escalation	Standard
Cache Aggressive	Same	Same	Extend TTLs 2×, fuzzy match cache keys
Cost Saving	recommendation, checkout_help only	All others downgraded to Haiku	Aggressive + serve stale cache
Emergency	recommendation only	Everything else → Haiku	Max aggressive

Preventing customer experience degradation:

Transparent degradation: When downgraded from Sonnet to Haiku, the orchestrator adds extra few-shot examples to the Haiku prompt to compensate for quality loss
Quality floor: Even in emergency mode, if Haiku produces a low-confidence response (below quality threshold), escalate to a human agent rather than serve a bad answer
Budget replenishment: At midnight UTC, reset the daily budget counter. During the first hour of the new day, gradually restore normal routing (don't blast all accumulated demand through Sonnet)

A11. Cost-Aware Prompt Engineering Framework

Token efficiency metric:

$$\text{Token Efficiency} = \frac{\text{Quality Score}}{\text{Total Tokens Consumed}} \times 1000$$

This gives "quality points per 1,000 tokens." Higher is better — either by improving quality without token increase, or reducing tokens without quality loss.

Systematic prompt optimization process:

Phase 1 — Baseline measurement: For each intent, measure the current prompt's: - Total input tokens (including system prompt) - Average output tokens - Quality score (BERTScore + judge model average) - Token efficiency ratio

Phase 2 — Optimization techniques:

Technique	Description	Expected Token Reduction	Quality Risk
System prompt compression	Rewrite verbose instructions into concise directives	15–30% of system prompt tokens	Low — if meaning preserved
Example pruning	Reduce few-shot examples from 5 to 2 (most representative)	40–60% of example tokens	Medium — edge case handling may degrade
Output format constraints	Add "respond in ≤ 3 sentences" or structured JSON schema	20–40% of output tokens	Low–Medium — may lose nuance
Context window pruning	Remove conversation turns older than 3 turns	30–50% of history tokens	Medium — multi-turn coherence may suffer
Instruction consolidation	Merge redundant or overlapping instructions	10–20% of system prompt	Low

Phase 3 — A/B test evaluation:

For each prompt variant:

# Evaluation for prompt optimization A/B test
test_config = {
    "test_id": "prompt-opt-rec-v2",
    "intent": "recommendation",
    "control_prompt": current_production_prompt,
    "treatment_prompt": optimized_prompt,
    "metrics": {
        "primary": "token_efficiency",
        "guardrails": ["bertscore >= 0.88", "manga_domain_accuracy >= 4.0"],
        "cost": "cost_per_conversation"
    },
    "min_duration_days": 7,
    "traffic_split": 0.10  # 10% of recommendation traffic
}

Phase 4 — Edge case regression detection:

Prompt compression can introduce subtle failures. To catch these:

Adversarial evaluation dataset: 50 deliberately tricky queries that test edge cases: - Manga with similar names (Naruto vs Boruto: Naruto Next Generations) - Genres that overlap (dark fantasy vs horror) - Out-of-catalog requests - Multi-language queries
A/B test duration extension: Even if the main metrics show success, hold the test for 14 days to accumulate enough edge case exposure.
Automated regression scanner: Compare the optimized prompt's responses against the original prompt for all 200 golden dataset samples. Flag any response where quality drops > 10% — review these manually before promoting.

A12. Cost-Performance Simulation Framework

Architecture — SageMaker Pipeline:

┌─────────────────────────────────────────────────────────────────┐
│              MangaAssist Cost Simulation Pipeline                │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  Step 1: Data Ingestion (SageMaker Processing)                  │
│  ├── Historical token distributions per intent (from S3)        │
│  ├── Current model assignments and pricing                      │
│  ├── Cache hit rates per intent (from CloudWatch)               │
│  ├── Traffic volume by hour/day (from DynamoDB aggregates)      │
│  └── Quality scores per model-intent pair (from eval results)   │
│                                                                 │
│  Step 2: Scenario Generation (SageMaker Processing)             │
│  ├── Parse input scenario parameters                            │
│  ├── Generate Monte Carlo simulation inputs                     │
│  └── Create scenario matrix                                     │
│                                                                 │
│  Step 3: Monte Carlo Simulation (SageMaker Training job)        │
│  ├── 10,000 simulation runs per scenario                        │
│  ├── Sample from token distribution (not just mean)             │
│  ├── Model traffic growth with Poisson variance                 │
│  ├── Apply cache hit rate changes when model changes            │
│  └── Include infrastructure costs (OpenSearch, ECS, DDB, Redis) │
│                                                                 │
│  Step 4: Analysis & Output (SageMaker Processing)               │
│  ├── Aggregate by scenario: mean, P5, P50, P95                  │
│  ├── Generate confidence intervals                              │
│  └── Write results to S3 + DynamoDB                             │
│                                                                 │
│  Step 5: Visualization (Lambda → QuickSight)                    │
│  └── Update dashboard with new projections                      │
└─────────────────────────────────────────────────────────────────┘

Input parameters format:

{
  "scenario_name": "Q2 2025 Projection",
  "base_date": "2025-03-01",
  "projection_months": 6,
  "assumptions": {
    "traffic_growth": {
      "type": "linear",
      "monthly_rate": 0.10
    },
    "price_changes": [
      {
        "model": "claude-3-5-sonnet",
        "effective_date": "2025-05-01",
        "new_input_price_per_m": 2.50,
        "new_output_price_per_m": 12.00
      }
    ],
    "new_models": [
      {
        "model": "claude-4-haiku",
        "available_date": "2025-06-01",
        "input_price_per_m": 0.40,
        "output_price_per_m": 2.00,
        "estimated_quality_vs_sonnet": 0.92
      }
    ],
    "architecture_changes": [
      {
        "type": "prompt_cache",
        "effective_date": "2025-04-01",
        "cache_eligible_intents": ["faq", "promotion", "chitchat"],
        "expected_token_reduction": 0.30
      }
    ],
    "model_reassignments": [
      {
        "intent": "product_question",
        "from_model": "sonnet",
        "to_model": "haiku",
        "effective_date": "2025-04-15"
      }
    ]
  }
}

Monte Carlo simulation core:

def simulate_monthly_cost(
    intent_configs: dict,
    traffic_model: TrafficModel,
    token_distributions: dict[str, Distribution],
    cache_model: CacheModel,
    n_simulations: int = 10000
) -> SimulationResult:
    monthly_costs = []

    for _ in range(n_simulations):
        total_cost = 0.0

        for intent, config in intent_configs.items():
            # Sample daily traffic from Poisson distribution
            daily_traffic = np.random.poisson(traffic_model.mean_daily[intent])
            monthly_traffic = daily_traffic * 30

            # Sample token counts from fitted distributions (not just means!)
            input_tokens = token_distributions[intent].input.rvs(monthly_traffic)
            output_tokens = token_distributions[intent].output.rvs(monthly_traffic)

            # Apply cache — cached requests have zero LLM cost
            cache_rate = cache_model.get_rate(intent, config.model)
            llm_requests = int(monthly_traffic * (1 - cache_rate))

            # Calculate LLM cost
            input_cost = np.sum(input_tokens[:llm_requests]) * config.model.input_price / 1e6
            output_cost = np.sum(output_tokens[:llm_requests]) * config.model.output_price / 1e6

            total_cost += input_cost + output_cost

        # Add infrastructure costs (relatively stable, add noise ±10%)
        infra_cost = sample_infrastructure_costs()
        total_cost += infra_cost

        monthly_costs.append(total_cost)

    return SimulationResult(
        mean=np.mean(monthly_costs),
        p5=np.percentile(monthly_costs, 5),
        p50=np.percentile(monthly_costs, 50),
        p95=np.percentile(monthly_costs, 95),
        confidence_interval_95=(
            np.percentile(monthly_costs, 2.5),
            np.percentile(monthly_costs, 97.5)
        )
    )

Output format:

┌───────────────────────────────────────────────────────────────┐
│  Scenario: Q2 2025 Projection (10% monthly growth)           │
├───────────┬──────────┬──────────┬──────────┬────────────────┤
│ Month     │ P5       │ P50      │ P95      │ 95% CI         │
├───────────┼──────────┼──────────┼──────────┼────────────────┤
│ Apr 2025  │ $52,100  │ $58,400  │ $66,200  │ [$50,800-67,900] │
│ May 2025  │ $54,300  │ $61,800  │ $71,500  │ [$52,900-73,200] │
│ Jun 2025  │ $49,100  │ $55,200  │ $63,800  │ [$47,600-65,400] │  ← price drop
│ Jul 2025  │ $52,800  │ $59,900  │ $69,100  │ [$51,200-71,000] │
│ Aug 2025  │ $56,400  │ $64,100  │ $74,200  │ [$54,600-76,100] │
│ Sep 2025  │ $48,200  │ $54,800  │ $63,500  │ [$46,700-65,200] │  ← Claude 4 Haiku
└───────────┴──────────┴──────────┴──────────┴────────────────┘

Key insights:
- Bedrock price reduction in May saves ~$5K/month
- Claude 4 Haiku migration in Sep saves ~$9K/month
- Monthly budget cap of $400K is achievable by Aug with all optimizations

← Back to Questions · ← Back to Skill 02 Hub