LOCAL PREVIEW View on GitHub

03 — Cost–Performance & Token Efficiency — Answers

Easy

A1. Per-Conversation LLM Cost Calculation

Claude 3.5 Sonnet pricing: - Input: $3.00 per million tokens - Output: $15.00 per million tokens

Claude 3 Haiku pricing: - Input: $0.25 per million tokens - Output: $1.25 per million tokens

Recommendation intent — Sonnet:

$$\text{Input cost} = 800 \times \frac{\$3.00}{1{,}000{,}000} = \$0.0024$$

$$\text{Output cost} = 400 \times \frac{\$15.00}{1{,}000{,}000} = \$0.0060$$

$$\text{Total per conversation} = \$0.0024 + \$0.0060 = \$0.0084$$

Recommendation intent — Haiku:

$$\text{Input cost} = 800 \times \frac{\$0.25}{1{,}000{,}000} = \$0.0002$$

$$\text{Output cost} = 400 \times \frac{\$1.25}{1{,}000{,}000} = \$0.0005$$

$$\text{Total per conversation} = \$0.0002 + \$0.0005 = \$0.0007$$

Cost ratio:

$$\frac{\text{Sonnet}}{\text{Haiku}} = \frac{\$0.0084}{\$0.0007} = 12\times$$

Daily cost at 40K conversations/day: - Sonnet: 40,000 × $0.0084 = $336/day ($10,080/month) - Haiku: 40,000 × $0.0007 = $28/day ($840/month) - Daily savings from switching to Haiku: $308/day ($9,240/month)

However, this savings is only worthwhile if Haiku meets the quality threshold for recommendation. Based on evaluation data, Haiku scores below the 0.88 BERTScore threshold for this intent, so the savings are not realizable without quality compromise.


A2. Token Budgets for MangaAssist

Definition: A token budget is a hard or soft limit on the number of input and output tokens consumed per LLM invocation for a specific intent. It controls costs, prevents runaway token consumption, and ensures predictable latency.

FAQ intent token budget:

Component Token Budget Rationale
System prompt 200 tokens Fixed — FAQ instructions and persona
Conversation history 300 tokens (max 2 prior turns) FAQ rarely needs deep history
User query 100 tokens FAQ questions are typically concise
RAG context 400 tokens (2 chunks max) FAQ answers are usually in a single knowledge base article
Total input budget 1,000 tokens
Output budget (max_tokens) 250 tokens FAQ answers should be concise and direct

Enforcement on ECS Fargate:

# In the orchestrator service running on ECS Fargate
INTENT_TOKEN_BUDGETS = {
    "faq": {"max_input": 1000, "max_output": 250},
    "recommendation": {"max_input": 2500, "max_output": 500},
    "chitchat": {"max_input": 600, "max_output": 150},
    # ... other intents
}

async def invoke_llm(intent: str, prompt: str, context: str) -> str:
    budget = INTENT_TOKEN_BUDGETS[intent]

    # Token counting using tiktoken (approximate for Claude)
    input_tokens = count_tokens(prompt + context)

    if input_tokens > budget["max_input"]:
        # Strategy: truncate conversation history first, then RAG context
        prompt = truncate_to_budget(prompt, context, budget["max_input"])

    response = await bedrock.invoke_model(
        modelId=get_model_for_intent(intent),
        body={
            "prompt": prompt,
            "max_tokens": budget["max_output"],  # Hard cap enforced by Bedrock
            "stop_sequences": ["\n\nHuman:"]
        }
    )
    return response

The max_tokens parameter in the Bedrock API call is the hard enforcement. Input truncation is application-level, applied before the API call.


A3. Effective Cost with Caching

Given: - faq: 50K conversations/day, 500 input / 200 output tokens, Haiku, 45% cache hit rate - order_tracking: 30K conversations/day, 600 input / 300 output tokens, Haiku, 30% cache hit rate

FAQ cost calculation:

Per-conversation Haiku cost:

$$\text{Input} = 500 \times \frac{\$0.25}{1M} = \$0.000125$$

$$\text{Output} = 200 \times \frac{\$1.25}{1M} = \$0.000250$$

$$\text{Total} = \$0.000375$$

With 45% cache hit rate (cache hits cost ~$0 for LLM — only Redis lookup cost): - LLM invocations: 50,000 × 0.55 = 27,500 - Cache hits: 50,000 × 0.45 = 22,500 (near-zero LLM cost)

$$\text{FAQ daily cost} = 27{,}500 \times \$0.000375 = \$10.31$$

Without caching: 50,000 × $0.000375 = $18.75 → 45% savings from caching

Order tracking cost calculation:

Per-conversation Haiku cost:

$$\text{Input} = 600 \times \frac{\$0.25}{1M} = \$0.000150$$

$$\text{Output} = 300 \times \frac{\$1.25}{1M} = \$0.000375$$

$$\text{Total} = \$0.000525$$

With 30% cache hit rate: - LLM invocations: 30,000 × 0.70 = 21,000

$$\text{Order tracking daily cost} = 21{,}000 \times \$0.000525 = \$11.03$$

Without caching: 30,000 × $0.000525 = $15.75 → 30% savings

Combined daily cost: $10.31 + $11.03 = $21.34/day Without caching: $18.75 + $15.75 = $34.50/day Total savings from caching: $13.16/day ($394.80/month)

Note: These calculations exclude ElastiCache Redis infrastructure cost (~$150/month for a cache.r6g.large instance), which is amortized across all cached intents.


Medium

A4. Complete Cost-Per-Conversation Breakdown

Assumptions: - Sonnet: $3.00/$15.00 per M input/output tokens - Haiku: $0.25/$1.25 per M input/output tokens - Average tokens per conversation: 800 input, 350 output (varies by intent — detailed below)

Intent Model Daily Vol. Avg In Tok Avg Out Tok Cost/Conv Daily Cost Monthly Cost
recommendation Sonnet 67K 1,200 450 $0.01035 $693 $20,790
product_question Sonnet 50K 900 400 $0.00870 $435 $13,050
faq Haiku 100K 500 200 $0.00038 $38 $1,125
order_tracking Haiku 83K 600 300 $0.00053 $44 $1,313
return_request Haiku 27K 700 350 $0.00061 $16 $494
promotion Haiku 40K 400 200 $0.00035 $14 $420
checkout_help Sonnet 33K 800 300 $0.00690 $228 $6,831
chitchat Haiku 17K 300 150 $0.00026 $4 $133
escalation Haiku 7K 600 400 $0.00065 $5 $137
product_discovery Sonnet 60K 1,100 400 $0.00930 $558 $16,740
TOTAL 484K $2,035 $61,033

Top 3 most expensive intents:

  1. recommendation — $20,790/month (34% of total spend) - Optimization: Implement semantic caching — similar recommendation queries share responses. At 25% cache hit rate → saves ~$5,200/month. - Optimization: Context compression — reduce RAG context from 5 chunks to top 2 using a re-ranker → reduces input tokens from 1,200 to 700 → saves ~$3,900/month.

  2. product_discovery — $16,740/month (27%) - Optimization: Complexity-based routing — route simple discovery queries ("popular manga") to Haiku, complex ones to Sonnet. At 40% simple ratio → saves ~$6,030/month. - Optimization: Progressive disclosure — start with Haiku for initial broad results, escalate to Sonnet only if user refines search.

  3. product_question — $13,050/month (21%) - Optimization: Fine-tune Haiku on product_question data → if quality meets threshold, migrate fully → saves ~$12,000/month. - Optimization: Structured data extraction — for questions about specific product attributes (price, page count, dimensions), use DynamoDB direct lookup instead of LLM → eliminates ~30% of LLM calls.

Projected savings from all optimizations: ~$27,000/month (44% reduction)


A5. RAG Context Cost Comparison for Product Discovery

Scenario details: - 5 retrieved chunks × 300 tokens each = 1,500 tokens of RAG context - Base input tokens (system prompt + user query): 800 - Output tokens: 400

Option A — All 5 chunks to Sonnet:

$$\text{Input} = (800 + 1{,}500) \times \frac{\$3.00}{1M} = 2{,}300 \times \$0.000003 = \$0.00690$$

$$\text{Output} = 400 \times \frac{\$15.00}{1M} = \$0.00600$$

$$\text{Total} = \$0.01290 \text{ per conversation}$$

Option B — All 5 chunks to Haiku:

$$\text{Input} = 2{,}300 \times \frac{\$0.25}{1M} = \$0.000575$$

$$\text{Output} = 400 \times \frac{\$1.25}{1M} = \$0.000500$$

$$\text{Total} = \$0.001075 \text{ per conversation}$$

Option C — Re-ranker selects top 2 chunks, send to Sonnet:

Re-ranker cost: $0.0001 per inference Input tokens with 2 chunks: 800 + (2 × 300) = 1,400

$$\text{Re-ranker} = \$0.0001$$

$$\text{Input} = 1{,}400 \times \frac{\$3.00}{1M} = \$0.00420$$

$$\text{Output} = 400 \times \frac{\$15.00}{1M} = \$0.00600$$

$$\text{Total} = \$0.0001 + \$0.00420 + \$0.00600 = \$0.01030 \text{ per conversation}$$

Comparison at 60K conversations/day:

Option Cost/Conv Daily Cost Monthly Cost Quality
A: 5 chunks → Sonnet $0.01290 $774 $23,220 Highest — most context
B: 5 chunks → Haiku $0.00108 $65 $1,935 Lowest — Haiku may miss nuance
C: 2 chunks → Sonnet + re-ranker $0.01030 $618 $18,540 High — top 2 chunks are usually sufficient

Recommendation: Option C saves $4,680/month vs Option A with minimal quality loss. The re-ranker filters irrelevant chunks (like Option B's included One Piece result when the user asked about dark fantasy), actually improving faithfulness by reducing noise in the context window.


A6. Intent-Based Token Budget Table

Intent Max Input Tokens Max Output Tokens Model Target $/Conv Rationale
recommendation 2,500 500 Sonnet $0.0150 Large RAG context (5 products), detailed multi-product responses
product_question 2,000 400 Sonnet $0.0120 Product context + comparison needs
product_discovery 2,200 450 Sonnet $0.0135 Iterative discovery needs broad context
faq 1,000 250 Haiku $0.0006 Concise answers, limited context needed
order_tracking 1,200 300 Haiku $0.0007 Order data is structured, responses are template-like
return_request 1,200 350 Haiku $0.0007 Return policy is finite, but edge cases need detail
promotion 800 200 Haiku $0.0005 Promotions are pre-defined, short responses
checkout_help 1,500 350 Sonnet $0.0098 Step-by-step guidance needs clarity
chitchat 600 150 Haiku $0.0003 Casual, short exchanges
escalation 1,500 500 Haiku $0.0010 Needs higher output to generate detailed agent summary

Why escalation gets a higher output budget than order_tracking:

  • order_tracking output is a structured status update: "Your order #12345 shipped on March 15 and is expected to arrive by March 20. Tracking number: XYZ." This rarely exceeds 100 tokens.

  • escalation output is a conversation summary for the human agent. It must include: 1. Summary of the customer's issue 2. What solutions were attempted by the chatbot 3. Customer's emotional state / sentiment 4. Relevant order/product details 5. Recommended actions for the agent

This summary can reach 300–500 tokens to be useful. Generating a truncated summary wastes the human agent's time.


Hard

Search space definition:

For each of MangaAssist's 10 intents, the configurable parameters are:

Parameter Options Values
Model 3 Sonnet, Haiku, Fine-tuned Haiku
max_tokens (output) 4 150, 250, 400, 500
temperature 3 0.1, 0.3, 0.7
RAG context chunks 3 0, 2, 5 (if applicable)

Per intent: 3 × 4 × 3 × 3 = 108 configurations Across 10 intents: 108^10 ≈ 10^20 combinations — brute force is impossible.

Optimization approach — Bayesian Optimization with Multi-Objective Acquisition:

  1. Objective functions: - $f_1(\mathbf{x})$ = Total monthly LLM cost (minimize) - $f_2(\mathbf{x})$ = Weighted average quality score (maximize)

Where $\mathbf{x}$ is the configuration vector across all 10 intents.

  1. Constraints: - Per-intent quality ≥ minimum threshold (hard constraint) - P99 latency ≤ 5,000ms per intent (hard constraint)

  2. Algorithm — NSGA-II (Non-dominated Sorting Genetic Algorithm): - Population size: 100 configurations - Evaluate each configuration using offline evaluation (Bedrock evaluation jobs on golden datasets) - Evolve over 50 generations with crossover and mutation on intent-level parameters - The Pareto front emerges naturally — configurations not dominated by any other

  3. Efficient evaluation: - Don't run production traffic. Use the golden evaluation datasets (200 samples per intent). - Cost is calculated analytically from token counts and model pricing. - Quality is measured via BERTScore + judge model scoring. - Latency is estimated from historical distributions per model.

  4. Output: A Pareto front with ~15–20 non-dominated configurations:

Config A: $35K/month, quality=0.92  (all Sonnet, high tokens)
Config B: $28K/month, quality=0.91  (Sonnet for top 3, Haiku for rest)
Config C: $18K/month, quality=0.87  (Sonnet for recommendation only)
Config D: $12K/month, quality=0.83  (all Haiku)

The business chooses from the Pareto front based on budget constraints and quality appetite.

  1. A configuration is Pareto-dominated if there exists another configuration that is both cheaper AND higher quality. For example, if Config X costs $25K with quality 0.88, and Config Y costs $22K with quality 0.90, then X is dominated by Y and should never be chosen.

A8. Cost-Aware Intent Sub-Routing for Product Question

Complexity classifier design:

Features for routing product_question to Sonnet vs Haiku:

Feature Type Example
Query token count Numeric Short queries (< 20 tokens) → likely simple
Number of entities mentioned Numeric Multi-product comparison → complex
Presence of comparison words Binary "compare", "difference between", "versus" → complex
Coreference depth Numeric "What about the second one?" → needs context → complex
Product attribute type Categorical Factual (page count, price) → simple; Subjective (quality, art style) → complex
Conversation turn number Numeric Turn 1 → usually simple; Turn 4+ → often complex

Training pipeline (SageMaker):

  1. Data: Label 10K historical product_question conversations as "simple" or "complex" based on: - Did Haiku produce an acceptable response? (run both models, compare) - Simple = Haiku BERTScore ≥ 0.85 against reference; Complex = Haiku < 0.85

  2. Model: XGBoost classifier on SageMaker (fast inference, < 10ms latency)

  3. Deploy: SageMaker real-time endpoint, invoked by the ECS Fargate orchestrator before model selection

Expected cost savings:

Metric Current (All Sonnet) With Sub-Routing
Daily volume 50K 50K
Sonnet calls 50K (100%) 35K (70% complex)
Haiku calls 0 15K (30% simple)
Daily LLM cost $435 $305 + $8 = $313
Monthly savings $3,660/month

Additional SageMaker cost for classifier: ~$100/month (ml.c5.large endpoint) → net savings: $3,560/month

Quality guardrail: Run a shadow evaluation weekly — invoke both Sonnet and Haiku for all queries, compare the complexity classifier's routing decision against the actual quality gap. If the classifier's "simple" routing is wrong > 5% of the time, retrain.


A9. Cost Attribution Dashboard Design

Data pipeline architecture:

Bedrock API Response
    │ (includes token counts in response metadata)
    ▼
┌─────────────────────┐
│ ECS Fargate         │  Extract: input_tokens, output_tokens,
│ (Orchestrator)      │  model_id, intent, session_id,
│                     │  cache_hit, timestamp
└──────────┬──────────┘
           │ Structured JSON log
           ▼
┌─────────────────────┐
│ CloudWatch Logs     │  Log group: /mangaassist/llm-usage
│                     │  Retention: 90 days
└──────────┬──────────┘
           │ Subscription filter
           ▼
┌─────────────────────┐
│ Kinesis Data        │  Buffer: 60s / 5MB batches
│ Firehose            │
└──────────┬──────────┘
           │
     ┌─────┴──────┐
     ▼            ▼
┌─────────┐  ┌─────────────┐
│ S3      │  │ DynamoDB    │
│ (Raw)   │  │ (Hourly     │
│ Parquet │  │  Aggregates)│
└────┬────┘  └──────┬──────┘
     │               │
     ▼               ▼
┌─────────────────────────────┐
│ QuickSight Dashboard        │
│                             │
│ ┌─────────────────────────┐ │
│ │ Cost by Intent (Pie)    │ │
│ │ Cost by Model (Stacked) │ │
│ │ Input vs Output Tokens  │ │
│ │ Cache Savings (Line)    │ │
│ │ Hourly Trend (Heatmap)  │ │
│ └─────────────────────────┘ │
└─────────────────────────────┘

Per-request log entry (emitted by ECS Fargate):

{
  "timestamp": "2025-03-15T14:23:01Z",
  "request_id": "req-abc123",
  "session_id": "sess-xyz789",
  "intent": "recommendation",
  "model_id": "anthropic.claude-3-5-sonnet-20241022-v2:0",
  "input_tokens": 1247,
  "output_tokens": 412,
  "cache_hit": false,
  "latency_ms": 2340,
  "input_cost_usd": 0.003741,
  "output_cost_usd": 0.006180,
  "total_cost_usd": 0.009921,
  "region": "us-east-1",
  "hour_of_day": 14
}

DynamoDB hourly aggregation schema:

PK: INTENT#recommendation
SK: 2025-03-15T14:00:00Z
Attributes:
  model: "sonnet"
  total_conversations: 2834
  total_input_tokens: 3,541,298
  total_output_tokens: 1,168,508
  total_cost_usd: 28.16
  cache_hit_count: 412
  cache_miss_count: 2422
  avg_cost_per_conversation: 0.00993

This hourly granularity keeps DynamoDB costs low (~$15/month) while enabling dashboard queries down to hourly resolution. For sub-hourly analysis, query the raw Parquet files in S3 via Athena.


Very Hard

A10. Real-Time Cost Anomaly Detection and Dynamic Routing

Architecture:

┌──────────────────────────────────────────────────────────┐
│                Real-Time Cost Control System              │
├──────────────────────────────────────────────────────────┤
│                                                          │
│  ┌─────────────┐    ┌──────────────┐    ┌────────────┐  │
│  │ ECS Fargate │───►│ CloudWatch   │───►│ Anomaly    │  │
│  │ (Per-req    │    │ Metrics      │    │ Detection  │  │
│  │  cost emit) │    │              │    │ (ML-based) │  │
│  └──────┬──────┘    └──────────────┘    └─────┬──────┘  │
│         │                                      │         │
│         ▼                                      ▼         │
│  ┌─────────────┐    ┌──────────────┐    ┌────────────┐  │
│  │ ElastiCache │◄───│ Cost Budget  │◄───│ CloudWatch │  │
│  │ Redis       │    │ Lambda       │    │ Alarm      │  │
│  │ (Running    │    │ (Threshold   │    │            │  │
│  │  daily cost)│    │  checks)     │    │            │  │
│  └─────────────┘    └──────────────┘    └────────────┘  │
└──────────────────────────────────────────────────────────┘

Anomaly definitions:

Anomaly Type Detection Threshold
Single conversation spike Per-conversation cost > 10× intent average recommendation > $0.10/conversation
Intent cost velocity spike Rolling 15-min cost > 3× historical same-hour average CloudWatch Anomaly Detection (ML band)
Daily budget approaching Cumulative spend reaches % thresholds 70% / 85% / 95% / 100% of $5,000
Token count explosion Input tokens > 10,000 on a single request Direct check in orchestrator

Real-time budget tracking (Redis):

# Executed on every LLM invocation
async def track_and_check_budget(intent: str, cost: float) -> str:
    pipe = redis.pipeline()

    # Increment daily counter
    today = datetime.utcnow().strftime("%Y-%m-%d")
    pipe.incrbyfloat(f"cost:daily:{today}", cost)
    pipe.incrbyfloat(f"cost:daily:{today}:intent:{intent}", cost)

    results = await pipe.execute()
    daily_total = float(results[0])

    DAILY_CAP = 5000.0

    if daily_total >= DAILY_CAP * 0.95:
        # CRITICAL: Only serve high-value intents on Sonnet
        return "emergency_mode"
    elif daily_total >= DAILY_CAP * 0.85:
        # WARNING: Downgrade low-priority intents
        return "cost_saving_mode"
    elif daily_total >= DAILY_CAP * 0.70:
        # CAUTION: Enable aggressive caching
        return "cache_aggressive_mode"
    else:
        return "normal_mode"

Automated downgrade behavior:

Mode Sonnet Intents Haiku Intents Caching
Normal recommendation, product_question, product_discovery, checkout_help faq, order_tracking, return_request, promotion, chitchat, escalation Standard
Cache Aggressive Same Same Extend TTLs 2×, fuzzy match cache keys
Cost Saving recommendation, checkout_help only All others downgraded to Haiku Aggressive + serve stale cache
Emergency recommendation only Everything else → Haiku Max aggressive

Preventing customer experience degradation:

  1. Transparent degradation: When downgraded from Sonnet to Haiku, the orchestrator adds extra few-shot examples to the Haiku prompt to compensate for quality loss
  2. Quality floor: Even in emergency mode, if Haiku produces a low-confidence response (below quality threshold), escalate to a human agent rather than serve a bad answer
  3. Budget replenishment: At midnight UTC, reset the daily budget counter. During the first hour of the new day, gradually restore normal routing (don't blast all accumulated demand through Sonnet)

A11. Cost-Aware Prompt Engineering Framework

Token efficiency metric:

$$\text{Token Efficiency} = \frac{\text{Quality Score}}{\text{Total Tokens Consumed}} \times 1000$$

This gives "quality points per 1,000 tokens." Higher is better — either by improving quality without token increase, or reducing tokens without quality loss.

Systematic prompt optimization process:

Phase 1 — Baseline measurement: For each intent, measure the current prompt's: - Total input tokens (including system prompt) - Average output tokens - Quality score (BERTScore + judge model average) - Token efficiency ratio

Phase 2 — Optimization techniques:

Technique Description Expected Token Reduction Quality Risk
System prompt compression Rewrite verbose instructions into concise directives 15–30% of system prompt tokens Low — if meaning preserved
Example pruning Reduce few-shot examples from 5 to 2 (most representative) 40–60% of example tokens Medium — edge case handling may degrade
Output format constraints Add "respond in ≤ 3 sentences" or structured JSON schema 20–40% of output tokens Low–Medium — may lose nuance
Context window pruning Remove conversation turns older than 3 turns 30–50% of history tokens Medium — multi-turn coherence may suffer
Instruction consolidation Merge redundant or overlapping instructions 10–20% of system prompt Low

Phase 3 — A/B test evaluation:

For each prompt variant:

# Evaluation for prompt optimization A/B test
test_config = {
    "test_id": "prompt-opt-rec-v2",
    "intent": "recommendation",
    "control_prompt": current_production_prompt,
    "treatment_prompt": optimized_prompt,
    "metrics": {
        "primary": "token_efficiency",
        "guardrails": ["bertscore >= 0.88", "manga_domain_accuracy >= 4.0"],
        "cost": "cost_per_conversation"
    },
    "min_duration_days": 7,
    "traffic_split": 0.10  # 10% of recommendation traffic
}

Phase 4 — Edge case regression detection:

Prompt compression can introduce subtle failures. To catch these:

  1. Adversarial evaluation dataset: 50 deliberately tricky queries that test edge cases: - Manga with similar names (Naruto vs Boruto: Naruto Next Generations) - Genres that overlap (dark fantasy vs horror) - Out-of-catalog requests - Multi-language queries

  2. A/B test duration extension: Even if the main metrics show success, hold the test for 14 days to accumulate enough edge case exposure.

  3. Automated regression scanner: Compare the optimized prompt's responses against the original prompt for all 200 golden dataset samples. Flag any response where quality drops > 10% — review these manually before promoting.


A12. Cost-Performance Simulation Framework

Architecture — SageMaker Pipeline:

┌─────────────────────────────────────────────────────────────────┐
│              MangaAssist Cost Simulation Pipeline                │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  Step 1: Data Ingestion (SageMaker Processing)                  │
│  ├── Historical token distributions per intent (from S3)        │
│  ├── Current model assignments and pricing                      │
│  ├── Cache hit rates per intent (from CloudWatch)               │
│  ├── Traffic volume by hour/day (from DynamoDB aggregates)      │
│  └── Quality scores per model-intent pair (from eval results)   │
│                                                                 │
│  Step 2: Scenario Generation (SageMaker Processing)             │
│  ├── Parse input scenario parameters                            │
│  ├── Generate Monte Carlo simulation inputs                     │
│  └── Create scenario matrix                                     │
│                                                                 │
│  Step 3: Monte Carlo Simulation (SageMaker Training job)        │
│  ├── 10,000 simulation runs per scenario                        │
│  ├── Sample from token distribution (not just mean)             │
│  ├── Model traffic growth with Poisson variance                 │
│  ├── Apply cache hit rate changes when model changes            │
│  └── Include infrastructure costs (OpenSearch, ECS, DDB, Redis) │
│                                                                 │
│  Step 4: Analysis & Output (SageMaker Processing)               │
│  ├── Aggregate by scenario: mean, P5, P50, P95                  │
│  ├── Generate confidence intervals                              │
│  └── Write results to S3 + DynamoDB                             │
│                                                                 │
│  Step 5: Visualization (Lambda → QuickSight)                    │
│  └── Update dashboard with new projections                      │
└─────────────────────────────────────────────────────────────────┘

Input parameters format:

{
  "scenario_name": "Q2 2025 Projection",
  "base_date": "2025-03-01",
  "projection_months": 6,
  "assumptions": {
    "traffic_growth": {
      "type": "linear",
      "monthly_rate": 0.10
    },
    "price_changes": [
      {
        "model": "claude-3-5-sonnet",
        "effective_date": "2025-05-01",
        "new_input_price_per_m": 2.50,
        "new_output_price_per_m": 12.00
      }
    ],
    "new_models": [
      {
        "model": "claude-4-haiku",
        "available_date": "2025-06-01",
        "input_price_per_m": 0.40,
        "output_price_per_m": 2.00,
        "estimated_quality_vs_sonnet": 0.92
      }
    ],
    "architecture_changes": [
      {
        "type": "prompt_cache",
        "effective_date": "2025-04-01",
        "cache_eligible_intents": ["faq", "promotion", "chitchat"],
        "expected_token_reduction": 0.30
      }
    ],
    "model_reassignments": [
      {
        "intent": "product_question",
        "from_model": "sonnet",
        "to_model": "haiku",
        "effective_date": "2025-04-15"
      }
    ]
  }
}

Monte Carlo simulation core:

def simulate_monthly_cost(
    intent_configs: dict,
    traffic_model: TrafficModel,
    token_distributions: dict[str, Distribution],
    cache_model: CacheModel,
    n_simulations: int = 10000
) -> SimulationResult:
    monthly_costs = []

    for _ in range(n_simulations):
        total_cost = 0.0

        for intent, config in intent_configs.items():
            # Sample daily traffic from Poisson distribution
            daily_traffic = np.random.poisson(traffic_model.mean_daily[intent])
            monthly_traffic = daily_traffic * 30

            # Sample token counts from fitted distributions (not just means!)
            input_tokens = token_distributions[intent].input.rvs(monthly_traffic)
            output_tokens = token_distributions[intent].output.rvs(monthly_traffic)

            # Apply cache — cached requests have zero LLM cost
            cache_rate = cache_model.get_rate(intent, config.model)
            llm_requests = int(monthly_traffic * (1 - cache_rate))

            # Calculate LLM cost
            input_cost = np.sum(input_tokens[:llm_requests]) * config.model.input_price / 1e6
            output_cost = np.sum(output_tokens[:llm_requests]) * config.model.output_price / 1e6

            total_cost += input_cost + output_cost

        # Add infrastructure costs (relatively stable, add noise ±10%)
        infra_cost = sample_infrastructure_costs()
        total_cost += infra_cost

        monthly_costs.append(total_cost)

    return SimulationResult(
        mean=np.mean(monthly_costs),
        p5=np.percentile(monthly_costs, 5),
        p50=np.percentile(monthly_costs, 50),
        p95=np.percentile(monthly_costs, 95),
        confidence_interval_95=(
            np.percentile(monthly_costs, 2.5),
            np.percentile(monthly_costs, 97.5)
        )
    )

Output format:

┌───────────────────────────────────────────────────────────────┐
│  Scenario: Q2 2025 Projection (10% monthly growth)           │
├───────────┬──────────┬──────────┬──────────┬────────────────┤
│ Month     │ P5       │ P50      │ P95      │ 95% CI         │
├───────────┼──────────┼──────────┼──────────┼────────────────┤
│ Apr 2025  │ $52,100  │ $58,400  │ $66,200  │ [$50,800-67,900] │
│ May 2025  │ $54,300  │ $61,800  │ $71,500  │ [$52,900-73,200] │
│ Jun 2025  │ $49,100  │ $55,200  │ $63,800  │ [$47,600-65,400] │  ← price drop
│ Jul 2025  │ $52,800  │ $59,900  │ $69,100  │ [$51,200-71,000] │
│ Aug 2025  │ $56,400  │ $64,100  │ $74,200  │ [$54,600-76,100] │
│ Sep 2025  │ $48,200  │ $54,800  │ $63,500  │ [$46,700-65,200] │  ← Claude 4 Haiku
└───────────┴──────────┴──────────┴──────────┴────────────────┘

Key insights:
- Bedrock price reduction in May saves ~$5K/month
- Claude 4 Haiku migration in Sep saves ~$9K/month
- Monthly budget cap of $400K is achievable by Aug with all optimizations

← Back to Questions · ← Back to Skill 02 Hub