03 — Cost–Performance & Token Efficiency

Optimizing MangaAssist's token spend across 10 intents while maintaining quality thresholds — Sonnet at $3/$15 per million tokens, Haiku at $0.25/$1.25.

Context

MangaAssist processes ~300K conversations/day across 10 intents. At Claude 3.5 Sonnet pricing ($3/M input tokens, $15/M output tokens), running everything through Sonnet costs approximately $18K–$25K/day. Routing cost-appropriate intents to Claude 3 Haiku ($0.25/M input, $1.25/M output) and applying token budgets can reduce this by 60–70% without meaningful quality loss. This scenario covers cost-per-conversation analysis, Pareto-optimal model configurations, and intent-based token budget design.

Questions (12)

Easy (1–3)

Q1. Calculate the daily token cost for MangaAssist's recommendation intent given the following assumptions: 40K conversations/day for this intent, average 800 input tokens and 400 output tokens per conversation, using Claude 3.5 Sonnet. Then calculate the same using Claude 3 Haiku. What is the daily savings from switching this intent to Haiku?

Q2. Define what a "token budget" means in the context of MangaAssist. For the faq intent, propose an input token budget (max system prompt + conversation history + user query) and an output token budget (max response length). How do you enforce these budgets at the application level on ECS Fargate?

Q3. MangaAssist uses ElastiCache Redis to cache frequent responses. If the cache hit rate for faq is 45% and for order_tracking is 30%, calculate the effective daily cost for these two intents combined. Assume: faq = 50K conversations/day (avg 500 input / 200 output tokens), order_tracking = 30K conversations/day (avg 600 input / 300 output tokens), using Haiku.

Medium (4–6)

Q4. Build a complete cost-per-conversation breakdown for all 10 MangaAssist intents. For each intent, specify: model assignment (Sonnet/Haiku), average input/output tokens, daily volume, daily cost, and cost per conversation. Identify the top 3 most expensive intents and propose optimization strategies for each.

Q5. Explain the concept of Pareto-optimal model configuration for MangaAssist. Given three model options — Claude 3.5 Sonnet, Claude 3 Haiku, and a fine-tuned Haiku (via SageMaker) — plot the hypothetical cost-quality tradeoff for the product_question intent. How do you determine which configuration is Pareto-optimal, and what would make a configuration Pareto-dominated?

Q6. MangaAssist's recommendation intent uses RAG with OpenSearch Serverless, injecting 3–5 product descriptions into the context (~1,200 tokens). This inflates the input token count significantly. Design a context compression strategy that reduces the token count of retrieved documents by 40% without losing critical product information. How does this impact the cost-per-conversation and response quality?

Hard (7–9)

Q7. Design a dynamic model routing system for MangaAssist that selects between Sonnet and Haiku in real-time based on query complexity, not just intent. For example, a simple recommendation query ("suggest a popular manga") should use Haiku, while a complex one ("recommend a seinen manga with philosophical themes similar to Vagabond but with a female protagonist") should use Sonnet. Define the complexity scoring algorithm, the routing thresholds, and the expected cost impact.

Q8. MangaAssist's token costs spike 3× during the holiday season (Black Friday through New Year) due to increased traffic and longer, more complex purchase-decision conversations. Design a cost mitigation strategy that maintains quality during peak periods. Cover: pre-computed response libraries, aggressive caching warmup, prompt compression, and temporary quality tier adjustments. Quantify the expected cost reduction.

Q9. The MangaAssist finance team demands a monthly token budget cap of $400K. Current spend is projected at $550K/month. Design a cost optimization plan that achieves the $400K target. For each optimization, provide: what changes, expected savings, quality impact (measured by BERTScore delta), and implementation complexity. Rank optimizations by ROI (savings per engineering hour).

Very Hard (10–12)

Q10. Design a real-time cost anomaly detection system for MangaAssist. Define what constitutes a token cost anomaly (e.g., a single conversation consuming 50K tokens, a sudden 200% spike in per-conversation cost for an intent). Implement the detection using CloudWatch anomaly detection, set up automated throttling (e.g., switch Sonnet intents to Haiku if cost exceeds budget by 20%), and define the self-healing process. How do you prevent cost anomaly responses from degrading customer experience?

Q11. MangaAssist wants to implement "cost-aware prompt engineering" — systematically rewriting system prompts to reduce token consumption while maintaining output quality. Design the evaluation framework: how do you measure the token efficiency of a prompt (tokens consumed per unit of quality), how do you A/B test prompt variants for cost, and how do you prevent prompt optimization from introducing subtle quality regressions that only appear in edge cases?

Q12. Build a comprehensive cost simulation model for MangaAssist that projects monthly costs under different scenarios: traffic growth (10%/20%/50% increase), model price changes (Bedrock price reduction), new model availability (Claude 4 with different pricing), and architecture changes (adding a prompt cache, switching to batch inference for offline tasks). The model should take input parameters and output a cost projection with confidence intervals. Design the simulation methodology, data inputs, and output format.

← Back to Skill 02 Hub · Answers →