Interview Q&A — FM Assessment and Selection
Skill 1.2.1 | Task 1.2 — Select and Configure FMs | Domain 1
Scenario 1: Claude 3 Sonnet Used for Intent Classification (Cost Benchmark Failure)
Opening Question
Q: MangaAssist launched with Claude 3 Sonnet routing every user message — including simple intent classification. Three months in, the team notices a monthly Bedrock bill that is 12× higher than projected for the classification tier. Walk me through the root cause, how you diagnose it, and how you fix it without compromising accuracy.
Model Answer
The root cause is absence of a task complexity taxonomy during model selection. The team selected Sonnet as a single model for all generative tasks without benchmarking whether simpler, cheaper models could handle the subset of classification tasks. Sonnet at $3/$15/1M tokens is appropriate for nuanced synthesis and complex reasoning, not for a 5-label intent classifier. Haiku at $0.25/$1.25/1M tokens achieves equivalent accuracy on a narrow, deterministic task like classifying "browse", "search", "recommend", "purchase_intent", "other". Diagnosis: pull Bedrock invocation logs from CloudWatch Logs Insights, group by the use_case tag or prompt prefix pattern, and calculate cost per task type. Confirm that classification tasks represent the majority of Sonnet invocations. Resolution: run an accuracy parity benchmark with a labelled test set (100+ messages per intent), compare Haiku vs. Sonnet. If parity is confirmed (typically within 1–2% on closed-set classification), migrate classification to Haiku. At 1M messages/day with 60% classification, this saves ~$12,000/month.
Follow-up 1: How to structure the task complexity taxonomy
Q: What does a task complexity taxonomy look like in practice and how does it drive model selection decisions?
A: Three tiers: (1) High complexity: open-ended multi-turn reasoning, nuanced creative synthesis, long-context analysis → Sonnet. (2) Medium complexity: summarization, multi-label classification with ambiguous boundaries, structured extraction from noisy text → Haiku with optional Sonnet fallback. (3) Low complexity: binary or narrow multi-class classification, short-format intent detection, yes/no safety checks → Haiku. For each Bedrock call site in the codebase, annotate the task_type tag. Before any launch, produce a table: task_type | complexity_tier | recommended_model | monthly_cost_estimate @ 1M msgs/day. Require architecture sign-off on any call site using Sonnet for a low-complexity task. This taxonomy exercise takes 2 hours and prevents the 12× cost overrun.
Follow-up 2: Validating accuracy parity before migrating to Haiku
Q: How do you ensure the classification accuracy on Haiku is sufficient before you cut over?
A: Run the benchmark_model_parity() function against a labelled test set with at least 100 examples per intent class (500 total). Measurement: accuracy per class (not just overall accuracy — watch for Haiku failing specifically on the "purchase_intent" class which is the highest business-value signal). Threshold: if Haiku achieves ≥ 95% accuracy on every class and the overall accuracy delta vs. Sonnet is ≤ 2 absolute percentage points, migration is approved. Also measure confidence score distribution: if Haiku returns low-confidence scores (< 0.5) on 10%+ of samples, further prompt engineering is needed before migration. Shadow run: route 5% of production traffic to Haiku for 24 hours in staging, compare classification outcomes via CloudWatch metric IntentClassificationAccuracy with model as a dimension.
Follow-up 3: What does the tiered routing architecture look like after the fix?
Q: Describe the code architecture after the fix — how does the system route different tasks to different models?
A: A TieredModelRouter class reads task-type-to-model mappings from AppConfig (not hard-coded) with the structure: {"intent_classify": "haiku", "product_recommendation": "sonnet", "moderation": "haiku"}. Each call site passes task_type to the router, which resolves the model ID. This means a cost override (flip all to Haiku during a spike) or a capability upgrade (move recommendation to Sonnet 3.5) is a config change, not a code deployment. Add task_type and model_id as structured log fields on every Bedrock call for per-task cost attribution in CloudWatch. Set separate cost alarms for Haiku and Sonnet spend to catch a tiering failure — if Haiku cost suddenly drops to zero, the routing logic likely broke and everything is flowing through Sonnet.
Follow-up 4: Ongoing governance to prevent drift
Q: How do you prevent a future engineer from adding a new Bedrock call site that bypasses the tiered router and uses Sonnet for a simple task?
A: Two controls: (1) Static analysis: a custom Semgrep or Bandit rule that detects any direct bedrock.invoke_model(modelId=SONNET_MODEL_ID) call outside the TieredModelRouter module — fails CI on detection. (2) Cost gate: a GitHub Actions step that estimates monthly cost for any PR that adds or modifies a Bedrock call site based on the task type declared, and comments the projection on the PR. PRs where the cost estimate for a new call site exceeds a threshold require architecture team review. Both controls together: the first catches technical bypasses; the second creates economic visibility for product engineers who might not think about model cost when building features.
Grill 1: "Haiku is less capable — we can't risk accuracy drop on purchase intent"
Q: PM says: "Purchase intent classification directly affects conversion. If Haiku misclassifies even 1% more, we lose revenue. We should keep Sonnet." How do you respond?
A: This is a testable hypothesis — run the benchmark before deciding. If the data shows Haiku has materially lower accuracy on purchase_intent, keep Sonnet for that class only, and still migrate all other intent classes (browse, search, other) to Haiku. The economic picture: if purchase_intent is 5% of messages, the Sonnet cost for that 5% is $600/month — a justified cost. The other 55% of classification use on Sonnet is pure waste. The correct response to "we can't risk accuracy drop" is "let me show you what the accuracy delta is so we can make an economic decision." If the delta is 0.3%, that's not a business risk — it's noise. If it's 5%, keep Sonnet for that class. Data-driven class-level decisions, not blanket model selection.
Grill 2: The benchmark shows Haiku accuracy is 2% lower on "recommend" intent specifically
Q: Your benchmark shows Haiku gives 2% lower accuracy on "recommend" intent. The product team says that's too risky. What do you do? A: Two options: (1) Prompt engineering correction: the 2% gap on "recommend" may be due to the prompt not giving Haiku enough context. Add few-shot examples for the "recommend" class specifically and re-run the benchmark. Claude 3 Haiku is a capable model — a 2-percentage-point gap on a closed classification task is almost always closeable with better prompting, not a fundamental capability gap. (2) Hybrid routing: use Haiku for all intents except "recommend" (keep Sonnet). Still saves ~50% of classification cost. Run both options through the product team, present the cost impact of each, and let them decide with data rather than instinct.
Grill 3: The task type tag is missing on legacy call sites
Q: Half the existing Bedrock call sites in the codebase have no task_type tag — they were written before the taxonomy. How do you retroactively classify them?
A: Systematic approach: (1) Pull all distinct prompt templates or function names from CloudWatch Logs (the structured logs capture function name and approximate prompt prefix). (2) For each unique call site, read the prompt — a human engineer can correctly classify 90% of them in 30 minutes based on the output type expected. (3) For ambiguous cases, run 50 real messages through Sonnet and Haiku and compare outputs — if outputs are functionally identical (same user experience), classify as low complexity. (4) Add the task_type tag annotation as a refactoring ticket with a 2-week deadline; the CI rule has a 1-month grace period for existing files with a warning level (not error). Discovery of un-tagged call sites through cost attribution is also useful: any Bedrock call without a task_type dimension falls into the unknown bucket in CloudWatch — unknown should have zero tolerance within the grace period.
Red Flags — Weak Answer Indicators
- Treating this as purely a cost optimization issue, not a model selection methodology gap
- No concrete parity benchmark numbers — just "we'll test and see"
- Missing the task complexity taxonomy concept — proposing ad hoc per-service decisions
- No cost attribution mechanism (can't diagnose which task types are driving cost)
- Treating Haiku as uniformly inferior without testing
Strong Answer Indicators
- Proposes a three-tier complexity taxonomy with specific model assignments per tier
- Designs accuracy parity benchmark with per-class thresholds, not just overall accuracy
- Builds the tiered router with AppConfig backing so model selection is a config change, not a deployment
- Establishes two complementary enforcement controls (static analysis + PR cost gate)
- Acknowledges class-level analysis for the "purchase_intent" sensitivity concern — doesn't blanket reject Haiku
Scenario 2: Wrong Embedding Model — Poor Japanese Recall
Opening Question
Q: MangaAssist's OpenSearch recall for Japanese-language queries drops to 42% at launch, while English queries hit 78%. Both query types search the same index. What is the single most likely root cause and what is the diagnostic path?
Model Answer
The most likely root cause is the wrong embedding model. Amazon Titan Embeddings Lite v1 was trained predominantly on English corpora and produces poor-quality vector representations for Japanese tokens — Japanese queries don't align semantically with Japanese document embeddings in the vector space. The diagnostic path: (1) Pull failing queries from CloudWatch Logs and segment by detected query language — confirm that failures cluster around Japanese queries. (2) Run an offline cosine similarity test: embed the same Japanese query and its expected document with the current model and measure similarity. If similarity scores are below 0.5 for semantically identical Japanese text, the model is confirmed as the issue. (3) Check the embedding model selection ADR — if it shows only English evaluation data, the root cause is confirmed: the benchmark coverage gap. Remediation: migrate to Amazon Titan Embeddings V2 (amazon.titan-embed-text-v2:0) which has explicit multilingual coverage including Japanese, then re-index the knowledge base and run a bilingual recall benchmark to confirm recovery.
Follow-up 1: What the migration to Titan V2 entails
Q: What does re-indexing with Titan V2 actually involve in terms of operations?
A: Full re-index procedure: (1) create a new OpenSearch Serverless index with the Titan V2 vector dimension (1,536 for V2 vs. 1,536 for V1 — same dimensionality, but vectors are not interchangeable because the geometric space is different). (2) Re-embed all documents in the knowledge base by invoking bedrock:InvokeModel for each document chunk with Titan V2. At 50,000 average chunks and a 10-document-per-second throughput, this takes approximately 90 minutes. (3) Blue/green index swap via OpenSearch alias: point the alias from the old index to the new one atomically once re-indexing is complete. (4) Update the query path to embed user queries with Titan V2 before searching. (5) Run the bilingual MRR benchmark against the new index before declaring success. No user-facing downtime during re-index because the alias still points to the old index until cutover.
Follow-up 2: What metrics to monitor ongoing for bilingual recall health
Q: After migrating to Titan V2, what ongoing metrics prevent a future recall regression?
A: Three ongoing metrics: (1) RetrievalMRR_by_language — mean reciprocal rank segmented by query language (English vs. Japanese). Alarm when MRR for Japanese queries falls below 0.6. (2) ZeroResultRate_by_language — the rate of queries that return zero chunks above the similarity threshold, broken down by language. Alarm when Japanese zero-result rate exceeds 5%. (3) RelevanceJudgment_weekly — a weekly automated eval run using LLM-as-judge to score the top-3 retrieved chunks for 100 sampled Japanese queries on a 1–5 relevance scale. Alert when the mean score drops below 3.5. These three metrics together catch embedding quality degradation (MRR), index coverage gaps (zero-result rate), and subjective relevance issues (LLM judge) from different angles.
Follow-up 3: What should the embedding model evaluation checklist have included?
Q: Design the embedding model selection checklist that would have caught this before launch. A: Mandatory evaluation criteria: (1) Supported languages: explicit documentation from the model provider on training language distribution. Titan Lite's documentation does not list Japanese as a supported language — this is the first filter. (2) Bilingual recall benchmark: run MRR evaluation on a balanced test set (100 EN / 100 JA query-document pairs). Gate at MRR ≥ 0.7 for each supported language. (3) Mixed-language query test: queries that mix Japanese and English terms (e.g., "鬼滅の刃 volume recommendations") — these are common in MangaAssist's user base. (4) Manga-domain terminology: include 20 domain-specific terms (manga genre names, Japanese honorifics, title naming patterns) in the test set to verify domain semantic alignment. (5) Cost per 1M tokens: log this alongside the quality scores — a model that passes all quality gates at half the cost wins.
Follow-up 4: Re-indexing takes 90 minutes — what do you serve users during that time?
Q: During the 90-minute re-indexing operation, the old index is serving users with 42% recall. What is the user experience strategy? A: Blue/green index approach: the old index remains live and serving during the entire re-indexing operation. The new V2 index is built in parallel, invisible to users (different index name, not reachable via the alias). Only when the V2 index is fully populated and passes the offline MRR benchmark is the alias swapped atomically. The swap takes < 1 second and is invisible to users. Zero degradation during re-index. If the V2 index fails the MRR benchmark post-re-index, the alias is not swapped — the old index keeps serving while the issue is investigated. This is exactly why blue/green alias-based deployment is the correct pattern for search index migrations.
Grill 1: "Re-indexing is expensive — the token cost is $200. Do we really need to do this?"
Q: Finance team asks: "Is a $200 re-indexing operation really justified?" How do you make the case? A: Japanese users are ~30% of MangaAssist's users. A 42% recall rate means 58% of their queries return irrelevant or no products. At MangaAssist's scale, this translates to: (a) ~174,000 failed searches per day for Japanese users; (b) a direct conversion rate impact estimated at 3–5% of those users abandoning the session; © CSAT damage. The $200 one-time re-index cost is insignificant against even a 0.5% improvement in conversion for Japanese users at any meaningful GMV. The business case is trivially positive. Once made, the argument also prevents future "should we test multilingual models?" debates — you invest in the right model upfront and validate before committing, rather than at this point where you're paying for a full re-index plus the business impact window.
Grill 2: Titan V2 also has lower recall for certain rare manga genres
Q: Post-migration benchmark shows Titan V2 has 85% MRR for mainstream genres but only 65% for rare/niche genres. How do you proceed? A: 65% MRR for niche genres is still a major improvement over 42% overall. Short-term: ship the migration, accept the 65% for niche genres, and monitor the zero-result rate for those genre queries. Medium-term: two options — (1) Coarse-to-fine retrieval: for niche genre queries, execute a BM25 keyword search in parallel with the vector search and merge results using a reciprocal rank fusion strategy. BM25 handles exact-match and rare-token queries better than embedding similarity. (2) Fine-tuning: if niche genre coverage remains poor after 90 days of data collection, trigger an embedding model fine-tuning job using contrastive loss on niche-genre query-document pairs. This is why you keep the bilingual MRR monitoring by genre sub-category — it guides where to invest next.
Red Flags — Weak Answer Indicators
- Proposing a re-index without a benchmark comparing old vs. new model first
- Missing the blue/green alias approach — suggesting downtime for re-indexing
- No multilingual test set design — just "run some Japanese queries and see"
- Not segmenting MRR by language in the ongoing monitoring
Strong Answer Indicators
- Immediately establishes cosine similarity diagnostic to confirm wrong embedding model before investing in a fix
- Designs a five-criteria embedding model evaluation checklist with quantitative thresholds
- Proposes blue/green index alias swap for zero-downtime migration
- Creates ongoing MRR monitoring segmented by language with explicit alarm thresholds
- Handles the niche genre gap with a practical coarse-to-fine retrieval hybrid
Scenario 3: Context Window Underestimated — Prompt Truncation in Multi-Turn Sessions
Opening Question
Q: MangaAssist users in long manga recommendation sessions start experiencing nonsensical or context-blind responses after ~25 turns. Bedrock occasionally throws ValidationException: input length exceeds model maximum. The model was benchmarked as having a 200K token context window. Explain the root cause and design a fix.
Model Answer
The context window benchmark was done on single-turn interactions. Multi-turn sessions accumulate: the system prompt (constant overhead), conversation history (grows linearly with turns), and RAG context chunks (injected on each turn). At 25 turns with average 500 tokens per turn for assistant responses and 200 tokens for user messages, history alone is 17,500 tokens. Add the system prompt (1,000 tokens) and RAG context (5,000 tokens per turn), and the assembled prompt reaches 23,000 tokens per turn but with the entire history, it grows to 180,000–210,000 tokens for power users. The benchmark didn't simulate multi-turn session growth. One scenario of single-turn assessment = 3–5K tokens → well within 200K. Reality = 200K exhausted by turn 25–30. Fix requires a token budget manager: partition the 200K window into fixed allocations — system prompt: 2K, RAG context: 8K, recent history: 120K, new message + response reservation: 20K. When accumulated history exceeds the history budget, summarize old turns with a Haiku call (preserving key signals) and replace with the summary.
Follow-up 1: How to estimate tokens without a tokenizer
Q: The token budget manager needs to estimate tokens before calling Bedrock. How do you do this reliably?
A: Conservative estimation rule: 3 characters per token for Japanese text (kanji/kana are denser than English), 4 characters per token for English. The estimate_tokens(text) function uses len(text) // 3 for mixed JP/EN content — the conservative bias intentionally overestimates to ensure you never exceed the budget. For production accuracy: the Bedrock response body includes usage.input_tokens and usage.output_tokens — log these on every call and compare to the estimate. After 5,000 calls, analyze the estimate vs. actual ratio and calibrate the constant. The estimate-then-calibrate loop costs nothing and is far cheaper than calling the Bedrock tokenizer API on every assembly step. Key constraint: always reserve 20K tokens for the response — never let the reserved output budget be squeezed by growing input. The model truncating its output mid-sentence is worse for user experience than the manager trimming old history.
Follow-up 2: Quality of summarized history vs. verbatim history
Q: When you summarize old turns, does the model's recommendation quality degrade? A: It depends on what is preserved in the summary. The summarization prompt must explicitly preserve: (1) manga titles mentioned by name, (2) user preference signals ("I prefer dark fantasy", "I don't like rom-com"), (3) purchase or wish-list signals, (4) explicit dislikes ("I already read One Piece"). General conversational filler ("thanks!", "great suggestion") is safe to lose. A well-designed summarization prompt retains the semantically dense user preference model, not the literal dialogue. Test: run a 30-turn session test with both verbatim history and summarized history, score recommendation quality using LLM-as-judge on the last 5 turns. If the scores are within 0.3 of each other on a 1–5 scale, summarization is acceptable. Run this test before shipping the summary feature. Expected result: summarization wins slightly on recommendation coherence because recent turns dominate history, and compression removes noise.
Follow-up 3: Which part of the token budget is the biggest lever?
Q: If you had to cut 50K tokens from the total budget, where would you cut? A: Priority order to cut: (1) Old conversation history first — summarize turns 1–20 into a 300-word summary. Saves 10,000–15,000 tokens with minimal quality loss. (2) RAG context chunks — reduce from 5 chunks to 3 chunks per turn. Each chunk is approximately 800 tokens. Saving 2 chunks saves 1,600 tokens per turn, 40,000 tokens for a 25-turn session. Test retrieval quality impact: if MRR drops < 5% with 3 chunks vs. 5, this is a good trade. (3) System prompt — review for verbosity. A system prompt that is 2,000 tokens → 800 tokens through a rewrite review is 1,200 tokens freed per request. Never cut the response reservation — that's user-facing quality. Never cut the most recent 5 turns of history — that's the active working context.
Follow-up 4: Detecting sessions at risk before they hit the limit
Q: How do you detect a session approaching the token limit and act preemptively?
A: Add a token_utilization custom CloudWatch metric emitted on every Bedrock invocation: current_prompt_tokens / 200000 × 100. Alarm thresholds: (1) 60% utilization triggers proactive history summarization on the next turn — don't wait for 95%. (2) 80% utilization triggers an engineering alert (anomaly, investigate prompt construction). (3) 90% utilization returns a user-visible warning: "Your conversation is getting long — I'll summarize our earlier discussion for clarity." This gives users transparent context about what's happening and prevents the jarring experience of suddenly context-blind responses. The 60% trigger for proactive summarization is the key: it maintains quality by summarizing while context is still rich, not as a last-resort when the window is nearly full.
Grill 1: Summarization uses Haiku — that's more cost
Q: Every summarization call costs tokens in Haiku. That adds per-session cost. Is this acceptable? A: Quantify the cost: a history spanning 20 turns × 700 tokens/turn = 14,000 tokens to summarize. A 400-token Haiku output summary. Haiku input cost = 14,000 × $0.25/1M = $0.0035. Haiku output = 400 × $1.25/1M = $0.0005. Total summarization cost: $0.0040 per session that hits the window. Contrast: without summarization, the next Bedrock call fails with a ValidationException — you lose the entire session. Or the user gets context-blind responses that damage CSAT and drive churn. $0.004 to preserve a long-session user experience is not a cost problem. It triggers at most once per session (when history crosses 120K tokens) — not on every turn.
Grill 2: The summarization itself exceeds the context window if history is already 190K tokens
Q: If the assembled prompt is already 190K tokens, you can't call Haiku with all that history for summarization — it also has a context limit. How do you handle this?
A: The summmarization strategy must use a sliding window of old turns, not the entire history at once. Summarize turns 1–10 in one Haiku call (maximum 10 turns × 700 tokens = 7,000 tokens — trivially fits in Haiku's 200K context), then summarize turns 11–20 in a second call, then concatenate the two summaries into one. This gives a 600-word summary for 20 turns in two inexpensive Haiku calls. The budget manager triggers this proactively at 60% utilization, not at 190K tokens. If the proactive trigger was missed (e.g., due to a bug) and the window is at 190K, the only safe path is to drop the oldest 30% of turns without summarization and log a context_overflow_recovery metric — better to lose some history than to crash the session.
Red Flags — Weak Answer Indicators
- Treating 200K context as "more than enough" without simulating multi-turn growth
- No token budget allocation across components (system prompt, history, RAG, response)
- Missing the proactive summarization trigger — relying on error handling at 100% utilization
- Not handling the summarization-of-large-history edge case
Strong Answer Indicators
- Explicitly partitions the 200K window into fixed component budgets
- Uses conservative 3-char/token estimate for Japanese text
- Triggers summarization at 60% utilization (proactive, not reactive)
- Uses sliding-window multi-call summarization for histories that exceed Haiku's comfortable input range
- Quantifies summarization cost at $0.004 per affected session to justify the approach
Scenario 4: English-Only Benchmark — Japanese Query Blind Spot
Opening Question
Q: MangaAssist's production chatbot performs well in QA testing but shows a 22% drop in CSAT specifically for Japanese-speaking users post-launch. The benchmark evaluation set had 500 questions, all in English. Explain what happened, how you detect it, and redesign the benchmark process.
Model Answer
The benchmark coverage gap: the evaluation set was built from internally authored English documentation, not sampled from the distribution of expected production queries. Japanese users represent ~30% of MangaAssist's user base but 0% of the benchmark. The FM was never tested for Japanese comprehension, kanji/kana handling, mixed-language manga queries (e.g., "鬼滅の刃のおすすめ巻は?"), or bilingual product name queries. The production FM handled English queries well because it was both trained on English and benchmarked on English. Japanese queries hit the model blind — no one knew how well it handled them. Detection: post-launch, segment CSAT scores by detected query language. If Japanese CSAT is statistically significantly lower than English CSAT, the language gap is confirmed. Then run an offline evaluation on a sample of Japanese queries from production logs. Redesign: the benchmark must have language distribution matching production — at minimum 30% Japanese for MangaAssist.
Follow-up 1: How to build a representative multilingual evaluation set
Q: You need to build a bilingual benchmark before the next model version launches. Walk me through the construction process. A: Construction steps: (1) Sample from expected production distribution: use query log analysis (even from a beta cohort or internal testing) to identify the realistic distribution of query types and languages. If no logs are available, use the product team's user research — MangaAssist knows its user demographics. (2) Stratify by language AND query type: 150 EN × 5 intent classes = 750 EN; 50 JP × 5 intent classes = 250 JP (minimum). For each (language, intent) cell, write or collect 50 questions with reference answers. (3) Include domain-specific terms: manga genre names, Japanese character names, bilingual product descriptions. These are where general-purpose LLMs fail most often. (4) Include adversarial cases: mixed-language queries, transliteration ambiguity (e.g., "Ruroni Kenshin" vs. "Rurouni Kenshin"), ambiguous intent in Japanese. (5) Version and store in S3 with the ground-truth labels so every future model evaluation uses the same set and deviations are detectable.
Follow-up 2: Minimum passing thresholds per language
Q: What are the minimum quality thresholds a model must pass per language before being approved for production? A: Per-language gates using three metrics: (1) Relevance (1–5 scale, LLM-as-judge): English ≥ 3.8, Japanese ≥ 3.5 (slightly relaxed for Japanese because graders must be Japanese-speaking, adding noise — use consistent graders). (2) Language fidelity: if the user asks in Japanese, the response must be in Japanese. Score this as a binary 1/0 per question. Gate: language fidelity ≥ 98% for each language. (3) Factual correctness on product-specific questions (title, author, genre): ≥ 90% for each language. Composite gate: all three metrics must pass for all languages. Any single language failing any single metric blocks promotion. Log the per-language scores alongside the overall score in the Model Registry model package metadata so reviewers can see exactly where the language performance stood at promotion time.
Follow-up 3: The 500-question English benchmark passed — but clearly wasn't sufficient. What is the new minimum benchmark size?
Q: Is there a principled way to determine how large the benchmark set should be for each language? A: Power analysis: you need enough samples to detect a 0.2-point difference on a 5-point scale with 80% statistical power. For relevance scores with typical standard deviation of 0.8–1.0, this requires ~170 samples per language per intent class. For MangaAssist's 5 intent classes (browse, search, recommend, purchase-intent, other) × 2 languages = 10 cells × 170 = 1,700 minimum. In practice, 200 per language per run is a workable approximation — 200 EN + 200 JP = 400. More important than absolute sample size is representativeness: 200 samples drawn from real production query distribution beats 2,000 samples authored in-house in English. The 500-question set failed not because it was too small but because it had zero Japanese — a size-1 increase to the Japanese cell would have exposed the gap.
Follow-up 4: Continuous monitoring of language performance post-launch
Q: After launch with a properly validated model, how do you detect language-quality degradation over time?
A: Add a language detection step to the query pipeline and emit per-language quality metrics: (1) LLMAsJudgeScore_by_language — a sample of 50 real queries per language per day scored by an automated LLM-as-judge. Alarm when Japanese mean score drops below 3.5 for 3 consecutive days. (2) CSAT_by_language — if explicit user feedback is available, segment by detected query language. (3) LanguageFidelityRate — what fraction of Japanese queries received a Japanese response. Alarm at < 97%. (4) Language shift detection: if the fraction of Japanese queries increases (e.g., due to a promotional campaign targeting Japan), the existing benchmark may no longer reflect the new production distribution. Emit QueryLanguageDistribution daily and alert when any language's share shifts more than 10 percentage points month-over-month — that's a signal to refresh the benchmark.
Grill 1: "We have no Japanese-speaking engineers to write evaluation questions"
Q: The team says they don't have any Japanese-speaking engineers to build the Japanese benchmark. How do you proceed? A: Three options, in preference order: (1) Customer seed data: use a beta cohort survey — ask 50 Japanese beta users to submit 10 questions each "you'd ask the manga chatbot." 500 real Japanese questions with user-expected answers, no engineers needed. (2) Professional translation: hire a manga-literate Japanese translator via a localization service to translate the English benchmark into natural Japanese (not machine translation — manga terminology requires domain knowledge). 500 questions translated in 2–3 business days. (3) Synthetic generation with validation: prompt a strong Japanese-language LLM to generate 200 Japanese queries across the intent taxonomy, with a native-speaking reviewer auditing for naturalness. Each option has tradeoffs: option 1 is the most authentic; option 2 is fastest for a versioned dataset; option 3 is cheapest. The wrong answer is "we can't do Japanese evaluation" — that's choosing to ship a product that fails 30% of users without measurement.
Grill 2: The LLM-as-judge scores Japanese responses in English — is the judgment valid?
Q: Your LLM-as-judge evaluator is an English-language Claude 3 Sonnet model. Can it accurately judge the quality of a Japanese response to a Japanese question?
A: This is a valid concern. A judge model that doesn't understand Japanese cannot assess: language naturalness, idiomatic correctness, appropriate politeness level (keigo vs. casual), or whether a translated term is the standard Japanese term for a manga concept. Solution: use a bilingual judge prompt and a bilingual judge model. Claude 3 Sonnet is trained on multilingual data and can evaluate Japanese responses. To verify judge quality: run the judge on 50 pairs where you know the ground truth (a correct Japanese response vs. a Japanese response with a deliberate error). If the judge correctly identifies the better response 90%+ of the time, it's sufficiently reliable. Additionally, add a language_fidelity dimension to the judge prompt: "Did the response use the same language as the question? [Yes/No]" — this doesn't require understanding Japanese content, just language identification.
Red Flags — Weak Answer Indicators
- Accepting a 500-question English-only benchmark as "sufficient" without questioning language coverage
- No language distribution analysis on expected production traffic before benchmark construction
- Missing language fidelity as a separate evaluation dimension
- LLM-as-judge evaluating Japanese without acknowledging the multilingual judge capability requirement
Strong Answer Indicators
- Immediately proposes sampling from production distribution (not internal authoring) for benchmark construction
- Designs per-language quality gates with language-specific thresholds
- Addresses benchmark construction without Japanese engineers via 3 practical alternatives
- Validates LLM-as-judge on a ground-truth set before trusting it for Japanese evaluation
Scenario 5: FM Cannot Reliably Produce Strict JSON — Parser Failures at Scale
Opening Question
Q: MangaAssist's cart-update and recommendation APIs require the chatbot to return strict JSON objects. In production, 8% of these calls raise json.JSONDecodeError because the FM wraps responses in markdown fences, adds prose, or returns malformed objects. What is the root cause, what is the immediate fix, and how do you prevent this from reaching production in the future?
Model Answer
The root cause is that structured output compliance was never included as an FM evaluation dimension during model selection. The model was evaluated for answer quality (relevance, helpfulness) but not for JSON format compliance. Claude 3 models can produce JSON reliably — but only with careful prompt engineering: a system prompt that specifies the exact schema, includes few-shot examples showing the expected output format, and explicitly prohibits markdown fencing or explanatory prose. A loose instruction like "respond with JSON" leaves the model free to wrap the output as it chooses. The immediate fix has two layers: (1) JSON extraction wrapper with retry logic — a function that strips markdown fences (```json ... ```), finds the first JSON object using regex, and attempts json.loads(). If parsing fails after 3 retry attempts with increasingly strict prompt instructions, raise a RuntimeError and fall back to a safe default. (2) Immediate prompt fix — update the system prompt with explicit schema, example, and constraints. Going forward: add a structured output compliance test suite that Gates CI — 100 structured prompts, 97% parse success rate required.
Follow-up 1: What the improved system prompt looks like
Q: Write the key elements of a system prompt that ensures consistent JSON output.
A: Four required elements in the system prompt: (1) Role + constraint: "You are a MangaAssist action parser. You MUST respond with ONLY a valid JSON object matching the schema below. No prose, no markdown fences, no explanation before or after the JSON." (2) Explicit schema: { "action": "string — one of: add_to_cart|remove_from_cart|view_product", "product_id": "string — pattern: MNG-\\d{4}", "quantity": "integer — minimum 1" }. (3) Concrete example: Example output: {"action": "add_to_cart", "product_id": "MNG-1234", "quantity": 2}. (4) Negative example: "NOT: json { ... } — never use markdown. NOT: 'Sure! Here is the JSON: { ... }' — no introduction." Both positive and negative examples trained the model to recognize the expected format. The schema + example + negative example combination typically achieves 99%+ JSON compliance with Claude 3 Haiku or Sonnet.
Follow-up 2: Should you use Claude's tool-use API instead of free-text prompting?
Q: Claude has a tool-use (function-calling) API that guarantees structured JSON output. Why not just use that?
A: Tool-use (function-calling) is the right default for all action-parsing use cases. It guarantees structured output because Bedrock returns a tool_use block with the arguments as a strongly-typed dictionary — no JSON extraction needed, no parse failures. The free-text-prompting approach requires an extraction wrapper as a reliability measure; tool-use makes the wrapper unnecessary. For MangaAssist's cart-update and recommendation action parsing: define a record_cart_action tool with a strict parameter schema; call invoke_model with tools=[{...}] and tool_choice={"type": "tool", "name": "record_cart_action"}. The FM cannot respond outside the tool schema. Limitation: tool-use does not work with Bedrock streaming (InvokeModelWithResponseStream). For the streaming chat path where you also need to detect intent/action mid-stream, you need to design a two-phase response: stream the natural-language response, then at the end use a non-streaming tool-call to extract the structured action. This design should have been part of the FM capability assessment before launch.
Follow-up 3: Quantifying the production impact of 8% parse failure
Q: 8% parse failure rate — what is the actual user impact and business impact? A: Quantify directly: (1) At 1M messages/day, if 10% of messages trigger a cart or recommendation API call, that is 100,000 structured calls/day. 8% failure = 8,000 failed cart/recommendation operations per day. (2) Each failure either: (a) surfaces as a 500 to the user (cart update fails), or (b) silently falls back to a generic response (recommendation disappears). Cases (a) and (b) each represent a negative user experience event. Assuming 30% conversion rate on recommendations, 8,000 failures × 30% = 2,400 lost recommendation conversions per day. At even a modest GMV per conversion, this is a significant daily revenue leak. (3) 500 error rate from JSON parse failures: alarm threshold should be 0.5% of structured calls, not 8%. At 0.5%, you investigate. At 8%, you have a production incident. The alarm was missing.
Follow-up 4: CI enforcement for structured output compliance
Q: What does a CI test suite for JSON schema compliance look like?
A: Test suite components: (1) Schema compliance test: invoke the FM with 100 varied user messages that should trigger a cart action. For each response, call extract_json_from_response() and jsonschema.validate(). Assert parse success rate ≥ 97% and schema validation pass rate ≥ 97%. (2) Negative case tests: messages that should NOT trigger an action (e.g., "what's the weather?") — assert the FM either returns a non-action indicator or is caught by the system prompt's fallback behavior. (3) Edge case tests: very long product names, unusual Unicode characters, mixed JP/EN product codes. (4) Retry behavior test: mock the FM to return a malformed JSON on the first attempt — assert the retry logic returns a valid JSON on the second attempt. Test runs in < 60 seconds using mocked Bedrock responses for most cases, with 10 live calls for end-to-end validation. Gate: any PR that modifies the system prompt, the extraction wrapper, or the tool schema must pass the full compliance suite.
Grill 1: "97% is good enough — the retry wrapper handles the rest"
Q: Engineering lead says: "97% + retry handles the remaining 3%. Why invest in tool-use migration?" How do you respond? A: The 97% + retry argument has three hidden costs: (1) Latency: each retry is a full Bedrock round-trip. A 3% retry rate at 100,000 structured calls/day = 3,000 extra Bedrock calls, each adding 1–2 seconds of latency for the affected user. That is 3,000 users per day experiencing latency spikes. (2) Cost: 3,000 extra Sonnet calls ≈ 3,000 × $0.003 = $9/day = $270/month in retry overhead. (3) Compounding failures: after 3 retries, the error surfaces as a 500. The retry wrapper doesn't eliminate failures — it reduces them from 8% to ~0.3% (3 retries succeed on the second or third attempt most of the time, but not always). Tool-use eliminates the entire retry chain: 0% parse failures, 0 retry overhead, 0 extra cost. The migration to tool-use is a one-time 2-day engineering investment to remove a permanent operational burden. The argument for the wrapper is "good enough for now" — the argument for tool-use is "correct architecture."
Grill 2: The tool-use migration breaks the streaming path
Q: After proposing tool-use, the engineer points out that the streaming path for chat responses will break because tool-use doesn't work with InvokeModelWithResponseStream. How do you handle this?
A: Design a two-phase response for messages that might require action extraction: (1) Phase 1 (streaming): stream the natural-language response to the user via InvokeModelWithResponseStream. Include a subtle instruction in the prompt: "If the user requested an action (cart, recommendation), include [ACTION_NEEDED] at the end of your response." Detect [ACTION_NEEDED] in the stream. (2) Phase 2 (non-streaming tool call): after the stream completes, if ACTION_NEEDED was detected, make a single non-streaming InvokeModel call with tool-use to extract the structured action data. This call happens asynchronously while the user reads the streamed response. The user's latency for the natural-language answer is unchanged. The action extraction adds 200–400ms after the last token — invisible if the UI processes the action after completing the text display. This two-phase design is the standard pattern for hybrid streaming + structured-extraction use cases.
Red Flags — Weak Answer Indicators
- Treating the retry wrapper as the final solution rather than a stopgap
- Not mentioning Claude's tool-use (function-calling) API as the architecturally correct solution
- No CI compliance test suite — relying on production observation to discover parse failures
- Missing the streaming + tool-use incompatibility challenge in the architecture discussion
Strong Answer Indicators
- Designs a 4-element system prompt (role+constraint, schema, positive example, negative example)
- Proposes tool-use as the correct architecture for action-parsing use cases
- Correctly identifies the streaming incompatibility and designs a two-phase response
- Creates a CI compliance test suite with 97% gate, schema validation, and edge cases
- Quantifies the 8% failure rate as 8,000 failed operations/day with direct conversion revenue impact