10: Interview Q&A — Troubleshoot GenAI Applications

AIP-C01 Mapping

Task 5.2 → Skills 5.2.1–5.2.5: 30 interview questions covering all five troubleshooting skills. Mix of technical, behavioral, and system design questions. All answers are grounded in the MangaAssist architecture.

Skill 5.2.1 — Content Handling Troubleshooting

Q1 (Technical — Medium)

How would you detect that a foundation model is silently truncating input context?

Compare the input_tokens count in the Bedrock response metadata against your expected token count from pre-flight estimation. If they diverge, content was truncated. In MangaAssist, we implemented a TokenBudgetManager that estimates tokens per section (system prompt, history, RAG context, user message) before assembly. If the total exceeds the practical limit, we compress or drop low-priority sections and log a TruncationEvent metric. The key insight is that the FM will never raise an error — you must measure proactively.

Q2 (Technical — Hard)

A chatbot user reports an incomplete answer about a long product FAQ. Walk through your debugging process.

Pull the trace ID from the session and check the BudgetUtilization metric — was it near 100%?
Check the Bedrock invocation log: compare input_tokens against the expected assembly size.
If truncated, identify which section was dropped. In MangaAssist's FIFO assembly, the last section appended (usually RAG context) gets truncated.
Check if the TruncationDetector fired — if it didn't, that's a monitoring gap.
Root cause: the FAQ content was 2,400 tokens but the budget only had 800 tokens remaining after history.
Fix: set FAQ content as FIXED priority and conversation history as COMPRESSIBLE.

Q3 (Behavioral — Medium)

Tell me about a time you had to choose between truncating conversation history and truncating retrieved context.

In MangaAssist, during a multi-turn support session (20+ turns), both conversation history and RAG context competed for the same budget. I analyzed which content the FM relied on more: for support intents, the first 2 turns (issue description, order number) were critical, while mid-session turns were clarifications. I implemented priority-based compression: first 2 turns as FIXED, clarifications as COMPRESSIBLE, RAG context as VARIABLE with a minimum guaranteed allocation. This preserved the issue context while still providing product information.

Q4 (Technical — Medium)

How does Japanese text affect token budget calculations differently from English?

Japanese characters (hiragana, katakana, kanji) typically encode as 1–3 tokens per character, versus English at approximately 1 token per 4 characters. A 1,000-character Japanese sentence takes roughly 2–3× more tokens than a 1,000-character English sentence. In MangaAssist, we detect the Japanese character ratio in the input and adjust the chars_per_token estimate from 4 (English) to 2 (Japanese). This prevents under-estimation that would lead to unexpected truncation on Japanese-heavy requests.

Skill 5.2.2 — FM Integration Troubleshooting

Q5 (Technical — Medium)

Explain how you'd implement a circuit breaker for Bedrock API calls.

Three states: CLOSED (normal), OPEN (all calls fail-fast), HALF_OPEN (test one call). Track failures in a sliding window (e.g., 60 seconds). When failures exceed a threshold (e.g., 5 in 60s), transition to OPEN. After a cool-down period (30s), transition to HALF_OPEN and allow one test call. If it succeeds, return to CLOSED. If it fails, back to OPEN. In MangaAssist, the BedrockClientWrapper combines this with retry logic: retry up to 3 times with exponential backoff on transient errors (429, 503), but if the circuit opens, immediately fall back to the Haiku model tier.

Q6 (Technical — Hard)

During a traffic spike, your Bedrock streaming responses are timing out. What's happening and how do you fix it?

The streaming API returns chunks progressively, but under high concurrency, Bedrock delivers chunks at a reduced rate (back-pressure). If the client timeout covers the total response time, it can fail for longer responses. In MangaAssist, we discovered this during a promotional event: first-byte latency was fine (2s), but total stream duration exceeded the 15s timeout. Fix: (1) increase total timeout to 30s for streaming, (2) add per-chunk timeout of 5s — if no chunk in 5s, return partial response with a flag, (3) enable circuit breaker to fall back to synchronous Haiku after 5 stream timeouts.

Q7 (Behavioral — Medium)

Describe how you decide between retrying a failed FM call versus falling back to a cached response.

The decision depends on the error type and user's tolerance. For transient errors (429 throttling), retry with exponential backoff — the request is likely to succeed on retry. For systematic errors (400 bad request, model deprecation), retrying is futile — fall back immediately. For timeouts, attempt one retry, then fall back. In MangaAssist, we use a cached response for product FAQ queries (static, cacheable) but always retry for personalized recommendations (stale cache would degrade the experience). The circuit breaker pattern automates this: when failures accumulate, the system switches to fallback mode automatically.

Q8 (Technical — Medium)

How would you monitor the cost impact of Bedrock retries?

Track RetryCount as a CloudWatch metric per model tier. Each retry doubles the token cost for that request. In MangaAssist, we emit RetryCount × InputTokens as a WastedTokens metric. If WastedTokens exceeds 10% of total token spend, investigate — it usually means the circuit breaker threshold is too high (allowing too many retries before opening) or there's a systematic issue that retries can't fix.

Skill 5.2.3 — Prompt Engineering Troubleshooting

Q9 (Technical — Medium)

How do you test prompt changes before deploying to production?

We maintain a GoldenTestSuite of 50+ test cases organized by intent, each with an expected output pattern. The PromptTestRunner runs the candidate prompt against the full suite and scores each response across multiple dimensions: accuracy, format compliance, safety, and completeness. We compare the candidate's scores against the current production prompt using compare_versions(). The CI pipeline blocks deployment if any dimension score drops below the baseline threshold or if the pass rate falls below 95%.

Q10 (Technical — Hard)

After a prompt update, recommendation quality dropped but only for Japanese-language users. How do you investigate?

Segment the QualityScore metric by language dimension — confirm the drop is language-specific.
Pull golden test results: check if the Japanese test cases were included in the suite (if not, that's the gap).
Review the prompt change: instruction ordering matters. If English instructions were prepended, they may push Japanese context out of the model's attention window.
Test the prompt with Japanese-only inputs: check if the model follows Japanese formatting instructions.
In MangaAssist, we discovered that reordering product title placement in the prompt caused the model to favor English romanized titles over Japanese kanji titles. Fix: ensure product names appear in both languages adjacent to each other in the prompt.

Q11 (Behavioral — Hard)

You need to improve prompt quality but can't afford the latency of an LLM-as-judge evaluation. What's your approach?

Use heuristic scoring as an MVP. In MangaAssist, the PromptScorer evaluates six dimensions without any LLM call: accuracy (keyword matching against expected terms), safety (blocklist check), format compliance (JSON schema validation), completeness (required field presence), hallucination signals (check if response references data not in the provided context), and overall coherence (response length and structure checks). This runs in <10ms per response. We plan to upgrade to LLM-as-judge once we've validated the heuristic baseline and can justify the cost of ~$0.003/evaluation.

Q12 (Technical — Medium)

What is a golden test suite and why is it critical for GenAI applications?

A golden test suite is a curated set of input-output pairs representing the expected behavior of your GenAI system. Unlike traditional unit tests with exact matches, golden tests score responses across quality dimensions (accuracy, format, safety) because FM outputs are non-deterministic. In MangaAssist, each golden test case includes: input message, intent, expected output patterns, quality thresholds, and tags for coverage tracking. The suite is critical because FM outputs can regress silently — without automated testing, you discover regressions through user complaints, which is a lagging indicator.

Skill 5.2.4 — Retrieval System Troubleshooting

Q13 (Technical — Medium)

How would you detect embedding drift in a production vector store?

Schedule a daily Lambda that samples 200 documents from OpenSearch, re-embeds their text with the current model, and computes cosine distance between the stored and fresh embeddings. Track EmbeddingDriftP95 in CloudWatch. If P95 exceeds 0.15, the stored embeddings are significantly diverging from what the current model would produce — trigger re-embedding. In MangaAssist, this caught a drift issue after an AWS silently updated the Titan Embed model's weights, causing a gradual quality degradation that wouldn't have been noticed for weeks otherwise.

Q14 (Technical — Hard)

You just upgraded your embedding model from 1536 dimensions to 1024 dimensions. What's your migration strategy?

Never mix dimensions in one index. The migration plan: 1. Create a new OpenSearch index configured for 1024 dimensions. 2. Re-embed ALL documents using the new model (batch Lambda, ~2 hours for 100K docs). 3. Update the query Lambda to use the new model for query encoding. 4. Update ALL embedding consumers — including any pre-computed similarity caches, which are easy to miss. 5. Atomic index alias swap: point the alias from old index to new index. 6. Verify with RetrievalDiagnostics (Precision@K, MRR) across all intents. 7. Keep the old index for 48 hours as rollback. 8. Maintain an Embedding Consumer Registry so no secondary consumer is missed.

Q15 (Behavioral — Medium)

Tell me about a time you debugged a retrieval quality issue that wasn't caused by the retrieval system itself.

In MangaAssist, Precision@5 dropped for the "similar manga" intent. The embedding drift monitor showed zero drift, and the index was healthy. Investigation revealed the "similar manga" feature read pre-computed embeddings from a DynamoDB cache instead of encoding queries live. During an embedding model migration, the cache was not updated — it contained v1 embeddings while the index held v2 embeddings. The lesson: retrieval quality issues can originate from any component in the data pipeline, not just the vector store or embedding model.

Q16 (Technical — Medium)

How do you measure whether your RAG context is actually helping the model's response?

Use an AlignmentChecker that categorizes each claim in the FM response as "grounded" (supported by the provided RAG context) or "ungrounded" (not present in context). Track context_utilization (fraction of retrieved context actually referenced in the response) and grounding_rate (fraction of response claims supported by context). In MangaAssist, a grounding rate below 70% indicates the RAG context is not relevant enough, and we should investigate the retrieval query or chunk quality.

Skill 5.2.5 — Prompt Maintenance Troubleshooting

Q17 (Technical — Medium)

How do you detect prompt quality regression in production without manual review?

Use Statistical Process Control (SPC). The PromptHealthChecker maintains a rolling baseline of quality scores per template, computes the mean and standard deviation, and flags anomalies when the current score exceeds a z-score threshold (e.g., |z| > 2.5). This detects regressions within 1 hour of deployment without requiring a human reviewer. In MangaAssist, this caught a quality regression caused by a seasonal shift — anime premiere season changed user query patterns, but the prompt hadn't adapted.

Q18 (Technical — Hard)

Your prompt templates use variable substitution. How do you prevent a missing variable from silently producing a bad response?

The ProductionSchemaValidator checks all template variables before rendering: each required variable must be non-null and non-empty. For optional variables, provide explicit defaults. In MangaAssist, a template update added {{user_preferences}} as a required variable, but 10% of users had no preferences stored. The template rendered with an empty placeholder, producing generic responses. Fix: validate required variables pre-rendering, and if any are missing, either populate from inferred data or fall back to a template version that doesn't depend on the missing variable.

Q19 (Behavioral — Medium)

Describe your approach to managing prompt templates across multiple languages.

In MangaAssist, we maintain separate templates for Japanese and English. Each template is versioned independently because linguistic adjustments affect quality differently. The PromptObservabilityPipeline tags every trace with language and template_version, so we can detect regressions per language. Key lesson: a change that improves English output often degrades Japanese output because instruction ordering and emphasis patterns differ between languages. We learned to always test prompt changes against both language golden test suites before deploying.

Q20 (Technical — Medium)

What is a prompt confusion detector and when would you use it?

A PromptConfusionDetector tracks intent pairs where the FM response for one intent looks like it belongs to another intent. It monitors classification confidence gaps and tracks confusion rates between intent pairs. In MangaAssist, the detector identified that "order_status" and "return_request" intents had a 12% confusion rate — the FM was sometimes answering order status questions with return instructions. This signaled that the prompt templates for those two intents needed clearer differentiation, or the intent classifier needed retraining.

Cross-Skill Questions

Q21 (System Design — Hard)

Design an observability strategy for a GenAI chatbot from scratch. What metrics, traces, and alerts would you set up?

Layer it: 1. Infrastructure metrics: ECS CPU/memory, DynamoDB read/write capacity, OpenSearch cluster health. 2. FM integration metrics: invocation latency (P50/P95/P99), error rate, throttle count, circuit breaker state, token usage per model tier. 3. Content metrics: token budget utilization, truncation count, compression ratio. 4. Retrieval metrics: embedding drift, document staleness, Precision@K, MRR per intent. 5. Prompt quality metrics: golden test pass rate (CI), template health score (SPC, production), schema violation rate, hallucination rate. 6. Business metrics: user satisfaction (thumbs up/down), escalation rate, session completion rate.

Traces: X-Ray with subsegments for intent classification, prompt assembly, FM invocation, response validation, and guardrails. Every trace tagged with session_id, intent, template_version, model_id.

Alerts: severity matrix from file 08 — warning and critical thresholds per metric, routed to Slack (warning) and PagerDuty (critical).

Q22 (System Design — Hard)

A user says "the chatbot gave me wrong information." Walk through your full troubleshooting process.

Get the session ID and trace ID from the support ticket.
Pull the X-Ray trace → identify which span took the longest and whether any failed.
Check BudgetUtilization → was context truncated? (Skill 5.2.1)
Check the Bedrock invocation log → did the FM call succeed? Any retries? (Skill 5.2.2)
Check the retrieval log → what documents were retrieved? Were they relevant? Stale? (Skill 5.2.4)
Check the response against the provided context → was it grounded or hallucinated? (Skill 5.2.3)
Check the template version → was it recently changed? (Skill 5.2.5)
Based on findings, classify the root cause and apply the appropriate fix + add regression test.

Q23 (Behavioral — Hard)

You're on-call and three different alarms fire simultaneously: high Bedrock latency, low template health score, and high budget utilization. How do you prioritize?

Read the pattern, not individual alarms. High Bedrock latency is likely the root cause — when the FM is slow, responses may be truncated (high budget utilization if the system retries with shorter context) and quality drops (low template health score). Start with Bedrock: check the service health page and throttle metrics. If it's a Bedrock issue, the other two alarms are secondary symptoms and will resolve when Bedrock recovers. If Bedrock is healthy, investigate whether a prompt change increased input size (causing both budget pressure and latency).

Q24 (Technical — Medium)

How do you decide between RAG, fine-tuning, and prompt engineering for a new feature?

Use the decision framework: - Prompt engineering: when the knowledge is available at query time (RAG context or user input) and the task is instruction-following. Fastest to iterate, lowest cost. - RAG: when the knowledge is large, frequently updated, or domain-specific. MangaAssist uses RAG for product catalog (50K+ items, updated weekly). - Fine-tuning: when the task requires a behavioral change the base model can't achieve through prompting (e.g., consistent JSON format in a specific domain). Higher cost, longer iteration cycle, model version coupling. - Start with prompt engineering, add RAG if context is needed, fine-tune only if prompt + RAG quality plateaus.

Q25 (Technical — Medium)

What's the cost-quality-latency triangle in GenAI systems?

Every design decision trades off between cost, quality, and latency: - Model tiering (Sonnet → Haiku fallback): reduces cost and latency, sacrifices quality - Aggressive caching: reduces cost and latency but serves stale responses (quality risk) - More RAG context: improves quality but increases latency and cost (more tokens) - LLM-as-judge evaluation: improves quality assurance but adds cost and latency - The architecture should make these tradeoffs explicit and tunable, not baked in.

Behavioral Scenarios (STAR Framework)

Q26

Tell me about a time you reduced the cost of a GenAI application without sacrificing quality.

Situation: MangaAssist's monthly Bedrock bill was growing 15% month-over-month due to increased traffic and longer conversations. Task: Reduce LLM costs by 30% without a measurable quality drop. Action: (1) Implemented model tiering: route FAQ and simple intents to Haiku (20× cheaper than Sonnet), keeping Sonnet only for complex recommendations and support. (2) Added ElastiCache for frequently asked product queries (80% cache hit rate). (3) Compressed conversation history to reduce input tokens. Result: 40% cost reduction ($3,100/month net savings), quality score remained stable at 0.82 (within normal SPC bounds).

Q27

Describe a time you had to debug a GenAI system issue that crossed multiple service boundaries.

Situation: After an embedding model migration, the "similar manga" feature showed random results, but all other features worked. Task: Find out why only one feature was broken despite the migration appearing complete. Action: Investigated the full data flow for "similar manga": user clicks item → Lambda reads pre-computed embedding from DynamoDB cache → sends to OpenSearch KNN. The DynamoDB cache still had v1 embeddings while the index had v2. The dimension mismatch caused random results. Created an Embedding Consumer Registry to track all components that use embeddings. Result: Regenerated the cache in 30 minutes. The registry prevented similar issues in the next migration.

Q28

Tell me about a time you designed a system to prevent a class of failures rather than fixing individual bugs.

Situation: MangaAssist had three separate incidents in one month caused by prompt changes that weren't properly tested — each requiring manual investigation and rollback. Task: Eliminate prompt regression as a recurring incident class. Action: Built the prompt testing pipeline: (1) GoldenTestSuite with 50+ cases covering all intents and languages, (2) PromptTestRunner in CI that blocks deployment on score regression, (3) PromptHealthChecker SPC in production for post-deploy monitoring, (4) deployment manifest that pairs prompt version + model version. Result: Zero prompt regression incidents in the following three months. MTTR for prompt-related issues dropped from 4 hours to 30 minutes.

Quick Reference: Question → Skill → File Mapping

Question	Skill	Difficulty	Related File
Q1: Detecting silent truncation	5.2.1	Medium	01
Q2: Debugging incomplete FAQ	5.2.1	Hard	01, 09
Q3: History vs context tradeoff	5.2.1	Medium	01
Q4: Japanese tokenization impact	5.2.1	Medium	01
Q5: Circuit breaker for Bedrock	5.2.2	Medium	02
Q6: Streaming timeout fix	5.2.2	Hard	02, 09
Q7: Retry vs fallback decision	5.2.2	Medium	02
Q8: Monitoring retry cost	5.2.2	Medium	02, 06
Q9: Testing prompt changes	5.2.3	Medium	03
Q10: Language-specific regression	5.2.3	Hard	03, 09
Q11: Heuristic vs LLM-judge	5.2.3	Hard	03, 06
Q12: Golden test suite purpose	5.2.3	Medium	03
Q13: Embedding drift detection	5.2.4	Medium	04, 07
Q14: Embedding migration strategy	5.2.4	Hard	04, 09
Q15: Non-retrieval quality issue	5.2.4	Medium	04, 09
Q16: RAG context alignment	5.2.4	Medium	04
Q17: SPC prompt regression	5.2.5	Medium	05
Q18: Template variable validation	5.2.5	Hard	05, 09
Q19: Multi-language templates	5.2.5	Medium	05
Q20: Confusion detector	5.2.5	Medium	05
Q21: Full observability design	All	Hard	07, 08
Q22: Wrong info troubleshooting	All	Hard	08, 09
Q23: Multi-alarm triage	All	Hard	08
Q24: RAG vs fine-tuning vs prompt	All	Medium	06
Q25: Cost-quality-latency triangle	All	Medium	06, 08
Q26: Cost reduction STAR	All	Medium	06
Q27: Cross-service debugging STAR	5.2.4	Medium	09
Q28: Failure class prevention STAR	5.2.3	Medium	03, 08