3. Grounded Debugging Scenarios for MangaAssist

This file is written for interview use. Each scenario is grounded in the MangaAssist architecture and follows the same structure so the incidents are easy to explain under pressure.

Use the stories as operating narratives, not as scripts to memorize word-for-word.

Scenario 1: Bedrock Throttling During Prime Day Traffic Spike

Context

During a peak event, MangaAssist traffic moved from normal volume toward the documented Prime Day spike range. Recommendation and FAQ intents were the most affected because they hit the full RAG plus LLM path.

Symptom

First-token latency spiked and some users received short unavailable responses instead of grounded answers.

What alerted me

CloudWatch showed LLM timeout rate moving above the 5% threshold and P99 TTFT pushing past the expected range.

What I checked first

I checked whether the slowdown was inside Bedrock or in the service fan-out before Bedrock. The orchestrator traces showed recommendation, catalog, and RAG still finishing close to their normal latency budget.

How I narrowed it down

I opened Bedrock invocation logs and grouped failures by model_id. The spike was concentrated on the more expensive model path, not evenly distributed across requests. Input token size was elevated for some turns, but not enough to explain the entire incident.

Root cause

This matched the repo's documented pattern: the incident was primarily a load and model-path problem, not just prompt size. Bedrock throttling under peak traffic caused queueing on the FM path.

Fix

I enabled stronger model tiering and fallback behavior:

routed simpler intents to cheaper paths,
increased use of template and API-only responses,
allowed request queuing with priority,
reduced reliance on the most expensive model path until the spike settled.

Prevention

I kept a dedicated dashboard for Bedrock throttling, token cost, and model mix, and I treated >5% throttling as an actionable operational threshold.

Services involved

Bedrock, orchestrator, RAG pipeline, API Gateway, ECS or Lambda scaling.

Metric or log signal

Bedrock throttling > 5%, elevated P99 TTFT, invocation failures grouped by model_id.

Scenario 2: RAG Returned Stale Release Information

Context

A user asked about a newly available volume, but the chatbot answered as if the title had not yet released.

Symptom

The response looked coherent, but it was grounded in stale content.

What alerted me

Customer feedback flagged the answer as wrong even though the final response was fluent and safe.

What I checked first

I checked the product catalog response and confirmed the live catalog already showed the new release correctly.

How I narrowed it down

I inspected the retrieval trace for that request and found the selected chunks came from an older FAQ or editorial source. The last_updated metadata on the chosen chunks lagged behind the current catalog state.

Root cause

The knowledge-base refresh path was behind. The retrieval layer served stale chunks even though the live catalog was correct.

Fix

I forced a refresh of the affected content path, tightened freshness checks during retrieval, and made the prompt prefer live product metadata over older editorial text whenever release status or availability was involved.

Prevention

I added freshness monitoring around last_updated, retrieval audits for recently changed titles, and stronger separation between evergreen editorial chunks and time-sensitive catalog facts.

Services involved

OpenSearch Serverless, embedding pipeline, orchestrator, product catalog, Bedrock.

Metric or log signal

Retrieval trace showing stale chunk IDs and old last_updated values compared with current catalog data.

Scenario 3: Intent Classifier Misrouted Recommendation Queries

Context

Users asked implicit recommendation questions like, "What should I read after Berserk?" but the chatbot answered with attribute-style product facts instead of recommendations.

Symptom

The response was structurally valid but routed to the wrong service path.

What alerted me

I saw a drop in recommendation quality and found more thumbs-down feedback on recommendation-like prompts.

What I checked first

I checked the classifier trace for affected turns and reviewed predicted intent plus confidence.

How I narrowed it down

The model was sending a subset of implicit recommendation requests to product_question. Shadow-mode material in the repo had already documented this exact regression pattern.

Root cause

The classifier had insufficient coverage for implicit recommendation phrasing. The rule-based stage did not catch it, and the ML stage overfit toward product-attribute wording.

Fix

I retrained or rebalanced the classifier with more implicit recommendation examples and tightened route validation for low-confidence cases.

Prevention

I monitored confusion-matrix drift weekly and treated recommendation-to-product-question confusion as a first-class quality regression.

Services involved

SageMaker classifier endpoint, orchestrator, recommendation engine, MLflow tracing.

Metric or log signal

Classifier confidence trends, confusion matrix, and intent distribution mismatch for recommendation queries.

Scenario 4: Guardrail False Positives on Manga Terminology

Context

Manga titles and descriptions naturally contain violent terms that are normal in domain context.

Symptom

Users received safe fallbacks for valid product questions involving titles or themes with words like kill, death, demon, or attack.

What alerted me

Guardrail block rate rose above the expected operating target, and the blocks clustered around specific genres and series.

What I checked first

I inspected blocked samples and Bedrock guardrail traces to see which filter categories fired.

How I narrowed it down

The logs showed legitimate manga-related phrasing was being treated as unsafe without enough context awareness. This matched the repo's documented false-positive problem.

Root cause

The guardrail layer was too static and too binary for a manga domain. It lacked context-aware interpretation and confidence-based handling.

Fix

I moved to context-aware guardrails and confidence scoring so valid product content did not trigger immediate hard blocks.

Prevention

I kept a blocked-sample audit, segmented guardrail dashboards by intent and genre, and A/B tested threshold changes carefully.

Services involved

Bedrock Guardrails, orchestrator, product catalog, analytics pipeline.

Metric or log signal

Guardrail block rate by intent or series, plus guardrail trace details with category and confidence.

Scenario 5: Conversation Memory Growth Caused Read-Latency Regressions

Context

Long support or recommendation sessions can exceed 15 to 20 turns, especially when the user compares editions or asks follow-up questions.

Symptom

Turn latency increased gradually over long conversations, even when downstream service health remained stable.

What alerted me

I noticed slowdowns clustering by session age rather than by traffic level or intent.

What I checked first

I checked orchestrator traces for memory load duration and then reviewed DynamoDB read patterns for large sessions.

How I narrowed it down

The memory span was growing disproportionately for high-turn sessions, and the prompt assembly step was also receiving more history than intended.

Root cause

Conversation history was not being summarized aggressively enough. DynamoDB remained the source of truth, but large session payloads increased read cost and inflated prompt size.

Fix

I tightened summarization and sliding-window behavior so only the most relevant recent turns plus a compact summary flowed into prompt assembly.

Prevention

I added alerts on turn-count-driven latency growth and treated memory compression as part of performance engineering, not just token optimization.

Services involved

DynamoDB conversation memory, orchestrator, prompt assembly, Bedrock.

Metric or log signal

Memory span latency, turn-count distribution, and rising input token counts on long sessions.

Scenario 6: The LLM Hallucinated a Discount Percentage

Context

The user asked whether a box set was cheaper than buying volumes individually.

Symptom

The chatbot produced a plausible but incorrect discount comparison.

What alerted me

The answer was flagged by validation or by user feedback because the price comparison did not match live catalog data.

What I checked first

I checked the product catalog response and verified the raw price data first, because prices are not cached in this design.

How I narrowed it down

The catalog data was correct. Bedrock invocation logs showed the model received multiple product prices in one prompt, and the raw output blended them into an invented discount narrative.

Root cause

This was a generation-layer hallucination caused by dense comparative price context and insufficiently constrained output instructions.

Fix

I tightened the prompt to require explicit comparisons from provided price fields only and strengthened post-generation price validation so incorrect math or invented percentages could not pass.

Prevention

I kept price and ASIN validation as hard gates and treated any comparative commerce answer as high-risk for hallucination.

Services involved

Product catalog, orchestrator, Bedrock, post-generation validators, guardrails.

Metric or log signal

ASIN or price validation failure, plus prompt and raw model output comparison.

Scenario 7: WebSocket Disconnects Broke Streaming Responses

Context

Users saw partial responses or lost the chat stream entirely during busy periods.

Symptom

The model generated an answer, but the user experienced a stalled or broken stream.

What alerted me

Complaints showed incomplete answers, while backend logs still indicated some successful generation events.

What I checked first

I checked whether the response had been fully generated and approved by guardrails before looking at transport.

How I narrowed it down

The orchestrator trace completed normally, but connection logs showed disconnects or stalled delivery on the WebSocket path. That isolated the issue to transport rather than content generation.

Root cause

Under load, connection stability or stream delivery degraded even though the model path was healthy.

Fix

I stabilized the WebSocket handling path, reviewed load balancer behavior, and ensured the client could reconnect and resume gracefully when possible.

Prevention

I added explicit monitoring for stream completion rate, disconnect reason codes, and the delta between generated responses and successfully delivered responses.

Services involved

API Gateway or ALB, orchestrator, WebSocket stream layer, frontend widget.

Metric or log signal

Disconnect spikes, incomplete stream counts, and successful generation spans without corresponding delivery completion.

Scenario 8: Prompt Injection Attempt Against the System Prompt

Context

An adversarial user tried to get the chatbot to reveal internal instructions or ignore safety and policy boundaries.

Symptom

The prompt contained phrases like "ignore previous instructions" or asked for hidden system content.

What alerted me

Pre-generation filters or security review surfaced suspicious inputs.

What I checked first

I checked the raw user message, pre-generation safety layer output, and whether the request still reached the full LLM path.

How I narrowed it down

The pattern matched prompt-injection signatures. The important part was confirming whether the request was blocked early, sanitized, or allowed through with containment.

Root cause

This was not a bug in one downstream service. It was a deliberate adversarial input that tested prompt isolation and guardrails.

Fix

I blocked or constrained the request through the injection defense path and verified the response did not leak hidden instructions or unsafe internal state.

Prevention

I fed the pattern back into red-team test sets, updated rule and model-based injection detectors, and kept system prompts isolated from user-controllable context.

Services involved

Pre-generation safety filters, Bedrock Guardrails, orchestrator, security review workflow.

Metric or log signal

Injection detection hits, denied-request counts, and audit logs from security-focused traces.

Scenario 9: Lambda Cold Starts After Deployment Pushed P99 Above SLA

Context

After a deployment, latency increased even though business traffic was normal and downstream services appeared healthy.

Symptom

P99 first-token latency rose sharply for a subset of requests shortly after deploy.

What alerted me

Latency alarms fired even though Bedrock throttling and retrieval latencies remained mostly stable.

What I checked first

I compared trace spans before and after deployment and focused on the earliest backend stages.

How I narrowed it down

The extra latency was concentrated before downstream calls, which pointed away from Bedrock and toward the orchestrator runtime. Cold-start indicators or init duration explained the gap.

Root cause

Freshly deployed Lambda capacity introduced startup cost on the hot path.

Fix

I adjusted concurrency strategy and capacity warm-up so the orchestrator stopped paying that cost on user-facing turns.

Prevention

I monitored init duration separately from business logic duration and reviewed deployment-time capacity posture before high-traffic windows.

Services involved

Lambda orchestrator, API Gateway, CloudWatch metrics, tracing layer.

Metric or log signal

Init duration or early-span latency increase with otherwise normal downstream timing.

Scenario 10: Shadow Mode Caught a Model Regression Before Production

Context

The team evaluated a newer model version in shadow mode before serving it to users.

Symptom

The candidate model looked superficially better but showed output drift that would have hurt style compliance and cost.

What alerted me

Shadow comparison metrics showed longer responses and style changes versus production.

What I checked first

I checked length deltas, guardrail pass rates, and sample side-by-side outputs for the same request IDs.

How I narrowed it down

The candidate produced emoji in a nontrivial share of responses and increased average output length materially. The repo already documents both of these exact shadow-mode findings.

Root cause

The newer model had a different behavioral prior. It was not wrong in a safety sense, but it violated style expectations and would have increased latency and token cost.

Fix

I updated prompt constraints and held back promotion until the candidate passed style and length requirements.

Prevention

I kept shadow mode as a mandatory evaluation layer for model or prompt transitions, not just offline benchmarks.

Services involved

Bedrock model path, orchestrator, Kinesis shadow logging, nightly aggregation, evaluation dashboards.

Metric or log signal

Shadow-mode comparison metrics: response length delta, guardrail pass rate, style drift in sampled outputs.

Patterns I Learned Across Incidents

Across these incidents, the repeated lesson was that chatbot debugging is rarely about one log line.

The patterns that mattered most were:

correlation IDs across every service,
clear separation of retrieval bugs from generation bugs,
clear separation of generation bugs from delivery bugs,
domain-aware guardrails instead of generic blocking,
treating prompt and model changes like deployable production changes,
using traces to identify where latency accumulated,
using validation layers to catch commerce-critical mistakes such as wrong price or wrong ASIN.

Those patterns are what make the stories credible in interviews.