LOCAL PREVIEW View on GitHub

09: Debugging Scenarios and Runbooks

AIP-C01 Mapping

Task 5.2 → All Skills (5.2.1–5.2.5): Ten production-style debugging scenarios grounded in the MangaAssist architecture. Each scenario maps to a specific troubleshooting skill and follows the same structure used in Debugging/03-debugging-scenarios.md.


Scenario 1: Context Window Silently Truncated Long Manga Series FAQ (Skill 5.2.1)

Context

A returning user asked about the reading order of a 50-volume manga series. The orchestrator assembled the prompt with system instructions, 12-turn conversation history, RAG context from the series FAQ, and the user's question.

Symptom

The chatbot returned a reading order list that stopped at volume 23. The response sounded confident and complete, but it omitted the second half of the series.

What alerted me

No error was raised. A customer support escalation flagged the incomplete answer. I checked the CloudWatch BudgetUtilization metric and saw a spike to 98% for that session.

What I checked first

I pulled the Bedrock invocation log for the request. The input_tokens field in the response metadata showed 4,850 tokens against a 5,000-token practical budget — nearly full.

How I narrowed it down

I compared the assembled prompt size against each section's contribution. The 12-turn conversation history consumed 2,100 tokens. The series FAQ was 2,400 tokens. System prompt and user message took the rest. The FAQ was the last section assembled, so its trailing content was silently dropped.

Root cause

Prompt assembly used a naive concatenation strategy with no budget-aware ordering. The FAQ content was appended after conversation history, and when the budget filled up, the tail of the FAQ (volumes 24–50) was silently truncated before reaching the FM.

Fix

Integrated the TokenBudgetManager from file 01. Set the FAQ content as FIXED priority (allocated first). Set conversation history as COMPRESSIBLE (compressed via HistoryCompressor when budget is tight). Added a TruncationDetector that logs a warning and emits a TruncationEvent metric whenever content is dropped.

Prevention

CloudWatch alarm on BudgetUtilization > 90% with 1-minute evaluation. Added the TruncationDetector to the critical path — if content is dropped, the response includes a disclaimer: "I have partial information about this series."

Services involved

ECS orchestrator, Bedrock, CloudWatch.

Metric or log signal

BudgetUtilization > 95%, TruncationCount > 0, input_tokens close to model limit in Bedrock response metadata.


Scenario 2: Bedrock Streaming Timeout During Peak Traffic (Skill 5.2.2)

Context

During a promotional event, MangaAssist experienced a 3× traffic spike. The orchestrator used invoke_model_with_response_stream for conversational responses.

Symptom

Users saw partial responses — the first sentence appeared, then the stream hung for 15 seconds before the connection timed out. The client displayed "Something went wrong" after the timeout.

What alerted me

CloudWatch alarm: InvocationLatencyP95 > 8s fired, followed by ErrorRate > 5% (client-side timeouts counted as errors).

What I checked first

I checked the Bedrock service health page — no active incidents. I checked the ThrottleCount metric — elevated but not extreme. I checked the circuit breaker state — still CLOSED, meaning failures hadn't hit the threshold yet.

How I narrowed it down

I pulled X-Ray traces for the failing requests. The bedrock_invocation span showed that the initial API response (first byte) arrived within 2 seconds, but subsequent chunks had irregular gaps of 3–5 seconds. The total stream duration exceeded the 15-second client timeout.

Root cause

Under high concurrency, Bedrock was delivering stream chunks at a reduced rate (back-pressure). The client timeout was set to 15 seconds for the complete response, but streaming responses for longer answers needed 20+ seconds. The issue was a mismatch between client timeout and realistic streaming duration under load.

Fix

  1. Increased client stream timeout from 15s to 30s for streaming responses.
  2. Added per-chunk timeout of 5 seconds — if no chunk arrives in 5s, close the stream and return what we have with a "response was interrupted" flag.
  3. Enabled the circuit breaker: after 5 stream timeouts in 60 seconds, the circuit opens and falls back to synchronous invoke_model with Haiku (shorter response, faster).

Prevention

Added StreamChunkLatency metric (time between consecutive chunks). Alarm at P95 > 3s. Load testing now includes a streaming latency scenario at 3× peak traffic.

Services involved

ECS orchestrator, Bedrock (streaming API), CloudWatch, X-Ray.

Metric or log signal

InvocationLatencyP95 > 8s, StreamChunkLatency P95 > 3s, ThrottleCount elevated, client-side timeout errors.


Scenario 3: Prompt Version Rollback Caused Recommendation Quality Drop (Skill 5.2.3)

Context

The team deployed prompt v2.3 to improve manga recommendation formatting. After receiving complaints about output quality, they rolled back to v2.2. Quality did not recover.

Symptom

After the rollback to v2.2, recommendation quality scores (measured by PromptHealthChecker) remained at 0.65, down from the pre-v2.3 baseline of 0.82.

What alerted me

TemplateHealthScore alarm fired 30 minutes after rollback. Expected recovery didn't happen.

What I checked first

I confirmed the active template version was genuinely v2.2 (not a stale config cache). I compared the v2.2 template text against the known-good copy in git — identical.

How I narrowed it down

I ran PromptTestRunner.compare_versions("v2.2", "v2.3") and noticed that v2.2 actually scored worse than expected on a subset of golden tests. I checked those test cases — they were added during the v2.3 development cycle and tested scenarios that v2.2 was never designed for.

Then I checked the Bedrock model version. During the v2.3 deployment, the team had also upgraded from Claude 3.5 Sonnet v1 to v2. The rollback reverted the prompt but NOT the model version. Prompt v2.2 was tuned for Sonnet v1; it performed differently on Sonnet v2.

Root cause

Compound rollback failure: prompt was reverted but the model version change was not. Prompt v2.2 + Sonnet v2 was an untested combination. The golden test suite had been expanded to include new scenarios that only v2.3 could handle.

Fix

  1. Rolled back the model version to Sonnet v1 alongside the prompt rollback — quality recovered to 0.81.
  2. Re-ran the full golden test suite for v2.3 against Sonnet v2, identified 3 failing cases.
  3. Fixed v2.3 for those cases, re-deployed as v2.4 paired with Sonnet v2.

Prevention

Implemented a deployment manifest that pairs prompt version + model version as a single deployable unit. Rollback now reverts both. Added model_version as a dimension in all prompt quality metrics.

Services involved

ECS orchestrator (config service), Bedrock, CloudWatch, CI/CD pipeline.

Metric or log signal

TemplateHealthScore < 0.7 post-rollback, golden test failures on rollback version, model_version dimension mismatch in traces.


Scenario 4: Embedding Model Update Broke Similarity Scores (Skill 5.2.4)

Context

The team upgraded from Titan Embed Text v1 (1536 dimensions) to v2 (1024 dimensions) for cost savings. The ingestion pipeline was updated to use v2, and new documents were re-embedded over the weekend.

Symptom

Monday morning: relevance scores for product queries dropped significantly. Some queries returned completely irrelevant manga recommendations. Precision@5 dropped from 0.78 to 0.31.

What alerted me

The EmbeddingDriftP95 metric spiked to 0.92 (threshold: 0.15). The overnight drift monitor run flagged 187 of 200 sampled documents as drifted.

What I checked first

I checked the OpenSearch index mapping — the vector field was configured for 1536 dimensions. The newly ingested v2 embeddings were 1024 dimensions. OpenSearch was zero-padding the v2 embeddings to 1536 dimensions to match the index schema.

How I narrowed it down

I sampled 10 documents: 5 with the old 1536-dimension embeddings and 5 with the new zero-padded 1024→1536 embeddings. Cosine similarity between a query and old documents was ~0.85. Between the same query and zero-padded documents: ~0.35. The zero-padding destroyed the geometric relationships in the embedding space.

Additionally, the query-time embedding was still using v1 (the Lambda function had not been updated), so queries were encoded in v1 space while some documents were in v2 space.

Root cause

Mixed-dimension embedding space: some documents embedded with v1 (1536d), some with v2 (1024d zero-padded to 1536d), and queries encoded with v1. The three different vector representations were geometrically incompatible.

Fix

  1. Immediately reverted the ingestion pipeline to v1 and re-embedded the affected documents back to v1.
  2. Created a migration plan: (a) create a new OpenSearch index with 1024d configuration, (b) re-embed ALL documents with v2, © update query Lambda to v2, (d) atomic index alias swap.

Prevention

Added a pre-deployment check: EmbeddingDimensionValidator verifies that the ingestion model, query model, and index configuration all use the same dimension. The embedding drift monitor now includes a dimension consistency check as its first step.

Services involved

OpenSearch Serverless, Bedrock (Titan Embed), Lambda (ingestion + query), CloudWatch.

Metric or log signal

EmbeddingDriftP95 > 0.15, Precision@5 drop, dimension mismatch in index mapping vs. embedding output.


Scenario 5: Prompt Template Rendering Failures in Production (Skill 5.2.5)

Context

The MangaAssist prompt templates use Jinja2-style variable substitution. A template update added a new variable {{user_preferences}} for personalized recommendations.

Symptom

10% of recommendation requests returned generic, non-personalized responses. The FM produced valid output, but it looked like a first-time user response even for returning users.

What alerted me

SchemaViolationRate did not change (responses were valid). The PromptConfusionDetector flagged an increase in "recommendation" intents producing "general_greeting"-style responses. Customer feedback noted "the chatbot stopped remembering my preferences."

What I checked first

I checked the template rendering logs. For 10% of requests, the {{user_preferences}} variable was rendered as an empty string. No error was logged because the template treated missing variables as empty by default.

How I narrowed it down

I checked the user preferences lookup. The DynamoDB GetItem call for preferences was succeeding, but for users who had never explicitly set preferences, the item didn't exist. The orchestrator passed None to the template, and Jinja2 rendered it as empty.

The old template didn't reference {{user_preferences}}, so the absence was invisible. The new template depended on it, so 10% of users (those without explicit preferences) got degraded personalization.

Root cause

The template update added a dependency on a data field that was not universally populated. No validation checked whether required template variables were present and non-empty before rendering.

Fix

  1. Added a default value for user_preferences: if empty, populate from the user's recent purchase/browse history (inferred preferences).
  2. Added ProductionSchemaValidator pre-flight check: for each template, validate that all required variables are non-empty before rendering. If a required variable is missing, log a warning and use the previous template version that doesn't depend on it.

Prevention

Template changes now require a variable dependency manifest listing all required variables and their data sources. CI pipeline verifies that every required variable has a data source and a default fallback.

Services involved

ECS orchestrator, DynamoDB (user preferences), CloudWatch, CI/CD pipeline.

Metric or log signal

PromptConfusionDetector confusion rate ↑, template rendering logs with empty variables, TemplateFallbackCount > 0.


Scenario 6: RAG Retrieving Outdated Manga Pricing After Catalog Update (Skill 5.2.4)

Context

The MangaAssist product catalog is updated weekly with new pricing from the publisher API. The ingestion pipeline processes the full catalog dump every Sunday night.

Symptom

On Tuesday, a user asked about the price of a newly discounted manga box set. The chatbot quoted the old (higher) price. The user checked the website, saw the correct price, and escalated.

What alerted me

Customer support escalation. The DocumentStaleness metric showed some documents with last_updated timestamps older than expected, but the metric was aggregated and didn't surface the specific document.

What I checked first

I checked the Sunday ingestion Lambda logs. The Lambda completed successfully — 48,000 documents processed, 200 new, 47,800 updated. No errors.

How I narrowed it down

I searched for the specific product ID in the ingestion logs. The document for the box set was processed, but the pricing field in the ingested document showed the old price. I checked the catalog dump file — it also had the old price.

The publisher API had a known issue: price changes for box sets were published on a 48-hour delay relative to individual volumes. The Sunday catalog dump was captured before the Monday price update.

Root cause

The ingestion pipeline's assumption was that the full catalog dump was always current. For a subset of products (box sets), pricing lagged by up to 48 hours in the publisher API. The weekly ingestion schedule couldn't capture mid-week price updates.

Fix

  1. Added a delta-sync Lambda triggered by S3 event notifications: when the publisher pushes a price update file (daily), re-ingest only the changed documents.
  2. Added a price_last_verified field to each document and surfaced a disclaimer when the verification is older than 24 hours: "Price shown was last verified on [date]. Check the product page for the latest price."

Prevention

DocumentStaleness metric now tracks per-category staleness. Box sets and promotional items have a tighter freshness SLA (12 hours vs. 48 hours for static content). Alarm: BoxSetStaleness > 12h.

Services involved

Lambda (ingestion), S3, OpenSearch Serverless, DynamoDB (product catalog), CloudWatch.

Metric or log signal

DocumentStaleness > 24h for category=box_set, price_last_verified older than SLA, customer escalation count.


Scenario 7: X-Ray Traces Revealed Hidden Latency in Guardrail Chain (Skill 5.2.5)

Context

The MangaAssist orchestrator runs three Bedrock Guardrails sequentially: (1) input toxicity filter, (2) PII detection, (3) output content policy. Each guardrail adds latency.

Symptom

End-to-end response latency increased from P95=3.2s to P95=5.8s over two weeks. No single deployment correlated with the increase.

What alerted me

CloudWatch alarm: TotalExecutionDuration P95 > 5s. The alarm had been firing intermittently for a week before it became persistent.

What I checked first

I checked the bedrock_invocation span — Bedrock latency was stable at ~1.5s. I checked the prompt_assembly span — stable at ~50ms. The total trace was 5.8s, but the FM call accounted for only 1.5s. The missing 4.3s was unaccounted.

How I narrowed it down

I added granular X-Ray subsegments for each guardrail call. Results: - Input toxicity filter: 180ms - PII detection: 150ms - Output content policy: 3,900ms ← anomalous

The output content policy guardrail was processing the full FM response (including RAG citations and metadata) through the toxicity filter. As response length had gradually increased (more detailed recommendations), the guardrail processing time scaled linearly.

Root cause

The output guardrail was receiving the raw FM response including structured metadata, citations, and formatting. The text-to-check had grown from ~500 tokens to ~2,000 tokens over two weeks as the recommendation template was expanded. The guardrail's processing time scaled roughly linearly with input length, causing the 4× latency increase.

Fix

  1. Pre-process the FM response before guardrail check: strip metadata, citations, and formatting. Pass only the user-visible text to the guardrail.
  2. Moved input guardrails to asynchronous (fire-and-forget for logging, block only on high-confidence toxicity).
  3. Set a guardrail timeout of 2 seconds — if exceeded, log and pass through (with downstream content policy check).

Prevention

Added per-guardrail latency metrics: Guardrail_Input_Toxicity, Guardrail_PII, Guardrail_Output_Policy. Alarm when any guardrail exceeds 1 second. Response size is now a tracked metric that correlates with guardrail latency.

Services involved

ECS orchestrator, Bedrock Guardrails, X-Ray, CloudWatch.

Metric or log signal

TotalExecutionDuration P95 > 5s, per-guardrail X-Ray subsegment duration, response size growth trend.


Scenario 8: Schema Validation Caught JSON Format Drift in LLM Responses (Skill 5.2.3)

Context

MangaAssist recommendation responses are expected in a structured JSON format: {"recommendations": [{"title": "...", "reason": "...", "price": "..."}], "summary": "..."}. The frontend parses this JSON to render recommendation cards.

Symptom

The frontend started showing raw JSON text instead of formatted cards for ~5% of recommendation responses.

What alerted me

SchemaViolationRate metric jumped from 0.2% to 5.3% after a model version update. Frontend error logs showed JSON parse failures.

What I checked first

I pulled 10 failing responses from CloudWatch Logs. The JSON structure had changed: the model was now returning "suggestions" instead of "recommendations" as the top-level key, and "price" was sometimes a number instead of a string.

How I narrowed it down

I compared the prompt template — no changes. I compared the model: the team had upgraded to a newer Sonnet point release three days earlier. The new model version interpreted the JSON schema instructions slightly differently, producing valid JSON that didn't match the expected schema.

I ran the golden test suite: 47/50 tests passed, but the 3 that failed all involved recommendation format — the same schema drift issue.

Root cause

FM model point release changed JSON output behavior. The prompt said "return recommendations as JSON" but didn't enforce the exact key names strongly enough. The new model interpreted "recommendations" as "suggestions" in some contexts.

Fix

  1. Hardened the prompt with explicit schema: added You MUST use exactly these keys: "recommendations", "title", "reason", "price" (as string) with a concrete example in the system prompt.
  2. Added ProductionSchemaValidator fallback: if the schema doesn't match, attempt key remapping (suggestionsrecommendations) before failing.
  3. Updated the 3 golden test cases to assert exact key names.

Prevention

Model version upgrades now require a full golden test suite run before production deployment. Added a "schema stability" dimension to the quality scorer that tests exact JSON key compliance.

Services involved

ECS orchestrator, Bedrock (Sonnet), CloudWatch, CI/CD pipeline.

Metric or log signal

SchemaViolationRate > 5%, golden test failures on schema dimensions, frontend JSON parse error logs.


Scenario 9: Multi-Turn Context Window Exhaustion in Long Support Sessions (Skill 5.2.1)

Context

MangaAssist support sessions for complex order issues (returns, replacements across multiple items) can span 20+ conversational turns. Each turn adds user message + assistant response to the conversation history.

Symptom

At turn 18, the chatbot "forgot" the user's original issue and started asking questions that had already been answered in turns 2–4. The user became frustrated and escalated to a human agent.

What alerted me

The human agent flagged the transcript. I checked CloudWatch: BudgetUtilization for that session had been above 90% since turn 12, and hit 100% at turn 15. After turn 15, the HistoryCompressor was aggressively summarizing — but it was compressing the oldest turns (1–5), which contained the original issue details.

What I checked first

I replayed the token budget allocation for each turn. By turn 12, conversation history consumed 3,200 tokens of the 5,000 budget. By turn 15, history alone was 4,100 tokens. The compressor started dropping turns 1–5 to fit.

How I narrowed it down

The HistoryCompressor used a naive FIFO strategy: compress/drop the oldest turns first. But for support sessions, the oldest turns contain the most critical context (the original complaint, order number, item details). Mid-session turns are often clarifications that are less important.

Root cause

Conversation history compression strategy was not intent-aware. FIFO compression dropped high-value early turns in support sessions, causing the chatbot to lose context about the core issue.

Fix

  1. Implemented a priority-based compression strategy: - Mark the first 2 turns as FIXED (never compressed — they contain the issue statement) - Mark turns containing order numbers, item IDs, or resolution requests as FIXED - Mark clarification turns as COMPRESSIBLE - Mark chit-chat turns as DROPPABLE
  2. Added a session summary that's regenerated every 5 turns, stored separately from the raw history

Prevention

New metric: OriginalContextRetained — after compression, check if the first 2 turns are still present. Alarm if not. For support sessions, increased the practical token budget to 6,000 tokens (the model limit allows it) to provide more headroom.

Services involved

ECS orchestrator, DynamoDB (conversation history), Bedrock, CloudWatch.

Metric or log signal

BudgetUtilization > 95% sustained across multiple turns, CompressionRatio > 0.5 (aggressive compression), OriginalContextRetained = false.


Scenario 10: Embedding Dimension Mismatch After Model Upgrade (Skill 5.2.4)

Context

Follow-up to Scenario 4 — after the botched v1→v2 upgrade, the team executed the planned migration: new index, full re-embedding, atomic swap. However, one component was missed.

Symptom

After the atomic index swap, 95% of queries worked correctly with the new v2 embeddings. But 5% of queries — specifically those from the "similar manga" feature — returned random, unrelated results.

What alerted me

RetrievalMRR metric for the similar_manga intent dropped from 0.72 to 0.11 while all other intents remained healthy. The EmbeddingDriftMonitor showed zero drift (because all documents were freshly embedded with v2).

What I checked first

I checked the query-side embedding: the main query Lambda was updated to v2. I checked the index: all documents were v2. Metrics said no drift. Everything looked correct at the infrastructure level.

How I narrowed it down

The "similar manga" feature used a different query path. Instead of encoding the user's text query, it took the embedding of an existing document (the manga the user was currently viewing) and found nearest neighbors. This embedding was read from a DynamoDB pre-computed similarity cache — and that cache had NOT been updated during the migration.

The cache contained v1 embeddings (1536d). The new index expected v2 queries (1024d). OpenSearch was silently truncating the 1536d vector to 1024d for the KNN query, producing random results.

Root cause

The migration plan covered the OpenSearch index and the query Lambda but missed the DynamoDB pre-computed embedding cache used by the "similar manga" feature. This was a secondary embedding consumer that wasn't in the migration checklist.

Fix

  1. Regenerated the similarity cache with v2 embeddings by running a one-time batch Lambda.
  2. Updated the "similar manga" Lambda to encode the query via Bedrock (v2) instead of reading from cache, as a temporary fix while the cache rebuilt.

Prevention

Created an Embedding Consumer Registry — a document listing every component that reads or writes embeddings, with the expected model version and dimension. The migration checklist now requires sign-off from each consumer before proceeding. Added a dimension assertion at query time: if the query vector dimension doesn't match the index dimension, fail loudly with a clear error message instead of silently truncating.

Services involved

OpenSearch Serverless, DynamoDB (similarity cache), Bedrock (Titan Embed v2), Lambda (similar manga query), CloudWatch.

Metric or log signal

RetrievalMRR dropped for specific intent, dimension mismatch assertion failures, stale cache last_updated timestamps.


Quick Reference: Scenario → Skill Mapping

Scenario Skill Root Cause Category Key Takeaway
1: Series FAQ truncation 5.2.1 Silent truncation Budget-aware assembly prevents silent failures
2: Streaming timeout 5.2.2 Timeout mismatch Streaming needs per-chunk + total timeout
3: Rollback quality drop 5.2.3 Compound rollback Prompt + model version must deploy/rollback as a pair
4: Dimension mismatch 5.2.4 Mixed embeddings Never mix embedding model versions in one index
5: Template variable missing 5.2.5 Silent empty render Validate template variables are non-empty before rendering
6: Stale pricing 5.2.4 Ingestion lag Delta-sync + freshness SLA per category
7: Hidden guardrail latency 5.2.5 Unmonitored span Instrument every processing step with X-Ray
8: JSON key drift 5.2.3 Model behavior change Explicit schema in prompt + model upgrade testing
9: Support context loss 5.2.1 Naive compression Intent-aware compression preserves critical context
10: Similarity cache miss 5.2.4 Incomplete migration Maintain an Embedding Consumer Registry

Intuition Gained

Instinct What These Scenarios Teach
Compound root cause awareness The hardest bugs (Scenarios 3, 4, 10) involve multiple things changing at once. Always check: "what else changed at the same time?"
Silent failure suspicion Scenarios 1, 5, 9 produced no errors — just degraded quality. Build the habit of monitoring quality metrics, not just error metrics
Migration checklist discipline Scenarios 4 and 10 show that missing one consumer during a migration causes partial, confusing failures. Registry-based checklists prevent this
Latency archaeology Scenario 7 demonstrates that latency can creep in gradually through unmonitored components. X-Ray instrumentation on every step is not optional
Schema as contract Scenario 8 shows that FMs do not guarantee output format stability. Treat the response schema as a contract that must be validated and enforced