4. Debugging Quick Reference

This is the scan-first guide for MangaAssist incidents. Use it to decide where to look before diving into the longer documents.

Bedrock-specific analysis: 01-bedrock-logging.md
Full application trace path: 02-application-logging.md
Interview narratives and root-cause stories: 03-debugging-scenarios.md

Triage Matrix

Symptom	First place to check	Second place to check	Third place to check
Wrong answer	classifier output and confidence	retrieval chunks and reranker scores	Bedrock prompt plus raw output
Slow answer	orchestrator latency breakdown	Bedrock latency and throttling	Lambda cold start or ECS saturation
No answer	API Gateway status and backend exceptions	WebSocket disconnects	Bedrock timeout or fallback rate
Blocked answer	guardrail trace and block category	prompt wording for risky phrasing	intent-specific false-positive rate
Stale answer	chunk `last_updated` metadata	catalog freshness	reindex or refresh pipeline
Misrouted answer	classifier confusion	rule-based prefilter coverage	shadow-mode regression comparison

First Three Places To Check by Symptom

If the answer is wrong

Check classifier prediction and confidence.
Check retrieved chunks, reranker scores, and source freshness.
Check Bedrock prompt payload, raw output, and post-generation validation.

If the answer is slow

Check orchestrator trace for the slowest span.
Check Bedrock invocation latency and throttling rate.
Check runtime overhead such as Lambda cold starts, ECS saturation, or open circuit breakers.

If there is no answer

Check API Gateway request acceptance and response status.
Check orchestrator exceptions or early fallback behavior.
Check WebSocket delivery logs and disconnect reasons.

If the answer is blocked or evasive

Check Bedrock guardrail category and confidence.
Check whether the content is a domain false positive.
Check whether the fallback text replaced an otherwise valid raw answer.

If the answer is stale

Check last_updated on selected chunks.
Check product catalog freshness and event propagation.
Check whether reindexing or cache warming affected retrieval.

Dashboard Checklist

Dependency Health

API Gateway 4xx and 5xx
circuit breaker open events
DynamoDB throttle events
OpenSearch latency and error spikes

Model Health

TTFT P50 and P99
Bedrock throttling events
input and output token trends
timeout and fallback rate

Retrieval Health

Recall@3 trend
high-frequency chunk dominance
stale chunk selection
reranker score drift

Guardrail Behavior

guardrail block rate
false positives by intent or genre
ASIN validation failures
price validation failures

Business Impact

escalation rate
thumbs-down trend
conversion rate for chat users
support deflection trend

Thresholds Worth Memorizing

These are the most useful numbers to remember from the repo:

Signal	Target or threshold
P99 first-token latency	`< 1.5s` target
P99 full response latency	`< 3s` target
Error rate	`< 0.5%` target
Intent accuracy	`> 90%` target
Hallucination rate	`< 2%` target
Guardrail block rate	`< 5%` target
LLM timeout rate	`> 5%` is incident-worthy
Bedrock throttling	`> 5%` triggers response action
Guardrail pass rate in eval	`>= 95%`

Reusable CloudWatch Logs Insights Queries

Bedrock failures

fields @timestamp, request_id, model_id, invocation_status, error_code, bedrock_latency_ms
| filter invocation_status in ["timeout", "throttled", "error"]
| sort @timestamp desc
| limit 50

Guardrail block concentration

stats count(*) as blocked by intent, guardrail_category
| sort blocked desc

Token inflation by intent

stats avg(input_tokens) as avg_in, avg(output_tokens) as avg_out by intent, model_id
| sort avg_out desc

Frequently retrieved chunks

fields retrieved_chunk_ids
| unnest retrieved_chunk_ids as chunk_id
| stats count(*) as retrieval_count by chunk_id
| sort retrieval_count desc
| limit 20

Misrouting investigation

fields @timestamp, request_id, intent, intent_confidence, fallback_used
| filter intent_confidence < 0.70
| sort @timestamp desc
| limit 100

Useful AWS CLI Checks

These are interview-friendly examples, not environment-specific production commands.

Tail recent Lambda orchestrator logs

aws logs tail /aws/lambda/mangaassist-orchestrator --since 15m --follow

Search a Bedrock log group for one request ID

aws logs filter-log-events --log-group-name /aws/bedrock/mangaassist --filter-pattern "req-123"

Inspect a SageMaker endpoint's health

aws cloudwatch get-metric-statistics --namespace AWS/SageMaker --metric-name ModelLatency --dimensions Name=EndpointName,Value=mangaassist-intent-classifier --start-time 2026-03-22T17:00:00Z --end-time 2026-03-22T18:00:00Z --period 60 --statistics Average p99

Look at recent DynamoDB throttling or latency metrics

aws cloudwatch get-metric-statistics --namespace AWS/DynamoDB --metric-name SuccessfulRequestLatency --dimensions Name=TableName,Value=MangaAssistSessions --start-time 2026-03-22T17:00:00Z --end-time 2026-03-22T18:00:00Z --period 60 --statistics Average Maximum

Interview Cheat Sheet

Bedrock throttle incident

I separated upstream fan-out latency from Bedrock latency, found throttling above the 5% threshold, and shifted more traffic to lighter-weight or non-LLM paths until the spike passed.

Stale RAG incident

I proved the catalog was current, then used retrieval traces to show stale chunks were being selected because the knowledge-base refresh path lagged behind.

Misrouting incident

I traced recommendation-like prompts into the classifier, found they were being labeled as product_question, and corrected it with more training data plus route validation.

Guardrail over-blocking incident

I inspected blocked samples, found manga-domain terms were being treated as unsafe, and tuned guardrails toward context-aware confidence scoring instead of rigid binary blocking.

Price hallucination incident

I confirmed the source catalog data was correct, then showed the model invented a discount narrative from dense comparative context and fixed it with stricter prompting plus hard validation.

Usage Guidance

Use this file when you need to decide where to start.

Use 01-bedrock-logging.md when the issue is likely inside generation, retrieval visibility, or guardrails.

Use 02-application-logging.md when the issue may be anywhere in the request path.

Use 03-debugging-scenarios.md when you want a full interview answer with context, root cause, fix, and prevention.