4. Debugging Quick Reference
This is the scan-first guide for MangaAssist incidents. Use it to decide where to look before diving into the longer documents.
- Bedrock-specific analysis:
01-bedrock-logging.md - Full application trace path:
02-application-logging.md - Interview narratives and root-cause stories:
03-debugging-scenarios.md
Triage Matrix
| Symptom | First place to check | Second place to check | Third place to check |
|---|---|---|---|
| Wrong answer | classifier output and confidence | retrieval chunks and reranker scores | Bedrock prompt plus raw output |
| Slow answer | orchestrator latency breakdown | Bedrock latency and throttling | Lambda cold start or ECS saturation |
| No answer | API Gateway status and backend exceptions | WebSocket disconnects | Bedrock timeout or fallback rate |
| Blocked answer | guardrail trace and block category | prompt wording for risky phrasing | intent-specific false-positive rate |
| Stale answer | chunk last_updated metadata |
catalog freshness | reindex or refresh pipeline |
| Misrouted answer | classifier confusion | rule-based prefilter coverage | shadow-mode regression comparison |
First Three Places To Check by Symptom
If the answer is wrong
- Check classifier prediction and confidence.
- Check retrieved chunks, reranker scores, and source freshness.
- Check Bedrock prompt payload, raw output, and post-generation validation.
If the answer is slow
- Check orchestrator trace for the slowest span.
- Check Bedrock invocation latency and throttling rate.
- Check runtime overhead such as Lambda cold starts, ECS saturation, or open circuit breakers.
If there is no answer
- Check API Gateway request acceptance and response status.
- Check orchestrator exceptions or early fallback behavior.
- Check WebSocket delivery logs and disconnect reasons.
If the answer is blocked or evasive
- Check Bedrock guardrail category and confidence.
- Check whether the content is a domain false positive.
- Check whether the fallback text replaced an otherwise valid raw answer.
If the answer is stale
- Check
last_updatedon selected chunks. - Check product catalog freshness and event propagation.
- Check whether reindexing or cache warming affected retrieval.
Dashboard Checklist
Dependency Health
- API Gateway 4xx and 5xx
- circuit breaker open events
- DynamoDB throttle events
- OpenSearch latency and error spikes
Model Health
- TTFT P50 and P99
- Bedrock throttling events
- input and output token trends
- timeout and fallback rate
Retrieval Health
- Recall@3 trend
- high-frequency chunk dominance
- stale chunk selection
- reranker score drift
Guardrail Behavior
- guardrail block rate
- false positives by intent or genre
- ASIN validation failures
- price validation failures
Business Impact
- escalation rate
- thumbs-down trend
- conversion rate for chat users
- support deflection trend
Thresholds Worth Memorizing
These are the most useful numbers to remember from the repo:
| Signal | Target or threshold |
|---|---|
| P99 first-token latency | < 1.5s target |
| P99 full response latency | < 3s target |
| Error rate | < 0.5% target |
| Intent accuracy | > 90% target |
| Hallucination rate | < 2% target |
| Guardrail block rate | < 5% target |
| LLM timeout rate | > 5% is incident-worthy |
| Bedrock throttling | > 5% triggers response action |
| Guardrail pass rate in eval | >= 95% |
Reusable CloudWatch Logs Insights Queries
Bedrock failures
fields @timestamp, request_id, model_id, invocation_status, error_code, bedrock_latency_ms
| filter invocation_status in ["timeout", "throttled", "error"]
| sort @timestamp desc
| limit 50
Guardrail block concentration
stats count(*) as blocked by intent, guardrail_category
| sort blocked desc
Token inflation by intent
stats avg(input_tokens) as avg_in, avg(output_tokens) as avg_out by intent, model_id
| sort avg_out desc
Frequently retrieved chunks
fields retrieved_chunk_ids
| unnest retrieved_chunk_ids as chunk_id
| stats count(*) as retrieval_count by chunk_id
| sort retrieval_count desc
| limit 20
Misrouting investigation
fields @timestamp, request_id, intent, intent_confidence, fallback_used
| filter intent_confidence < 0.70
| sort @timestamp desc
| limit 100
Useful AWS CLI Checks
These are interview-friendly examples, not environment-specific production commands.
Tail recent Lambda orchestrator logs
aws logs tail /aws/lambda/mangaassist-orchestrator --since 15m --follow
Search a Bedrock log group for one request ID
aws logs filter-log-events --log-group-name /aws/bedrock/mangaassist --filter-pattern "req-123"
Inspect a SageMaker endpoint's health
aws cloudwatch get-metric-statistics --namespace AWS/SageMaker --metric-name ModelLatency --dimensions Name=EndpointName,Value=mangaassist-intent-classifier --start-time 2026-03-22T17:00:00Z --end-time 2026-03-22T18:00:00Z --period 60 --statistics Average p99
Look at recent DynamoDB throttling or latency metrics
aws cloudwatch get-metric-statistics --namespace AWS/DynamoDB --metric-name SuccessfulRequestLatency --dimensions Name=TableName,Value=MangaAssistSessions --start-time 2026-03-22T17:00:00Z --end-time 2026-03-22T18:00:00Z --period 60 --statistics Average Maximum
Interview Cheat Sheet
Bedrock throttle incident
I separated upstream fan-out latency from Bedrock latency, found throttling above the 5% threshold, and shifted more traffic to lighter-weight or non-LLM paths until the spike passed.
Stale RAG incident
I proved the catalog was current, then used retrieval traces to show stale chunks were being selected because the knowledge-base refresh path lagged behind.
Misrouting incident
I traced recommendation-like prompts into the classifier, found they were being labeled as product_question, and corrected it with more training data plus route validation.
Guardrail over-blocking incident
I inspected blocked samples, found manga-domain terms were being treated as unsafe, and tuned guardrails toward context-aware confidence scoring instead of rigid binary blocking.
Price hallucination incident
I confirmed the source catalog data was correct, then showed the model invented a discount narrative from dense comparative context and fixed it with stricter prompting plus hard validation.
Usage Guidance
Use this file when you need to decide where to start.
Use 01-bedrock-logging.md when the issue is likely inside generation, retrieval visibility, or guardrails.
Use 02-application-logging.md when the issue may be anywhere in the request path.
Use 03-debugging-scenarios.md when you want a full interview answer with context, root cause, fix, and prevention.