2. Application Logging Across MangaAssist
This document explains how I debug the chatbot throughout the application, not just inside Bedrock. It is the operational map for tracing one user turn across the full MangaAssist architecture.
For Bedrock-specific retrieval, invocation, and guardrail visibility, see 01-bedrock-logging.md.
The Core Principle
I never debug a distributed chatbot by jumping between unrelated logs. I debug by following one turn through the system using shared identifiers.
The minimum correlation set is:
request_idsession_idturn_idcustomer_id_hashfor authenticated users- timestamp window
Without those fields, a Prime Day incident becomes guesswork.
Request Path I Trace During Incidents
The repo's documented request path is:
- Frontend React widget sends the user message and page context.
- API Gateway validates rate limits and auth.
- Orchestrator loads memory and classifies intent.
- Orchestrator fans out in parallel to recommendation, catalog, and RAG.
- Prompt is assembled and sent to Bedrock.
- Guardrails validate the response.
- Response is streamed over WebSocket.
- Memory and analytics are updated.
When debugging, I follow exactly that sequence. I do not start in the middle.
Service-by-Service Log Inventory
| Service | AWS runtime | Primary log location | Key fields to inspect | Common failure indicators | Sensitivity note |
|---|---|---|---|---|---|
| Frontend widget | Browser plus CDN | Client telemetry pipeline or frontend logs | session_id, page context, WebSocket events, click events |
message not sent, duplicate send, missing page context | avoid raw PII |
| API Gateway | Managed API layer | CloudWatch access and execution logs | request_id, route, auth result, status code, latency |
4xx spikes, 5xx spikes, rate-limit rejections | request metadata only |
| Auth validation | Internal service | CloudWatch service logs | request_id, auth decision, user state |
expired session, auth retries, banned or guest misclassification | no raw tokens in logs |
| Orchestrator | Lambda or ECS | /aws/lambda/... or /ecs/... |
request_id, session_id, intent, services called, total latency, fallback flags |
timeout, fan-out failure, bad routing, prompt inflation | most important application log |
| Conversation memory | DynamoDB | CloudWatch metrics plus app logs around DDB calls | session_id, turn count, read latency, write latency, item size |
throttle, hot partition, large session payload | no raw conversation text in standard logs |
| Intent classifier | SageMaker endpoint | /aws/sagemaker/Endpoints/... and app trace spans |
predicted intent, confidence, model version, inference latency | misrouting, confidence drop, endpoint latency | keep user text scrubbed |
| Recommendation engine | Service or Lambda | service logs in CloudWatch | seed entity, returned ASINs, latency, cache hit | empty recommendations, wrong seed resolution, slow response | product IDs only |
| Product catalog | Internal service | service logs in CloudWatch | ASINs requested, response status, price freshness, inventory freshness | stale product metadata, partial response, price mismatch | price data is business-sensitive |
| RAG retrieval | OpenSearch plus service logs | OpenSearch logs plus orchestrator trace | query, chunk IDs, rerank scores, last_updated |
irrelevant chunk dominance, stale content, cold cache | sanitize user question |
| Bedrock generation | Managed FM service | invocation logs in CloudWatch and S3 | model, token counts, generation latency, status | throttling, timeout, output drift, token inflation | prompt and response need controlled access |
| Guardrails | Bedrock plus app logic | CloudWatch guardrail traces and orchestrator logs | filter results, block reason, fallback used | false positives, over-blocking, unexpected rewrite | sensitive due to user content |
| WebSocket delivery | API Gateway or ALB | access logs plus connection metrics | connection ID, disconnect reason, frame timing | stalled stream, dropped connection, reconnect storm | connection metadata only |
| Analytics pipeline | Kinesis | CloudWatch metrics and consumer logs | event count, lag, failed records | missing feedback data, downstream lag | customer IDs must remain hashed |
Structured Logging Contract for MangaAssist
Across services, I would normalize around a shared JSON logging shape like this:
{
"timestamp": "2026-03-22T18:14:55Z",
"request_id": "req-123",
"session_id": "sess-456",
"turn_id": "turn-17",
"customer_id_hash": "cust-9af...",
"intent": "recommendation",
"intent_confidence": 0.94,
"route": ["memory", "classifier", "recommendation_engine", "product_catalog", "rag", "bedrock", "guardrails"],
"model_id": "claude-3.5-sonnet",
"model_version": "candidate",
"kb_query_id": "kbq-777",
"asin_list": ["B0AAA", "B0BBB", "B0CCC"],
"cache_hit": false,
"fallback_used": false,
"guardrail_action": "pass",
"latency_ms": 2341,
"error_code": null,
"escalation_flag": false
}
This schema supports both production troubleshooting and later offline analysis.
Secure Logging Rules for This Project
MangaAssist cannot log everything to CloudWatch indiscriminately. The repo already implies a split between operational metadata and more sensitive prompt or message content.
My logging rules for this project are:
- No raw PII in routine CloudWatch logs.
- Customer IDs are hashed before reaching analytics or trace views.
- Raw prompts and responses, if retained for deep debugging, go to encrypted S3 with restricted access and auditability.
- Standard application logs store metadata, not full message bodies.
- Message-level debugging access is time-limited and approved because user content may contain addresses, order details, or payment-adjacent context.
This is especially important for support flows like order tracking and return requests.
How I Debug One Turn End to End
Step 1: Check API Entry
From API Gateway logs, verify:
- the request actually reached the backend,
- auth succeeded,
- rate limiting did not reject the request,
- the route and status code look normal.
If the request never entered cleanly, I stop there. There is no reason to inspect Bedrock yet.
Step 2: Check Orchestrator Decision-Making
In orchestrator logs, verify:
- conversation history loaded correctly,
- intent classification was called,
- expected services were invoked,
- total latency matches the latency budget,
- fallback flags were or were not triggered.
This is where I usually isolate whether the incident is routing, data, model, or delivery.
Step 3: Check Classifier Confidence
For wrong-route incidents, I inspect:
- predicted intent,
- confidence score,
- classifier version,
- whether the request fell through to general LLM handling.
This matters because the repo sets intent accuracy targets above 90%, and shadow mode already documented a regression where recommendation queries were misrouted to product_question.
Step 4: Check Parallel Service Fan-Out
The request path calls recommendation, catalog, and RAG in parallel. I verify:
- which service was slowest,
- which service failed or retried,
- whether partial data caused a degraded prompt,
- whether circuit breakers were open.
The repo treats this parallel wall time as roughly 300ms, so if one branch takes significantly longer, that is visible immediately.
Step 5: Check Bedrock and Guardrails
Only after upstream service behavior looks sane do I inspect Bedrock invocation logs and guardrail traces.
Step 6: Check Delivery and Persistence
If the response looked healthy but the user never saw it, I check:
- WebSocket stream logs,
- disconnect reasons,
- frontend receipt or render failures,
- whether the turn was still saved to memory,
- whether analytics recorded the event.
This separates "bad answer" from "good answer that failed during delivery."
MLflow Tracing or X-Ray View
The repo already describes full request trace logging and distributed tracing. In practice, I want a single trace view that includes spans like:
gateway.accept_requestorchestrator.load_memoryclassifier.predict_intentrecommendation.fetch_candidatescatalog.fetch_productsrag.retrieve_chunksrag.rerank_chunksprompt.assemblebedrock.generateguardrails.validatewebsocket.stream_responsememory.save_turnanalytics.emit_event
Each span should carry:
- start and end time,
- latency,
- service status,
- model version where relevant,
- request correlation keys,
- selected business metadata such as intent and ASIN list.
That lets me answer both of these questions quickly:
- Where did latency accumulate?
- At what step did the data become wrong?
Logging Patterns by Symptom
Symptom: Wrong Answer
First places I check:
- classifier prediction and confidence,
- RAG retrieval and reranker outputs,
- Bedrock prompt and raw output,
- guardrail action.
Symptom: Slow Answer
First places I check:
- orchestrator total latency breakdown,
- slowest parallel branch,
- Bedrock latency and throttling,
- Lambda cold start or ECS saturation.
Symptom: No Answer Reached the User
First places I check:
- API Gateway entry and response status,
- orchestrator exceptions,
- WebSocket disconnects,
- whether the turn was saved even if delivery failed.
Symptom: Good Retrieval but Weird Final Answer
First places I check:
- prompt assembly,
- model invocation payload,
- guardrail rewrites,
- prompt version changes.
Symptom: Stale or Repeated Content
First places I check:
- chunk metadata including
last_updated, - OpenSearch cache or reindex timing,
- retrieval frequency of dominant chunks,
- product catalog freshness.
CloudWatch Signals I Expect Every On-Call Engineer to Know
These are the most useful application-level signals for this project:
P99 latency (first token)with target below1.5sP99 latency (full response)with target below3sError ratewith target below0.5%Guardrail block ratewith target below5%LLM timeout ratewith investigation threshold above5%DynamoDB throttlingCircuit breaker openeventsIntent distributiondriftASIN validation failuresEscalation rate
These thresholds come directly from the existing repo documents and should appear consistently in dashboards, alerts, and incident narratives.
Interview-Ready Answer
If I am asked, "How do you debug the chatbot throughout the application?", my answer is:
- I take the request ID and session ID from the reported incident.
- I verify the request entered through API Gateway cleanly.
- I inspect orchestrator logs to see routing, memory load, and which downstream services were called.
- I compare classifier output, retrieval output, and catalog data to find where the answer first became wrong.
- I inspect Bedrock invocation and guardrail logs only after upstream data looks correct.
- I finish by checking delivery logs and persistence so I know whether the problem was content, latency, or transport.
That shows system debugging discipline, not just LLM familiarity.