LOCAL PREVIEW View on GitHub

2. Application Logging Across MangaAssist

This document explains how I debug the chatbot throughout the application, not just inside Bedrock. It is the operational map for tracing one user turn across the full MangaAssist architecture.

For Bedrock-specific retrieval, invocation, and guardrail visibility, see 01-bedrock-logging.md.


The Core Principle

I never debug a distributed chatbot by jumping between unrelated logs. I debug by following one turn through the system using shared identifiers.

The minimum correlation set is:

  • request_id
  • session_id
  • turn_id
  • customer_id_hash for authenticated users
  • timestamp window

Without those fields, a Prime Day incident becomes guesswork.


Request Path I Trace During Incidents

The repo's documented request path is:

  1. Frontend React widget sends the user message and page context.
  2. API Gateway validates rate limits and auth.
  3. Orchestrator loads memory and classifies intent.
  4. Orchestrator fans out in parallel to recommendation, catalog, and RAG.
  5. Prompt is assembled and sent to Bedrock.
  6. Guardrails validate the response.
  7. Response is streamed over WebSocket.
  8. Memory and analytics are updated.

When debugging, I follow exactly that sequence. I do not start in the middle.


Service-by-Service Log Inventory

Service AWS runtime Primary log location Key fields to inspect Common failure indicators Sensitivity note
Frontend widget Browser plus CDN Client telemetry pipeline or frontend logs session_id, page context, WebSocket events, click events message not sent, duplicate send, missing page context avoid raw PII
API Gateway Managed API layer CloudWatch access and execution logs request_id, route, auth result, status code, latency 4xx spikes, 5xx spikes, rate-limit rejections request metadata only
Auth validation Internal service CloudWatch service logs request_id, auth decision, user state expired session, auth retries, banned or guest misclassification no raw tokens in logs
Orchestrator Lambda or ECS /aws/lambda/... or /ecs/... request_id, session_id, intent, services called, total latency, fallback flags timeout, fan-out failure, bad routing, prompt inflation most important application log
Conversation memory DynamoDB CloudWatch metrics plus app logs around DDB calls session_id, turn count, read latency, write latency, item size throttle, hot partition, large session payload no raw conversation text in standard logs
Intent classifier SageMaker endpoint /aws/sagemaker/Endpoints/... and app trace spans predicted intent, confidence, model version, inference latency misrouting, confidence drop, endpoint latency keep user text scrubbed
Recommendation engine Service or Lambda service logs in CloudWatch seed entity, returned ASINs, latency, cache hit empty recommendations, wrong seed resolution, slow response product IDs only
Product catalog Internal service service logs in CloudWatch ASINs requested, response status, price freshness, inventory freshness stale product metadata, partial response, price mismatch price data is business-sensitive
RAG retrieval OpenSearch plus service logs OpenSearch logs plus orchestrator trace query, chunk IDs, rerank scores, last_updated irrelevant chunk dominance, stale content, cold cache sanitize user question
Bedrock generation Managed FM service invocation logs in CloudWatch and S3 model, token counts, generation latency, status throttling, timeout, output drift, token inflation prompt and response need controlled access
Guardrails Bedrock plus app logic CloudWatch guardrail traces and orchestrator logs filter results, block reason, fallback used false positives, over-blocking, unexpected rewrite sensitive due to user content
WebSocket delivery API Gateway or ALB access logs plus connection metrics connection ID, disconnect reason, frame timing stalled stream, dropped connection, reconnect storm connection metadata only
Analytics pipeline Kinesis CloudWatch metrics and consumer logs event count, lag, failed records missing feedback data, downstream lag customer IDs must remain hashed

Structured Logging Contract for MangaAssist

Across services, I would normalize around a shared JSON logging shape like this:

{
  "timestamp": "2026-03-22T18:14:55Z",
  "request_id": "req-123",
  "session_id": "sess-456",
  "turn_id": "turn-17",
  "customer_id_hash": "cust-9af...",
  "intent": "recommendation",
  "intent_confidence": 0.94,
  "route": ["memory", "classifier", "recommendation_engine", "product_catalog", "rag", "bedrock", "guardrails"],
  "model_id": "claude-3.5-sonnet",
  "model_version": "candidate",
  "kb_query_id": "kbq-777",
  "asin_list": ["B0AAA", "B0BBB", "B0CCC"],
  "cache_hit": false,
  "fallback_used": false,
  "guardrail_action": "pass",
  "latency_ms": 2341,
  "error_code": null,
  "escalation_flag": false
}

This schema supports both production troubleshooting and later offline analysis.


Secure Logging Rules for This Project

MangaAssist cannot log everything to CloudWatch indiscriminately. The repo already implies a split between operational metadata and more sensitive prompt or message content.

My logging rules for this project are:

  1. No raw PII in routine CloudWatch logs.
  2. Customer IDs are hashed before reaching analytics or trace views.
  3. Raw prompts and responses, if retained for deep debugging, go to encrypted S3 with restricted access and auditability.
  4. Standard application logs store metadata, not full message bodies.
  5. Message-level debugging access is time-limited and approved because user content may contain addresses, order details, or payment-adjacent context.

This is especially important for support flows like order tracking and return requests.


How I Debug One Turn End to End

Step 1: Check API Entry

From API Gateway logs, verify:

  • the request actually reached the backend,
  • auth succeeded,
  • rate limiting did not reject the request,
  • the route and status code look normal.

If the request never entered cleanly, I stop there. There is no reason to inspect Bedrock yet.

Step 2: Check Orchestrator Decision-Making

In orchestrator logs, verify:

  • conversation history loaded correctly,
  • intent classification was called,
  • expected services were invoked,
  • total latency matches the latency budget,
  • fallback flags were or were not triggered.

This is where I usually isolate whether the incident is routing, data, model, or delivery.

Step 3: Check Classifier Confidence

For wrong-route incidents, I inspect:

  • predicted intent,
  • confidence score,
  • classifier version,
  • whether the request fell through to general LLM handling.

This matters because the repo sets intent accuracy targets above 90%, and shadow mode already documented a regression where recommendation queries were misrouted to product_question.

Step 4: Check Parallel Service Fan-Out

The request path calls recommendation, catalog, and RAG in parallel. I verify:

  • which service was slowest,
  • which service failed or retried,
  • whether partial data caused a degraded prompt,
  • whether circuit breakers were open.

The repo treats this parallel wall time as roughly 300ms, so if one branch takes significantly longer, that is visible immediately.

Step 5: Check Bedrock and Guardrails

Only after upstream service behavior looks sane do I inspect Bedrock invocation logs and guardrail traces.

Step 6: Check Delivery and Persistence

If the response looked healthy but the user never saw it, I check:

  • WebSocket stream logs,
  • disconnect reasons,
  • frontend receipt or render failures,
  • whether the turn was still saved to memory,
  • whether analytics recorded the event.

This separates "bad answer" from "good answer that failed during delivery."


MLflow Tracing or X-Ray View

The repo already describes full request trace logging and distributed tracing. In practice, I want a single trace view that includes spans like:

  1. gateway.accept_request
  2. orchestrator.load_memory
  3. classifier.predict_intent
  4. recommendation.fetch_candidates
  5. catalog.fetch_products
  6. rag.retrieve_chunks
  7. rag.rerank_chunks
  8. prompt.assemble
  9. bedrock.generate
  10. guardrails.validate
  11. websocket.stream_response
  12. memory.save_turn
  13. analytics.emit_event

Each span should carry:

  • start and end time,
  • latency,
  • service status,
  • model version where relevant,
  • request correlation keys,
  • selected business metadata such as intent and ASIN list.

That lets me answer both of these questions quickly:

  • Where did latency accumulate?
  • At what step did the data become wrong?

Logging Patterns by Symptom

Symptom: Wrong Answer

First places I check:

  1. classifier prediction and confidence,
  2. RAG retrieval and reranker outputs,
  3. Bedrock prompt and raw output,
  4. guardrail action.

Symptom: Slow Answer

First places I check:

  1. orchestrator total latency breakdown,
  2. slowest parallel branch,
  3. Bedrock latency and throttling,
  4. Lambda cold start or ECS saturation.

Symptom: No Answer Reached the User

First places I check:

  1. API Gateway entry and response status,
  2. orchestrator exceptions,
  3. WebSocket disconnects,
  4. whether the turn was saved even if delivery failed.

Symptom: Good Retrieval but Weird Final Answer

First places I check:

  1. prompt assembly,
  2. model invocation payload,
  3. guardrail rewrites,
  4. prompt version changes.

Symptom: Stale or Repeated Content

First places I check:

  1. chunk metadata including last_updated,
  2. OpenSearch cache or reindex timing,
  3. retrieval frequency of dominant chunks,
  4. product catalog freshness.

CloudWatch Signals I Expect Every On-Call Engineer to Know

These are the most useful application-level signals for this project:

  • P99 latency (first token) with target below 1.5s
  • P99 latency (full response) with target below 3s
  • Error rate with target below 0.5%
  • Guardrail block rate with target below 5%
  • LLM timeout rate with investigation threshold above 5%
  • DynamoDB throttling
  • Circuit breaker open events
  • Intent distribution drift
  • ASIN validation failures
  • Escalation rate

These thresholds come directly from the existing repo documents and should appear consistently in dashboards, alerts, and incident narratives.


Interview-Ready Answer

If I am asked, "How do you debug the chatbot throughout the application?", my answer is:

  1. I take the request ID and session ID from the reported incident.
  2. I verify the request entered through API Gateway cleanly.
  3. I inspect orchestrator logs to see routing, memory load, and which downstream services were called.
  4. I compare classifier output, retrieval output, and catalog data to find where the answer first became wrong.
  5. I inspect Bedrock invocation and guardrail logs only after upstream data looks correct.
  6. I finish by checking delivery logs and persistence so I know whether the problem was content, latency, or transport.

That shows system debugging discipline, not just LLM familiarity.