2. Application Logging Across MangaAssist

This document explains how I debug the chatbot throughout the application, not just inside Bedrock. It is the operational map for tracing one user turn across the full MangaAssist architecture.

For Bedrock-specific retrieval, invocation, and guardrail visibility, see 01-bedrock-logging.md.

The Core Principle

I never debug a distributed chatbot by jumping between unrelated logs. I debug by following one turn through the system using shared identifiers.

The minimum correlation set is:

request_id
session_id
turn_id
customer_id_hash for authenticated users
timestamp window

Without those fields, a Prime Day incident becomes guesswork.

Request Path I Trace During Incidents

The repo's documented request path is:

Frontend React widget sends the user message and page context.
API Gateway validates rate limits and auth.
Orchestrator loads memory and classifies intent.
Orchestrator fans out in parallel to recommendation, catalog, and RAG.
Prompt is assembled and sent to Bedrock.
Guardrails validate the response.
Response is streamed over WebSocket.
Memory and analytics are updated.

When debugging, I follow exactly that sequence. I do not start in the middle.

Service-by-Service Log Inventory

Service	AWS runtime	Primary log location	Key fields to inspect	Common failure indicators	Sensitivity note
Frontend widget	Browser plus CDN	Client telemetry pipeline or frontend logs	`session_id`, page context, WebSocket events, click events	message not sent, duplicate send, missing page context	avoid raw PII
API Gateway	Managed API layer	CloudWatch access and execution logs	`request_id`, route, auth result, status code, latency	4xx spikes, 5xx spikes, rate-limit rejections	request metadata only
Auth validation	Internal service	CloudWatch service logs	`request_id`, auth decision, user state	expired session, auth retries, banned or guest misclassification	no raw tokens in logs
Orchestrator	Lambda or ECS	`/aws/lambda/...` or `/ecs/...`	`request_id`, `session_id`, intent, services called, total latency, fallback flags	timeout, fan-out failure, bad routing, prompt inflation	most important application log
Conversation memory	DynamoDB	CloudWatch metrics plus app logs around DDB calls	`session_id`, turn count, read latency, write latency, item size	throttle, hot partition, large session payload	no raw conversation text in standard logs
Intent classifier	SageMaker endpoint	`/aws/sagemaker/Endpoints/...` and app trace spans	predicted intent, confidence, model version, inference latency	misrouting, confidence drop, endpoint latency	keep user text scrubbed
Recommendation engine	Service or Lambda	service logs in CloudWatch	seed entity, returned ASINs, latency, cache hit	empty recommendations, wrong seed resolution, slow response	product IDs only
Product catalog	Internal service	service logs in CloudWatch	ASINs requested, response status, price freshness, inventory freshness	stale product metadata, partial response, price mismatch	price data is business-sensitive
RAG retrieval	OpenSearch plus service logs	OpenSearch logs plus orchestrator trace	query, chunk IDs, rerank scores, `last_updated`	irrelevant chunk dominance, stale content, cold cache	sanitize user question
Bedrock generation	Managed FM service	invocation logs in CloudWatch and S3	model, token counts, generation latency, status	throttling, timeout, output drift, token inflation	prompt and response need controlled access
Guardrails	Bedrock plus app logic	CloudWatch guardrail traces and orchestrator logs	filter results, block reason, fallback used	false positives, over-blocking, unexpected rewrite	sensitive due to user content
WebSocket delivery	API Gateway or ALB	access logs plus connection metrics	connection ID, disconnect reason, frame timing	stalled stream, dropped connection, reconnect storm	connection metadata only
Analytics pipeline	Kinesis	CloudWatch metrics and consumer logs	event count, lag, failed records	missing feedback data, downstream lag	customer IDs must remain hashed

Structured Logging Contract for MangaAssist

Across services, I would normalize around a shared JSON logging shape like this:

{
  "timestamp": "2026-03-22T18:14:55Z",
  "request_id": "req-123",
  "session_id": "sess-456",
  "turn_id": "turn-17",
  "customer_id_hash": "cust-9af...",
  "intent": "recommendation",
  "intent_confidence": 0.94,
  "route": ["memory", "classifier", "recommendation_engine", "product_catalog", "rag", "bedrock", "guardrails"],
  "model_id": "claude-3.5-sonnet",
  "model_version": "candidate",
  "kb_query_id": "kbq-777",
  "asin_list": ["B0AAA", "B0BBB", "B0CCC"],
  "cache_hit": false,
  "fallback_used": false,
  "guardrail_action": "pass",
  "latency_ms": 2341,
  "error_code": null,
  "escalation_flag": false
}

This schema supports both production troubleshooting and later offline analysis.

Secure Logging Rules for This Project

MangaAssist cannot log everything to CloudWatch indiscriminately. The repo already implies a split between operational metadata and more sensitive prompt or message content.

My logging rules for this project are:

No raw PII in routine CloudWatch logs.
Customer IDs are hashed before reaching analytics or trace views.
Raw prompts and responses, if retained for deep debugging, go to encrypted S3 with restricted access and auditability.
Standard application logs store metadata, not full message bodies.
Message-level debugging access is time-limited and approved because user content may contain addresses, order details, or payment-adjacent context.

This is especially important for support flows like order tracking and return requests.

How I Debug One Turn End to End

Step 1: Check API Entry

From API Gateway logs, verify:

the request actually reached the backend,
auth succeeded,
rate limiting did not reject the request,
the route and status code look normal.

If the request never entered cleanly, I stop there. There is no reason to inspect Bedrock yet.

Step 2: Check Orchestrator Decision-Making

In orchestrator logs, verify:

conversation history loaded correctly,
intent classification was called,
expected services were invoked,
total latency matches the latency budget,
fallback flags were or were not triggered.

This is where I usually isolate whether the incident is routing, data, model, or delivery.

Step 3: Check Classifier Confidence

For wrong-route incidents, I inspect:

predicted intent,
confidence score,
classifier version,
whether the request fell through to general LLM handling.

This matters because the repo sets intent accuracy targets above 90%, and shadow mode already documented a regression where recommendation queries were misrouted to product_question.

Step 4: Check Parallel Service Fan-Out

The request path calls recommendation, catalog, and RAG in parallel. I verify:

which service was slowest,
which service failed or retried,
whether partial data caused a degraded prompt,
whether circuit breakers were open.

The repo treats this parallel wall time as roughly 300ms, so if one branch takes significantly longer, that is visible immediately.

Step 5: Check Bedrock and Guardrails

Only after upstream service behavior looks sane do I inspect Bedrock invocation logs and guardrail traces.

Step 6: Check Delivery and Persistence

If the response looked healthy but the user never saw it, I check:

WebSocket stream logs,
disconnect reasons,
frontend receipt or render failures,
whether the turn was still saved to memory,
whether analytics recorded the event.

This separates "bad answer" from "good answer that failed during delivery."

MLflow Tracing or X-Ray View

The repo already describes full request trace logging and distributed tracing. In practice, I want a single trace view that includes spans like:

gateway.accept_request
orchestrator.load_memory
classifier.predict_intent
recommendation.fetch_candidates
catalog.fetch_products
rag.retrieve_chunks
rag.rerank_chunks
prompt.assemble
bedrock.generate
guardrails.validate
websocket.stream_response
memory.save_turn
analytics.emit_event

Each span should carry:

start and end time,
latency,
service status,
model version where relevant,
request correlation keys,
selected business metadata such as intent and ASIN list.

That lets me answer both of these questions quickly:

Where did latency accumulate?
At what step did the data become wrong?

Logging Patterns by Symptom

Symptom: Wrong Answer

First places I check:

classifier prediction and confidence,
RAG retrieval and reranker outputs,
Bedrock prompt and raw output,
guardrail action.

Symptom: Slow Answer

First places I check:

orchestrator total latency breakdown,
slowest parallel branch,
Bedrock latency and throttling,
Lambda cold start or ECS saturation.

Symptom: No Answer Reached the User

First places I check:

API Gateway entry and response status,
orchestrator exceptions,
WebSocket disconnects,
whether the turn was saved even if delivery failed.

Symptom: Good Retrieval but Weird Final Answer

First places I check:

prompt assembly,
model invocation payload,
guardrail rewrites,
prompt version changes.

Symptom: Stale or Repeated Content

First places I check:

chunk metadata including last_updated,
OpenSearch cache or reindex timing,
retrieval frequency of dominant chunks,
product catalog freshness.

CloudWatch Signals I Expect Every On-Call Engineer to Know

These are the most useful application-level signals for this project:

P99 latency (first token) with target below 1.5s
P99 latency (full response) with target below 3s
Error rate with target below 0.5%
Guardrail block rate with target below 5%
LLM timeout rate with investigation threshold above 5%
DynamoDB throttling
Circuit breaker open events
Intent distribution drift
ASIN validation failures
Escalation rate

These thresholds come directly from the existing repo documents and should appear consistently in dashboards, alerts, and incident narratives.

Interview-Ready Answer

If I am asked, "How do you debug the chatbot throughout the application?", my answer is:

I take the request ID and session ID from the reported incident.
I verify the request entered through API Gateway cleanly.
I inspect orchestrator logs to see routing, memory load, and which downstream services were called.
I compare classifier output, retrieval output, and catalog data to find where the answer first became wrong.
I inspect Bedrock invocation and guardrail logs only after upstream data looks correct.
I finish by checking delivery logs and persistence so I know whether the problem was content, latency, or transport.

That shows system debugging discipline, not just LLM familiarity.