1. Bedrock Logging for MangaAssist

This document answers the operational question: how do I check RAG chatbot logs in Bedrock, and how do I tell whether the problem is retrieval, generation, or guardrails?

It is intentionally Bedrock-focused. For end-to-end service tracing across API Gateway, Lambda, ECS, SageMaker, DynamoDB, and WebSocket delivery, see 02-application-logging.md.

What MangaAssist Uses Bedrock For

In MangaAssist, Bedrock sits on the generation path after the orchestrator has already:

loaded conversation memory,
classified intent,
fetched downstream data in parallel,
assembled prompt context.

Bedrock is therefore not the first place to look for every incident. It is the first place to look when:

the response content looks wrong even though upstream services appear healthy,
generation latency spikes,
the response is blocked or heavily modified by guardrails,
token cost suddenly rises,
RAG-backed answers look ungrounded or stale.

Relevant architecture references:

04b-architecture-lld.md for orchestrator, RAG pipeline, memory, and guardrails
06-detailed-workflow.md for request flow and latency budget
13-metrics.md for operational targets
Model-Inference/06-model-evaluation-framework.md for throttling and rollback thresholds

What To Enable in Bedrock

For a production RAG chatbot, I would enable three visibility layers around Bedrock:

Layer	Why it matters for MangaAssist	Primary destination
Model invocation logging	Reconstruct what the model saw and how long it took	CloudWatch Logs plus encrypted S3
Knowledge-base or retrieval trace logging	Verify which chunks were retrieved and why	CloudWatch Logs
Guardrail trace logging	Explain why a response was blocked, rewritten, or passed	CloudWatch Logs

1. Model Invocation Logging

Enable Bedrock model invocation logging so each call captures enough metadata to debug quality and latency issues:

request_id or invocation correlation key
model_id
input and output token counts
invocation timestamp
latency or time to completion
success vs. failure status
sanitized prompt payload
sanitized model response payload

For MangaAssist, these logs are most useful when validating:

price hallucination incidents,
response length inflation,
model tiering between Sonnet and Haiku,
throttling during traffic spikes.

2. Retrieval Trace Logging

MangaAssist uses a RAG pipeline with chunking, Titan embeddings, OpenSearch Serverless vector search, and reranking. Bedrock-side retrieval traces are useful when the user says the chatbot gave a policy answer based on the wrong manga title, stale FAQ content, or irrelevant editorial text.

The key fields to preserve are:

retrieval query text
query embedding request reference
candidate chunk IDs
reranker scores
final selected chunks
source metadata such as source_type, asin, category, last_updated

3. Guardrail Logging

Guardrail logging is mandatory for this project because manga content can trigger false positives on terms like kill, death, demon, or attack even when the conversation is about legitimate product metadata or plot themes.

The key fields are:

filter category
confidence score
action taken: pass, block, redact, rewrite, or fallback
original span or phrase that triggered the rule
resulting response status

MangaAssist specifically cares about guardrails because the documented target is a guardrail block rate below 5%, while warning thresholds in the reliability doc treat >10% for 15 minutes as a production issue.

How I Check a Single Bad RAG Response in Bedrock

When a user says, "the chatbot gave the wrong answer," I do not start by reading random logs. I reconstruct one turn end to end in this order.

Step 1: Start with the Correlation Key

From the app logs or tracing system, identify:

request_id
session_id
turn_id if present
intent
timestamp window

Without a correlation key, Bedrock logs are too noisy during peak traffic.

Step 2: Inspect the Invocation Record

From the Bedrock invocation log, confirm:

which model handled the turn,
whether the model was Sonnet or Haiku,
input token count,
output token count,
latency,
whether the invocation succeeded, throttled, or timed out.

Questions I ask immediately:

Was the latency spike inside Bedrock or before Bedrock?
Did prompt size increase sharply?
Did the model return normally and the issue happen later in guardrails?
Was a fallback model used?

Step 3: Verify What Retrieval Context Was Attached

For a RAG-backed answer, I then inspect retrieval traces:

what was the actual retrieval query?
which chunks were retrieved first?
which chunks survived reranking?
did the chosen chunks belong to the correct source_type?
were the last_updated timestamps stale?

For MangaAssist, this is critical because one of the documented failure modes is a frequently retrieved editorial chunk skewing recommendations or FAQ-style answers.

Step 4: Compare the Prompt With the Output

If retrieval looks right, I compare:

structured product data injected into the prompt,
RAG chunks included in the prompt,
user message,
final raw model output.

This is where I detect problems like:

correct order data, wrong LLM interpretation,
valid product data, invented discount text,
relevant chunks, but the model over-weighted one chunk.

Step 5: Check Guardrail Outcome

If the raw output looks good but the delivered answer is blocked, evasive, or missing details, I inspect guardrail logs for:

which filter fired,
confidence level,
whether the response was replaced with a fallback,
whether the incident is a domain false positive rather than a real safety issue.

Failure-Type Playbooks

1. Bedrock Timeout or Throttling

Symptoms

first-token latency jumps above the normal P99 < 1.5s target,
full response latency trends toward or beyond 3s,
timeout rate exceeds the documented 5% threshold,
fallbacks or brief unavailable messages increase.

What I check in Bedrock

throttling-related invocation failures,
latency grouped by model_id,
token counts to rule out prompt explosion,
time window correlation with traffic spikes.

How I interpret it

The repo already notes that timeout spikes are primarily a load and model-path problem, not just a prompt-size problem. If Bedrock throttling is above 5%, the operational play is request queuing with priority, stronger routing to template or API paths, and more aggressive model tiering.

2. Wrong Answer but Retrieval Looks Correct

Symptoms

retrieved chunks are relevant,
product or order data is accurate,
raw response still misstates facts or merges fields incorrectly.

What I check in Bedrock

raw prompt structure,
whether field names are ambiguous,
whether the model over-indexed on one chunk,
whether output temperature or prompt instructions changed.

Typical root cause in MangaAssist

The repo documents one concrete pattern: the LLM confused fulfillment_status with delivery_status. That is not a retrieval bug. It is a prompt-structure or schema-interpretation bug.

3. Wrong Retrieval Despite a Good User Question

Symptoms

response sounds coherent,
cited facts come from irrelevant editorial or stale FAQ chunks,
same chunk appears too often across many sessions.

What I check in Bedrock

retrieval query text,
top candidate chunks,
reranker scores,
chunk frequency across recent requests,
last_updated metadata on selected chunks.

Typical root cause in MangaAssist

The repo explicitly describes an editorial chunk that was retrieved too frequently because its embedding sat too close to many queries. That is a retrieval quality problem, not an LLM reasoning problem.

4. Guardrail Block on Valid Manga Content

Symptoms

safe fallback responses rise,
valid product answers are blocked,
user intent is legitimate but content terms are genre-heavy.

What I check in Bedrock

guardrail category and confidence,
exact phrase that triggered the rule,
whether the block came from toxicity, denied topics, or PII,
guardrail block rate by intent.

Typical root cause in MangaAssist

Static safety rules over-block domain terms. The repo already documents this as a false-positive problem that required context-aware guardrails and confidence scoring.

5. Token Explosion From Prompt Assembly

Symptoms

output is slower and more expensive,
response quality becomes less focused,
token count spikes for a subset of intents or long sessions.

What I check in Bedrock

input token count trend,
intent distribution during the spike,
whether conversation history was summarized,
whether too many chunks or products were injected.

Typical root cause in MangaAssist

Context assembly exceeded the intended ~5,000 token budget or failed to compress history and product data correctly.

Bedrock Log Fields I Would Standardize Around

Even if AWS-managed logs vary by destination, I want these normalized fields available in analysis pipelines for MangaAssist:

{
  "request_id": "req-123",
  "session_id": "sess-456",
  "intent": "recommendation",
  "model_id": "claude-3.5-sonnet",
  "invocation_status": "success",
  "input_tokens": 3847,
  "output_tokens": 142,
  "bedrock_latency_ms": 1490,
  "retrieved_chunk_ids": ["chunk-001", "chunk-042", "chunk-187"],
  "retrieval_scores": [0.94, 0.81, 0.73],
  "guardrail_action": "pass",
  "fallback_used": false,
  "timestamp": "2026-03-22T18:12:31Z"
}

This matches the observability direction already described in Challenges/real-world-challenges.md.

CloudWatch Logs Insights Queries for Bedrock

The exact field names depend on how logs are routed, but these are the analysis patterns I would keep ready.

Query 1: Find Throttled or Timed-Out Bedrock Calls

fields @timestamp, request_id, session_id, model_id, invocation_status, error_code, bedrock_latency_ms
| filter invocation_status in ["throttled", "timeout", "error"]
| sort @timestamp desc
| limit 100

Use this when users report missing or delayed answers during spikes.

Query 2: Compare Latency by Model

stats pct(bedrock_latency_ms, 50) as p50,
      pct(bedrock_latency_ms, 95) as p95,
      pct(bedrock_latency_ms, 99) as p99,
      count(*) as requests
by model_id
| sort p99 desc

Use this to decide whether Sonnet or Haiku is driving the slowdown.

Query 3: Detect Token Inflation

stats avg(input_tokens) as avg_in,
      avg(output_tokens) as avg_out,
      max(input_tokens) as max_in,
      max(output_tokens) as max_out
by intent, model_id
| sort avg_out desc

Use this when cost rises or responses become verbose.

Query 4: Guardrail Block Analysis by Intent

stats count(*) as blocked
by intent, guardrail_action, guardrail_category
| sort blocked desc

Use this to catch domain-specific false positives, especially around manga terminology.

Query 5: Frequently Retrieved Chunks

fields retrieved_chunk_ids
| unnest retrieved_chunk_ids as chunk_id
| stats count(*) as retrieval_count by chunk_id
| sort retrieval_count desc
| limit 20

Use this when one editorial or FAQ chunk appears to dominate responses.

Query 6: High-Latency RAG Turns

fields @timestamp, request_id, intent, model_id, bedrock_latency_ms, input_tokens, retrieved_chunk_ids
| filter intent in ["faq", "recommendation", "product_question"]
| filter bedrock_latency_ms > 2000
| sort @timestamp desc
| limit 50

Use this to review slow grounded responses and determine whether the problem is prompt size or model saturation.

How Bedrock Logs Connect to the Rest of the App

Bedrock logs should answer these questions only:

what model was invoked,
what grounded context reached the model,
how long the invocation took,
whether guardrails modified the answer,
whether the issue was retrieval, generation, or blocking.

They should not be your only source of truth. To understand whether upstream services fed bad inputs into Bedrock, pair this file with 02-application-logging.md.

Interview-Ready Answer

If asked, "How do you check RAG chatbot logs in Bedrock?", a strong MangaAssist answer is:

I start with the request or session ID from the app trace.
I open Bedrock invocation logs to verify model, tokens, latency, and failure status.
I inspect retrieval traces to see which chunks were fetched, reranked, and injected.
I compare the raw prompt context with the raw model output to separate retrieval issues from generation issues.
I inspect Bedrock guardrail logs to see whether the answer was blocked or rewritten.
If Bedrock looks healthy, I pivot back to orchestrator, classifier, catalog, or memory logs to find the upstream source of the bad context.

That shows both Bedrock fluency and system-level debugging discipline.