MLflow Deep-Dive Scenarios for MangaAssist

Where MLflow is used in the chatbot, how it helped, and why it became part of the operating model instead of just another dashboard.

How MLflow Fits This Chatbot

MangaAssist is not a single model call. A meaningful response can touch:

The orchestrator on ECS Fargate
A SageMaker-hosted intent classifier
Titan embeddings plus OpenSearch retrieval
A SageMaker reranker
Bedrock for response generation
A custom guardrails pipeline
AppConfig for prompt and experiment control

MLflow gives that multi-service path one shared control plane:

MLflow capability	What it covers in MangaAssist
Tracing	Step-level visibility for intent, retrieval, prompt build, Bedrock generation, guardrails, and fallbacks
Experiments	Prompt comparisons, retriever tuning, golden dataset runs, and shadow evaluations
Registry	Versioning for intent classifier, reranker, embedding config, prompt bundle, and release bundle
Artifacts	Prompt files, evaluation reports, retrieved chunk snapshots, confusion matrices, and run summaries
Tags and lineage	Traceable linkage between request, prompt version, model bundle, deployment stage, and user feedback

Scenario 1: Debugging a Bad Bedrock Response

Problem

A customer says the recommendation answer was slow and also off-target. Without request-level tracing, the on-call engineer has to check CloudWatch, OpenSearch logs, SageMaker logs, and application logs separately.

Where MLflow Is Used

MLflow traces the full request as one parent trace with child spans for:

load_context
intent_classification
retrieve_chunks
rerank_chunks
build_prompt
bedrock_generate
apply_guardrails
save_response

Services Involved

ECS Fargate orchestrator
SageMaker intent classifier
Bedrock Claude generation
Bedrock Titan embeddings
OpenSearch Serverless
SageMaker reranker
Guardrails service

What We Capture

Span	Key attributes
`intent_classification`	`intent`, `confidence`, `classifier_version`, `classification_path`
`retrieve_chunks`	`query_embedding_model`, `candidate_count`, `top_chunk_ids`, `source_types`
`rerank_chunks`	`reranker_version`, `input_count`, `output_count`, `latency_ms`
`build_prompt`	`prompt_version`, `system_prompt_hash`, `retrieved_chunk_count`, `prompt_chars`
`bedrock_generate`	`bedrock_model_id`, `temperature`, `max_tokens`, `input_tokens`, `output_tokens`, `stop_reason`
`apply_guardrails`	`guardrail_stage`, `verdict`, `rule_id`, `fallback_triggered`

How It Helped

It separated Bedrock latency from retrieval latency instead of treating the whole request as one opaque number.
It exposed cases where the reranker was processing too many documents before the Bedrock call even started.
It let engineers replay the exact prompt version and retrieved context that led to the wrong answer.
It reduced time-to-root-cause from a cross-service investigation to a single trace lookup.

Example Outcome

One trace made it obvious that only 280 ms was spent in Bedrock while 390 ms was spent reranking 50 chunks instead of 20. The fix was not a model swap. It was a retrieval configuration correction.

Scenario 2: Preventing Prompt Regressions Before a Bedrock Rollout

Problem

A small prompt edit can improve one intent and silently degrade another. MangaAssist uses different prompt shapes for recommendation, FAQ, and product Q&A flows, so changes must be evaluated before they go live.

Where MLflow Is Used

Every prompt candidate is run as an MLflow experiment:

Prompt files are logged as artifacts.
Golden dataset metrics are logged as run metrics.
Bedrock model ID, prompt version, retriever version, and guardrail version are logged as tags.
Comparison reports are stored as artifacts and linked to the PR or change request.

Services Involved

GitHub or code review system
CodeBuild or CI runner
Bedrock for offline response generation
AppConfig for active prompt selection
S3 for evaluation dataset and reports

What We Compare

Intent accuracy on prompt-sensitive flows
BERTScore or rubric score against references
Guardrail pass rate
Response length distribution
Hallucination or grounding failure count
Bedrock token usage and latency

How It Helped

Prompt changes stopped being subjective because each candidate had a run ID and a reproducible report.
Teams could compare prompt variants even when the underlying Bedrock model stayed the same.
AppConfig promotion was gated on an MLflow run that met the release thresholds.

Example Outcome

A concise recommendation prompt looked better in ad hoc testing but dropped grounded product mention accuracy on the golden set. MLflow made that visible before the prompt reached production traffic.

Scenario 3: Registry and Release Lineage Across SageMaker and Bedrock

Problem

A single MangaAssist response depends on more than one versioned asset:

Intent classifier version
Reranker version
Embedding model choice and chunk policy
Prompt version
Bedrock model alias
Guardrail ruleset version

If those move independently, rollback and audit become error-prone.

Where MLflow Is Used

MLflow Registry stores individual assets and also a release bundle that points to the exact combination promoted together.

Services Involved

MLflow tracking and registry server
S3 artifact store
RDS metadata store
SageMaker for hosted models
AppConfig for active prompt and routing configuration
EventBridge for promotion events

Bundle Metadata

{
  "release_bundle": "mangaassist-chatbot-prod",
  "version": "2026.03.24-rc2",
  "intent_classifier_version": "12",
  "reranker_version": "8",
  "retriever_policy_version": "rag-v5",
  "prompt_version": "recommendation-2.4",
  "bedrock_model_id": "anthropic.claude-3-5-sonnet",
  "guardrail_ruleset_version": "gr-17"
}

How It Helped

Rollback became a bundle-level action instead of a scramble across multiple systems.
On-call could answer "what produced this response?" with one registry lookup.
Shadow and canary evaluations could compare release bundles instead of isolated components.

Example Outcome

When a Bedrock prompt update and reranker refresh shipped in the same window, MLflow kept the lineage intact and prevented a partial rollback that would have mixed incompatible versions.

Scenario 4: Explaining Thumbs-Down Feedback with Trace Correlation

Problem

User feedback is only useful if it can be tied back to what actually happened in the pipeline. "Bad answer" is not enough to decide whether the issue came from retrieval, prompting, generation, or guardrails.

Where MLflow Is Used

Each chatbot response carries a trace_id and response_id. Feedback events attach to those IDs and are written back to MLflow-linked analytics pipelines.

Services Involved

Frontend feedback widget
Orchestrator response metadata
MLflow trace store
Kinesis or event stream
Redshift or analytics warehouse

What We Join

Feedback value: thumbs up, thumbs down, escalation
Intent and route chosen
Prompt version and Bedrock model ID
Retrieved chunk IDs and source types
Guardrail decisions
Final response latency and token cost

How It Helped

It separated poor retrieval from poor writing tone.
It identified which prompt version caused a spike in dissatisfaction for recommendation queries.
It created labeled examples for future retraining and adversarial testing.

Example Outcome

An increase in thumbs-down on promotion questions looked like a generation problem at first. Trace correlation showed the real issue was stale promo chunks in the retrieval layer, not Bedrock output quality.

Problem

Guardrails that are too loose create safety risk. Guardrails that are too aggressive block good answers and hurt conversion. Aggregate block counts do not explain which stage is overfiring.

Where MLflow Is Used

Each guardrail stage is its own child span under apply_guardrails, with explicit inputs, outputs, verdict, and latency.

Services Involved

Bedrock generation
Guardrails pipeline
CloudWatch alarms
Security review workflows

Stage-Level Visibility

Guardrail stage	Logged fields
Prompt injection detection	`prompt_injection_score`, `decision`, `pattern_id`
PII detection	`entity_count`, `entity_types`, `redaction_applied`
Hallucination check	`grounding_score`, `unsupported_claim_count`, `fallback_triggered`
Format validation	`schema_passed`, `missing_fields`

How It Helped

It showed whether the issue was false positives, latency overhead, or missing detections.
It made safety tuning measurable instead of anecdotal.
It allowed reviewers to inspect blocked traces with enough context to improve rules safely.

Example Outcome

The hallucination check flagged too many recommendation answers because it expected direct policy-style grounding. Trace analysis showed it needed intent-aware thresholds instead of one global threshold.

Scenario 6: Cost and Routing Optimization for Bedrock Usage

Problem

MangaAssist mixes cheap deterministic flows with expensive generative flows. If every ambiguous query goes to the largest model with long prompts, token cost rises quickly.

Where MLflow Is Used

MLflow traces and experiments log:

Model used
Input and output tokens
Prompt length
Retrieved chunk count
Cache hit or miss
Latency by intent and route

Services Involved

Bedrock
AppConfig routing flags
Semantic cache
CloudWatch cost dashboards
MLflow experiment reports

How It Helped

It exposed which intents were overusing Bedrock even when a template or lighter path would work.
It quantified the impact of prompt compression, retrieval truncation, and cache hit rate.
It gave product and engineering a shared cost-per-session view tied to quality outcomes.

Example Outcome

By tracing token usage per intent, the team found that low-value clarifications were using the same large prompt wrapper as long-form recommendations. Splitting those routes cut token spend without hurting satisfaction.

Scenario 7: Shadow Testing a New Bedrock Model or Prompt Bundle

Problem

Foundation model behavior can shift even when application code stays fixed. Before promoting a new Bedrock model alias or prompt bundle, the team wants to observe real traffic behavior without exposing users to the candidate.

Where MLflow Is Used

The orchestrator forks requests into:

Production path: the response shown to the customer
Candidate path: a shadow run logged to MLflow but not returned to the user

Both runs share the same input, trace tags, and retrieval snapshot so they can be compared directly.

Services Involved

Orchestrator
Bedrock production model
Bedrock candidate model
AppConfig experiment flag
MLflow experiment and trace store

Comparison Metrics

Response quality deltas
Guardrail pass rate
Latency delta
Token delta
Escalation proxy metrics
Retrieval usage consistency

How It Helped

It let the team compare candidate behavior on live traffic safely.
It caught regressions that were invisible on a static golden dataset.
It turned Bedrock upgrades into measured release decisions instead of trust-based upgrades.

Example Outcome

A candidate prompt-model bundle improved long-form FAQ answers but increased competitor mention violations on recommendation flows. Shadow traces surfaced that before production promotion.

Summary: Why MLflow Was Worth It

Before MLflow	After MLflow
Logs were split across services with no shared request narrative	One trace showed the full request path
Prompt and model changes were hard to compare objectively	Every candidate had runs, metrics, and artifacts
Feedback was disconnected from root cause	Feedback linked back to trace, prompt, model, and retrieval context
Rollback required tribal knowledge across systems	Release bundles preserved lineage and rollback context
Bedrock cost and quality tradeoffs were hard to quantify	Token, latency, and quality metrics were compared on the same runs

MLflow Deep-Dive Scenarios for MangaAssist

How MLflow Fits This Chatbot

Scenario 1: Debugging a Bad Bedrock Response

Problem

Where MLflow Is Used

Services Involved

What We Capture

How It Helped

Example Outcome

Scenario 2: Preventing Prompt Regressions Before a Bedrock Rollout

Problem

Where MLflow Is Used

Services Involved

What We Compare

How It Helped

Example Outcome

Scenario 3: Registry and Release Lineage Across SageMaker and Bedrock

Problem

Where MLflow Is Used

Services Involved

Bundle Metadata

How It Helped

Example Outcome

Scenario 4: Explaining Thumbs-Down Feedback with Trace Correlation

Problem

Where MLflow Is Used

Services Involved

What We Join

How It Helped

Example Outcome

Scenario 5: Tuning the Guardrails Pipeline Without Flying Blind

Problem

Where MLflow Is Used

Services Involved

Stage-Level Visibility

How It Helped

Example Outcome

Scenario 6: Cost and Routing Optimization for Bedrock Usage

Problem

Where MLflow Is Used

Services Involved

How It Helped

Example Outcome

Scenario 7: Shadow Testing a New Bedrock Model or Prompt Bundle

Problem

Where MLflow Is Used

Services Involved

Comparison Metrics

How It Helped

Example Outcome

Summary: Why MLflow Was Worth It