MLflow Deep-Dive Scenarios for MangaAssist
Where MLflow is used in the chatbot, how it helped, and why it became part of the operating model instead of just another dashboard.
How MLflow Fits This Chatbot
MangaAssist is not a single model call. A meaningful response can touch:
- The orchestrator on ECS Fargate
- A SageMaker-hosted intent classifier
- Titan embeddings plus OpenSearch retrieval
- A SageMaker reranker
- Bedrock for response generation
- A custom guardrails pipeline
- AppConfig for prompt and experiment control
MLflow gives that multi-service path one shared control plane:
| MLflow capability | What it covers in MangaAssist |
|---|---|
| Tracing | Step-level visibility for intent, retrieval, prompt build, Bedrock generation, guardrails, and fallbacks |
| Experiments | Prompt comparisons, retriever tuning, golden dataset runs, and shadow evaluations |
| Registry | Versioning for intent classifier, reranker, embedding config, prompt bundle, and release bundle |
| Artifacts | Prompt files, evaluation reports, retrieved chunk snapshots, confusion matrices, and run summaries |
| Tags and lineage | Traceable linkage between request, prompt version, model bundle, deployment stage, and user feedback |
Scenario 1: Debugging a Bad Bedrock Response
Problem
A customer says the recommendation answer was slow and also off-target. Without request-level tracing, the on-call engineer has to check CloudWatch, OpenSearch logs, SageMaker logs, and application logs separately.
Where MLflow Is Used
MLflow traces the full request as one parent trace with child spans for:
load_contextintent_classificationretrieve_chunksrerank_chunksbuild_promptbedrock_generateapply_guardrailssave_response
Services Involved
- ECS Fargate orchestrator
- SageMaker intent classifier
- Bedrock Claude generation
- Bedrock Titan embeddings
- OpenSearch Serverless
- SageMaker reranker
- Guardrails service
What We Capture
| Span | Key attributes |
|---|---|
intent_classification |
intent, confidence, classifier_version, classification_path |
retrieve_chunks |
query_embedding_model, candidate_count, top_chunk_ids, source_types |
rerank_chunks |
reranker_version, input_count, output_count, latency_ms |
build_prompt |
prompt_version, system_prompt_hash, retrieved_chunk_count, prompt_chars |
bedrock_generate |
bedrock_model_id, temperature, max_tokens, input_tokens, output_tokens, stop_reason |
apply_guardrails |
guardrail_stage, verdict, rule_id, fallback_triggered |
How It Helped
- It separated Bedrock latency from retrieval latency instead of treating the whole request as one opaque number.
- It exposed cases where the reranker was processing too many documents before the Bedrock call even started.
- It let engineers replay the exact prompt version and retrieved context that led to the wrong answer.
- It reduced time-to-root-cause from a cross-service investigation to a single trace lookup.
Example Outcome
One trace made it obvious that only 280 ms was spent in Bedrock while 390 ms was spent reranking 50 chunks instead of 20. The fix was not a model swap. It was a retrieval configuration correction.
Scenario 2: Preventing Prompt Regressions Before a Bedrock Rollout
Problem
A small prompt edit can improve one intent and silently degrade another. MangaAssist uses different prompt shapes for recommendation, FAQ, and product Q&A flows, so changes must be evaluated before they go live.
Where MLflow Is Used
Every prompt candidate is run as an MLflow experiment:
- Prompt files are logged as artifacts.
- Golden dataset metrics are logged as run metrics.
- Bedrock model ID, prompt version, retriever version, and guardrail version are logged as tags.
- Comparison reports are stored as artifacts and linked to the PR or change request.
Services Involved
- GitHub or code review system
- CodeBuild or CI runner
- Bedrock for offline response generation
- AppConfig for active prompt selection
- S3 for evaluation dataset and reports
What We Compare
- Intent accuracy on prompt-sensitive flows
- BERTScore or rubric score against references
- Guardrail pass rate
- Response length distribution
- Hallucination or grounding failure count
- Bedrock token usage and latency
How It Helped
- Prompt changes stopped being subjective because each candidate had a run ID and a reproducible report.
- Teams could compare prompt variants even when the underlying Bedrock model stayed the same.
- AppConfig promotion was gated on an MLflow run that met the release thresholds.
Example Outcome
A concise recommendation prompt looked better in ad hoc testing but dropped grounded product mention accuracy on the golden set. MLflow made that visible before the prompt reached production traffic.
Scenario 3: Registry and Release Lineage Across SageMaker and Bedrock
Problem
A single MangaAssist response depends on more than one versioned asset:
- Intent classifier version
- Reranker version
- Embedding model choice and chunk policy
- Prompt version
- Bedrock model alias
- Guardrail ruleset version
If those move independently, rollback and audit become error-prone.
Where MLflow Is Used
MLflow Registry stores individual assets and also a release bundle that points to the exact combination promoted together.
Services Involved
- MLflow tracking and registry server
- S3 artifact store
- RDS metadata store
- SageMaker for hosted models
- AppConfig for active prompt and routing configuration
- EventBridge for promotion events
Bundle Metadata
{
"release_bundle": "mangaassist-chatbot-prod",
"version": "2026.03.24-rc2",
"intent_classifier_version": "12",
"reranker_version": "8",
"retriever_policy_version": "rag-v5",
"prompt_version": "recommendation-2.4",
"bedrock_model_id": "anthropic.claude-3-5-sonnet",
"guardrail_ruleset_version": "gr-17"
}
How It Helped
- Rollback became a bundle-level action instead of a scramble across multiple systems.
- On-call could answer "what produced this response?" with one registry lookup.
- Shadow and canary evaluations could compare release bundles instead of isolated components.
Example Outcome
When a Bedrock prompt update and reranker refresh shipped in the same window, MLflow kept the lineage intact and prevented a partial rollback that would have mixed incompatible versions.
Scenario 4: Explaining Thumbs-Down Feedback with Trace Correlation
Problem
User feedback is only useful if it can be tied back to what actually happened in the pipeline. "Bad answer" is not enough to decide whether the issue came from retrieval, prompting, generation, or guardrails.
Where MLflow Is Used
Each chatbot response carries a trace_id and response_id. Feedback events attach to those IDs and are written back to MLflow-linked analytics pipelines.
Services Involved
- Frontend feedback widget
- Orchestrator response metadata
- MLflow trace store
- Kinesis or event stream
- Redshift or analytics warehouse
What We Join
- Feedback value: thumbs up, thumbs down, escalation
- Intent and route chosen
- Prompt version and Bedrock model ID
- Retrieved chunk IDs and source types
- Guardrail decisions
- Final response latency and token cost
How It Helped
- It separated poor retrieval from poor writing tone.
- It identified which prompt version caused a spike in dissatisfaction for recommendation queries.
- It created labeled examples for future retraining and adversarial testing.
Example Outcome
An increase in thumbs-down on promotion questions looked like a generation problem at first. Trace correlation showed the real issue was stale promo chunks in the retrieval layer, not Bedrock output quality.
Scenario 5: Tuning the Guardrails Pipeline Without Flying Blind
Problem
Guardrails that are too loose create safety risk. Guardrails that are too aggressive block good answers and hurt conversion. Aggregate block counts do not explain which stage is overfiring.
Where MLflow Is Used
Each guardrail stage is its own child span under apply_guardrails, with explicit inputs, outputs, verdict, and latency.
Services Involved
- Bedrock generation
- Guardrails pipeline
- CloudWatch alarms
- Security review workflows
Stage-Level Visibility
| Guardrail stage | Logged fields |
|---|---|
| Prompt injection detection | prompt_injection_score, decision, pattern_id |
| PII detection | entity_count, entity_types, redaction_applied |
| Hallucination check | grounding_score, unsupported_claim_count, fallback_triggered |
| Format validation | schema_passed, missing_fields |
How It Helped
- It showed whether the issue was false positives, latency overhead, or missing detections.
- It made safety tuning measurable instead of anecdotal.
- It allowed reviewers to inspect blocked traces with enough context to improve rules safely.
Example Outcome
The hallucination check flagged too many recommendation answers because it expected direct policy-style grounding. Trace analysis showed it needed intent-aware thresholds instead of one global threshold.
Scenario 6: Cost and Routing Optimization for Bedrock Usage
Problem
MangaAssist mixes cheap deterministic flows with expensive generative flows. If every ambiguous query goes to the largest model with long prompts, token cost rises quickly.
Where MLflow Is Used
MLflow traces and experiments log:
- Model used
- Input and output tokens
- Prompt length
- Retrieved chunk count
- Cache hit or miss
- Latency by intent and route
Services Involved
- Bedrock
- AppConfig routing flags
- Semantic cache
- CloudWatch cost dashboards
- MLflow experiment reports
How It Helped
- It exposed which intents were overusing Bedrock even when a template or lighter path would work.
- It quantified the impact of prompt compression, retrieval truncation, and cache hit rate.
- It gave product and engineering a shared cost-per-session view tied to quality outcomes.
Example Outcome
By tracing token usage per intent, the team found that low-value clarifications were using the same large prompt wrapper as long-form recommendations. Splitting those routes cut token spend without hurting satisfaction.
Scenario 7: Shadow Testing a New Bedrock Model or Prompt Bundle
Problem
Foundation model behavior can shift even when application code stays fixed. Before promoting a new Bedrock model alias or prompt bundle, the team wants to observe real traffic behavior without exposing users to the candidate.
Where MLflow Is Used
The orchestrator forks requests into:
- Production path: the response shown to the customer
- Candidate path: a shadow run logged to MLflow but not returned to the user
Both runs share the same input, trace tags, and retrieval snapshot so they can be compared directly.
Services Involved
- Orchestrator
- Bedrock production model
- Bedrock candidate model
- AppConfig experiment flag
- MLflow experiment and trace store
Comparison Metrics
- Response quality deltas
- Guardrail pass rate
- Latency delta
- Token delta
- Escalation proxy metrics
- Retrieval usage consistency
How It Helped
- It let the team compare candidate behavior on live traffic safely.
- It caught regressions that were invisible on a static golden dataset.
- It turned Bedrock upgrades into measured release decisions instead of trust-based upgrades.
Example Outcome
A candidate prompt-model bundle improved long-form FAQ answers but increased competitor mention violations on recommendation flows. Shadow traces surfaced that before production promotion.
Summary: Why MLflow Was Worth It
| Before MLflow | After MLflow |
|---|---|
| Logs were split across services with no shared request narrative | One trace showed the full request path |
| Prompt and model changes were hard to compare objectively | Every candidate had runs, metrics, and artifacts |
| Feedback was disconnected from root cause | Feedback linked back to trace, prompt, model, and retrieval context |
| Rollback required tribal knowledge across systems | Release bundles preserved lineage and rollback context |
| Bedrock cost and quality tradeoffs were hard to quantify | Token, latency, and quality metrics were compared on the same runs |