LOCAL PREVIEW View on GitHub

MLflow Deep-Dive Scenarios for MangaAssist

Where MLflow is used in the chatbot, how it helped, and why it became part of the operating model instead of just another dashboard.

How MLflow Fits This Chatbot

MangaAssist is not a single model call. A meaningful response can touch:

  • The orchestrator on ECS Fargate
  • A SageMaker-hosted intent classifier
  • Titan embeddings plus OpenSearch retrieval
  • A SageMaker reranker
  • Bedrock for response generation
  • A custom guardrails pipeline
  • AppConfig for prompt and experiment control

MLflow gives that multi-service path one shared control plane:

MLflow capability What it covers in MangaAssist
Tracing Step-level visibility for intent, retrieval, prompt build, Bedrock generation, guardrails, and fallbacks
Experiments Prompt comparisons, retriever tuning, golden dataset runs, and shadow evaluations
Registry Versioning for intent classifier, reranker, embedding config, prompt bundle, and release bundle
Artifacts Prompt files, evaluation reports, retrieved chunk snapshots, confusion matrices, and run summaries
Tags and lineage Traceable linkage between request, prompt version, model bundle, deployment stage, and user feedback

Scenario 1: Debugging a Bad Bedrock Response

Problem

A customer says the recommendation answer was slow and also off-target. Without request-level tracing, the on-call engineer has to check CloudWatch, OpenSearch logs, SageMaker logs, and application logs separately.

Where MLflow Is Used

MLflow traces the full request as one parent trace with child spans for:

  • load_context
  • intent_classification
  • retrieve_chunks
  • rerank_chunks
  • build_prompt
  • bedrock_generate
  • apply_guardrails
  • save_response

Services Involved

  • ECS Fargate orchestrator
  • SageMaker intent classifier
  • Bedrock Claude generation
  • Bedrock Titan embeddings
  • OpenSearch Serverless
  • SageMaker reranker
  • Guardrails service

What We Capture

Span Key attributes
intent_classification intent, confidence, classifier_version, classification_path
retrieve_chunks query_embedding_model, candidate_count, top_chunk_ids, source_types
rerank_chunks reranker_version, input_count, output_count, latency_ms
build_prompt prompt_version, system_prompt_hash, retrieved_chunk_count, prompt_chars
bedrock_generate bedrock_model_id, temperature, max_tokens, input_tokens, output_tokens, stop_reason
apply_guardrails guardrail_stage, verdict, rule_id, fallback_triggered

How It Helped

  • It separated Bedrock latency from retrieval latency instead of treating the whole request as one opaque number.
  • It exposed cases where the reranker was processing too many documents before the Bedrock call even started.
  • It let engineers replay the exact prompt version and retrieved context that led to the wrong answer.
  • It reduced time-to-root-cause from a cross-service investigation to a single trace lookup.

Example Outcome

One trace made it obvious that only 280 ms was spent in Bedrock while 390 ms was spent reranking 50 chunks instead of 20. The fix was not a model swap. It was a retrieval configuration correction.


Scenario 2: Preventing Prompt Regressions Before a Bedrock Rollout

Problem

A small prompt edit can improve one intent and silently degrade another. MangaAssist uses different prompt shapes for recommendation, FAQ, and product Q&A flows, so changes must be evaluated before they go live.

Where MLflow Is Used

Every prompt candidate is run as an MLflow experiment:

  • Prompt files are logged as artifacts.
  • Golden dataset metrics are logged as run metrics.
  • Bedrock model ID, prompt version, retriever version, and guardrail version are logged as tags.
  • Comparison reports are stored as artifacts and linked to the PR or change request.

Services Involved

  • GitHub or code review system
  • CodeBuild or CI runner
  • Bedrock for offline response generation
  • AppConfig for active prompt selection
  • S3 for evaluation dataset and reports

What We Compare

  • Intent accuracy on prompt-sensitive flows
  • BERTScore or rubric score against references
  • Guardrail pass rate
  • Response length distribution
  • Hallucination or grounding failure count
  • Bedrock token usage and latency

How It Helped

  • Prompt changes stopped being subjective because each candidate had a run ID and a reproducible report.
  • Teams could compare prompt variants even when the underlying Bedrock model stayed the same.
  • AppConfig promotion was gated on an MLflow run that met the release thresholds.

Example Outcome

A concise recommendation prompt looked better in ad hoc testing but dropped grounded product mention accuracy on the golden set. MLflow made that visible before the prompt reached production traffic.


Scenario 3: Registry and Release Lineage Across SageMaker and Bedrock

Problem

A single MangaAssist response depends on more than one versioned asset:

  • Intent classifier version
  • Reranker version
  • Embedding model choice and chunk policy
  • Prompt version
  • Bedrock model alias
  • Guardrail ruleset version

If those move independently, rollback and audit become error-prone.

Where MLflow Is Used

MLflow Registry stores individual assets and also a release bundle that points to the exact combination promoted together.

Services Involved

  • MLflow tracking and registry server
  • S3 artifact store
  • RDS metadata store
  • SageMaker for hosted models
  • AppConfig for active prompt and routing configuration
  • EventBridge for promotion events

Bundle Metadata

{
  "release_bundle": "mangaassist-chatbot-prod",
  "version": "2026.03.24-rc2",
  "intent_classifier_version": "12",
  "reranker_version": "8",
  "retriever_policy_version": "rag-v5",
  "prompt_version": "recommendation-2.4",
  "bedrock_model_id": "anthropic.claude-3-5-sonnet",
  "guardrail_ruleset_version": "gr-17"
}

How It Helped

  • Rollback became a bundle-level action instead of a scramble across multiple systems.
  • On-call could answer "what produced this response?" with one registry lookup.
  • Shadow and canary evaluations could compare release bundles instead of isolated components.

Example Outcome

When a Bedrock prompt update and reranker refresh shipped in the same window, MLflow kept the lineage intact and prevented a partial rollback that would have mixed incompatible versions.


Scenario 4: Explaining Thumbs-Down Feedback with Trace Correlation

Problem

User feedback is only useful if it can be tied back to what actually happened in the pipeline. "Bad answer" is not enough to decide whether the issue came from retrieval, prompting, generation, or guardrails.

Where MLflow Is Used

Each chatbot response carries a trace_id and response_id. Feedback events attach to those IDs and are written back to MLflow-linked analytics pipelines.

Services Involved

  • Frontend feedback widget
  • Orchestrator response metadata
  • MLflow trace store
  • Kinesis or event stream
  • Redshift or analytics warehouse

What We Join

  • Feedback value: thumbs up, thumbs down, escalation
  • Intent and route chosen
  • Prompt version and Bedrock model ID
  • Retrieved chunk IDs and source types
  • Guardrail decisions
  • Final response latency and token cost

How It Helped

  • It separated poor retrieval from poor writing tone.
  • It identified which prompt version caused a spike in dissatisfaction for recommendation queries.
  • It created labeled examples for future retraining and adversarial testing.

Example Outcome

An increase in thumbs-down on promotion questions looked like a generation problem at first. Trace correlation showed the real issue was stale promo chunks in the retrieval layer, not Bedrock output quality.


Scenario 5: Tuning the Guardrails Pipeline Without Flying Blind

Problem

Guardrails that are too loose create safety risk. Guardrails that are too aggressive block good answers and hurt conversion. Aggregate block counts do not explain which stage is overfiring.

Where MLflow Is Used

Each guardrail stage is its own child span under apply_guardrails, with explicit inputs, outputs, verdict, and latency.

Services Involved

  • Bedrock generation
  • Guardrails pipeline
  • CloudWatch alarms
  • Security review workflows

Stage-Level Visibility

Guardrail stage Logged fields
Prompt injection detection prompt_injection_score, decision, pattern_id
PII detection entity_count, entity_types, redaction_applied
Hallucination check grounding_score, unsupported_claim_count, fallback_triggered
Format validation schema_passed, missing_fields

How It Helped

  • It showed whether the issue was false positives, latency overhead, or missing detections.
  • It made safety tuning measurable instead of anecdotal.
  • It allowed reviewers to inspect blocked traces with enough context to improve rules safely.

Example Outcome

The hallucination check flagged too many recommendation answers because it expected direct policy-style grounding. Trace analysis showed it needed intent-aware thresholds instead of one global threshold.


Scenario 6: Cost and Routing Optimization for Bedrock Usage

Problem

MangaAssist mixes cheap deterministic flows with expensive generative flows. If every ambiguous query goes to the largest model with long prompts, token cost rises quickly.

Where MLflow Is Used

MLflow traces and experiments log:

  • Model used
  • Input and output tokens
  • Prompt length
  • Retrieved chunk count
  • Cache hit or miss
  • Latency by intent and route

Services Involved

  • Bedrock
  • AppConfig routing flags
  • Semantic cache
  • CloudWatch cost dashboards
  • MLflow experiment reports

How It Helped

  • It exposed which intents were overusing Bedrock even when a template or lighter path would work.
  • It quantified the impact of prompt compression, retrieval truncation, and cache hit rate.
  • It gave product and engineering a shared cost-per-session view tied to quality outcomes.

Example Outcome

By tracing token usage per intent, the team found that low-value clarifications were using the same large prompt wrapper as long-form recommendations. Splitting those routes cut token spend without hurting satisfaction.


Scenario 7: Shadow Testing a New Bedrock Model or Prompt Bundle

Problem

Foundation model behavior can shift even when application code stays fixed. Before promoting a new Bedrock model alias or prompt bundle, the team wants to observe real traffic behavior without exposing users to the candidate.

Where MLflow Is Used

The orchestrator forks requests into:

  • Production path: the response shown to the customer
  • Candidate path: a shadow run logged to MLflow but not returned to the user

Both runs share the same input, trace tags, and retrieval snapshot so they can be compared directly.

Services Involved

  • Orchestrator
  • Bedrock production model
  • Bedrock candidate model
  • AppConfig experiment flag
  • MLflow experiment and trace store

Comparison Metrics

  • Response quality deltas
  • Guardrail pass rate
  • Latency delta
  • Token delta
  • Escalation proxy metrics
  • Retrieval usage consistency

How It Helped

  • It let the team compare candidate behavior on live traffic safely.
  • It caught regressions that were invisible on a static golden dataset.
  • It turned Bedrock upgrades into measured release decisions instead of trust-based upgrades.

Example Outcome

A candidate prompt-model bundle improved long-form FAQ answers but increased competitor mention violations on recommendation flows. Shadow traces surfaced that before production promotion.


Summary: Why MLflow Was Worth It

Before MLflow After MLflow
Logs were split across services with no shared request narrative One trace showed the full request path
Prompt and model changes were hard to compare objectively Every candidate had runs, metrics, and artifacts
Feedback was disconnected from root cause Feedback linked back to trace, prompt, model, and retrieval context
Rollback required tribal knowledge across systems Release bundles preserved lineage and rollback context
Bedrock cost and quality tradeoffs were hard to quantify Token, latency, and quality metrics were compared on the same runs