MLflow Low-Level Implementation Guide for MangaAssist
A concrete implementation guide for the MLflow scenarios in this folder: tracing, prompt evaluation, release lineage, Bedrock integration, feedback correlation, and rollback-safe rollouts.
1. Scope
This guide implements the scenario set from:
01-mlflow-deep-dive-scenarios.md02-mlflow-bedrock-and-service-integration.md
The target outcome is an MLflow-backed control plane for:
- Bedrock request tracing
- SageMaker model tracing
- Retriever and guardrail visibility
- Prompt and model evaluation runs
- Release bundle lineage
- Feedback-linked root cause analysis
2. Target Architecture
graph TD
A[Chat UI] --> B[API Gateway or ALB]
B --> C[Chatbot Orchestrator]
C --> D[Trace Middleware]
D --> E[Intent Client]
D --> F[Retriever]
D --> G[Prompt Builder]
D --> H[Bedrock Adapter]
D --> I[Guardrails Pipeline]
D --> J[Feedback Sink]
E --> SM[SageMaker]
F --> BR1[Bedrock Embeddings]
F --> OS[OpenSearch]
F --> RR[SageMaker Reranker]
H --> BR2[Bedrock Generation]
C --> MLSDK[MLflow SDK]
MLSDK --> MLF[MLflow Tracking Server]
MLF --> S3[S3 Artifacts]
MLF --> RDS[RDS Metadata]
C --> CW[CloudWatch]
J --> KIN[Kinesis]
3. Suggested Code Layout
If you were implementing this in the application codebase, keep MLflow concerns explicit and local to integration points:
app/
observability/
mlflow_bootstrap.py
trace_tags.py
redaction.py
middleware/
request_trace_middleware.py
clients/
bedrock_client.py
sagemaker_intent_client.py
sagemaker_reranker_client.py
appconfig_client.py
retrieval/
retriever.py
chunk_snapshot.py
guardrails/
pipeline.py
feedback/
feedback_sink.py
eval/
run_golden_dataset.py
compare_release_bundle.py
release/
registry_bundle.py
promote_bundle.py
4. Infrastructure Setup
MLflow Tracking Plane
- Deploy MLflow Tracking Server on ECS Fargate.
- Use RDS PostgreSQL as backend metadata store.
- Use S3 as artifact store.
- Restrict access to internal networks and IAM-authenticated engineering roles.
Required Environment Variables
MLFLOW_TRACKING_URI=http://mlflow.internal.mangaassist.local
MLFLOW_EXPERIMENT_NAME=mangaassist-prod
MLFLOW_ENABLE_ASYNC_LOGGING=true
MLFLOW_TRACE_SAMPLING_RATIO=0.10
MLFLOW_ARTIFACT_MAX_PREVIEW_CHARS=500
MANGAASSIST_ENV=prod
MANGAASSIST_REGION=us-east-1
Bootstrap Module
import os
import mlflow
def configure_mlflow() -> None:
mlflow.set_tracking_uri(os.environ["MLFLOW_TRACKING_URI"])
mlflow.set_experiment(os.environ["MLFLOW_EXPERIMENT_NAME"])
mlflow.config.enable_async_logging(
os.getenv("MLFLOW_ENABLE_ASYNC_LOGGING", "true").lower() == "true"
)
5. Request Trace Contract
Every customer-visible response should create exactly one top-level trace.
Top-Level Trace Tags
| Tag | Required | Notes |
|---|---|---|
session_id |
Yes | Stable for multi-turn conversations |
request_id |
Yes | Unique per inbound request |
intent |
Yes | Set after classification |
page_type |
Yes | Product, search, category, cart |
prompt_version |
Conditional | Required on LLM-backed routes |
release_bundle |
Yes | Bundle under evaluation or in production |
bedrock_model_id |
Conditional | Required on Bedrock-backed routes |
cache_hit |
Yes | true or false |
user_type |
Optional | Guest, signed-in, Prime |
experiment_id |
Optional | A/B or shadow experiment reference |
Standard Span Tree
| Parent | Child spans |
|---|---|
handle_message |
load_context, intent_classification, route_intent, retrieve_chunks, build_prompt, bedrock_generate, apply_guardrails, persist_response |
retrieve_chunks |
embed_query, vector_search, keyword_search, rerank_chunks |
apply_guardrails |
prompt_injection, pii_detection, grounding_check, format_validation |
6. Middleware Implementation
Create the trace at the boundary, not inside individual business functions.
import mlflow
def handle_chat_request(request):
with mlflow.start_span(name="handle_message") as span:
span.set_attributes({
"request_id": request.request_id,
"session_id": request.session_id,
"page_type": request.page_context.page_type,
"customer_locale": request.page_context.locale,
})
return orchestrate_chat(request)
Immediately after intent classification, update the active trace:
mlflow.update_current_trace(tags={
"intent": intent.name,
"release_bundle": release_bundle.version,
"prompt_version": prompt_config.version if prompt_config else "none",
"bedrock_model_id": route.model_id if route.uses_bedrock else "none",
})
7. Bedrock Adapter Implementation
The Bedrock adapter owns:
- Prompt hashing
- Token and stop-reason logging
- Model ID logging
- Redacted prompt previews
Required Attributes
providerbedrock_model_idprompt_hashprompt_charstemperaturemax_tokensinput_tokensoutput_tokensstop_reason
Implementation Rule
Do not let business logic call boto3 directly. Route all Bedrock calls through one traced adapter so metadata stays consistent.
8. Retriever Implementation
Retrieval Contract
For every retrieval-backed answer, log:
- Query text hash
- Embedding model name
- Metadata filters used
- Candidate count
- Final chunk IDs
- Final chunk source types
- Reranker model version
Snapshot Artifact
When a request goes through the LLM path, store a small artifact that contains the exact chunk payloads used to build the prompt:
{
"trace_id": "8d0d8d39c4f24a31b57f2a31b9d7e112",
"chunk_ids": ["faq_102", "policy_88", "product_123"],
"source_types": ["faq", "policy", "product_description"],
"retriever_policy_version": "rag-v5"
}
This is critical for replay, prompt debugging, and shadow comparisons.
9. Guardrails Implementation
Stage API
Use a single interface for every stage:
class GuardrailStageResult:
passed: bool
score: float | None
reason: str | None
rule_id: str | None
Tracing Rule
Each stage runs inside its own child span and logs:
passedscorereasonrule_idlatency_ms
If a stage rewrites or blocks content, log:
action=redactaction=blockfallback_triggered=true
10. Feedback Correlation
Response Contract
Every response returned to the UI should include:
{
"response_id": "resp_01HZZX...",
"trace_id": "8d0d8d39c4f24a31b57f2a31b9d7e112",
"prompt_version": "recommendation-2.4",
"release_bundle": "2026.03.24-rc2"
}
Feedback Event Schema
{
"response_id": "resp_01HZZX...",
"trace_id": "8d0d8d39c4f24a31b57f2a31b9d7e112",
"feedback": "thumbs_down",
"timestamp": "2026-03-24T21:05:00Z",
"reason_code": "wrong_recommendation"
}
Processing Rule
- Send feedback to the event stream.
- Update the related MLflow trace with
user_feedback. - Materialize daily aggregates by intent, prompt version, release bundle, and Bedrock model ID.
11. Evaluation Pipeline
Golden Dataset Run
For every prompt or model change:
- Load the golden dataset subset affected by the change.
- Execute the full inference path.
- Log aggregate metrics and per-case outputs to MLflow.
- Fail the release if thresholds are not met.
Minimum Logged Metrics
intent_accuracygrounded_answer_rateguardrail_pass_ratethumbs_up_proxy_scoreavg_input_tokensavg_output_tokensp95_latency_ms
Run Artifacts
- Prompt files
- Evaluation dataset version
- Per-case result table
- Failure examples
- Confusion matrix
- Cost summary
12. Registry and Release Bundle Design
Separate Registry Layers
| Registry object | Example |
|---|---|
| Hosted model | manga-intent-classifier |
| Hosted model | manga-reranker |
| Prompt pack | mangaassist-prompts |
| Retriever policy | mangaassist-rag-policy |
| Release bundle | mangaassist-chatbot-prod |
Promotion Rule
Only promote a release bundle if:
- Offline evaluation passes
- Shadow evaluation passes
- Canary thresholds pass
- Required sign-offs exist for prompt and guardrail changes
Bundle Writer
def build_release_bundle(intent_version, reranker_version, prompt_version, model_id, rag_version, guardrail_version):
return {
"intent_classifier_version": intent_version,
"reranker_version": reranker_version,
"prompt_version": prompt_version,
"bedrock_model_id": model_id,
"retriever_policy_version": rag_version,
"guardrail_ruleset_version": guardrail_version,
}
13. Shadow and Canary Rollout
Shadow Mode
- Fork a copy of the request to the candidate release bundle.
- Use the same retrieval snapshot and request metadata.
- Do not return the candidate response to the user.
- Log comparison metrics under the same experiment.
Canary Mode
- Move from 1 percent to 10 percent to 50 percent to 100 percent.
- Attach
deployment_stage=canaryordeployment_stage=prodto every trace. - Trigger rollback when quality or latency alarms breach.
14. Alerting Strategy
Use CloudWatch for fast detection and MLflow for investigation.
Trigger Metrics
p99_latency_mserror_rateguardrail_block_ratethumbs_down_rategrounding_failure_ratebedrock_cost_per_session
Investigation Pivot
Every alert payload should include:
release_bundleprompt_versionbedrock_model_id- one or more example
trace_idvalues
15. Data Governance and Redaction
Required Controls
- Redact or hash PII before sending data to MLflow.
- Keep long raw artifacts in encrypted S3 with clear retention rules.
- Store only prompt or response previews in searchable metadata.
- Restrict registry promotion rights to the release owners.
Recommended Redaction Helper
def redact_text(text: str) -> str:
text = redact_email(text)
text = redact_phone(text)
text = redact_order_id(text)
return text
Run redaction before set_inputs() and set_outputs().
16. Scenario-to-Implementation Mapping
| Scenario | Must-have implementation pieces |
|---|---|
| Bedrock debugging | Request middleware, traced Bedrock adapter, retrieval snapshot artifacts |
| Prompt regression | Golden dataset runner, prompt artifact logging, AppConfig version tags |
| Registry and lineage | Release bundle builder, registry metadata, promotion hooks |
| Feedback root cause | Response trace IDs, feedback event schema, warehouse joins |
| Guardrail tuning | Child spans per guardrail stage, verdict fields, fallback tags |
| Cost optimization | Token logging, route tags, cost dashboards |
| Shadow testing | Dual-path execution, shared trace metadata, comparison runs |
17. Rollout Plan
Phase 1
- Deploy MLflow tracking backend.
- Add request middleware and top-level trace creation.
- Instrument Bedrock, intent classification, and retrieval.
Phase 2
- Instrument guardrails and response persistence.
- Add trace IDs to API responses.
- Start feedback correlation.
Phase 3
- Add golden dataset runs for prompts and release bundles.
- Add registry objects and promotion rules.
- Add shadow mode comparison.
Phase 4
- Add canary-aware lineage and rollback hooks.
- Add dashboards and trace-linked alarms.
- Tune redaction, retention, and storage policies.
18. Definition of Done
The implementation is complete when:
- Any bad response can be traced end-to-end in one place.
- Any release can be tied to a reproducible bundle of prompt, model, and retriever versions.
- Any thumbs-down can be joined to its trace within minutes.
- Bedrock latency and token usage can be analyzed by intent and prompt version.
- Guardrail blocks can be inspected stage by stage.
- A candidate bundle can be evaluated offline, in shadow, and in canary before full promotion.