MLflow Low-Level Implementation Guide for MangaAssist

A concrete implementation guide for the MLflow scenarios in this folder: tracing, prompt evaluation, release lineage, Bedrock integration, feedback correlation, and rollback-safe rollouts.

1. Scope

This guide implements the scenario set from:

01-mlflow-deep-dive-scenarios.md
02-mlflow-bedrock-and-service-integration.md

The target outcome is an MLflow-backed control plane for:

Bedrock request tracing
SageMaker model tracing
Retriever and guardrail visibility
Prompt and model evaluation runs
Release bundle lineage
Feedback-linked root cause analysis

2. Target Architecture

graph TD
    A[Chat UI] --> B[API Gateway or ALB]
    B --> C[Chatbot Orchestrator]
    C --> D[Trace Middleware]
    D --> E[Intent Client]
    D --> F[Retriever]
    D --> G[Prompt Builder]
    D --> H[Bedrock Adapter]
    D --> I[Guardrails Pipeline]
    D --> J[Feedback Sink]

    E --> SM[SageMaker]
    F --> BR1[Bedrock Embeddings]
    F --> OS[OpenSearch]
    F --> RR[SageMaker Reranker]
    H --> BR2[Bedrock Generation]

    C --> MLSDK[MLflow SDK]
    MLSDK --> MLF[MLflow Tracking Server]
    MLF --> S3[S3 Artifacts]
    MLF --> RDS[RDS Metadata]
    C --> CW[CloudWatch]
    J --> KIN[Kinesis]

3. Suggested Code Layout

If you were implementing this in the application codebase, keep MLflow concerns explicit and local to integration points:

app/
  observability/
    mlflow_bootstrap.py
    trace_tags.py
    redaction.py
  middleware/
    request_trace_middleware.py
  clients/
    bedrock_client.py
    sagemaker_intent_client.py
    sagemaker_reranker_client.py
    appconfig_client.py
  retrieval/
    retriever.py
    chunk_snapshot.py
  guardrails/
    pipeline.py
  feedback/
    feedback_sink.py
  eval/
    run_golden_dataset.py
    compare_release_bundle.py
  release/
    registry_bundle.py
    promote_bundle.py

4. Infrastructure Setup

MLflow Tracking Plane

Deploy MLflow Tracking Server on ECS Fargate.
Use RDS PostgreSQL as backend metadata store.
Use S3 as artifact store.
Restrict access to internal networks and IAM-authenticated engineering roles.

Required Environment Variables

MLFLOW_TRACKING_URI=http://mlflow.internal.mangaassist.local
MLFLOW_EXPERIMENT_NAME=mangaassist-prod
MLFLOW_ENABLE_ASYNC_LOGGING=true
MLFLOW_TRACE_SAMPLING_RATIO=0.10
MLFLOW_ARTIFACT_MAX_PREVIEW_CHARS=500
MANGAASSIST_ENV=prod
MANGAASSIST_REGION=us-east-1

Bootstrap Module

import os
import mlflow


def configure_mlflow() -> None:
    mlflow.set_tracking_uri(os.environ["MLFLOW_TRACKING_URI"])
    mlflow.set_experiment(os.environ["MLFLOW_EXPERIMENT_NAME"])
    mlflow.config.enable_async_logging(
        os.getenv("MLFLOW_ENABLE_ASYNC_LOGGING", "true").lower() == "true"
    )

5. Request Trace Contract

Every customer-visible response should create exactly one top-level trace.

Top-Level Trace Tags

Tag	Required	Notes
`session_id`	Yes	Stable for multi-turn conversations
`request_id`	Yes	Unique per inbound request
`intent`	Yes	Set after classification
`page_type`	Yes	Product, search, category, cart
`prompt_version`	Conditional	Required on LLM-backed routes
`release_bundle`	Yes	Bundle under evaluation or in production
`bedrock_model_id`	Conditional	Required on Bedrock-backed routes
`cache_hit`	Yes	`true` or `false`
`user_type`	Optional	Guest, signed-in, Prime
`experiment_id`	Optional	A/B or shadow experiment reference

Standard Span Tree

Parent	Child spans
`handle_message`	`load_context`, `intent_classification`, `route_intent`, `retrieve_chunks`, `build_prompt`, `bedrock_generate`, `apply_guardrails`, `persist_response`
`retrieve_chunks`	`embed_query`, `vector_search`, `keyword_search`, `rerank_chunks`
`apply_guardrails`	`prompt_injection`, `pii_detection`, `grounding_check`, `format_validation`

6. Middleware Implementation

Create the trace at the boundary, not inside individual business functions.

import mlflow


def handle_chat_request(request):
    with mlflow.start_span(name="handle_message") as span:
        span.set_attributes({
            "request_id": request.request_id,
            "session_id": request.session_id,
            "page_type": request.page_context.page_type,
            "customer_locale": request.page_context.locale,
        })
        return orchestrate_chat(request)

Immediately after intent classification, update the active trace:

mlflow.update_current_trace(tags={
    "intent": intent.name,
    "release_bundle": release_bundle.version,
    "prompt_version": prompt_config.version if prompt_config else "none",
    "bedrock_model_id": route.model_id if route.uses_bedrock else "none",
})

7. Bedrock Adapter Implementation

The Bedrock adapter owns:

Prompt hashing
Token and stop-reason logging
Model ID logging
Redacted prompt previews

Required Attributes

provider
bedrock_model_id
prompt_hash
prompt_chars
temperature
max_tokens
input_tokens
output_tokens
stop_reason

Implementation Rule

Do not let business logic call boto3 directly. Route all Bedrock calls through one traced adapter so metadata stays consistent.

8. Retriever Implementation

Retrieval Contract

For every retrieval-backed answer, log:

Query text hash
Embedding model name
Metadata filters used
Candidate count
Final chunk IDs
Final chunk source types
Reranker model version

Snapshot Artifact

When a request goes through the LLM path, store a small artifact that contains the exact chunk payloads used to build the prompt:

{
  "trace_id": "8d0d8d39c4f24a31b57f2a31b9d7e112",
  "chunk_ids": ["faq_102", "policy_88", "product_123"],
  "source_types": ["faq", "policy", "product_description"],
  "retriever_policy_version": "rag-v5"
}

This is critical for replay, prompt debugging, and shadow comparisons.

9. Guardrails Implementation

Stage API

Use a single interface for every stage:

class GuardrailStageResult:
    passed: bool
    score: float | None
    reason: str | None
    rule_id: str | None

Tracing Rule

Each stage runs inside its own child span and logs:

passed
score
reason
rule_id
latency_ms

If a stage rewrites or blocks content, log:

action=redact
action=block
fallback_triggered=true

10. Feedback Correlation

Response Contract

Every response returned to the UI should include:

{
  "response_id": "resp_01HZZX...",
  "trace_id": "8d0d8d39c4f24a31b57f2a31b9d7e112",
  "prompt_version": "recommendation-2.4",
  "release_bundle": "2026.03.24-rc2"
}

Feedback Event Schema

{
  "response_id": "resp_01HZZX...",
  "trace_id": "8d0d8d39c4f24a31b57f2a31b9d7e112",
  "feedback": "thumbs_down",
  "timestamp": "2026-03-24T21:05:00Z",
  "reason_code": "wrong_recommendation"
}

Processing Rule

Send feedback to the event stream.
Update the related MLflow trace with user_feedback.
Materialize daily aggregates by intent, prompt version, release bundle, and Bedrock model ID.

11. Evaluation Pipeline

Golden Dataset Run

For every prompt or model change:

Load the golden dataset subset affected by the change.
Execute the full inference path.
Log aggregate metrics and per-case outputs to MLflow.
Fail the release if thresholds are not met.

Minimum Logged Metrics

intent_accuracy
grounded_answer_rate
guardrail_pass_rate
thumbs_up_proxy_score
avg_input_tokens
avg_output_tokens
p95_latency_ms

Run Artifacts

Prompt files
Evaluation dataset version
Per-case result table
Failure examples
Confusion matrix
Cost summary

12. Registry and Release Bundle Design

Separate Registry Layers

Registry object	Example
Hosted model	`manga-intent-classifier`
Hosted model	`manga-reranker`
Prompt pack	`mangaassist-prompts`
Retriever policy	`mangaassist-rag-policy`
Release bundle	`mangaassist-chatbot-prod`

Promotion Rule

Only promote a release bundle if:

Offline evaluation passes
Shadow evaluation passes
Canary thresholds pass
Required sign-offs exist for prompt and guardrail changes

Bundle Writer

def build_release_bundle(intent_version, reranker_version, prompt_version, model_id, rag_version, guardrail_version):
    return {
        "intent_classifier_version": intent_version,
        "reranker_version": reranker_version,
        "prompt_version": prompt_version,
        "bedrock_model_id": model_id,
        "retriever_policy_version": rag_version,
        "guardrail_ruleset_version": guardrail_version,
    }

13. Shadow and Canary Rollout

Shadow Mode

Fork a copy of the request to the candidate release bundle.
Use the same retrieval snapshot and request metadata.
Do not return the candidate response to the user.
Log comparison metrics under the same experiment.

Canary Mode

Move from 1 percent to 10 percent to 50 percent to 100 percent.
Attach deployment_stage=canary or deployment_stage=prod to every trace.
Trigger rollback when quality or latency alarms breach.

14. Alerting Strategy

Use CloudWatch for fast detection and MLflow for investigation.

Trigger Metrics

p99_latency_ms
error_rate
guardrail_block_rate
thumbs_down_rate
grounding_failure_rate
bedrock_cost_per_session

Investigation Pivot

Every alert payload should include:

release_bundle
prompt_version
bedrock_model_id
one or more example trace_id values

15. Data Governance and Redaction

Required Controls

Redact or hash PII before sending data to MLflow.
Keep long raw artifacts in encrypted S3 with clear retention rules.
Store only prompt or response previews in searchable metadata.
Restrict registry promotion rights to the release owners.

Recommended Redaction Helper

def redact_text(text: str) -> str:
    text = redact_email(text)
    text = redact_phone(text)
    text = redact_order_id(text)
    return text

Run redaction before set_inputs() and set_outputs().

16. Scenario-to-Implementation Mapping

Scenario	Must-have implementation pieces
Bedrock debugging	Request middleware, traced Bedrock adapter, retrieval snapshot artifacts
Prompt regression	Golden dataset runner, prompt artifact logging, AppConfig version tags
Registry and lineage	Release bundle builder, registry metadata, promotion hooks
Feedback root cause	Response trace IDs, feedback event schema, warehouse joins
Guardrail tuning	Child spans per guardrail stage, verdict fields, fallback tags
Cost optimization	Token logging, route tags, cost dashboards
Shadow testing	Dual-path execution, shared trace metadata, comparison runs

17. Rollout Plan

Phase 1

Deploy MLflow tracking backend.
Add request middleware and top-level trace creation.
Instrument Bedrock, intent classification, and retrieval.

Phase 2

Instrument guardrails and response persistence.
Add trace IDs to API responses.
Start feedback correlation.

Phase 3

Add golden dataset runs for prompts and release bundles.
Add registry objects and promotion rules.
Add shadow mode comparison.

Phase 4

Add canary-aware lineage and rollback hooks.
Add dashboards and trace-linked alarms.
Tune redaction, retention, and storage policies.

18. Definition of Done

The implementation is complete when:

Any bad response can be traced end-to-end in one place.
Any release can be tied to a reproducible bundle of prompt, model, and retriever versions.
Any thumbs-down can be joined to its trace within minutes.
Bedrock latency and token usage can be analyzed by intent and prompt version.
Guardrail blocks can be inspected stage by stage.
A candidate bundle can be evaluated offline, in shadow, and in canary before full promotion.