LOCAL PREVIEW View on GitHub

MLflow Low-Level Implementation Guide for MangaAssist

A concrete implementation guide for the MLflow scenarios in this folder: tracing, prompt evaluation, release lineage, Bedrock integration, feedback correlation, and rollback-safe rollouts.

1. Scope

This guide implements the scenario set from:

  • 01-mlflow-deep-dive-scenarios.md
  • 02-mlflow-bedrock-and-service-integration.md

The target outcome is an MLflow-backed control plane for:

  • Bedrock request tracing
  • SageMaker model tracing
  • Retriever and guardrail visibility
  • Prompt and model evaluation runs
  • Release bundle lineage
  • Feedback-linked root cause analysis

2. Target Architecture

graph TD
    A[Chat UI] --> B[API Gateway or ALB]
    B --> C[Chatbot Orchestrator]
    C --> D[Trace Middleware]
    D --> E[Intent Client]
    D --> F[Retriever]
    D --> G[Prompt Builder]
    D --> H[Bedrock Adapter]
    D --> I[Guardrails Pipeline]
    D --> J[Feedback Sink]

    E --> SM[SageMaker]
    F --> BR1[Bedrock Embeddings]
    F --> OS[OpenSearch]
    F --> RR[SageMaker Reranker]
    H --> BR2[Bedrock Generation]

    C --> MLSDK[MLflow SDK]
    MLSDK --> MLF[MLflow Tracking Server]
    MLF --> S3[S3 Artifacts]
    MLF --> RDS[RDS Metadata]
    C --> CW[CloudWatch]
    J --> KIN[Kinesis]

3. Suggested Code Layout

If you were implementing this in the application codebase, keep MLflow concerns explicit and local to integration points:

app/
  observability/
    mlflow_bootstrap.py
    trace_tags.py
    redaction.py
  middleware/
    request_trace_middleware.py
  clients/
    bedrock_client.py
    sagemaker_intent_client.py
    sagemaker_reranker_client.py
    appconfig_client.py
  retrieval/
    retriever.py
    chunk_snapshot.py
  guardrails/
    pipeline.py
  feedback/
    feedback_sink.py
  eval/
    run_golden_dataset.py
    compare_release_bundle.py
  release/
    registry_bundle.py
    promote_bundle.py

4. Infrastructure Setup

MLflow Tracking Plane

  • Deploy MLflow Tracking Server on ECS Fargate.
  • Use RDS PostgreSQL as backend metadata store.
  • Use S3 as artifact store.
  • Restrict access to internal networks and IAM-authenticated engineering roles.

Required Environment Variables

MLFLOW_TRACKING_URI=http://mlflow.internal.mangaassist.local
MLFLOW_EXPERIMENT_NAME=mangaassist-prod
MLFLOW_ENABLE_ASYNC_LOGGING=true
MLFLOW_TRACE_SAMPLING_RATIO=0.10
MLFLOW_ARTIFACT_MAX_PREVIEW_CHARS=500
MANGAASSIST_ENV=prod
MANGAASSIST_REGION=us-east-1

Bootstrap Module

import os
import mlflow


def configure_mlflow() -> None:
    mlflow.set_tracking_uri(os.environ["MLFLOW_TRACKING_URI"])
    mlflow.set_experiment(os.environ["MLFLOW_EXPERIMENT_NAME"])
    mlflow.config.enable_async_logging(
        os.getenv("MLFLOW_ENABLE_ASYNC_LOGGING", "true").lower() == "true"
    )

5. Request Trace Contract

Every customer-visible response should create exactly one top-level trace.

Top-Level Trace Tags

Tag Required Notes
session_id Yes Stable for multi-turn conversations
request_id Yes Unique per inbound request
intent Yes Set after classification
page_type Yes Product, search, category, cart
prompt_version Conditional Required on LLM-backed routes
release_bundle Yes Bundle under evaluation or in production
bedrock_model_id Conditional Required on Bedrock-backed routes
cache_hit Yes true or false
user_type Optional Guest, signed-in, Prime
experiment_id Optional A/B or shadow experiment reference

Standard Span Tree

Parent Child spans
handle_message load_context, intent_classification, route_intent, retrieve_chunks, build_prompt, bedrock_generate, apply_guardrails, persist_response
retrieve_chunks embed_query, vector_search, keyword_search, rerank_chunks
apply_guardrails prompt_injection, pii_detection, grounding_check, format_validation

6. Middleware Implementation

Create the trace at the boundary, not inside individual business functions.

import mlflow


def handle_chat_request(request):
    with mlflow.start_span(name="handle_message") as span:
        span.set_attributes({
            "request_id": request.request_id,
            "session_id": request.session_id,
            "page_type": request.page_context.page_type,
            "customer_locale": request.page_context.locale,
        })
        return orchestrate_chat(request)

Immediately after intent classification, update the active trace:

mlflow.update_current_trace(tags={
    "intent": intent.name,
    "release_bundle": release_bundle.version,
    "prompt_version": prompt_config.version if prompt_config else "none",
    "bedrock_model_id": route.model_id if route.uses_bedrock else "none",
})

7. Bedrock Adapter Implementation

The Bedrock adapter owns:

  • Prompt hashing
  • Token and stop-reason logging
  • Model ID logging
  • Redacted prompt previews

Required Attributes

  • provider
  • bedrock_model_id
  • prompt_hash
  • prompt_chars
  • temperature
  • max_tokens
  • input_tokens
  • output_tokens
  • stop_reason

Implementation Rule

Do not let business logic call boto3 directly. Route all Bedrock calls through one traced adapter so metadata stays consistent.

8. Retriever Implementation

Retrieval Contract

For every retrieval-backed answer, log:

  • Query text hash
  • Embedding model name
  • Metadata filters used
  • Candidate count
  • Final chunk IDs
  • Final chunk source types
  • Reranker model version

Snapshot Artifact

When a request goes through the LLM path, store a small artifact that contains the exact chunk payloads used to build the prompt:

{
  "trace_id": "8d0d8d39c4f24a31b57f2a31b9d7e112",
  "chunk_ids": ["faq_102", "policy_88", "product_123"],
  "source_types": ["faq", "policy", "product_description"],
  "retriever_policy_version": "rag-v5"
}

This is critical for replay, prompt debugging, and shadow comparisons.

9. Guardrails Implementation

Stage API

Use a single interface for every stage:

class GuardrailStageResult:
    passed: bool
    score: float | None
    reason: str | None
    rule_id: str | None

Tracing Rule

Each stage runs inside its own child span and logs:

  • passed
  • score
  • reason
  • rule_id
  • latency_ms

If a stage rewrites or blocks content, log:

  • action=redact
  • action=block
  • fallback_triggered=true

10. Feedback Correlation

Response Contract

Every response returned to the UI should include:

{
  "response_id": "resp_01HZZX...",
  "trace_id": "8d0d8d39c4f24a31b57f2a31b9d7e112",
  "prompt_version": "recommendation-2.4",
  "release_bundle": "2026.03.24-rc2"
}

Feedback Event Schema

{
  "response_id": "resp_01HZZX...",
  "trace_id": "8d0d8d39c4f24a31b57f2a31b9d7e112",
  "feedback": "thumbs_down",
  "timestamp": "2026-03-24T21:05:00Z",
  "reason_code": "wrong_recommendation"
}

Processing Rule

  • Send feedback to the event stream.
  • Update the related MLflow trace with user_feedback.
  • Materialize daily aggregates by intent, prompt version, release bundle, and Bedrock model ID.

11. Evaluation Pipeline

Golden Dataset Run

For every prompt or model change:

  1. Load the golden dataset subset affected by the change.
  2. Execute the full inference path.
  3. Log aggregate metrics and per-case outputs to MLflow.
  4. Fail the release if thresholds are not met.

Minimum Logged Metrics

  • intent_accuracy
  • grounded_answer_rate
  • guardrail_pass_rate
  • thumbs_up_proxy_score
  • avg_input_tokens
  • avg_output_tokens
  • p95_latency_ms

Run Artifacts

  • Prompt files
  • Evaluation dataset version
  • Per-case result table
  • Failure examples
  • Confusion matrix
  • Cost summary

12. Registry and Release Bundle Design

Separate Registry Layers

Registry object Example
Hosted model manga-intent-classifier
Hosted model manga-reranker
Prompt pack mangaassist-prompts
Retriever policy mangaassist-rag-policy
Release bundle mangaassist-chatbot-prod

Promotion Rule

Only promote a release bundle if:

  • Offline evaluation passes
  • Shadow evaluation passes
  • Canary thresholds pass
  • Required sign-offs exist for prompt and guardrail changes

Bundle Writer

def build_release_bundle(intent_version, reranker_version, prompt_version, model_id, rag_version, guardrail_version):
    return {
        "intent_classifier_version": intent_version,
        "reranker_version": reranker_version,
        "prompt_version": prompt_version,
        "bedrock_model_id": model_id,
        "retriever_policy_version": rag_version,
        "guardrail_ruleset_version": guardrail_version,
    }

13. Shadow and Canary Rollout

Shadow Mode

  • Fork a copy of the request to the candidate release bundle.
  • Use the same retrieval snapshot and request metadata.
  • Do not return the candidate response to the user.
  • Log comparison metrics under the same experiment.

Canary Mode

  • Move from 1 percent to 10 percent to 50 percent to 100 percent.
  • Attach deployment_stage=canary or deployment_stage=prod to every trace.
  • Trigger rollback when quality or latency alarms breach.

14. Alerting Strategy

Use CloudWatch for fast detection and MLflow for investigation.

Trigger Metrics

  • p99_latency_ms
  • error_rate
  • guardrail_block_rate
  • thumbs_down_rate
  • grounding_failure_rate
  • bedrock_cost_per_session

Investigation Pivot

Every alert payload should include:

  • release_bundle
  • prompt_version
  • bedrock_model_id
  • one or more example trace_id values

15. Data Governance and Redaction

Required Controls

  • Redact or hash PII before sending data to MLflow.
  • Keep long raw artifacts in encrypted S3 with clear retention rules.
  • Store only prompt or response previews in searchable metadata.
  • Restrict registry promotion rights to the release owners.
def redact_text(text: str) -> str:
    text = redact_email(text)
    text = redact_phone(text)
    text = redact_order_id(text)
    return text

Run redaction before set_inputs() and set_outputs().

16. Scenario-to-Implementation Mapping

Scenario Must-have implementation pieces
Bedrock debugging Request middleware, traced Bedrock adapter, retrieval snapshot artifacts
Prompt regression Golden dataset runner, prompt artifact logging, AppConfig version tags
Registry and lineage Release bundle builder, registry metadata, promotion hooks
Feedback root cause Response trace IDs, feedback event schema, warehouse joins
Guardrail tuning Child spans per guardrail stage, verdict fields, fallback tags
Cost optimization Token logging, route tags, cost dashboards
Shadow testing Dual-path execution, shared trace metadata, comparison runs

17. Rollout Plan

Phase 1

  • Deploy MLflow tracking backend.
  • Add request middleware and top-level trace creation.
  • Instrument Bedrock, intent classification, and retrieval.

Phase 2

  • Instrument guardrails and response persistence.
  • Add trace IDs to API responses.
  • Start feedback correlation.

Phase 3

  • Add golden dataset runs for prompts and release bundles.
  • Add registry objects and promotion rules.
  • Add shadow mode comparison.

Phase 4

  • Add canary-aware lineage and rollback hooks.
  • Add dashboards and trace-linked alarms.
  • Tune redaction, retention, and storage policies.

18. Definition of Done

The implementation is complete when:

  • Any bad response can be traced end-to-end in one place.
  • Any release can be tied to a reproducible bundle of prompt, model, and retriever versions.
  • Any thumbs-down can be joined to its trace within minutes.
  • Bedrock latency and token usage can be analyzed by intent and prompt version.
  • Guardrail blocks can be inspected stage by stage.
  • A candidate bundle can be evaluated offline, in shadow, and in canary before full promotion.