LOCAL PREVIEW View on GitHub

MLflow, Bedrock, and Service Integration

How MLflow is wired into the AWS services behind MangaAssist so Bedrock generation, SageMaker models, retrieval, prompts, and feedback can be observed as one system.

End-to-End Integration Map

graph TD
    U[User Message] --> ORC[Chatbot Orchestrator on ECS]
    ORC --> IC[SageMaker Intent Classifier]
    ORC --> EMB[Bedrock Titan Embeddings]
    ORC --> OS[OpenSearch Serverless]
    ORC --> RR[SageMaker Reranker]
    ORC --> PC[Prompt Builder + AppConfig]
    ORC --> BR[Bedrock Claude]
    ORC --> GR[Guardrails Pipeline]
    ORC --> FB[Feedback Events]

    ORC --> ML[MLflow Tracing SDK]
    IC --> ML
    RR --> ML
    PC --> ML
    BR --> ML
    GR --> ML
    FB --> ML

    ML --> TS[MLflow Tracking Server]
    TS --> S3[S3 Artifacts]
    TS --> RDS[RDS Metadata]
    ORC --> CW[CloudWatch Metrics and Logs]

1. Bedrock Integration

Where Bedrock Appears

  • Claude for response generation
  • Titan embeddings for query embedding
  • Optional Bedrock prompt cache or routing metadata

Integration Pattern

There are two practical patterns:

  1. If the application calls Claude through an Anthropic Bedrock client, enable MLflow auto-tracing around that SDK.
  2. If the application uses boto3 directly, wrap the Bedrock client in a traced adapter and record prompt, model, token, and stop metadata manually.

Traced Bedrock Wrapper

import json
import hashlib
import mlflow
import boto3

bedrock = boto3.client("bedrock-runtime", region_name="us-east-1")


@mlflow.trace(name="bedrock_generate", span_type="LLM")
def generate_with_bedrock(model_id: str, prompt: str, temperature: float, max_tokens: int) -> str:
    span = mlflow.get_current_active_span()
    span.set_attributes({
        "provider": "bedrock",
        "bedrock_model_id": model_id,
        "temperature": temperature,
        "max_tokens": max_tokens,
        "prompt_hash": hashlib.sha256(prompt.encode("utf-8")).hexdigest(),
    })
    span.set_inputs({
        "prompt_preview": prompt[:500],
        "prompt_chars": len(prompt),
    })

    response = bedrock.invoke_model(
        modelId=model_id,
        body=json.dumps({
            "anthropic_version": "bedrock-2023-05-31",
            "max_tokens": max_tokens,
            "temperature": temperature,
            "messages": [{"role": "user", "content": prompt}],
        }),
    )

    payload = json.loads(response["body"].read())
    text = payload["content"][0]["text"]
    usage = payload.get("usage", {})

    span.set_outputs({
        "output_preview": text[:500],
        "output_chars": len(text),
        "input_tokens": usage.get("input_tokens"),
        "output_tokens": usage.get("output_tokens"),
        "stop_reason": payload.get("stop_reason"),
    })
    return text

Why It Matters

  • The Bedrock call becomes a first-class span instead of a hidden SDK call.
  • Prompt shape, token usage, and stop reason become queryable metadata.
  • Bedrock behavior can be compared across prompt versions and release bundles.

2. SageMaker Integration

What Runs on SageMaker

  • DistilBERT intent classifier
  • Cross-encoder reranker
  • Optional PII or sentiment models

Integration Pattern

Wrap each inference client in a traced function. Log model version, endpoint name, payload size, latency, and confidence.

import json
import mlflow
import boto3

runtime = boto3.client("sagemaker-runtime")


@mlflow.trace(name="intent_classification", span_type="CHAIN")
def classify_intent(endpoint_name: str, text: str) -> dict:
    response = runtime.invoke_endpoint(
        EndpointName=endpoint_name,
        ContentType="application/json",
        Body=json.dumps({"text": text}),
    )
    payload = json.loads(response["Body"].read())
    span = mlflow.get_current_active_span()
    span.set_attributes({
        "endpoint_name": endpoint_name,
        "model_family": "distilbert",
        "intent": payload["intent"],
        "confidence": payload["confidence"],
    })
    return payload

Why It Matters

  • Trace data makes it obvious whether latency came from Bedrock or upstream ML services.
  • Registry metadata can point back to the exact SageMaker artifact and deployment stage.

3. OpenSearch Integration

What We Trace

  • Query embedding latency
  • Vector search latency
  • Metadata filters
  • Candidate count and final selected chunks

Integration Pattern

Use a parent retrieve_chunks span and child spans for:

  • embed_query
  • vector_search
  • keyword_search
  • rerank_chunks

Each span logs chunk counts, source types, filters, and top chunk IDs.

Why It Matters

  • Retrieval quality issues stop being confused with generation quality issues.
  • Engineers can inspect the exact context sent to Bedrock.

4. AppConfig Integration

What AppConfig Controls

  • Active prompt version
  • Shadow mode flags
  • Prompt A/B experiment splits
  • Model routing rules
  • Guardrail thresholds

Integration Pattern

Read AppConfig once per request or once per cached config interval, then attach the resulting config version to the active trace:

mlflow.update_current_trace(tags={
    "prompt_version": prompt_config.version,
    "routing_policy_version": routing_config.version,
    "guardrail_ruleset_version": guardrail_config.version,
    "experiment_id": experiment.id if experiment else "none",
})

Why It Matters

  • Request behavior becomes explainable even when code has not changed.
  • Prompt or threshold rollouts are attributable to a specific config version.

5. CloudWatch and Alerting Integration

CloudWatch Still Matters

MLflow is not a replacement for operational telemetry. CloudWatch remains the best home for near-real-time alarms, platform logs, and service-native metrics.

Split of Responsibility

System Best at
MLflow Request traces, prompt/model lineage, eval runs, release comparison
CloudWatch Fast alarms, infrastructure metrics, error counts, service health
Grafana or dashboards Real-time visualization across CloudWatch and MLflow-derived aggregates

Integration Pattern

  • Emit latency, error, and guardrail counters to CloudWatch.
  • Store trace IDs in log lines and structured metrics.
  • Put the active trace_id in alarm context so responders can pivot from alarm to trace quickly.

6. S3 and RDS Tracking Backend

Backend Design

  • S3 stores prompt artifacts, evaluation reports, retrieved chunk snapshots, and large run outputs.
  • RDS stores run metadata, tags, metrics, registry entries, and searchable trace metadata.

Why It Matters

  • Artifacts remain durable and cheap to store.
  • Queries on run metadata stay fast.
  • The control plane remains internal and auditable.

7. Feedback and Analytics Integration

Event Flow

sequenceDiagram
    participant UI as Chat UI
    participant ORC as Orchestrator
    participant ML as MLflow
    participant K as Kinesis
    participant RS as Redshift

    UI->>ORC: thumbs_down(response_id, trace_id)
    ORC->>K: feedback event with trace metadata
    ORC->>ML: set trace tags: user_feedback=thumbs_down
    K->>RS: batch analytics load
    RS-->>ML: optional aggregated reports linked back by prompt/model version

Why It Matters

  • You can analyze satisfaction by prompt version, Bedrock model, intent, or retrieval policy.
  • You can build retraining datasets from real failure clusters.

8. Security and Redaction Boundaries

Rules

  • Redact PII before logging trace inputs and outputs.
  • Prefer hashes or previews for large prompts and responses.
  • Store raw conversation artifacts only in approved encrypted storage with retention control.
  • Never rely on MLflow as the sole audit system for regulated evidence; pair it with the repo's existing security and logging controls.
  • Email
  • Phone
  • Address
  • Full order ID
  • Payment references
  • Free-form customer profile text

9. Common Integration Tags

Use a consistent tag set across every scenario:

Tag Example
trace_id 8d0d8d39c4f24a31b57f2a31b9d7e112
session_id sess_7812f
intent recommendation
prompt_version recommendation-2.4
release_bundle 2026.03.24-rc2
bedrock_model_id anthropic.claude-3-5-sonnet
reranker_version 8
guardrail_ruleset_version gr-17
cache_hit true
user_feedback thumbs_down

Final Integration Principle

The point of MLflow in MangaAssist is not to centralize every metric in one tool. The point is to centralize lineage and request context so that Bedrock, SageMaker, OpenSearch, prompts, and feedback can all be reasoned about together.