MLflow, Bedrock, and Service Integration

How MLflow is wired into the AWS services behind MangaAssist so Bedrock generation, SageMaker models, retrieval, prompts, and feedback can be observed as one system.

End-to-End Integration Map

graph TD
    U[User Message] --> ORC[Chatbot Orchestrator on ECS]
    ORC --> IC[SageMaker Intent Classifier]
    ORC --> EMB[Bedrock Titan Embeddings]
    ORC --> OS[OpenSearch Serverless]
    ORC --> RR[SageMaker Reranker]
    ORC --> PC[Prompt Builder + AppConfig]
    ORC --> BR[Bedrock Claude]
    ORC --> GR[Guardrails Pipeline]
    ORC --> FB[Feedback Events]

    ORC --> ML[MLflow Tracing SDK]
    IC --> ML
    RR --> ML
    PC --> ML
    BR --> ML
    GR --> ML
    FB --> ML

    ML --> TS[MLflow Tracking Server]
    TS --> S3[S3 Artifacts]
    TS --> RDS[RDS Metadata]
    ORC --> CW[CloudWatch Metrics and Logs]

1. Bedrock Integration

Where Bedrock Appears

Claude for response generation
Titan embeddings for query embedding
Optional Bedrock prompt cache or routing metadata

Integration Pattern

There are two practical patterns:

If the application calls Claude through an Anthropic Bedrock client, enable MLflow auto-tracing around that SDK.
If the application uses boto3 directly, wrap the Bedrock client in a traced adapter and record prompt, model, token, and stop metadata manually.

Traced Bedrock Wrapper

import json
import hashlib
import mlflow
import boto3

bedrock = boto3.client("bedrock-runtime", region_name="us-east-1")


@mlflow.trace(name="bedrock_generate", span_type="LLM")
def generate_with_bedrock(model_id: str, prompt: str, temperature: float, max_tokens: int) -> str:
    span = mlflow.get_current_active_span()
    span.set_attributes({
        "provider": "bedrock",
        "bedrock_model_id": model_id,
        "temperature": temperature,
        "max_tokens": max_tokens,
        "prompt_hash": hashlib.sha256(prompt.encode("utf-8")).hexdigest(),
    })
    span.set_inputs({
        "prompt_preview": prompt[:500],
        "prompt_chars": len(prompt),
    })

    response = bedrock.invoke_model(
        modelId=model_id,
        body=json.dumps({
            "anthropic_version": "bedrock-2023-05-31",
            "max_tokens": max_tokens,
            "temperature": temperature,
            "messages": [{"role": "user", "content": prompt}],
        }),
    )

    payload = json.loads(response["body"].read())
    text = payload["content"][0]["text"]
    usage = payload.get("usage", {})

    span.set_outputs({
        "output_preview": text[:500],
        "output_chars": len(text),
        "input_tokens": usage.get("input_tokens"),
        "output_tokens": usage.get("output_tokens"),
        "stop_reason": payload.get("stop_reason"),
    })
    return text

Why It Matters

The Bedrock call becomes a first-class span instead of a hidden SDK call.
Prompt shape, token usage, and stop reason become queryable metadata.
Bedrock behavior can be compared across prompt versions and release bundles.

2. SageMaker Integration

What Runs on SageMaker

DistilBERT intent classifier
Cross-encoder reranker
Optional PII or sentiment models

Integration Pattern

Wrap each inference client in a traced function. Log model version, endpoint name, payload size, latency, and confidence.

import json
import mlflow
import boto3

runtime = boto3.client("sagemaker-runtime")


@mlflow.trace(name="intent_classification", span_type="CHAIN")
def classify_intent(endpoint_name: str, text: str) -> dict:
    response = runtime.invoke_endpoint(
        EndpointName=endpoint_name,
        ContentType="application/json",
        Body=json.dumps({"text": text}),
    )
    payload = json.loads(response["Body"].read())
    span = mlflow.get_current_active_span()
    span.set_attributes({
        "endpoint_name": endpoint_name,
        "model_family": "distilbert",
        "intent": payload["intent"],
        "confidence": payload["confidence"],
    })
    return payload

Why It Matters

Trace data makes it obvious whether latency came from Bedrock or upstream ML services.
Registry metadata can point back to the exact SageMaker artifact and deployment stage.

3. OpenSearch Integration

What We Trace

Query embedding latency
Vector search latency
Metadata filters
Candidate count and final selected chunks

Integration Pattern

Use a parent retrieve_chunks span and child spans for:

embed_query
vector_search
keyword_search
rerank_chunks

Each span logs chunk counts, source types, filters, and top chunk IDs.

Why It Matters

Retrieval quality issues stop being confused with generation quality issues.
Engineers can inspect the exact context sent to Bedrock.

4. AppConfig Integration

What AppConfig Controls

Active prompt version
Shadow mode flags
Prompt A/B experiment splits
Model routing rules
Guardrail thresholds

Integration Pattern

Read AppConfig once per request or once per cached config interval, then attach the resulting config version to the active trace:

mlflow.update_current_trace(tags={
    "prompt_version": prompt_config.version,
    "routing_policy_version": routing_config.version,
    "guardrail_ruleset_version": guardrail_config.version,
    "experiment_id": experiment.id if experiment else "none",
})

Why It Matters

Request behavior becomes explainable even when code has not changed.
Prompt or threshold rollouts are attributable to a specific config version.

5. CloudWatch and Alerting Integration

CloudWatch Still Matters

MLflow is not a replacement for operational telemetry. CloudWatch remains the best home for near-real-time alarms, platform logs, and service-native metrics.

Split of Responsibility

System	Best at
MLflow	Request traces, prompt/model lineage, eval runs, release comparison
CloudWatch	Fast alarms, infrastructure metrics, error counts, service health
Grafana or dashboards	Real-time visualization across CloudWatch and MLflow-derived aggregates

Integration Pattern

Emit latency, error, and guardrail counters to CloudWatch.
Store trace IDs in log lines and structured metrics.
Put the active trace_id in alarm context so responders can pivot from alarm to trace quickly.

6. S3 and RDS Tracking Backend

Backend Design

S3 stores prompt artifacts, evaluation reports, retrieved chunk snapshots, and large run outputs.
RDS stores run metadata, tags, metrics, registry entries, and searchable trace metadata.

Why It Matters

Artifacts remain durable and cheap to store.
Queries on run metadata stay fast.
The control plane remains internal and auditable.

7. Feedback and Analytics Integration

Event Flow

sequenceDiagram
    participant UI as Chat UI
    participant ORC as Orchestrator
    participant ML as MLflow
    participant K as Kinesis
    participant RS as Redshift

    UI->>ORC: thumbs_down(response_id, trace_id)
    ORC->>K: feedback event with trace metadata
    ORC->>ML: set trace tags: user_feedback=thumbs_down
    K->>RS: batch analytics load
    RS-->>ML: optional aggregated reports linked back by prompt/model version

Why It Matters

You can analyze satisfaction by prompt version, Bedrock model, intent, or retrieval policy.
You can build retraining datasets from real failure clusters.

8. Security and Redaction Boundaries

Rules

Redact PII before logging trace inputs and outputs.
Prefer hashes or previews for large prompts and responses.
Store raw conversation artifacts only in approved encrypted storage with retention control.
Never rely on MLflow as the sole audit system for regulated evidence; pair it with the repo's existing security and logging controls.

Recommended Redacted Fields

Email
Phone
Address
Full order ID
Payment references
Free-form customer profile text

9. Common Integration Tags

Use a consistent tag set across every scenario:

Tag	Example
`trace_id`	`8d0d8d39c4f24a31b57f2a31b9d7e112`
`session_id`	`sess_7812f`
`intent`	`recommendation`
`prompt_version`	`recommendation-2.4`
`release_bundle`	`2026.03.24-rc2`
`bedrock_model_id`	`anthropic.claude-3-5-sonnet`
`reranker_version`	`8`
`guardrail_ruleset_version`	`gr-17`
`cache_hit`	`true`
`user_feedback`	`thumbs_down`

Final Integration Principle

The point of MLflow in MangaAssist is not to centralize every metric in one tool. The point is to centralize lineage and request context so that Bedrock, SageMaker, OpenSearch, prompts, and feedback can all be reasoned about together.