MLflow, Bedrock, and Service Integration
How MLflow is wired into the AWS services behind MangaAssist so Bedrock generation, SageMaker models, retrieval, prompts, and feedback can be observed as one system.
End-to-End Integration Map
graph TD
U[User Message] --> ORC[Chatbot Orchestrator on ECS]
ORC --> IC[SageMaker Intent Classifier]
ORC --> EMB[Bedrock Titan Embeddings]
ORC --> OS[OpenSearch Serverless]
ORC --> RR[SageMaker Reranker]
ORC --> PC[Prompt Builder + AppConfig]
ORC --> BR[Bedrock Claude]
ORC --> GR[Guardrails Pipeline]
ORC --> FB[Feedback Events]
ORC --> ML[MLflow Tracing SDK]
IC --> ML
RR --> ML
PC --> ML
BR --> ML
GR --> ML
FB --> ML
ML --> TS[MLflow Tracking Server]
TS --> S3[S3 Artifacts]
TS --> RDS[RDS Metadata]
ORC --> CW[CloudWatch Metrics and Logs]
1. Bedrock Integration
Where Bedrock Appears
- Claude for response generation
- Titan embeddings for query embedding
- Optional Bedrock prompt cache or routing metadata
Integration Pattern
There are two practical patterns:
- If the application calls Claude through an Anthropic Bedrock client, enable MLflow auto-tracing around that SDK.
- If the application uses
boto3directly, wrap the Bedrock client in a traced adapter and record prompt, model, token, and stop metadata manually.
Traced Bedrock Wrapper
import json
import hashlib
import mlflow
import boto3
bedrock = boto3.client("bedrock-runtime", region_name="us-east-1")
@mlflow.trace(name="bedrock_generate", span_type="LLM")
def generate_with_bedrock(model_id: str, prompt: str, temperature: float, max_tokens: int) -> str:
span = mlflow.get_current_active_span()
span.set_attributes({
"provider": "bedrock",
"bedrock_model_id": model_id,
"temperature": temperature,
"max_tokens": max_tokens,
"prompt_hash": hashlib.sha256(prompt.encode("utf-8")).hexdigest(),
})
span.set_inputs({
"prompt_preview": prompt[:500],
"prompt_chars": len(prompt),
})
response = bedrock.invoke_model(
modelId=model_id,
body=json.dumps({
"anthropic_version": "bedrock-2023-05-31",
"max_tokens": max_tokens,
"temperature": temperature,
"messages": [{"role": "user", "content": prompt}],
}),
)
payload = json.loads(response["body"].read())
text = payload["content"][0]["text"]
usage = payload.get("usage", {})
span.set_outputs({
"output_preview": text[:500],
"output_chars": len(text),
"input_tokens": usage.get("input_tokens"),
"output_tokens": usage.get("output_tokens"),
"stop_reason": payload.get("stop_reason"),
})
return text
Why It Matters
- The Bedrock call becomes a first-class span instead of a hidden SDK call.
- Prompt shape, token usage, and stop reason become queryable metadata.
- Bedrock behavior can be compared across prompt versions and release bundles.
2. SageMaker Integration
What Runs on SageMaker
- DistilBERT intent classifier
- Cross-encoder reranker
- Optional PII or sentiment models
Integration Pattern
Wrap each inference client in a traced function. Log model version, endpoint name, payload size, latency, and confidence.
import json
import mlflow
import boto3
runtime = boto3.client("sagemaker-runtime")
@mlflow.trace(name="intent_classification", span_type="CHAIN")
def classify_intent(endpoint_name: str, text: str) -> dict:
response = runtime.invoke_endpoint(
EndpointName=endpoint_name,
ContentType="application/json",
Body=json.dumps({"text": text}),
)
payload = json.loads(response["Body"].read())
span = mlflow.get_current_active_span()
span.set_attributes({
"endpoint_name": endpoint_name,
"model_family": "distilbert",
"intent": payload["intent"],
"confidence": payload["confidence"],
})
return payload
Why It Matters
- Trace data makes it obvious whether latency came from Bedrock or upstream ML services.
- Registry metadata can point back to the exact SageMaker artifact and deployment stage.
3. OpenSearch Integration
What We Trace
- Query embedding latency
- Vector search latency
- Metadata filters
- Candidate count and final selected chunks
Integration Pattern
Use a parent retrieve_chunks span and child spans for:
embed_queryvector_searchkeyword_searchrerank_chunks
Each span logs chunk counts, source types, filters, and top chunk IDs.
Why It Matters
- Retrieval quality issues stop being confused with generation quality issues.
- Engineers can inspect the exact context sent to Bedrock.
4. AppConfig Integration
What AppConfig Controls
- Active prompt version
- Shadow mode flags
- Prompt A/B experiment splits
- Model routing rules
- Guardrail thresholds
Integration Pattern
Read AppConfig once per request or once per cached config interval, then attach the resulting config version to the active trace:
mlflow.update_current_trace(tags={
"prompt_version": prompt_config.version,
"routing_policy_version": routing_config.version,
"guardrail_ruleset_version": guardrail_config.version,
"experiment_id": experiment.id if experiment else "none",
})
Why It Matters
- Request behavior becomes explainable even when code has not changed.
- Prompt or threshold rollouts are attributable to a specific config version.
5. CloudWatch and Alerting Integration
CloudWatch Still Matters
MLflow is not a replacement for operational telemetry. CloudWatch remains the best home for near-real-time alarms, platform logs, and service-native metrics.
Split of Responsibility
| System | Best at |
|---|---|
| MLflow | Request traces, prompt/model lineage, eval runs, release comparison |
| CloudWatch | Fast alarms, infrastructure metrics, error counts, service health |
| Grafana or dashboards | Real-time visualization across CloudWatch and MLflow-derived aggregates |
Integration Pattern
- Emit latency, error, and guardrail counters to CloudWatch.
- Store trace IDs in log lines and structured metrics.
- Put the active
trace_idin alarm context so responders can pivot from alarm to trace quickly.
6. S3 and RDS Tracking Backend
Backend Design
- S3 stores prompt artifacts, evaluation reports, retrieved chunk snapshots, and large run outputs.
- RDS stores run metadata, tags, metrics, registry entries, and searchable trace metadata.
Why It Matters
- Artifacts remain durable and cheap to store.
- Queries on run metadata stay fast.
- The control plane remains internal and auditable.
7. Feedback and Analytics Integration
Event Flow
sequenceDiagram
participant UI as Chat UI
participant ORC as Orchestrator
participant ML as MLflow
participant K as Kinesis
participant RS as Redshift
UI->>ORC: thumbs_down(response_id, trace_id)
ORC->>K: feedback event with trace metadata
ORC->>ML: set trace tags: user_feedback=thumbs_down
K->>RS: batch analytics load
RS-->>ML: optional aggregated reports linked back by prompt/model version
Why It Matters
- You can analyze satisfaction by prompt version, Bedrock model, intent, or retrieval policy.
- You can build retraining datasets from real failure clusters.
8. Security and Redaction Boundaries
Rules
- Redact PII before logging trace inputs and outputs.
- Prefer hashes or previews for large prompts and responses.
- Store raw conversation artifacts only in approved encrypted storage with retention control.
- Never rely on MLflow as the sole audit system for regulated evidence; pair it with the repo's existing security and logging controls.
Recommended Redacted Fields
- Phone
- Address
- Full order ID
- Payment references
- Free-form customer profile text
9. Common Integration Tags
Use a consistent tag set across every scenario:
| Tag | Example |
|---|---|
trace_id |
8d0d8d39c4f24a31b57f2a31b9d7e112 |
session_id |
sess_7812f |
intent |
recommendation |
prompt_version |
recommendation-2.4 |
release_bundle |
2026.03.24-rc2 |
bedrock_model_id |
anthropic.claude-3-5-sonnet |
reranker_version |
8 |
guardrail_ruleset_version |
gr-17 |
cache_hit |
true |
user_feedback |
thumbs_down |
Final Integration Principle
The point of MLflow in MangaAssist is not to centralize every metric in one tool. The point is to centralize lineage and request context so that Bedrock, SageMaker, OpenSearch, prompts, and feedback can all be reasoned about together.