MLflow Tracing for LLM Observability — How It Transformed Our Debugging and Monitoring
How I introduced MLflow Tracing to MangaAssist, achieving full-pipeline observability for our LLM chatbot and eliminating blind spots in production debugging.
The Problem: LLM Pipelines Are Black Boxes
Before MLflow Tracing, debugging MangaAssist was painful:
User reports: "The chatbot gave me wrong manga recommendations"
Engineer's experience:
1. Check CloudWatch logs → See the final response, but WHY was it wrong?
2. Was the intent classified correctly? → Check SageMaker logs (different service)
3. Were the right documents retrieved? → Check OpenSearch query logs (another service)
4. Was the reranker working? → Check reranker endpoint logs (yet another service)
5. Was the prompt constructed correctly? → Not logged anywhere
6. Did the guardrails modify the response? → Partially logged
7. Total debugging time: 45-90 minutes per incident
The fundamental problem: a single chatbot response flows through 4-6 ML models and services, and traditional logging captures inputs/outputs of each service independently — never the full chain.
What Is MLflow Tracing?
MLflow Tracing is an open-source, OpenTelemetry-compatible LLM observability system that captures the inputs, outputs, latency, and metadata of every step in an LLM pipeline as a single, navigable trace.
Core Concepts
graph TB
subgraph Trace["Trace (full request lifecycle)"]
A[Span: Receive Message<br/>12ms] --> B[Span: Load Context<br/>8ms]
B --> C[Span: Classify Intent<br/>15ms]
C --> D[Span: RAG Retrieval<br/>180ms]
D --> D1[Span: Embed Query<br/>25ms]
D --> D2[Span: Vector Search<br/>45ms]
D --> D3[Span: Rerank Results<br/>110ms]
D --> E[Span: Generate Response<br/>650ms]
E --> E1[Span: Build Prompt<br/>5ms]
E --> E2[Span: LLM Call<br/>620ms]
E --> E3[Span: Parse Output<br/>25ms]
E --> F[Span: Apply Guardrails<br/>85ms]
F --> G[Span: Save & Return<br/>12ms]
end
| Concept | Definition |
|---|---|
| Trace | A complete request lifecycle — from user message to final response |
| Span | An individual step within the trace (e.g., "RAG Retrieval", "LLM Call") |
| Parent-Child | Spans are nested hierarchically — "RAG Retrieval" contains child spans for embed, search, rerank |
| Attributes | Key-value metadata attached to spans (model name, token count, temperature, etc.) |
| Events | Timestamped markers within spans (errors, cache hits, fallbacks) |
How I Implemented MLflow Tracing in MangaAssist
Step 1: Auto-Tracing for Framework Integrations
MLflow provides one-line auto-tracing for major LLM frameworks. We enabled it for our entire stack:
import mlflow
# Auto-trace all Bedrock (Anthropic) calls
mlflow.anthropic.autolog()
# Auto-trace all LangChain chains
mlflow.langchain.autolog()
# Auto-trace all OpenAI-compatible calls (vLLM uses OpenAI API)
mlflow.openai.autolog()
With three lines of code, every LLM call, chain invocation, and retrieval step was automatically captured — no manual instrumentation needed.
Step 2: Manual Tracing for Custom Components
For components not covered by auto-tracing (our custom intent classifier, caching layer, guardrails), I used the @mlflow.trace decorator:
import mlflow
from mlflow.entities import SpanType
@mlflow.trace(span_type=SpanType.CHAIN, name="intent_classification")
def classify_intent(user_message: str, page_context: dict) -> Intent:
"""Two-stage intent classification: rules → ML model."""
# Stage 1: Rule-based (fast path)
rule_result = apply_rules(user_message, page_context)
if rule_result.confidence > 0.95:
span = mlflow.get_current_active_span()
span.set_attributes({"classification_path": "rules", "confidence": rule_result.confidence})
return rule_result.intent
# Stage 2: ML model (DistilBERT via vLLM)
ml_result = call_intent_model(user_message)
span = mlflow.get_current_active_span()
span.set_attributes({
"classification_path": "ml_model",
"confidence": ml_result.confidence,
"model_version": "distilbert-manga-v3.2",
"top_3_intents": str(ml_result.top_k(3)),
})
return ml_result.intent
@mlflow.trace(span_type="RETRIEVER", name="rag_retrieval")
def retrieve_manga_context(query: str, filters: dict) -> list[Document]:
"""Hybrid retrieval: vector + keyword search with reranking."""
# Embed query
query_embedding = embed_query(query)
# Vector search (OpenSearch HNSW)
vector_results = opensearch_client.search(
index="manga-embeddings",
body=build_knn_query(query_embedding, k=50, filters=filters)
)
# Keyword search (BM25)
keyword_results = opensearch_client.search(
index="manga-content",
body=build_bm25_query(query, filters=filters)
)
# Merge and rerank
merged = reciprocal_rank_fusion(vector_results, keyword_results)
reranked = rerank_with_cross_encoder(query, merged[:50])
span = mlflow.get_current_active_span()
span.set_attributes({
"vector_results_count": len(vector_results),
"keyword_results_count": len(keyword_results),
"final_results_count": len(reranked[:5]),
"top_score": reranked[0].score if reranked else 0,
})
return reranked[:5]
@mlflow.trace(span_type=SpanType.CHAIN, name="guardrails_pipeline")
def apply_guardrails(response: str, context: dict) -> GuardrailResult:
"""6-stage guardrails pipeline."""
stages = [
("pii_detection", check_pii),
("prompt_injection", check_injection),
("content_moderation", check_content),
("hallucination_check", check_hallucination),
("response_length", check_length),
("format_validation", check_format),
]
for stage_name, check_fn in stages:
with mlflow.start_span(name=stage_name) as span:
result = check_fn(response, context)
span.set_inputs({"response_length": len(response)})
span.set_outputs({"passed": result.passed, "reason": result.reason})
if not result.passed:
return GuardrailResult(blocked=True, stage=stage_name, reason=result.reason)
return GuardrailResult(blocked=False)
Step 3: Trace Tagging for Filtering and Analysis
# Tag every trace with session metadata for filtering
mlflow.update_current_trace(tags={
"session_id": session.id,
"user_type": session.user_type, # "authenticated" | "guest"
"page_type": context.page_type, # "product" | "search" | "category"
"intent": classified_intent.name,
"cache_hit": str(cache_result is not None),
"model_used": model_config.name, # "claude-sonnet" | "claude-haiku"
"region": "ap-northeast-1",
})
Step 4: Production Deployment
graph LR
A[MangaAssist App<br/>ECS Fargate] -->|async trace export| B[MLflow Tracking Server<br/>ECS Fargate]
B --> C[S3 — Trace Storage]
B --> D[PostgreSQL RDS — Metadata]
B --> E[MLflow UI<br/>Internal Dashboard]
F[Prometheus] -->|scrape| B
F --> G[Grafana Dashboards]
Infrastructure: - MLflow Tracking Server on ECS Fargate (2 vCPU, 4GB RAM) - PostgreSQL RDS (db.t3.medium) for trace metadata - S3 for trace artifact storage (conversation logs, retrieved documents) - Cost: ~$400/month total
Key configuration for production:
# Use lightweight mlflow-tracing package (95% smaller than full mlflow)
# pip install mlflow-tracing
import mlflow
# Async logging — does not block application response
mlflow.config.enable_async_logging(True)
# Sampling — trace 10% of requests in production, 100% in staging
mlflow.config.set_sampling_ratio(0.10) # Production
# mlflow.config.set_sampling_ratio(1.0) # Staging
# PII redaction — mask sensitive data in traces
mlflow.config.enable_pii_redaction(True)
What MLflow Tracing Revealed
Finding 1: The Reranker Was Our Bottleneck
Trace breakdown (P95):
Intent Classification: 15ms ( 2%)
Context Loading: 8ms ( 1%)
Query Embedding: 25ms ( 3%)
Vector Search: 45ms ( 5%)
← Reranker: 380ms ( 42%) ← BOTTLENECK
LLM Generation: 320ms ( 36%)
Guardrails: 85ms ( 10%)
Other: 22ms ( 2%)
Total: 900ms (100%)
Action: I discovered the reranker was processing 50 documents instead of the intended 20 due to a config bug. Fixing it cut reranker latency from 380ms to 110ms — a 270ms improvement that brought us under our 800ms P95 target.
Without the trace waterfall visualization, this would have been invisible in aggregate latency metrics.
Finding 2: 38% of LLM Calls Were Redundant
Traces showed that semantically identical questions (e.g., "Is One Piece still running?" vs "Is One Piece ongoing?") were generating full LLM calls. This motivated the semantic caching layer:
Before: 100% of queries → full LLM pipeline (avg 800ms)
After: 42% cache hits (avg 8ms) + 58% full pipeline (avg 800ms)
Blended P50: 340ms (58% reduction)
Finding 3: Guardrails Were Overly Aggressive
Traces revealed that our hallucination check was flagging 12% of legitimate responses as "potentially hallucinated" — causing unnecessary fallbacks to templated responses. Analysis of flagged traces showed the confidence threshold was too conservative for manga domain-specific queries.
Action: Adjusted threshold from 0.7 to 0.55 for manga-specific intents, reducing false positive rate from 12% to 3% while maintaining <0.5% actual hallucination rate.
Finding 4: Cold Start Latency Was Hidden
Traces showed first-request latency after scaling events was 4.2s (vs. 800ms steady-state), entirely due to model loading in vLLM. This motivated our model pre-warming strategy on container startup.
MLflow vs. Alternatives — Why I Chose MLflow
| Criteria | MLflow Tracing | Langfuse | LangSmith | Datadog LLM |
|---|---|---|---|---|
| Open Source | Fully OSS (Apache 2.0) | OSS (self-hosted) | Proprietary (SaaS) | Proprietary |
| Self-Hosted | Yes — our data stays in AWS | Yes | No — data leaves your infra | No |
| Cost | ~$400/month (infra only) | ~$300/month (infra) | $400+/month (SaaS) | $2,000+/month |
| OTel Compatible | Native | Partial | No | Yes |
| Auto-Tracing | 8+ frameworks | 5+ frameworks | LangChain only | 4+ frameworks |
| Experiment Tracking | Built-in (same platform) | Limited | Limited | No |
| Model Registry | Built-in | No | No | No |
| PII Redaction | Built-in | Manual | Manual | Built-in |
| Production SDK Size | Small (mlflow-tracing) |
Small | N/A (SaaS) | Agent-based |
Why MLflow won: 1. Data sovereignty — Traces contain user conversations; data must stay within our AWS VPC. LangSmith's SaaS model was a non-starter. 2. Ecosystem integration — We already used MLflow for experiment tracking and model registry. Adding tracing was additive, not a new platform. 3. Cost — At our volume (100K+ traces/day), Datadog would cost $24K+/year. MLflow costs $4.8K/year (infrastructure only). 4. OpenTelemetry — MLflow traces integrate with our existing X-Ray/OTel distributed tracing pipeline, giving us end-to-end visibility from API Gateway to LLM response.
Production Dashboards We Built
Trace-Derived Metrics (Grafana)
┌─────────────────────────────────────────────────────────────┐
│ MangaAssist LLM Pipeline — Real-Time Dashboard │
├─────────────────────────────────────────────────────────────┤
│ │
│ [Latency Heatmap] [Trace Volume] │
│ P50: 340ms Traces/min: 1,850 │
│ P95: 780ms Error rate: 0.3% │
│ P99: 1,200ms Cache hit: 42% │
│ │
│ [Per-Span Breakdown] [Model Usage] │
│ Intent: 15ms (2%) Claude Sonnet: 58% │
│ RAG: 180ms (22%) Claude Haiku: 34% │
│ LLM: 420ms (52%) vLLM LoRA: 8% │
│ Guards: 85ms (10%) │
│ Other: 100ms (14%) [Token Costs] │
│ Input: $0.012/req │
│ [Error Traces] Output: $0.008/req │
│ Guardrail blocks: 3.2% Total: $0.020/req │
│ Timeout errors: 0.1% │
│ Model errors: 0.05% │
│ │
└─────────────────────────────────────────────────────────────┘
Alerting Rules (Derived from Traces)
| Alert | Condition | Action |
|---|---|---|
| Latency Spike | P95 > 1.5s for 5 min | Page on-call; investigate slowest span |
| Error Rate | > 1% for 3 min | Page on-call; check trace errors for root cause |
| Guardrail Block Rate | > 10% for 10 min | Notify AI team; likely model drift or prompt issue |
| Cache Hit Rate Drop | < 30% for 15 min | Investigate cache invalidation or traffic pattern change |
| Token Cost Spike | > 150% of baseline | Check for prompt length regression or routing misconfiguration |
Lessons Learned
1. Trace Everything, Sample in Production
- In staging: 100% trace sampling — catch every issue
- In production: 10% sampling — sufficient for pattern detection at our volume
- For errors: 100% sampling — always trace failed requests regardless of sampling rate
2. Async Logging Is Non-Negotiable
- Synchronous trace export adds 5-15ms per request
- Async logging (background thread) adds <1ms overhead
- At 1,800 req/min, synchronous would have cost us 150-450 CPU-seconds/min
3. The Lightweight SDK Matters
- Full
mlflowpackage: 200+ dependencies, 500MB+ install mlflow-tracingpackage: minimal dependencies, 95% smaller- In a containerized production environment, this reduces image size and cold start time significantly
4. Combine Auto + Manual Tracing
- Auto-tracing catches 70% of the pipeline automatically
- Manual
@mlflow.tracedecorators fill the remaining 30% (custom logic, business rules) - The result is a unified trace tree — no gaps, no blind spots
5. Traces Are a Debugging Multiplier
- Before MLflow: 45-90 minutes to debug a bad response
- After MLflow: 2-5 minutes — click the trace, see every step, find the problem
- 18x reduction in mean-time-to-diagnosis (MTTD)