04. ML Metrics Taxonomy — Full Reference + Production Application
"Choosing the right metric is as important as choosing the right model. I've seen teams optimize for accuracy when they should have optimized for recall, and measure F1 when the business needed conversion rate. Here's the full taxonomy of ML metrics I used, why each matters, and which ones actually drove production decisions."
Overview: Metric Categories in MangaAssist
graph TD
subgraph "Classification Metrics"
C1[Accuracy]
C2[Precision]
C3[Recall]
C4[F1 Score]
C5[Confusion Matrix]
C6[AUC-ROC]
C7[AUC-PR]
C8[Log Loss]
end
subgraph "Ranking & Retrieval Metrics"
R1["Recall@K"]
R2["Precision@K"]
R3[MRR]
R4[NDCG]
R5[MAP]
R6[Hit Rate]
end
subgraph "Embedding Quality Metrics"
E1[Cosine Similarity Distribution]
E2[Embedding Clustering Quality]
E3[Nearest Neighbor Accuracy]
E4[Alignment & Uniformity]
end
subgraph "Reranking Metrics"
RR1[Reranking Lift]
RR2["NDCG@K Improvement"]
RR3[Pairwise Accuracy]
end
C1 --> Applied1[Intent Classifier]
R1 --> Applied2[RAG Pipeline]
E1 --> Applied3[Embedding Model]
RR1 --> Applied4[Cross-Encoder Reranker]
Metric Selection Decision Guide
"Which metric should I use?" — Use this decision tree to pick the right metric for your scenario.
| Your Scenario | Primary Metric | Secondary Metric | Why Not the Other? |
|---|---|---|---|
| Comparing two classifier models | Macro F1 | Per-class AUC-PR | Macro F1 treats all classes equally; AUC-PR catches rare-class weakness |
| Monitoring production classifier | Weighted F1 + confusion matrix | Classification confidence trend | Weighted reflects actual traffic impact |
Evaluating imbalanced class (e.g., escalation at 5%) |
AUC-PR | Recall | AUC-ROC is misleading for imbalanced classes (was 0.97 vs 0.82 AUC-PR) |
| Deciding precision vs recall priority | Intent-specific: see §1.3 table | F1 as tiebreaker | Different intents have different miss-costs |
| Comparing RAG retrieval strategies | MRR | Recall@3, NDCG@3 | MRR rewards getting the best doc first; Recall@3 is the coverage floor |
| Measuring RAG context noise | Precision@K | Effective quality (R×P) | Low precision = irrelevant docs in prompt = wasted tokens + hallucination risk |
| Evaluating embedding quality | Cosine similarity gap | Nearest neighbor accuracy | Gap directly predicts retrieval separation quality |
| Justifying the reranker cost | Reranking lift on NDCG@3 | Pairwise accuracy | Lift must exceed the 50ms latency cost |
| Checking model confidence calibration | Log loss | ECE (Expected Calibration Error) | Overconfident models skip fallback when they should use it |
Common Metric Mistakes to Avoid
| Mistake | Why It's Wrong | What to Do Instead |
|---|---|---|
| Using only accuracy for a multi-class classifier | Hides class imbalance; 40% accuracy is achievable by always predicting the majority class | Pair accuracy with per-class F1 and confusion matrix |
| Using AUC-ROC for rare classes | Inflated by easy negative classification; our escalation was 0.97 AUC-ROC but 0.82 AUC-PR |
Use AUC-PR for any class < 10% of traffic |
| Using Recall@K without Precision@K | High recall with low precision means you're injecting noise into the LLM prompt | Track effective quality: R@K × P@K |
| Optimizing Recall@10 when you only use 3 chunks | Recall@10 = 96% is meaningless if you inject top 3 only | Optimize for the K you actually use (Recall@3 for us) |
| Comparing models on offline test sets only | Offline accuracy overestimates production performance by 4-6% due to distribution mismatch | Always validate on production traffic sample |
Part 1: Classification Metrics (Intent Classifier)
These metrics evaluated our fine-tuned DistilBERT intent classifier, which categorized every user message into one of 8 intent classes.
1.1 Accuracy
Definition: Percentage of correctly classified messages out of all messages.
$$\text{Accuracy} = \frac{\text{Correct Predictions}}{\text{Total Predictions}}$$
MangaAssist Application: Overall intent classification accuracy across all 8 intents.
| Context | Target | Actual (Production) |
|---|---|---|
| Offline test set (500 samples) | ≥ 90% | 92.1% |
| Production (weekly sample of 200) | ≥ 88% | 89.3% |
Why it matters but isn't enough: Accuracy hides class imbalance. If chitchat is 40% of messages and the model always predicts chitchat, accuracy could be 40% while being completely useless for every other intent. That's why I paired it with per-class metrics.
When I used it: Weekly dashboard metric, gate for model deployment (must be ≥ 90% on test set).
1.2 Precision
Definition: Of all messages predicted as a given intent, what percentage actually were that intent?
$$\text{Precision}_c = \frac{\text{True Positives}_c}{\text{True Positives}_c + \text{False Positives}_c}$$
MangaAssist Application: Per-intent precision tells me how "trustworthy" each intent classification is.
| Intent | Precision | Interpretation |
|---|---|---|
order_tracking |
0.96 | When we route to Order Service, we're almost always right |
recommendation |
0.91 | Occasionally routes product questions to the reco engine |
escalation |
0.89 | Some frustrated but non-escalation messages get escalated |
faq |
0.87 | Overlaps with product_question — some product questions misrouted to FAQ |
Why precision matters for escalation: Every false positive escalation sends a user to a human agent unnecessarily — costing ~$5 per escalation. At 500K messages/day, even 1% false positive escalations = 5,000 unnecessary escalations = $25K/day wasted.
When I used it: Flagged when per-class precision dropped below 0.85. Drove training data augmentation for confused classes.
1.3 Recall
Definition: Of all messages that actually belong to a given intent, what percentage did the model correctly identify?
$$\text{Recall}_c = \frac{\text{True Positives}_c}{\text{True Positives}_c + \text{False Negatives}_c}$$
MangaAssist Application: Per-intent recall tells me how many messages I'm "missing" for each intent.
| Intent | Recall | Interpretation |
|---|---|---|
order_tracking |
0.94 | Misses some indirect order queries ("where's my stuff?") |
recommendation |
0.88 | Misses implicit recommendations ("I'm bored, what should I read?") |
escalation |
0.92 | Catches most frustrated users |
return_request |
0.90 | Misses euphemistic returns ("this wasn't what I expected") |
Why recall matters for return_request: A missed return request means the user doesn't get routed to the returns API. They get a generic LLM response that can't process the return — leading to escalation and frustration.
Precision-Recall tradeoff by intent:
| Intent | Priority | Rationale |
|---|---|---|
escalation |
Recall > Precision | Better to escalate unnecessarily than miss a frustrated user |
order_tracking |
Recall > Precision | Better to check order status unnecessarily than miss a delivery question |
recommendation |
Precision > Recall | A misrouted recommendation just goes to generic LLM (acceptable degradation) |
chitchat |
Precision > Recall | Better to give a full LLM response to a greeting than template-respond to a real question |
1.4 F1 Score
Definition: Harmonic mean of precision and recall. Balances both.
$$F1_c = 2 \times \frac{\text{Precision}_c \times \text{Recall}_c}{\text{Precision}_c + \text{Recall}_c}$$
MangaAssist Application: Used as the single per-class metric for model comparison.
| Intent | Precision | Recall | F1 | Status |
|---|---|---|---|---|
recommendation |
0.91 | 0.88 | 0.89 | ✅ Above 0.85 threshold |
product_question |
0.88 | 0.90 | 0.89 | ✅ |
faq |
0.87 | 0.85 | 0.86 | ✅ Borderline |
order_tracking |
0.96 | 0.94 | 0.95 | ✅ |
return_request |
0.93 | 0.90 | 0.91 | ✅ |
escalation |
0.89 | 0.92 | 0.90 | ✅ |
promotion |
0.85 | 0.82 | 0.83 | ⚠️ Below threshold — needs augmentation |
chitchat |
0.94 | 0.96 | 0.95 | ✅ |
Deployment gate: All per-class F1 scores must be ≥ 0.85. If any class drops below, the model update is blocked until training data for that class is augmented.
Macro vs. Weighted F1: - Macro F1 (unweighted average across classes): Used for DS model comparison — ensures rare classes aren't ignored. - Weighted F1 (weighted by class frequency): Used for production monitoring — reflects actual user impact.
1.5 Confusion Matrix
Definition: A matrix showing predicted vs. actual class distributions.
MangaAssist Application: I used confusion matrices to identify specific class-pair confusions:
Predicted
rec prod faq order return esc promo chat
Actual rec 264 18 5 0 0 3 8 2
prod 22 269 8 0 0 1 0 0
faq 3 12 255 0 5 5 15 5
order 0 0 0 282 8 5 0 5
return 0 2 5 5 270 8 5 5
esc 2 0 3 3 5 276 1 10
promo 8 0 18 0 3 2 246 23
chat 0 0 2 3 2 8 10 275
Key insight from the confusion matrix: promotion ↔ faq and promotion ↔ chitchat were the most confused pairs. Queries like "Are there any deals?" could be a promotion query or a casual FAQ. This drove us to:
1. Add more promotion-specific training examples.
2. Consider merging promotion into faq (decided against it — promotion queries route to the Promotions Service API, while FAQ queries use RAG).
1.6 AUC-ROC (Area Under the Receiver Operating Characteristic Curve)
Definition: Measures the model's ability to distinguish between classes at all classification thresholds. AUC = 1.0 means perfect separation; AUC = 0.5 means random.
$$\text{AUC-ROC} = \int_0^1 \text{TPR}(t) \, d(\text{FPR}(t))$$
MangaAssist Application: I computed per-class AUC-ROC (one-vs-rest):
| Intent | AUC-ROC | Interpretation |
|---|---|---|
order_tracking |
0.99 | Near-perfect separation — very distinct language patterns |
chitchat |
0.98 | Easy to distinguish greetings |
recommendation |
0.95 | Good separation but some overlap with product_question |
promotion |
0.91 | Harder to separate from faq and chitchat |
When I used it: Primarily for DS model comparison (offline). Not a production monitoring metric — the operating threshold was fixed in production.
1.7 AUC-PR (Area Under the Precision-Recall Curve)
Definition: Like AUC-ROC but focused on the positive class. More informative for imbalanced classes.
MangaAssist Application: escalation intent was only ~5% of messages (imbalanced). AUC-ROC looked great (0.97) because the model could easily identify non-escalation messages. AUC-PR was more honest:
| Intent | % of Traffic | AUC-ROC | AUC-PR | Discrepancy |
|---|---|---|---|---|
escalation |
5% | 0.97 | 0.82 | AUC-ROC was misleadingly high |
promotion |
6% | 0.91 | 0.78 | Same pattern |
recommendation |
35% | 0.95 | 0.93 | Balanced — both metrics agree |
Key insight: For rare intents, AUC-PR was the better metric. It revealed that our escalation classifier was weaker than AUC-ROC suggested.
1.8 Log Loss (Cross-Entropy Loss)
Definition: Measures the quality of predicted probability distributions, not just the final classification. Penalizes confident wrong predictions more harshly.
$$\text{Log Loss} = -\frac{1}{N} \sum_{i=1}^{N} \sum_{c=1}^{C} y_{i,c} \log(p_{i,c})$$
MangaAssist Application: I monitored log loss as a proxy for model confidence calibration:
| Model Version | Accuracy | Log Loss | Interpretation |
|---|---|---|---|
| V1 (initial) | 88.3% | 0.42 | Poorly calibrated — overconfident on wrong predictions |
| V2 (after augmentation) | 92.1% | 0.28 | Better calibrated |
| V3 (with temperature scaling) | 92.1% | 0.23 | Same accuracy, better calibration |
Why calibration matters: I used the classifier's confidence score to decide whether to use the rule-based path or fall back to BERT. If the model was overconfident (low log loss but wrong), it wouldn't fall back when it should.
Part 2: Ranking & Retrieval Metrics (RAG Pipeline)
These metrics evaluated the RAG retrieval pipeline: given a user query, did we retrieve the right documents from OpenSearch?
2.1 Recall@K
Definition: Percentage of queries where the correct document appears in the top K retrieved results.
$$\text{Recall@K} = \frac{\text{Number of queries with relevant doc in top K}}{\text{Total queries}}$$
MangaAssist Application:
| K Value | Recall | Use Case |
|---|---|---|
| Recall@1 | 62% | Not sufficient — only 62% of the time the top result is relevant |
| Recall@3 | 86% | Primary metric — we inject top 3 chunks into the prompt |
| Recall@5 | 92% | Good — but 5 chunks would blow our token budget |
| Recall@10 | 96% | Retrieval ceiling — nearly always in top 10 |
Target: Recall@3 ≥ 80%. We achieved 86% after embedding fine-tuning (up from 72% baseline).
Why Recall@3: We injected exactly 3 chunks into the LLM prompt (token budget constraint). If the correct document wasn't in those 3, the LLM couldn't generate a grounded answer.
2.2 Precision@K
Definition: Of the K retrieved documents, what fraction are actually relevant?
$$\text{Precision@K} = \frac{\text{Relevant docs in top K}}{K}$$
MangaAssist Application:
| K Value | Precision | Interpretation |
|---|---|---|
| Precision@1 | 78% | When we retrieve 1 doc, it's relevant 78% of the time |
| Precision@3 | 79% | Of 3 retrieved docs, ~2.4 are relevant on average |
| Precision@5 | 68% | Noise increases with more docs |
Why precision matters for RAG: Low precision means irrelevant documents are injected into the LLM prompt, which: 1. Wastes tokens (each irrelevant chunk costs ~500 tokens × $3/M = wasted money). 2. Confuses the LLM — irrelevant context can lead to hallucinations or off-topic responses.
2.3 MRR (Mean Reciprocal Rank)
Definition: Average of the reciprocal rank of the first relevant result. MRR = 1.0 means the first result is always relevant.
$$\text{MRR} = \frac{1}{|Q|} \sum_{i=1}^{|Q|} \frac{1}{\text{rank}_i}$$
MangaAssist Application:
| Retrieval Method | MRR | Interpretation |
|---|---|---|
| Vector search only | 0.68 | First relevant chunk is ~1.5th position on average |
| BM25 keyword search only | 0.55 | Keyword alone is worse for semantic queries |
| Hybrid (vector + BM25 + RRF) | 0.75 | Fusion improves ranking |
| Hybrid + cross-encoder reranking | 0.81 | Reranker pushes relevant docs to top |
When I used it: MRR was the go-to metric for comparing retrieval strategies. Higher MRR = the LLM sees the most relevant chunk first, which matters because LLMs tend to weight early context more heavily.
2.4 NDCG (Normalized Discounted Cumulative Gain)
Definition: Measures ranking quality, accounting for the position of relevant results. Higher-ranked relevant results contribute more to the score.
$$\text{DCG@K} = \sum_{i=1}^{K} \frac{2^{rel_i} - 1}{\log_2(i + 1)}$$
$$\text{NDCG@K} = \frac{\text{DCG@K}}{\text{IDCG@K}}$$
MangaAssist Application: NDCG was particularly useful for multi-relevance queries. For a recommendation query like "dark fantasy manga," multiple documents were relevant but with varying degrees:
Relevance levels:
3 = Highly relevant (directly answers the query)
2 = Relevant (related but incomplete)
1 = Partially relevant (tangentially related)
0 = Irrelevant
| Retrieval Method | NDCG@3 | NDCG@5 |
|---|---|---|
| Vector search only | 0.72 | 0.69 |
| Hybrid + reranking | 0.84 | 0.81 |
When I used it: Primarily in the weekly RAG evaluation pipeline. More nuanced than Recall@K because it rewards putting the best documents first.
2.5 MAP (Mean Average Precision)
Definition: Mean of the average precision scores across all queries. Average precision rewards ranking relevant documents higher.
$$\text{AP} = \frac{1}{|R|} \sum_{k=1}^{K} \text{Precision@k} \times rel(k)$$
MangaAssist Application: MAP was used as a single-number summary of retrieval quality for weekly reporting. Our MAP was 0.78 (hybrid + reranking), up from 0.64 (vector only).
2.6 Hit Rate
Definition: Simplest retrieval metric — was there at least one relevant document in the top K?
$$\text{Hit Rate@K} = \frac{\text{Queries with ≥ 1 relevant doc in top K}}{\text{Total queries}}$$
MangaAssist Application: Hit Rate@3 = 89%. This was the "hard floor" metric — if we couldn't even find one relevant document, the RAG pipeline was functionally broken for that query.
Part 3: Embedding Quality Metrics
3.1 Cosine Similarity Distribution
What I measured: The distribution of cosine similarity scores between query embeddings and their top retrieved documents.
Baseline Titan Embeddings:
- Avg similarity for relevant pairs: 0.72
- Avg similarity for irrelevant pairs: 0.45
- Separation gap: 0.27
Fine-tuned Adapter:
- Avg similarity for relevant pairs: 0.84
- Avg similarity for irrelevant pairs: 0.38
- Separation gap: 0.46
A wider separation gap = easier for the retrieval system to distinguish relevant from irrelevant documents. The fine-tuned adapter nearly doubled the gap.
3.2 Nearest Neighbor Accuracy
Definition: For a set of query-document pairs, is the correct document the nearest neighbor in embedding space?
MangaAssist Application: Nearest neighbor accuracy improved from 58% to 76% after fine-tuning — particularly for manga-specific vocabulary where the base model struggled.
3.3 Alignment & Uniformity
Two properties of good embedding spaces: - Alignment: Semantically similar items should be close. Measured as average distance between positive pairs. - Uniformity: Embeddings should be spread across the space (not all clustered). Measured as log of pairwise Gaussian potential.
These were DS-internal metrics used during embedding adaptation training — not production monitoring metrics.
Part 4: Reranking Metrics
4.1 Reranking Lift
Definition: Improvement in retrieval quality from the cross-encoder reranker vs. raw vector search results.
| Metric | Before Reranking | After Reranking | Lift |
|---|---|---|---|
| Recall@3 | 78% | 86% | +8% |
| MRR | 0.75 | 0.81 | +0.06 |
| NDCG@3 | 0.72 | 0.84 | +0.12 |
The reranker added ~50ms of latency but significantly improved retrieval quality. The reranking lift justified the latency cost.
4.2 Pairwise Accuracy
Definition: Given two documents (one relevant, one irrelevant), does the reranker rank the relevant one higher?
MangaAssist Application: Pairwise accuracy = 93%. The 7% failure rate occurred mostly when both documents were partially relevant (relevance 1 vs. 2), not when comparing relevant vs. clearly irrelevant.
Part 5: Metrics I Actually Tracked in Production (Focused Section)
"Out of the 20+ metrics above, these are the ones on my production dashboard that I checked daily and used to drive decisions."
Tier 1: Daily Dashboard (Alerted On)
| Metric | Target | Alert Threshold | Why It's Tier 1 |
|---|---|---|---|
| Intent classification accuracy (weekly sample) | ≥ 90% | < 88% | Misclassification cascades through the entire pipeline |
| RAG Recall@3 (weekly eval) | ≥ 80% | < 75% | Below this, LLM responses degrade noticeably |
| Classification confidence (avg) | ≥ 0.87 | < 0.82 | Dropping confidence = model uncertainty = drift signal |
| Fallback-to-BERT rate | ≤ 30% | > 40% | High fallback = rule engine coverage declining |
Tier 2: Weekly Review (DS Sync)
| Metric | Target | Review Trigger |
|---|---|---|
| Per-class F1 | ≥ 0.85 all classes | Any class < 0.83 |
| MRR | ≥ 0.78 | Drop > 5% week-over-week |
| Confusion matrix top confusions | <3% per off-diagonal cell | New confusion pair emerges |
| Embedding similarity gap | ≥ 0.40 | Gap narrows (embedding quality degradation) |
Tier 3: Monthly Deep Dive (Model Release Gate)
| Metric | Target | Blocks Release If |
|---|---|---|
| Full accuracy on golden dataset | ≥ 90% | < 88% |
| AUC-PR for rare intents (escalation, promotion) | ≥ 0.75 | < 0.70 |
| NDCG@3 on RAG eval set | ≥ 0.80 | < 0.75 |
| Reranking lift | > +5% on Recall@3 | Negative lift (reranker making things worse) |
Key Takeaways for Interviews
-
"Accuracy alone is misleading" — I always pair accuracy with per-class F1 and confusion matrices. A 92% accurate model can have a 0.78 F1 on a critical intent class.
-
"Different metrics for different stakeholders" — DS cares about AUC-ROC and F1. Engineering cares about latency and throughput. Business cares about escalation rate and conversion. The shared metrics (Tier 1) bridge these worlds.
-
"Recall@3 was our north star retrieval metric" — Because we injected exactly 3 chunks, Recall@3 directly translated to "probability the LLM has good source material."
-
"Precision-recall priority depends on the intent" — For
escalation, recall matters more (never miss a frustrated user). Forrecommendation, precision matters more (don't route real questions to the reco engine). -
"AUC-PR is more honest than AUC-ROC for rare classes" — Our
escalationAUC-ROC was 0.97 (misleading) vs. AUC-PR of 0.82 (honest). -
"Embedding quality metrics showed the impact of fine-tuning" — The cosine similarity gap nearly doubled after domain-specific fine-tuning, and this translated directly to +14% Recall@3.
Related Documents
- 02-data-scientist-collaboration.md — How DS and I jointly defined metric thresholds
- 03-tradeoffs-decisions.md — How metrics drove tradeoff decisions
- 13-metrics.md — Business, UX, AI quality, and operational metrics framework
- Challenges/real-world-challenges.md §7 — RAG Quality — Retrieval challenges and solutions