04. ML Metrics Taxonomy — Full Reference + Production Application

"Choosing the right metric is as important as choosing the right model. I've seen teams optimize for accuracy when they should have optimized for recall, and measure F1 when the business needed conversion rate. Here's the full taxonomy of ML metrics I used, why each matters, and which ones actually drove production decisions."

Overview: Metric Categories in MangaAssist

graph TD
    subgraph "Classification Metrics"
        C1[Accuracy]
        C2[Precision]
        C3[Recall]
        C4[F1 Score]
        C5[Confusion Matrix]
        C6[AUC-ROC]
        C7[AUC-PR]
        C8[Log Loss]
    end

    subgraph "Ranking & Retrieval Metrics"
        R1["Recall@K"]
        R2["Precision@K"]
        R3[MRR]
        R4[NDCG]
        R5[MAP]
        R6[Hit Rate]
    end

    subgraph "Embedding Quality Metrics"
        E1[Cosine Similarity Distribution]
        E2[Embedding Clustering Quality]
        E3[Nearest Neighbor Accuracy]
        E4[Alignment & Uniformity]
    end

    subgraph "Reranking Metrics"
        RR1[Reranking Lift]
        RR2["NDCG@K Improvement"]
        RR3[Pairwise Accuracy]
    end

    C1 --> Applied1[Intent Classifier]
    R1 --> Applied2[RAG Pipeline]
    E1 --> Applied3[Embedding Model]
    RR1 --> Applied4[Cross-Encoder Reranker]

Metric Selection Decision Guide

"Which metric should I use?" — Use this decision tree to pick the right metric for your scenario.

Your Scenario	Primary Metric	Secondary Metric	Why Not the Other?
Comparing two classifier models	Macro F1	Per-class AUC-PR	Macro F1 treats all classes equally; AUC-PR catches rare-class weakness
Monitoring production classifier	Weighted F1 + confusion matrix	Classification confidence trend	Weighted reflects actual traffic impact
Evaluating imbalanced class (e.g., `escalation` at 5%)	AUC-PR	Recall	AUC-ROC is misleading for imbalanced classes (was 0.97 vs 0.82 AUC-PR)
Deciding precision vs recall priority	Intent-specific: see §1.3 table	F1 as tiebreaker	Different intents have different miss-costs
Comparing RAG retrieval strategies	MRR	Recall@3, NDCG@3	MRR rewards getting the best doc first; Recall@3 is the coverage floor
Measuring RAG context noise	Precision@K	Effective quality (R×P)	Low precision = irrelevant docs in prompt = wasted tokens + hallucination risk
Evaluating embedding quality	Cosine similarity gap	Nearest neighbor accuracy	Gap directly predicts retrieval separation quality
Justifying the reranker cost	Reranking lift on NDCG@3	Pairwise accuracy	Lift must exceed the 50ms latency cost
Checking model confidence calibration	Log loss	ECE (Expected Calibration Error)	Overconfident models skip fallback when they should use it

Common Metric Mistakes to Avoid

Mistake	Why It's Wrong	What to Do Instead
Using only accuracy for a multi-class classifier	Hides class imbalance; 40% accuracy is achievable by always predicting the majority class	Pair accuracy with per-class F1 and confusion matrix
Using AUC-ROC for rare classes	Inflated by easy negative classification; our `escalation` was 0.97 AUC-ROC but 0.82 AUC-PR	Use AUC-PR for any class < 10% of traffic
Using Recall@K without Precision@K	High recall with low precision means you're injecting noise into the LLM prompt	Track effective quality: R@K × P@K
Optimizing Recall@10 when you only use 3 chunks	Recall@10 = 96% is meaningless if you inject top 3 only	Optimize for the K you actually use (Recall@3 for us)
Comparing models on offline test sets only	Offline accuracy overestimates production performance by 4-6% due to distribution mismatch	Always validate on production traffic sample

Part 1: Classification Metrics (Intent Classifier)

These metrics evaluated our fine-tuned DistilBERT intent classifier, which categorized every user message into one of 8 intent classes.

1.1 Accuracy

Definition: Percentage of correctly classified messages out of all messages.

$$\text{Accuracy} = \frac{\text{Correct Predictions}}{\text{Total Predictions}}$$

MangaAssist Application: Overall intent classification accuracy across all 8 intents.

Context	Target	Actual (Production)
Offline test set (500 samples)	≥ 90%	92.1%
Production (weekly sample of 200)	≥ 88%	89.3%

Why it matters but isn't enough: Accuracy hides class imbalance. If chitchat is 40% of messages and the model always predicts chitchat, accuracy could be 40% while being completely useless for every other intent. That's why I paired it with per-class metrics.

When I used it: Weekly dashboard metric, gate for model deployment (must be ≥ 90% on test set).

1.2 Precision

Definition: Of all messages predicted as a given intent, what percentage actually were that intent?

$$\text{Precision}_c = \frac{\text{True Positives}_c}{\text{True Positives}_c + \text{False Positives}_c}$$

MangaAssist Application: Per-intent precision tells me how "trustworthy" each intent classification is.

Intent	Precision	Interpretation
`order_tracking`	0.96	When we route to Order Service, we're almost always right
`recommendation`	0.91	Occasionally routes product questions to the reco engine
`escalation`	0.89	Some frustrated but non-escalation messages get escalated
`faq`	0.87	Overlaps with `product_question` — some product questions misrouted to FAQ

Why precision matters for escalation: Every false positive escalation sends a user to a human agent unnecessarily — costing ~$5 per escalation. At 500K messages/day, even 1% false positive escalations = 5,000 unnecessary escalations = $25K/day wasted.

When I used it: Flagged when per-class precision dropped below 0.85. Drove training data augmentation for confused classes.

1.3 Recall

Definition: Of all messages that actually belong to a given intent, what percentage did the model correctly identify?

$$\text{Recall}_c = \frac{\text{True Positives}_c}{\text{True Positives}_c + \text{False Negatives}_c}$$

MangaAssist Application: Per-intent recall tells me how many messages I'm "missing" for each intent.

Intent	Recall	Interpretation
`order_tracking`	0.94	Misses some indirect order queries ("where's my stuff?")
`recommendation`	0.88	Misses implicit recommendations ("I'm bored, what should I read?")
`escalation`	0.92	Catches most frustrated users
`return_request`	0.90	Misses euphemistic returns ("this wasn't what I expected")

Why recall matters for return_request: A missed return request means the user doesn't get routed to the returns API. They get a generic LLM response that can't process the return — leading to escalation and frustration.

Precision-Recall tradeoff by intent:

Intent	Priority	Rationale
`escalation`	Recall > Precision	Better to escalate unnecessarily than miss a frustrated user
`order_tracking`	Recall > Precision	Better to check order status unnecessarily than miss a delivery question
`recommendation`	Precision > Recall	A misrouted recommendation just goes to generic LLM (acceptable degradation)
`chitchat`	Precision > Recall	Better to give a full LLM response to a greeting than template-respond to a real question

1.4 F1 Score

Definition: Harmonic mean of precision and recall. Balances both.

$$F1_c = 2 \times \frac{\text{Precision}_c \times \text{Recall}_c}{\text{Precision}_c + \text{Recall}_c}$$

MangaAssist Application: Used as the single per-class metric for model comparison.

Intent	Precision	Recall	F1	Status
`recommendation`	0.91	0.88	0.89	✅ Above 0.85 threshold
`product_question`	0.88	0.90	0.89	✅
`faq`	0.87	0.85	0.86	✅ Borderline
`order_tracking`	0.96	0.94	0.95	✅
`return_request`	0.93	0.90	0.91	✅
`escalation`	0.89	0.92	0.90	✅
`promotion`	0.85	0.82	0.83	⚠️ Below threshold — needs augmentation
`chitchat`	0.94	0.96	0.95	✅

Deployment gate: All per-class F1 scores must be ≥ 0.85. If any class drops below, the model update is blocked until training data for that class is augmented.

Macro vs. Weighted F1: - Macro F1 (unweighted average across classes): Used for DS model comparison — ensures rare classes aren't ignored. - Weighted F1 (weighted by class frequency): Used for production monitoring — reflects actual user impact.

1.5 Confusion Matrix

Definition: A matrix showing predicted vs. actual class distributions.

MangaAssist Application: I used confusion matrices to identify specific class-pair confusions:

                     Predicted
                 rec   prod   faq   order  return  esc   promo  chat
Actual  rec     264    18      5      0      0      3      8      2
        prod     22   269      8      0      0      1      0      0
        faq       3    12    255      0      5      5     15      5
        order     0     0      0    282      8      5      0      5
        return    0     2      5      5    270      8      5      5
        esc       2     0      3      3      5    276      1     10
        promo     8     0     18      0      3      2    246     23
        chat      0     0      2      3      2      8     10    275

Key insight from the confusion matrix: promotion ↔ faq and promotion ↔ chitchat were the most confused pairs. Queries like "Are there any deals?" could be a promotion query or a casual FAQ. This drove us to: 1. Add more promotion-specific training examples. 2. Consider merging promotion into faq (decided against it — promotion queries route to the Promotions Service API, while FAQ queries use RAG).

1.6 AUC-ROC (Area Under the Receiver Operating Characteristic Curve)

Definition: Measures the model's ability to distinguish between classes at all classification thresholds. AUC = 1.0 means perfect separation; AUC = 0.5 means random.

$$\text{AUC-ROC} = \int_0^1 \text{TPR}(t) \, d(\text{FPR}(t))$$

MangaAssist Application: I computed per-class AUC-ROC (one-vs-rest):

Intent	AUC-ROC	Interpretation
`order_tracking`	0.99	Near-perfect separation — very distinct language patterns
`chitchat`	0.98	Easy to distinguish greetings
`recommendation`	0.95	Good separation but some overlap with `product_question`
`promotion`	0.91	Harder to separate from `faq` and `chitchat`

When I used it: Primarily for DS model comparison (offline). Not a production monitoring metric — the operating threshold was fixed in production.

1.7 AUC-PR (Area Under the Precision-Recall Curve)

Definition: Like AUC-ROC but focused on the positive class. More informative for imbalanced classes.

MangaAssist Application: escalation intent was only ~5% of messages (imbalanced). AUC-ROC looked great (0.97) because the model could easily identify non-escalation messages. AUC-PR was more honest:

Intent	% of Traffic	AUC-ROC	AUC-PR	Discrepancy
`escalation`	5%	0.97	0.82	AUC-ROC was misleadingly high
`promotion`	6%	0.91	0.78	Same pattern
`recommendation`	35%	0.95	0.93	Balanced — both metrics agree

Key insight: For rare intents, AUC-PR was the better metric. It revealed that our escalation classifier was weaker than AUC-ROC suggested.

1.8 Log Loss (Cross-Entropy Loss)

Definition: Measures the quality of predicted probability distributions, not just the final classification. Penalizes confident wrong predictions more harshly.

$$\text{Log Loss} = -\frac{1}{N} \sum_{i=1}^{N} \sum_{c=1}^{C} y_{i,c} \log(p_{i,c})$$

MangaAssist Application: I monitored log loss as a proxy for model confidence calibration:

Model Version	Accuracy	Log Loss	Interpretation
V1 (initial)	88.3%	0.42	Poorly calibrated — overconfident on wrong predictions
V2 (after augmentation)	92.1%	0.28	Better calibrated
V3 (with temperature scaling)	92.1%	0.23	Same accuracy, better calibration

Why calibration matters: I used the classifier's confidence score to decide whether to use the rule-based path or fall back to BERT. If the model was overconfident (low log loss but wrong), it wouldn't fall back when it should.

Part 2: Ranking & Retrieval Metrics (RAG Pipeline)

These metrics evaluated the RAG retrieval pipeline: given a user query, did we retrieve the right documents from OpenSearch?

2.1 Recall@K

Definition: Percentage of queries where the correct document appears in the top K retrieved results.

$$\text{Recall@K} = \frac{\text{Number of queries with relevant doc in top K}}{\text{Total queries}}$$

MangaAssist Application:

K Value	Recall	Use Case
Recall@1	62%	Not sufficient — only 62% of the time the top result is relevant
Recall@3	86%	Primary metric — we inject top 3 chunks into the prompt
Recall@5	92%	Good — but 5 chunks would blow our token budget
Recall@10	96%	Retrieval ceiling — nearly always in top 10

Target: Recall@3 ≥ 80%. We achieved 86% after embedding fine-tuning (up from 72% baseline).

Why Recall@3: We injected exactly 3 chunks into the LLM prompt (token budget constraint). If the correct document wasn't in those 3, the LLM couldn't generate a grounded answer.

2.2 Precision@K

Definition: Of the K retrieved documents, what fraction are actually relevant?

$$\text{Precision@K} = \frac{\text{Relevant docs in top K}}{K}$$

MangaAssist Application:

K Value	Precision	Interpretation
Precision@1	78%	When we retrieve 1 doc, it's relevant 78% of the time
Precision@3	79%	Of 3 retrieved docs, ~2.4 are relevant on average
Precision@5	68%	Noise increases with more docs

Why precision matters for RAG: Low precision means irrelevant documents are injected into the LLM prompt, which: 1. Wastes tokens (each irrelevant chunk costs ~500 tokens × $3/M = wasted money). 2. Confuses the LLM — irrelevant context can lead to hallucinations or off-topic responses.

2.3 MRR (Mean Reciprocal Rank)

Definition: Average of the reciprocal rank of the first relevant result. MRR = 1.0 means the first result is always relevant.

$$\text{MRR} = \frac{1}{|Q|} \sum_{i=1}^{|Q|} \frac{1}{\text{rank}_i}$$

MangaAssist Application:

Retrieval Method	MRR	Interpretation
Vector search only	0.68	First relevant chunk is ~1.5^th position on average
BM25 keyword search only	0.55	Keyword alone is worse for semantic queries
Hybrid (vector + BM25 + RRF)	0.75	Fusion improves ranking
Hybrid + cross-encoder reranking	0.81	Reranker pushes relevant docs to top

When I used it: MRR was the go-to metric for comparing retrieval strategies. Higher MRR = the LLM sees the most relevant chunk first, which matters because LLMs tend to weight early context more heavily.

2.4 NDCG (Normalized Discounted Cumulative Gain)

Definition: Measures ranking quality, accounting for the position of relevant results. Higher-ranked relevant results contribute more to the score.

$$\text{DCG@K} = \sum_{i=1}^{K} \frac{2^{rel_i} - 1}{\log_2(i + 1)}$$

$$\text{NDCG@K} = \frac{\text{DCG@K}}{\text{IDCG@K}}$$

MangaAssist Application: NDCG was particularly useful for multi-relevance queries. For a recommendation query like "dark fantasy manga," multiple documents were relevant but with varying degrees:

Relevance levels:
3 = Highly relevant (directly answers the query)
2 = Relevant (related but incomplete)
1 = Partially relevant (tangentially related)
0 = Irrelevant

Retrieval Method	NDCG@3	NDCG@5
Vector search only	0.72	0.69
Hybrid + reranking	0.84	0.81

When I used it: Primarily in the weekly RAG evaluation pipeline. More nuanced than Recall@K because it rewards putting the best documents first.

2.5 MAP (Mean Average Precision)

Definition: Mean of the average precision scores across all queries. Average precision rewards ranking relevant documents higher.

$$\text{AP} = \frac{1}{|R|} \sum_{k=1}^{K} \text{Precision@k} \times rel(k)$$

MangaAssist Application: MAP was used as a single-number summary of retrieval quality for weekly reporting. Our MAP was 0.78 (hybrid + reranking), up from 0.64 (vector only).

2.6 Hit Rate

Definition: Simplest retrieval metric — was there at least one relevant document in the top K?

$$\text{Hit Rate@K} = \frac{\text{Queries with ≥ 1 relevant doc in top K}}{\text{Total queries}}$$

MangaAssist Application: Hit Rate@3 = 89%. This was the "hard floor" metric — if we couldn't even find one relevant document, the RAG pipeline was functionally broken for that query.

Part 3: Embedding Quality Metrics

3.1 Cosine Similarity Distribution

What I measured: The distribution of cosine similarity scores between query embeddings and their top retrieved documents.

Baseline Titan Embeddings:
  - Avg similarity for relevant pairs: 0.72
  - Avg similarity for irrelevant pairs: 0.45
  - Separation gap: 0.27

Fine-tuned Adapter:
  - Avg similarity for relevant pairs: 0.84
  - Avg similarity for irrelevant pairs: 0.38
  - Separation gap: 0.46

A wider separation gap = easier for the retrieval system to distinguish relevant from irrelevant documents. The fine-tuned adapter nearly doubled the gap.

3.2 Nearest Neighbor Accuracy

Definition: For a set of query-document pairs, is the correct document the nearest neighbor in embedding space?

MangaAssist Application: Nearest neighbor accuracy improved from 58% to 76% after fine-tuning — particularly for manga-specific vocabulary where the base model struggled.

3.3 Alignment & Uniformity

Two properties of good embedding spaces: - Alignment: Semantically similar items should be close. Measured as average distance between positive pairs. - Uniformity: Embeddings should be spread across the space (not all clustered). Measured as log of pairwise Gaussian potential.

These were DS-internal metrics used during embedding adaptation training — not production monitoring metrics.

Part 4: Reranking Metrics

4.1 Reranking Lift

Definition: Improvement in retrieval quality from the cross-encoder reranker vs. raw vector search results.

Metric	Before Reranking	After Reranking	Lift
Recall@3	78%	86%	+8%
MRR	0.75	0.81	+0.06
NDCG@3	0.72	0.84	+0.12

The reranker added ~50ms of latency but significantly improved retrieval quality. The reranking lift justified the latency cost.

4.2 Pairwise Accuracy

Definition: Given two documents (one relevant, one irrelevant), does the reranker rank the relevant one higher?

MangaAssist Application: Pairwise accuracy = 93%. The 7% failure rate occurred mostly when both documents were partially relevant (relevance 1 vs. 2), not when comparing relevant vs. clearly irrelevant.

Part 5: Metrics I Actually Tracked in Production (Focused Section)

"Out of the 20+ metrics above, these are the ones on my production dashboard that I checked daily and used to drive decisions."

Tier 1: Daily Dashboard (Alerted On)

Metric	Target	Alert Threshold	Why It's Tier 1
Intent classification accuracy (weekly sample)	≥ 90%	< 88%	Misclassification cascades through the entire pipeline
RAG Recall@3 (weekly eval)	≥ 80%	< 75%	Below this, LLM responses degrade noticeably
Classification confidence (avg)	≥ 0.87	< 0.82	Dropping confidence = model uncertainty = drift signal
Fallback-to-BERT rate	≤ 30%	> 40%	High fallback = rule engine coverage declining

Tier 2: Weekly Review (DS Sync)

Metric	Target	Review Trigger
Per-class F1	≥ 0.85 all classes	Any class < 0.83
MRR	≥ 0.78	Drop > 5% week-over-week
Confusion matrix top confusions	<3% per off-diagonal cell	New confusion pair emerges
Embedding similarity gap	≥ 0.40	Gap narrows (embedding quality degradation)

Tier 3: Monthly Deep Dive (Model Release Gate)

Metric	Target	Blocks Release If
Full accuracy on golden dataset	≥ 90%	< 88%
AUC-PR for rare intents (escalation, promotion)	≥ 0.75	< 0.70
NDCG@3 on RAG eval set	≥ 0.80	< 0.75
Reranking lift	> +5% on Recall@3	Negative lift (reranker making things worse)

Key Takeaways for Interviews

"Accuracy alone is misleading" — I always pair accuracy with per-class F1 and confusion matrices. A 92% accurate model can have a 0.78 F1 on a critical intent class.
"Different metrics for different stakeholders" — DS cares about AUC-ROC and F1. Engineering cares about latency and throughput. Business cares about escalation rate and conversion. The shared metrics (Tier 1) bridge these worlds.
"Recall@3 was our north star retrieval metric" — Because we injected exactly 3 chunks, Recall@3 directly translated to "probability the LLM has good source material."
"Precision-recall priority depends on the intent" — For escalation, recall matters more (never miss a frustrated user). For recommendation, precision matters more (don't route real questions to the reco engine).
"AUC-PR is more honest than AUC-ROC for rare classes" — Our escalation AUC-ROC was 0.97 (misleading) vs. AUC-PR of 0.82 (honest).
"Embedding quality metrics showed the impact of fine-tuning" — The cosine similarity gap nearly doubled after domain-specific fine-tuning, and this translated directly to +14% Recall@3.

02-data-scientist-collaboration.md — How DS and I jointly defined metric thresholds
03-tradeoffs-decisions.md — How metrics drove tradeoff decisions
13-metrics.md — Business, UX, AI quality, and operational metrics framework
Challenges/real-world-challenges.md §7 — RAG Quality — Retrieval challenges and solutions

04. ML Metrics Taxonomy — Full Reference + Production Application

Overview: Metric Categories in MangaAssist

Metric Selection Decision Guide

Common Metric Mistakes to Avoid

Part 1: Classification Metrics (Intent Classifier)

1.1 Accuracy

1.2 Precision

1.3 Recall

1.4 F1 Score

1.5 Confusion Matrix

1.6 AUC-ROC (Area Under the Receiver Operating Characteristic Curve)

1.7 AUC-PR (Area Under the Precision-Recall Curve)

1.8 Log Loss (Cross-Entropy Loss)

Part 2: Ranking & Retrieval Metrics (RAG Pipeline)

2.1 Recall@K

2.2 Precision@K

2.3 MRR (Mean Reciprocal Rank)

2.4 NDCG (Normalized Discounted Cumulative Gain)

2.5 MAP (Mean Average Precision)

2.6 Hit Rate

Part 3: Embedding Quality Metrics

3.1 Cosine Similarity Distribution

3.2 Nearest Neighbor Accuracy

3.3 Alignment & Uniformity

Part 4: Reranking Metrics

4.1 Reranking Lift

4.2 Pairwise Accuracy

Part 5: Metrics I Actually Tracked in Production (Focused Section)

Tier 1: Daily Dashboard (Alerted On)

Tier 2: Weekly Review (DS Sync)

Tier 3: Monthly Deep Dive (Model Release Gate)

Key Takeaways for Interviews

Related Documents