LOCAL PREVIEW View on GitHub

Story 04 — How 0.5 became 0.7 (and why the difference mattered)

One-line: The team set the hallucination block threshold at the framework default (0.5) and started blocking 12% of legitimate responses. I rebuilt the calibration as an ROC analysis with explicit precision/recall trade-offs at alert vs. block tiers — false-positive blocks dropped 3× while we held the same hallucination catch rate.

Situation

Production hardening phase. The hallucination guardrail was live in shadow mode, scoring every LLM response against a fact-verification check (a mix of LLM-as-Judge against retrieved context, plus a structured-data validator for ASINs, prices, and order numbers). Single threshold at 0.5: any response scoring ≥0.5 was blocked.

Two weeks of shadow data showed the threshold was way too aggressive. ~12% of legitimate responses were being flagged — including ones where the LLM had paraphrased correctly but the verifier was conservative on synonyms. The team's first instinct was to retrain the verifier. Their second instinct was to bump the threshold to 0.6 and call it done.

Both were wrong, and the project couldn't ship until I fixed it.

Task

Calibrate the hallucination threshold so that we caught real hallucinations without blocking legitimate responses — and document the methodology so it could be re-applied to every future safety threshold.

Action

1. Built a 500-question labeled golden set. Each question paired with: the retrieved context, the LLM response, and a binary human label is this response factually grounded in the retrieved context? This is the ground truth that calibration requires — without it, threshold-setting is just guessing. Stratified across intent types, with an oversample of edge cases (ambiguous attributions, paraphrases, partial facts).

2. Ran the verifier against the golden set. Got the verifier's score for every (response, label) pair. This is now an offline scoring problem with known answers — exactly the setup for an ROC analysis.

3. Plotted the ROC curve. TPR (correctly catching hallucinations) on the y-axis, FPR (incorrectly blocking legitimate responses) on the x-axis, parameterized by threshold. Saw the curve's shape: a clear elbow around threshold 0.7, where FPR dropped sharply but TPR stayed high.

4. Designed a two-tier threshold instead of a single one. - Alert threshold = 0.5 — high recall, accepts false positives. Triggers a logged event, a soft warning to the system, and a feedback signal for the eval team. Doesn't block the customer's response. This is the "we want to know" tier. - Block threshold = 0.7 — high precision, only blocks responses we're confident are hallucinated. Returns a templated safe response to the customer instead. This is the "we will not ship this" tier.

The two-tier design separates observability from enforcement. Most teams collapse them into one threshold and have to choose between catching too much and blocking too much. The math doesn't require that trade-off; it just requires you to have two thresholds.

5. Documented the methodology. Wrote it up in AI-Safety-Security-Governance/01-input-output-safety-controls/03-accuracy-verification-hallucination-control.md — risk classification, verification layer, confidence scoring, customer response strategy. This is a template for any binary safety guardrail going forward (toxicity, PII leakage, policy violation), not just hallucination.

The math/algorithmic depth that mattered

The whole story is a 200-level statistics problem dressed up as an engineering problem:

  • ROC curves and the precision-recall trade-off. Every binary classifier has a curve, not a number. Picking a single threshold is picking a single point on the curve — and you can only pick it sensibly if you've drawn the curve. The team's instinct to "bump 0.5 to 0.6" was equivalent to randomly walking along an unseen curve.
  • Cost asymmetry between FP and FN. A false positive (blocking a legitimate response) costs ~1 customer-frustration-event. A false negative (letting a hallucination through) costs ~1 trust-erosion-event, which is qualitatively much worse — wrong order info, wrong price, wrong product attribution all create real customer harm. Different costs → different optimal thresholds → two-tier design.
  • Calibration vs. accuracy. The verifier doesn't need to be more accurate to be more useful — it needs to be better calibrated against a labeled set. The team's "retrain the verifier" instinct would have helped, but only after we'd already done the calibration work. You always calibrate first; you only retrain when calibration shows a gap that calibration can't close.

The leadership move

This story has two leadership beats. First, I refused the "just bump the threshold" shortcut. Easy to lose that battle when the team is under ship pressure — easy to say "fine, 0.6, ship it." Saying "no, we need 500 labels first" costs time but buys the right answer.

Second, I made the methodology reusable. The two-tier alert/block design is now the default for every binary guardrail in the system. Toxicity, PII detection, policy compliance — all have alert/block tiers calibrated against their own labeled sets. The labor invested once for hallucination is amortized across every future guardrail.

There's also a translation move I had to make for the PM: "we're going to spend two weeks labeling 500 responses" sounds like ML self-indulgence. I framed it as "we're going to invest two weeks now to avoid blocking 1 in 8 customer responses for the next two years." Calibration is product work, not science work.

Result

What I'd want a future ML lead to take away

If a threshold matters to customer experience, calibrate it on a labeled set. If it doesn't matter, you don't need it. There is no in-between. "Default 0.5" is not a calibration; it's an unconsidered guess wearing the clothes of a decision.