US-07: Guardrail Strictness — Safety vs Latency vs User Experience

User Story

As a product safety lead, I want to calibrate guardrail strictness to block genuinely harmful or incorrect responses without over-blocking legitimate answers, So that users trust MangaAssist while not being frustrated by overly cautious or slow responses.

The Debate

graph TD
    subgraph "Inference Team (Safety)"
        I["Block everything suspicious.<br/>One hallucinated price = lawsuit risk.<br/>One PII leak = regulatory action.<br/>False positives are safer<br/>than false negatives."]
    end

    subgraph "Performance Team"
        P["The guardrail pipeline adds<br/>100-500ms per request.<br/>That's 25% of our latency<br/>budget. Can we make some<br/>checks async?"]
    end

    subgraph "Cost Team"
        C["Guardrail compute costs $15K/month.<br/>Plus, every blocked response triggers<br/>a retry or fallback that<br/>costs MORE than the original.<br/>False positive rate of 5%<br/>means $22K in wasted retries."]
    end

    subgraph "Product Team"
        PT["Users hate 'I can't help with<br/>that' responses. Every false block<br/>is a lost engagement. Our CSAT<br/>drops 0.3 points for each<br/>unnecessary refusal."]
    end

    I ---|"Latency<br/>cost"| P
    P ---|"Compute<br/>cost"| C
    C ---|"User<br/>experience"| PT
    PT ---|"Safety<br/>risk"| I

    style I fill:#ff6b6b,stroke:#333,color:#000
    style P fill:#4ecdc4,stroke:#333,color:#000
    style C fill:#f9d71c,stroke:#333,color:#000
    style PT fill:#a29bfe,stroke:#333,color:#000

Acceptance Criteria

Zero PII leaks in production (PII check is always sync, never skipped).
Price accuracy is 100% (price validation is always sync).
False positive rate (legitimate responses blocked) stays under 2%.
Guardrail latency stays under 100ms for sync checks.
User complaints about "I can't help" responses stay under 0.5% of sessions.

The Guardrail Pipeline and Its Costs

graph TD
    A["LLM Response"] --> B["Sync Guardrails<br/>(must pass before delivery)"]
    B --> C{"Pass?"}
    C -->|"Yes"| D["Deliver to User"]
    C -->|"No"| E["Block → Fallback Response"]

    D --> F["Async Guardrails<br/>(run after delivery)"]
    F --> G{"Pass?"}
    G -->|"Yes"| H["Log clean"]
    G -->|"No"| I["Flag for review<br/>Optional: retract + replace"]

    subgraph "Sync Checks — 100ms budget"
        S1["PII Detection<br/>20ms | Precision: 0.98"]
        S2["Price Validation<br/>30ms | Precision: 1.0"]
        S3["ASIN Existence<br/>30ms | Precision: 1.0"]
        S4["Toxicity Quick Scan<br/>20ms | Precision: 0.94"]
    end

    subgraph "Async Checks — No time budget"
        A1["Hallucination Scoring<br/>300ms | Precision: 0.87"]
        A2["Competitor Mention<br/>50ms | Precision: 0.96"]
        A3["Scope Drift<br/>200ms | Precision: 0.82"]
        A4["Full Quality Score<br/>400ms | Precision: 0.79"]
    end

    style B fill:#eb3b5a,stroke:#333,color:#fff
    style F fill:#fd9644,stroke:#333,color:#000
    style D fill:#2d8659,stroke:#333,color:#fff

The Precision-Recall Problem for Each Guardrail

graph LR
    subgraph "High Precision, Low Recall (Current)"
        HP["Blocks: 2% of responses<br/>False positives: 0.3%<br/>Missed violations: 1.5%<br/>✅ Users rarely frustrated<br/>⚠️ Some bad responses slip through"]
    end

    subgraph "High Recall, Low Precision (Aggressive)"
        HR["Blocks: 8% of responses<br/>False positives: 5%<br/>Missed violations: 0.2%<br/>❌ Users frequently frustrated<br/>✅ Almost nothing slips through"]
    end

    subgraph "Balanced (Target)"
        BAL["Blocks: 3.5% of responses<br/>False positives: 1%<br/>Missed violations: 0.8%<br/>✅ Acceptable frustration<br/>✅ Acceptable safety"]
    end

    style HP fill:#fd9644,stroke:#333,color:#000
    style HR fill:#eb3b5a,stroke:#333,color:#fff
    style BAL fill:#2d8659,stroke:#333,color:#fff

Guardrail-by-Guardrail Tradeoff Analysis

1. PII Detection

Strictness	Threshold	False Positive Rate	False Negative Rate	Example FP	Decision
Strict	Block any 10+ digit number	3%	0.1%	ISBN numbers blocked as phone numbers	Too aggressive
Moderate	Pattern + context check	0.5%	0.3%	Rare: formatted order IDs flagged	Selected
Lenient	Only obvious patterns	0.1%	2%	Partial credit card numbers slip through	Too risky

Decision: Moderate. PII is a sync-mandatory check. The 0.5% false positive rate is acceptable because the cost of a PII leak (regulatory, reputational) vastly outweighs the cost of occasionally blocking an ISBN.

2. Price Validation

graph TD
    A["LLM says:<br/>'Demon Slayer Vol 1 is $9.99'"] --> B{"Catalog says<br/>price is $9.99?"}
    B -->|"Match"| C["✅ Pass"]
    B -->|"Mismatch"| D["Replace with catalog price"]
    B -->|"No price in catalog"| E["Remove price from response"]

    style C fill:#2d8659,stroke:#333,color:#fff
    style D fill:#fd9644,stroke:#333,color:#000
    style E fill:#eb3b5a,stroke:#333,color:#fff

Decision: Always strict, always sync. Amazon cannot serve wrong prices. This check is non-negotiable and adds only 30ms (catalog lookup is typically cached but validated against live source).

3. Hallucination Scoring — The Hardest Tradeoff

graph TD
    subgraph "The Dilemma"
        D1["Real-time hallucination scoring<br/>takes 300ms and is only<br/>87% precise.<br/><br/>Async scoring catches issues<br/>after delivery but user already<br/>saw the hallucination."]
    end

    subgraph "Option A: Sync Scoring"
        A1["✅ Catches hallucinations before user sees them<br/>❌ Adds 300ms to every response<br/>❌ 13% false positive rate blocks good responses<br/>❌ $8K/month in compute"]
    end

    subgraph "Option B: Async Scoring"
        B1["✅ Zero latency impact<br/>✅ Zero false-positive UX impact<br/>❌ User sees hallucination for 2-5 seconds<br/>❌ Retraction UX is awkward"]
    end

    subgraph "Option C: Hybrid (Decision)"
        C1["Sync: lightweight NLI check (50ms)<br/>blocks obvious hallucinations<br/><br/>Async: full scoring (300ms)<br/>catches subtle hallucinations<br/>and retract if needed"]
    end

    style A1 fill:#eb3b5a,stroke:#333,color:#fff
    style B1 fill:#fd9644,stroke:#333,color:#000
    style C1 fill:#2d8659,stroke:#333,color:#fff

Decision: Hybrid. A lightweight sync NLI check (50ms, 0.90 precision) catches the obvious cases — like claiming a product is "free" or is by a wrong author. Full hallucination scoring runs async and triggers retraction only for serious violations.

4. Competitor Mention Filter

Approach	What It Catches	FP Rate	Latency	Decision
Keyword blocklist	"Barnes & Noble", "Viz Direct" etc.	0.1%	2ms	Sync — fast enough
Semantic comparison	Mentions that imply competitor comparison	4%	150ms	Async only
No filter	Nothing	0%	0ms	Too risky for Amazon brand

Decision: Keyword blocklist in sync (2ms, negligible cost). Semantic comparison async for borderline cases.

5. Scope Drift Detection

graph TD
    A["User: 'Tell me about<br/>the history of Japan'"] --> B{"Is this manga/Amazon<br/>related?"}
    B -->|"No"| C["Redirect:<br/>'I specialize in manga.<br/>Can I help you find<br/>a good manga about<br/>Japanese history?'"]
    B -->|"Borderline"| D["Answer briefly,<br/>then redirect"]
    B -->|"Yes"| E["Answer normally"]

    style C fill:#fd9644,stroke:#333,color:#000
    style D fill:#f9d71c,stroke:#333,color:#000
    style E fill:#2d8659,stroke:#333,color:#fff

The tradeoff: Strict scope enforcement frustrates users who ask tangentially related questions. Lenient scope allows the chatbot to ramble off-topic, wasting LLM tokens and potentially generating ungrounded content.

Decision: Moderate scope with graceful redirect. The chatbot acknowledges the question, briefly answers if it can, then steers back to manga. Full scope-drift detection runs async.

The False Positive Cost Model

Every false positive has a measurable cost:

graph TD
    A["False Positive<br/>(legitimate response blocked)"] --> B["Fallback Response Generated<br/>(+$0.008 for retry with template)"]
    A --> C["User Frustration<br/>(-0.3 CSAT points)"]
    A --> D["Potential Escalation<br/>($4-8 per human agent minute)"]
    A --> E["Engagement Loss<br/>(15% of blocked users leave session)"]

    style A fill:#eb3b5a,stroke:#333,color:#fff

False Positive Cost Calculation

FP Rate	Daily FP Count (1M requests)	Retry Cost	CSAT Impact	Escalation Cost	Total Daily Cost
1%	10,000	$80	measurable, localized	$400 (1% escalate)	$480
2%	20,000	$160	noticeable	$1,600 (2% escalate)	$1,760
5%	50,000	$400	significant	$8,000 (4% escalate)	$8,400
8%	80,000	$640	severe	$19,200 (6% escalate)	$19,840

Target: under 2% FP rate. The jump from 2% to 5% costs $6,640/day extra ($199K/month) in escalations and retries alone.

Intent-Specific Guardrail Profiles

Intent	Sync Checks	Async Checks	Strictness	Rationale
`recommendation`	PII, price, ASIN, toxicity, NLI	Full hallucination, scope, quality	High	Recommendations influence purchases
`faq`	PII, toxicity	Hallucination, scope	Moderate	Factual, lower risk
`product_question`	PII, price, ASIN	Hallucination	High	Price/availability must be accurate
`order_tracking`	PII	None (template response)	Low	Template path, no LLM output
`chitchat`	Toxicity	None	Minimal	Low risk, template response
`return_request`	PII, policy accuracy	Full pipeline	High	Incorrect policy info = support escalation
`checkout_help`	PII, price	Hallucination	High	Purchase-adjacent, accuracy critical

Monitoring: The Guardrail Health Dashboard

graph TD
    A["Guardrail Dashboard"] --> B["Block Rate"]
    A --> C["False Positive Rate"]
    A --> D["False Negative Rate"]
    A --> E["Latency Impact"]

    B --> B1["Block rate by guardrail<br/>Block rate by intent<br/>Block rate trend (7d)"]
    C --> C1["FP rate by guardrail<br/>FP review queue<br/>FP examples for tuning"]
    D --> D1["Escaped violations caught by async<br/>User-reported issues<br/>Escalation root causes"]
    E --> E1["Sync check latency p50/p95<br/>Async check completion time<br/>Retry/fallback rate"]

    style A fill:#54a0ff,stroke:#333,color:#000

2026 Update: Shift Guardrails from Block-Only to Detect, Repair, and Abstain

Treat everything above this section as the baseline guardrail architecture. This update keeps that original sync/async design visible and explains how the current architecture becomes more repair-oriented.

The most effective recent safety stacks reduce failures by changing the response shape and repair path, not only by blocking after the fact.

Use structured outputs or constrained decoding for policy and catalog-facing responses so downstream validation checks typed fields instead of brittle free-form prose.
Keep deterministic sync checks for zero-tolerance fields such as PII, price, policy IDs, and ASIN validity, but prefer repair paths for many other failures: rewrite, re-ground, strip unsafe claims, or ask a clarifying question.
For high-stakes policy answers, consider Amazon Bedrock Automated Reasoning checks. They validate generated claims against formalized rules, but they are detect-mode, non-streaming, and should be used selectively where the added latency is justified.
Split the guardrail budget into input safety, output safety, tool-result validation, and business-rule validation. One blended "guardrail latency" number hides where the real cost is.
Measure intervention outcomes by type: blocked, rewritten, clarified, retried, or human-escalated. That gives a more useful optimization signal than block rate alone.

Recent references: Bedrock Automated Reasoning overview, Automated Reasoning concepts, Integrate Automated Reasoning checks, Bedrock Guardrails CloudWatch metrics, vLLM structured outputs.

Reversal Triggers

Trigger	Action
False positive rate exceeds 3% for any guardrail	Loosen threshold; review recent FP examples and tune
A PII leak reaches a user (false negative)	Tighten PII detection immediately; add the pattern that was missed
Guardrail sync latency exceeds 100ms p95	Move the slowest sync check to async
"Can't help" complaints exceed 1% of sessions	Identify which guardrail is over-blocking; tune or change to redirect instead of block
Hallucination reported by user that async check missed	Improve hallucination scoring model; consider promoting lightweight check to sync

Impact on Trilemma

Dimension	Minimal Guardrails	Aggressive Guardrails	Balanced (Decision)
Cost	Low compute, but high risk	High compute + high FP retry cost	Moderate (sync cheap, async amortized)
Performance	Fast (no checks)	Slow (+300-500ms)	Good (+100ms sync only)
Inference Quality	Dangerous (PII leaks, hallucinations)	Over-blocked (poor UX)	Good (targeted checks, low FP)
Safety	Unacceptable	Excessive	Appropriate
QACPI	High raw score but unsustainable risk	Low (UX damage)	Highest