LOCAL PREVIEW View on GitHub

US-07: Guardrail Strictness — Safety vs Latency vs User Experience

User Story

As a product safety lead, I want to calibrate guardrail strictness to block genuinely harmful or incorrect responses without over-blocking legitimate answers, So that users trust MangaAssist while not being frustrated by overly cautious or slow responses.

The Debate

graph TD
    subgraph "Inference Team (Safety)"
        I["Block everything suspicious.<br/>One hallucinated price = lawsuit risk.<br/>One PII leak = regulatory action.<br/>False positives are safer<br/>than false negatives."]
    end

    subgraph "Performance Team"
        P["The guardrail pipeline adds<br/>100-500ms per request.<br/>That's 25% of our latency<br/>budget. Can we make some<br/>checks async?"]
    end

    subgraph "Cost Team"
        C["Guardrail compute costs $15K/month.<br/>Plus, every blocked response triggers<br/>a retry or fallback that<br/>costs MORE than the original.<br/>False positive rate of 5%<br/>means $22K in wasted retries."]
    end

    subgraph "Product Team"
        PT["Users hate 'I can't help with<br/>that' responses. Every false block<br/>is a lost engagement. Our CSAT<br/>drops 0.3 points for each<br/>unnecessary refusal."]
    end

    I ---|"Latency<br/>cost"| P
    P ---|"Compute<br/>cost"| C
    C ---|"User<br/>experience"| PT
    PT ---|"Safety<br/>risk"| I

    style I fill:#ff6b6b,stroke:#333,color:#000
    style P fill:#4ecdc4,stroke:#333,color:#000
    style C fill:#f9d71c,stroke:#333,color:#000
    style PT fill:#a29bfe,stroke:#333,color:#000

Acceptance Criteria

  • Zero PII leaks in production (PII check is always sync, never skipped).
  • Price accuracy is 100% (price validation is always sync).
  • False positive rate (legitimate responses blocked) stays under 2%.
  • Guardrail latency stays under 100ms for sync checks.
  • User complaints about "I can't help" responses stay under 0.5% of sessions.

The Guardrail Pipeline and Its Costs

graph TD
    A["LLM Response"] --> B["Sync Guardrails<br/>(must pass before delivery)"]
    B --> C{"Pass?"}
    C -->|"Yes"| D["Deliver to User"]
    C -->|"No"| E["Block → Fallback Response"]

    D --> F["Async Guardrails<br/>(run after delivery)"]
    F --> G{"Pass?"}
    G -->|"Yes"| H["Log clean"]
    G -->|"No"| I["Flag for review<br/>Optional: retract + replace"]

    subgraph "Sync Checks — 100ms budget"
        S1["PII Detection<br/>20ms | Precision: 0.98"]
        S2["Price Validation<br/>30ms | Precision: 1.0"]
        S3["ASIN Existence<br/>30ms | Precision: 1.0"]
        S4["Toxicity Quick Scan<br/>20ms | Precision: 0.94"]
    end

    subgraph "Async Checks — No time budget"
        A1["Hallucination Scoring<br/>300ms | Precision: 0.87"]
        A2["Competitor Mention<br/>50ms | Precision: 0.96"]
        A3["Scope Drift<br/>200ms | Precision: 0.82"]
        A4["Full Quality Score<br/>400ms | Precision: 0.79"]
    end

    style B fill:#eb3b5a,stroke:#333,color:#fff
    style F fill:#fd9644,stroke:#333,color:#000
    style D fill:#2d8659,stroke:#333,color:#fff

The Precision-Recall Problem for Each Guardrail

graph LR
    subgraph "High Precision, Low Recall (Current)"
        HP["Blocks: 2% of responses<br/>False positives: 0.3%<br/>Missed violations: 1.5%<br/>✅ Users rarely frustrated<br/>⚠️ Some bad responses slip through"]
    end

    subgraph "High Recall, Low Precision (Aggressive)"
        HR["Blocks: 8% of responses<br/>False positives: 5%<br/>Missed violations: 0.2%<br/>❌ Users frequently frustrated<br/>✅ Almost nothing slips through"]
    end

    subgraph "Balanced (Target)"
        BAL["Blocks: 3.5% of responses<br/>False positives: 1%<br/>Missed violations: 0.8%<br/>✅ Acceptable frustration<br/>✅ Acceptable safety"]
    end

    style HP fill:#fd9644,stroke:#333,color:#000
    style HR fill:#eb3b5a,stroke:#333,color:#fff
    style BAL fill:#2d8659,stroke:#333,color:#fff

Guardrail-by-Guardrail Tradeoff Analysis

1. PII Detection

Strictness Threshold False Positive Rate False Negative Rate Example FP Decision
Strict Block any 10+ digit number 3% 0.1% ISBN numbers blocked as phone numbers Too aggressive
Moderate Pattern + context check 0.5% 0.3% Rare: formatted order IDs flagged Selected
Lenient Only obvious patterns 0.1% 2% Partial credit card numbers slip through Too risky

Decision: Moderate. PII is a sync-mandatory check. The 0.5% false positive rate is acceptable because the cost of a PII leak (regulatory, reputational) vastly outweighs the cost of occasionally blocking an ISBN.

2. Price Validation

graph TD
    A["LLM says:<br/>'Demon Slayer Vol 1 is $9.99'"] --> B{"Catalog says<br/>price is $9.99?"}
    B -->|"Match"| C["✅ Pass"]
    B -->|"Mismatch"| D["Replace with catalog price"]
    B -->|"No price in catalog"| E["Remove price from response"]

    style C fill:#2d8659,stroke:#333,color:#fff
    style D fill:#fd9644,stroke:#333,color:#000
    style E fill:#eb3b5a,stroke:#333,color:#fff

Decision: Always strict, always sync. Amazon cannot serve wrong prices. This check is non-negotiable and adds only 30ms (catalog lookup is typically cached but validated against live source).

3. Hallucination Scoring — The Hardest Tradeoff

graph TD
    subgraph "The Dilemma"
        D1["Real-time hallucination scoring<br/>takes 300ms and is only<br/>87% precise.<br/><br/>Async scoring catches issues<br/>after delivery but user already<br/>saw the hallucination."]
    end

    subgraph "Option A: Sync Scoring"
        A1["✅ Catches hallucinations before user sees them<br/>❌ Adds 300ms to every response<br/>❌ 13% false positive rate blocks good responses<br/>❌ $8K/month in compute"]
    end

    subgraph "Option B: Async Scoring"
        B1["✅ Zero latency impact<br/>✅ Zero false-positive UX impact<br/>❌ User sees hallucination for 2-5 seconds<br/>❌ Retraction UX is awkward"]
    end

    subgraph "Option C: Hybrid (Decision)"
        C1["Sync: lightweight NLI check (50ms)<br/>blocks obvious hallucinations<br/><br/>Async: full scoring (300ms)<br/>catches subtle hallucinations<br/>and retract if needed"]
    end

    style A1 fill:#eb3b5a,stroke:#333,color:#fff
    style B1 fill:#fd9644,stroke:#333,color:#000
    style C1 fill:#2d8659,stroke:#333,color:#fff

Decision: Hybrid. A lightweight sync NLI check (50ms, 0.90 precision) catches the obvious cases — like claiming a product is "free" or is by a wrong author. Full hallucination scoring runs async and triggers retraction only for serious violations.

4. Competitor Mention Filter

Approach What It Catches FP Rate Latency Decision
Keyword blocklist "Barnes & Noble", "Viz Direct" etc. 0.1% 2ms Sync — fast enough
Semantic comparison Mentions that imply competitor comparison 4% 150ms Async only
No filter Nothing 0% 0ms Too risky for Amazon brand

Decision: Keyword blocklist in sync (2ms, negligible cost). Semantic comparison async for borderline cases.

5. Scope Drift Detection

graph TD
    A["User: 'Tell me about<br/>the history of Japan'"] --> B{"Is this manga/Amazon<br/>related?"}
    B -->|"No"| C["Redirect:<br/>'I specialize in manga.<br/>Can I help you find<br/>a good manga about<br/>Japanese history?'"]
    B -->|"Borderline"| D["Answer briefly,<br/>then redirect"]
    B -->|"Yes"| E["Answer normally"]

    style C fill:#fd9644,stroke:#333,color:#000
    style D fill:#f9d71c,stroke:#333,color:#000
    style E fill:#2d8659,stroke:#333,color:#fff

The tradeoff: Strict scope enforcement frustrates users who ask tangentially related questions. Lenient scope allows the chatbot to ramble off-topic, wasting LLM tokens and potentially generating ungrounded content.

Decision: Moderate scope with graceful redirect. The chatbot acknowledges the question, briefly answers if it can, then steers back to manga. Full scope-drift detection runs async.


The False Positive Cost Model

Every false positive has a measurable cost:

graph TD
    A["False Positive<br/>(legitimate response blocked)"] --> B["Fallback Response Generated<br/>(+$0.008 for retry with template)"]
    A --> C["User Frustration<br/>(-0.3 CSAT points)"]
    A --> D["Potential Escalation<br/>($4-8 per human agent minute)"]
    A --> E["Engagement Loss<br/>(15% of blocked users leave session)"]

    style A fill:#eb3b5a,stroke:#333,color:#fff

False Positive Cost Calculation

FP Rate Daily FP Count (1M requests) Retry Cost CSAT Impact Escalation Cost Total Daily Cost
1% 10,000 $80 measurable, localized $400 (1% escalate) $480
2% 20,000 $160 noticeable $1,600 (2% escalate) $1,760
5% 50,000 $400 significant $8,000 (4% escalate) $8,400
8% 80,000 $640 severe $19,200 (6% escalate) $19,840

Target: under 2% FP rate. The jump from 2% to 5% costs $6,640/day extra ($199K/month) in escalations and retries alone.


Intent-Specific Guardrail Profiles

Intent Sync Checks Async Checks Strictness Rationale
recommendation PII, price, ASIN, toxicity, NLI Full hallucination, scope, quality High Recommendations influence purchases
faq PII, toxicity Hallucination, scope Moderate Factual, lower risk
product_question PII, price, ASIN Hallucination High Price/availability must be accurate
order_tracking PII None (template response) Low Template path, no LLM output
chitchat Toxicity None Minimal Low risk, template response
return_request PII, policy accuracy Full pipeline High Incorrect policy info = support escalation
checkout_help PII, price Hallucination High Purchase-adjacent, accuracy critical

Monitoring: The Guardrail Health Dashboard

graph TD
    A["Guardrail Dashboard"] --> B["Block Rate"]
    A --> C["False Positive Rate"]
    A --> D["False Negative Rate"]
    A --> E["Latency Impact"]

    B --> B1["Block rate by guardrail<br/>Block rate by intent<br/>Block rate trend (7d)"]
    C --> C1["FP rate by guardrail<br/>FP review queue<br/>FP examples for tuning"]
    D --> D1["Escaped violations caught by async<br/>User-reported issues<br/>Escalation root causes"]
    E --> E1["Sync check latency p50/p95<br/>Async check completion time<br/>Retry/fallback rate"]

    style A fill:#54a0ff,stroke:#333,color:#000

2026 Update: Shift Guardrails from Block-Only to Detect, Repair, and Abstain

Treat everything above this section as the baseline guardrail architecture. This update keeps that original sync/async design visible and explains how the current architecture becomes more repair-oriented.

The most effective recent safety stacks reduce failures by changing the response shape and repair path, not only by blocking after the fact.

  • Use structured outputs or constrained decoding for policy and catalog-facing responses so downstream validation checks typed fields instead of brittle free-form prose.
  • Keep deterministic sync checks for zero-tolerance fields such as PII, price, policy IDs, and ASIN validity, but prefer repair paths for many other failures: rewrite, re-ground, strip unsafe claims, or ask a clarifying question.
  • For high-stakes policy answers, consider Amazon Bedrock Automated Reasoning checks. They validate generated claims against formalized rules, but they are detect-mode, non-streaming, and should be used selectively where the added latency is justified.
  • Split the guardrail budget into input safety, output safety, tool-result validation, and business-rule validation. One blended "guardrail latency" number hides where the real cost is.
  • Measure intervention outcomes by type: blocked, rewritten, clarified, retried, or human-escalated. That gives a more useful optimization signal than block rate alone.

Recent references: Bedrock Automated Reasoning overview, Automated Reasoning concepts, Integrate Automated Reasoning checks, Bedrock Guardrails CloudWatch metrics, vLLM structured outputs.

Reversal Triggers

Trigger Action
False positive rate exceeds 3% for any guardrail Loosen threshold; review recent FP examples and tune
A PII leak reaches a user (false negative) Tighten PII detection immediately; add the pattern that was missed
Guardrail sync latency exceeds 100ms p95 Move the slowest sync check to async
"Can't help" complaints exceed 1% of sessions Identify which guardrail is over-blocking; tune or change to redirect instead of block
Hallucination reported by user that async check missed Improve hallucination scoring model; consider promoting lightweight check to sync

Impact on Trilemma

Dimension Minimal Guardrails Aggressive Guardrails Balanced (Decision)
Cost Low compute, but high risk High compute + high FP retry cost Moderate (sync cheap, async amortized)
Performance Fast (no checks) Slow (+300-500ms) Good (+100ms sync only)
Inference Quality Dangerous (PII leaks, hallucinations) Over-blocked (poor UX) Good (targeted checks, low FP)
Safety Unacceptable Excessive Appropriate
QACPI High raw score but unsustainable risk Low (UX damage) Highest