US-07: Guardrail Strictness — Safety vs Latency vs User Experience
User Story
As a product safety lead, I want to calibrate guardrail strictness to block genuinely harmful or incorrect responses without over-blocking legitimate answers, So that users trust MangaAssist while not being frustrated by overly cautious or slow responses.
The Debate
graph TD
subgraph "Inference Team (Safety)"
I["Block everything suspicious.<br/>One hallucinated price = lawsuit risk.<br/>One PII leak = regulatory action.<br/>False positives are safer<br/>than false negatives."]
end
subgraph "Performance Team"
P["The guardrail pipeline adds<br/>100-500ms per request.<br/>That's 25% of our latency<br/>budget. Can we make some<br/>checks async?"]
end
subgraph "Cost Team"
C["Guardrail compute costs $15K/month.<br/>Plus, every blocked response triggers<br/>a retry or fallback that<br/>costs MORE than the original.<br/>False positive rate of 5%<br/>means $22K in wasted retries."]
end
subgraph "Product Team"
PT["Users hate 'I can't help with<br/>that' responses. Every false block<br/>is a lost engagement. Our CSAT<br/>drops 0.3 points for each<br/>unnecessary refusal."]
end
I ---|"Latency<br/>cost"| P
P ---|"Compute<br/>cost"| C
C ---|"User<br/>experience"| PT
PT ---|"Safety<br/>risk"| I
style I fill:#ff6b6b,stroke:#333,color:#000
style P fill:#4ecdc4,stroke:#333,color:#000
style C fill:#f9d71c,stroke:#333,color:#000
style PT fill:#a29bfe,stroke:#333,color:#000
Acceptance Criteria
- Zero PII leaks in production (PII check is always sync, never skipped).
- Price accuracy is 100% (price validation is always sync).
- False positive rate (legitimate responses blocked) stays under 2%.
- Guardrail latency stays under 100ms for sync checks.
- User complaints about "I can't help" responses stay under 0.5% of sessions.
The Guardrail Pipeline and Its Costs
graph TD
A["LLM Response"] --> B["Sync Guardrails<br/>(must pass before delivery)"]
B --> C{"Pass?"}
C -->|"Yes"| D["Deliver to User"]
C -->|"No"| E["Block → Fallback Response"]
D --> F["Async Guardrails<br/>(run after delivery)"]
F --> G{"Pass?"}
G -->|"Yes"| H["Log clean"]
G -->|"No"| I["Flag for review<br/>Optional: retract + replace"]
subgraph "Sync Checks — 100ms budget"
S1["PII Detection<br/>20ms | Precision: 0.98"]
S2["Price Validation<br/>30ms | Precision: 1.0"]
S3["ASIN Existence<br/>30ms | Precision: 1.0"]
S4["Toxicity Quick Scan<br/>20ms | Precision: 0.94"]
end
subgraph "Async Checks — No time budget"
A1["Hallucination Scoring<br/>300ms | Precision: 0.87"]
A2["Competitor Mention<br/>50ms | Precision: 0.96"]
A3["Scope Drift<br/>200ms | Precision: 0.82"]
A4["Full Quality Score<br/>400ms | Precision: 0.79"]
end
style B fill:#eb3b5a,stroke:#333,color:#fff
style F fill:#fd9644,stroke:#333,color:#000
style D fill:#2d8659,stroke:#333,color:#fff
The Precision-Recall Problem for Each Guardrail
graph LR
subgraph "High Precision, Low Recall (Current)"
HP["Blocks: 2% of responses<br/>False positives: 0.3%<br/>Missed violations: 1.5%<br/>✅ Users rarely frustrated<br/>⚠️ Some bad responses slip through"]
end
subgraph "High Recall, Low Precision (Aggressive)"
HR["Blocks: 8% of responses<br/>False positives: 5%<br/>Missed violations: 0.2%<br/>❌ Users frequently frustrated<br/>✅ Almost nothing slips through"]
end
subgraph "Balanced (Target)"
BAL["Blocks: 3.5% of responses<br/>False positives: 1%<br/>Missed violations: 0.8%<br/>✅ Acceptable frustration<br/>✅ Acceptable safety"]
end
style HP fill:#fd9644,stroke:#333,color:#000
style HR fill:#eb3b5a,stroke:#333,color:#fff
style BAL fill:#2d8659,stroke:#333,color:#fff
Guardrail-by-Guardrail Tradeoff Analysis
1. PII Detection
| Strictness | Threshold | False Positive Rate | False Negative Rate | Example FP | Decision |
|---|---|---|---|---|---|
| Strict | Block any 10+ digit number | 3% | 0.1% | ISBN numbers blocked as phone numbers | Too aggressive |
| Moderate | Pattern + context check | 0.5% | 0.3% | Rare: formatted order IDs flagged | Selected |
| Lenient | Only obvious patterns | 0.1% | 2% | Partial credit card numbers slip through | Too risky |
Decision: Moderate. PII is a sync-mandatory check. The 0.5% false positive rate is acceptable because the cost of a PII leak (regulatory, reputational) vastly outweighs the cost of occasionally blocking an ISBN.
2. Price Validation
graph TD
A["LLM says:<br/>'Demon Slayer Vol 1 is $9.99'"] --> B{"Catalog says<br/>price is $9.99?"}
B -->|"Match"| C["✅ Pass"]
B -->|"Mismatch"| D["Replace with catalog price"]
B -->|"No price in catalog"| E["Remove price from response"]
style C fill:#2d8659,stroke:#333,color:#fff
style D fill:#fd9644,stroke:#333,color:#000
style E fill:#eb3b5a,stroke:#333,color:#fff
Decision: Always strict, always sync. Amazon cannot serve wrong prices. This check is non-negotiable and adds only 30ms (catalog lookup is typically cached but validated against live source).
3. Hallucination Scoring — The Hardest Tradeoff
graph TD
subgraph "The Dilemma"
D1["Real-time hallucination scoring<br/>takes 300ms and is only<br/>87% precise.<br/><br/>Async scoring catches issues<br/>after delivery but user already<br/>saw the hallucination."]
end
subgraph "Option A: Sync Scoring"
A1["✅ Catches hallucinations before user sees them<br/>❌ Adds 300ms to every response<br/>❌ 13% false positive rate blocks good responses<br/>❌ $8K/month in compute"]
end
subgraph "Option B: Async Scoring"
B1["✅ Zero latency impact<br/>✅ Zero false-positive UX impact<br/>❌ User sees hallucination for 2-5 seconds<br/>❌ Retraction UX is awkward"]
end
subgraph "Option C: Hybrid (Decision)"
C1["Sync: lightweight NLI check (50ms)<br/>blocks obvious hallucinations<br/><br/>Async: full scoring (300ms)<br/>catches subtle hallucinations<br/>and retract if needed"]
end
style A1 fill:#eb3b5a,stroke:#333,color:#fff
style B1 fill:#fd9644,stroke:#333,color:#000
style C1 fill:#2d8659,stroke:#333,color:#fff
Decision: Hybrid. A lightweight sync NLI check (50ms, 0.90 precision) catches the obvious cases — like claiming a product is "free" or is by a wrong author. Full hallucination scoring runs async and triggers retraction only for serious violations.
4. Competitor Mention Filter
| Approach | What It Catches | FP Rate | Latency | Decision |
|---|---|---|---|---|
| Keyword blocklist | "Barnes & Noble", "Viz Direct" etc. | 0.1% | 2ms | Sync — fast enough |
| Semantic comparison | Mentions that imply competitor comparison | 4% | 150ms | Async only |
| No filter | Nothing | 0% | 0ms | Too risky for Amazon brand |
Decision: Keyword blocklist in sync (2ms, negligible cost). Semantic comparison async for borderline cases.
5. Scope Drift Detection
graph TD
A["User: 'Tell me about<br/>the history of Japan'"] --> B{"Is this manga/Amazon<br/>related?"}
B -->|"No"| C["Redirect:<br/>'I specialize in manga.<br/>Can I help you find<br/>a good manga about<br/>Japanese history?'"]
B -->|"Borderline"| D["Answer briefly,<br/>then redirect"]
B -->|"Yes"| E["Answer normally"]
style C fill:#fd9644,stroke:#333,color:#000
style D fill:#f9d71c,stroke:#333,color:#000
style E fill:#2d8659,stroke:#333,color:#fff
The tradeoff: Strict scope enforcement frustrates users who ask tangentially related questions. Lenient scope allows the chatbot to ramble off-topic, wasting LLM tokens and potentially generating ungrounded content.
Decision: Moderate scope with graceful redirect. The chatbot acknowledges the question, briefly answers if it can, then steers back to manga. Full scope-drift detection runs async.
The False Positive Cost Model
Every false positive has a measurable cost:
graph TD
A["False Positive<br/>(legitimate response blocked)"] --> B["Fallback Response Generated<br/>(+$0.008 for retry with template)"]
A --> C["User Frustration<br/>(-0.3 CSAT points)"]
A --> D["Potential Escalation<br/>($4-8 per human agent minute)"]
A --> E["Engagement Loss<br/>(15% of blocked users leave session)"]
style A fill:#eb3b5a,stroke:#333,color:#fff
False Positive Cost Calculation
| FP Rate | Daily FP Count (1M requests) | Retry Cost | CSAT Impact | Escalation Cost | Total Daily Cost |
|---|---|---|---|---|---|
| 1% | 10,000 | $80 | measurable, localized | $400 (1% escalate) | $480 |
| 2% | 20,000 | $160 | noticeable | $1,600 (2% escalate) | $1,760 |
| 5% | 50,000 | $400 | significant | $8,000 (4% escalate) | $8,400 |
| 8% | 80,000 | $640 | severe | $19,200 (6% escalate) | $19,840 |
Target: under 2% FP rate. The jump from 2% to 5% costs $6,640/day extra ($199K/month) in escalations and retries alone.
Intent-Specific Guardrail Profiles
| Intent | Sync Checks | Async Checks | Strictness | Rationale |
|---|---|---|---|---|
recommendation |
PII, price, ASIN, toxicity, NLI | Full hallucination, scope, quality | High | Recommendations influence purchases |
faq |
PII, toxicity | Hallucination, scope | Moderate | Factual, lower risk |
product_question |
PII, price, ASIN | Hallucination | High | Price/availability must be accurate |
order_tracking |
PII | None (template response) | Low | Template path, no LLM output |
chitchat |
Toxicity | None | Minimal | Low risk, template response |
return_request |
PII, policy accuracy | Full pipeline | High | Incorrect policy info = support escalation |
checkout_help |
PII, price | Hallucination | High | Purchase-adjacent, accuracy critical |
Monitoring: The Guardrail Health Dashboard
graph TD
A["Guardrail Dashboard"] --> B["Block Rate"]
A --> C["False Positive Rate"]
A --> D["False Negative Rate"]
A --> E["Latency Impact"]
B --> B1["Block rate by guardrail<br/>Block rate by intent<br/>Block rate trend (7d)"]
C --> C1["FP rate by guardrail<br/>FP review queue<br/>FP examples for tuning"]
D --> D1["Escaped violations caught by async<br/>User-reported issues<br/>Escalation root causes"]
E --> E1["Sync check latency p50/p95<br/>Async check completion time<br/>Retry/fallback rate"]
style A fill:#54a0ff,stroke:#333,color:#000
2026 Update: Shift Guardrails from Block-Only to Detect, Repair, and Abstain
Treat everything above this section as the baseline guardrail architecture. This update keeps that original sync/async design visible and explains how the current architecture becomes more repair-oriented.
The most effective recent safety stacks reduce failures by changing the response shape and repair path, not only by blocking after the fact.
- Use structured outputs or constrained decoding for policy and catalog-facing responses so downstream validation checks typed fields instead of brittle free-form prose.
- Keep deterministic sync checks for zero-tolerance fields such as PII, price, policy IDs, and ASIN validity, but prefer repair paths for many other failures: rewrite, re-ground, strip unsafe claims, or ask a clarifying question.
- For high-stakes policy answers, consider Amazon Bedrock Automated Reasoning checks. They validate generated claims against formalized rules, but they are detect-mode, non-streaming, and should be used selectively where the added latency is justified.
- Split the guardrail budget into input safety, output safety, tool-result validation, and business-rule validation. One blended "guardrail latency" number hides where the real cost is.
- Measure intervention outcomes by type: blocked, rewritten, clarified, retried, or human-escalated. That gives a more useful optimization signal than block rate alone.
Recent references: Bedrock Automated Reasoning overview, Automated Reasoning concepts, Integrate Automated Reasoning checks, Bedrock Guardrails CloudWatch metrics, vLLM structured outputs.
Reversal Triggers
| Trigger | Action |
|---|---|
| False positive rate exceeds 3% for any guardrail | Loosen threshold; review recent FP examples and tune |
| A PII leak reaches a user (false negative) | Tighten PII detection immediately; add the pattern that was missed |
| Guardrail sync latency exceeds 100ms p95 | Move the slowest sync check to async |
| "Can't help" complaints exceed 1% of sessions | Identify which guardrail is over-blocking; tune or change to redirect instead of block |
| Hallucination reported by user that async check missed | Improve hallucination scoring model; consider promoting lightweight check to sync |
Impact on Trilemma
| Dimension | Minimal Guardrails | Aggressive Guardrails | Balanced (Decision) |
|---|---|---|---|
| Cost | Low compute, but high risk | High compute + high FP retry cost | Moderate (sync cheap, async amortized) |
| Performance | Fast (no checks) | Slow (+300-500ms) | Good (+100ms sync only) |
| Inference Quality | Dangerous (PII leaks, hallucinations) | Over-blocked (poor UX) | Good (targeted checks, low FP) |
| Safety | Unacceptable | Excessive | Appropriate |
| QACPI | High raw score but unsustainable risk | Low (UX damage) | Highest |