3. Guardrails Pipeline Deep Dive

This document expands the original LLD-6 guardrails section into a deeper architecture and implementation reference. It explains where the pipeline sits in the end-to-end system, how each stage works, how policies are activated per intent, and what follow-up questions usually matter in interviews and design reviews.

For MangaAssist, guardrails are not only about moderation. They are a trust boundary between a probabilistic model and a commerce product that must stay correct, privacy-safe, brand-safe, and in scope.

Why This Subsystem Exists

Prompt hardening reduces the chance of bad model behavior. Guardrails deal with the bad behavior that still gets generated.

Failure Class	Example	Why Prompting Alone Is Not Enough	Primary Guardrail
PII leakage	Model echoes a shipping address or invents an email address	The model can still emit sensitive-looking text even when instructed not to	Stage 1: PII Filter
Commerce inaccuracy	Model invents a price, discount, or savings claim	The model can produce fluent but unverifiable math	Stage 2: Price Validator
Harmful language	Response contains toxic or abusive wording	Generic prompt rules do not guarantee safe output	Stage 3: Toxicity Filter
Brand violation	Response recommends another store	The model may mirror user phrasing or retrieved text	Stage 4: Competitor Filter
Grounding failure	Response mentions an ASIN not in the retrieved candidate set	The model can use training memory instead of request context	Stage 5: ASIN Validator
Capability drift	Response gives unsupported advice or claims actions it cannot perform	The model can drift outside business scope	Stage 6: Scope Check

The important distinction is simple:

Prompt hardening is probabilistic.
Guardrails are deterministic enforcement.

Guardrails In The Full Architecture

The pipeline sits after generation but depends on structured data collected before generation. That is why it must be treated as a real subsystem, not as a final regex wrapper.

flowchart TB
    User[User Chat Widget] --> Gateway[API Gateway and Auth]
    Gateway --> InputDef[Input Sanitization and Injection Defense]
    InputDef --> Orch[Chat Orchestrator]
    Orch --> Intent[Intent Classifier]
    Orch --> Data[Catalog, RAG, Order, Promo Services]
    Intent --> Prompt[Prompt Builder]
    Data --> Prompt
    Prompt --> FM[Foundation Model]

    subgraph Guardrails["Post-Generation Guardrails"]
        GC[Guardrails Coordinator]
        S1[Stage 1 PII Filter]
        S2[Stage 2 Price Validator]
        S3[Stage 3 Toxicity Filter]
        S4[Stage 4 Competitor Filter]
        S5[Stage 5 ASIN Validator]
        S6[Stage 6 Scope Check]
        GC --> S1 --> S2 --> S3 --> S4 --> S5 --> S6
    end

    FM --> GC
    S6 --> Formatter[Response Formatter]
    Formatter --> User

    Policy[Policy Store and Versioned Config] --> GC
    CatalogCache[Catalog Snapshot Cache] --> S2
    CatalogCache --> S5
    Allowlists[Manga Term and Brand Allowlists] --> S1
    Allowlists --> S3
    GC --> Audit[Audit Log Sink]
    GC --> Metrics[Metrics and Alerts]
    Audit --> Review[Human Review Queue]

Architectural Readout

This placement matters for five reasons:

Guardrails are downstream of the model, but upstream of the user.
Stages require structured context from the orchestrator, not just raw text.
Policy is versioned and externalized, so rules can be tuned without redeploying all stage code.
Audit and observability are part of the design, not an afterthought.
The pipeline is a hot-path dependency, so latency and failure handling matter as much as correctness.

End-To-End Dataflow

sequenceDiagram
    participant User
    participant Gateway
    participant Orch as Orchestrator
    participant Data as Domain Services
    participant FM as Foundation Model
    participant Guardrails
    participant Audit

    User->>Gateway: Chat message
    Gateway->>Orch: Authenticated request context
    Orch->>Orch: Sanitize input, load memory, classify intent
    Orch->>Data: Fetch catalog, retrieval, order, promo state
    Data-->>Orch: Structured context
    Orch->>FM: Prompt plus structured context
    FM-->>Orch: Draft response
    Orch->>Guardrails: Draft plus guardrail context
    Guardrails->>Guardrails: Run stages 1 through 6
    Guardrails-->>Orch: pass, modify, block, or redirect
    Orch->>Audit: Async stage findings and metrics
    Orch-->>Gateway: Approved response or fallback
    Gateway-->>User: Render final response

What Flows Into The Guardrails Layer

The pipeline should never receive only a plain string. It needs:

user_message
draft_response
classified_intent
authentication_state
retrieved_asins
catalog_snapshot
authorized_pii_fields
policy_version
locale
request_metadata

Without that structure, several stages become unreliable:

Price validation becomes text guessing instead of fact validation.
ASIN validation cannot distinguish real products from memorized products.
PII filtering cannot tell authorized account data from accidental leakage.
Scope checking cannot distinguish supported order help from disallowed advice.

Design Principles

Deterministic code owns hard business facts such as prices, discounts, ASINs, and entitlements.
The model can format and explain, but it should not be the source of truth for commerce facts.
Guardrails should be composable and independently observable.
Each stage should have a narrow responsibility and a clear action contract.
Low-risk violations can be modified; trust or legal violations should block.
Policy should be intent-aware, because not every stage applies equally to every request.
Guardrail tuning must be evaluated as a whole-system problem, not only stage by stage.

Pipeline Overview

Logical Stage Order

flowchart LR
    A[Draft Response] --> B[1 PII Filter]
    B --> C[2 Price Validator]
    C --> D[3 Toxicity Filter]
    D --> E[4 Competitor Filter]
    E --> F[5 ASIN Validator]
    F --> G[6 Scope Check]
    G --> H[Approved Response]

    B --> X[Fallback or Redacted Response]
    C --> X
    D --> X
    E --> X
    F --> X
    G --> X

Why This Order

PII comes first because redaction can preserve an otherwise useful answer.
Price comes early because incorrect pricing is a business liability.
Toxicity runs after deterministic correctness checks because it is the slowest stage.
Competitor and ASIN enforce brand and grounding constraints before delivery.
Scope runs last as the final capability boundary.

Stage Summary

Stage	Main Question	Typical Action	Needs Structured Context?	Target Latency
PII Filter	Did the model expose unauthorized PII?	Modify or block	Yes	3-5 ms
Price Validator	Are price claims backed by catalog truth?	Block	Yes	1-3 ms
Toxicity Filter	Is the answer harmful or abusive?	Block or audit	Yes	8-15 ms
Competitor Filter	Did the answer mention or redirect to competitors?	Block or redirect	Sometimes	under 1 ms
ASIN Validator	Are product identifiers real and grounded?	Block	Yes	1-3 ms
Scope Check	Is the answer within supported capability?	Block or redirect	Yes	under 1 ms

Intent-Aware Policy Matrix

One of the biggest sources of false positives in real systems is applying every rule to every request. MangaAssist uses intent-aware stage activation.

Intent	PII	Price	Toxicity	Competitor	ASIN	Scope
`recommendation`	Required	Skip unless prices shown	Required	Required	Required	Required
`product_question`	Required	Required if price appears	Required	Required	Required	Required
`price_comparison`	Required	Required	Required	Redirect mode	Required	Required
`order_tracking`	Required	Skip	Required	Skip	Skip	Required
`faq`	Required	Conditional	Required	Skip	Skip	Required
`chitchat`	Required	Skip	Required	Skip	Skip	Required

GUARDRAIL_POLICY = {
    "recommendation": {
        "enabled_stages": [
            "pii_filter",
            "toxicity_filter",
            "competitor_filter",
            "asin_validator",
            "scope_check",
        ],
        "price_validator": {"mode": "conditional"},
    },
    "price_comparison": {
        "enabled_stages": [
            "pii_filter",
            "price_validator",
            "toxicity_filter",
            "competitor_filter",
            "asin_validator",
            "scope_check",
        ],
        "competitor_filter": {"mode": "redirect"},
    },
}

The key idea is not only enable or disable. Some stages have modes:

strict: block on violation
modify: rewrite or redact
redirect: replace with a safe on-platform answer
audit_only: log but do not block

High-Level Design

Component	Responsibility	Why It Exists
Guardrails Coordinator	Owns stage ordering, short-circuiting, and final action selection	Central orchestration point
Policy Resolver	Loads config by intent, locale, user type, and version	Prevents hard-coded rule sprawl
Stage Registry	Maps stage names to implementation objects	Enables modular rollout
Catalog Snapshot Cache	Provides low-latency product truth	Required for deterministic validation
Manga Allowlist Service	Supplies title, genre, and domain exceptions	Reduces false positives
Audit Logger	Emits structured stage-level decisions	Required for review and tuning
Metrics Emitter	Publishes latency, block rate, and false positive proxies	Required for operations
Fallback Factory	Builds safe user-facing responses	Avoids generic and frustrating blocks

HLD Decision

The guardrails layer is implemented as an in-process module inside the chat orchestrator, not as a separate network hop.

That decision gives:

no extra service hop on every response
easier access to already-fetched request context
simpler hot-path error handling

The downside is tighter runtime coupling, but that is acceptable here because guardrails depend heavily on orchestration state anyway.

Low-Level Design

Core Interfaces

from dataclasses import dataclass, field
from typing import Any, Protocol


@dataclass
class GuardrailContext:
    request_id: str
    session_id: str
    intent: str
    locale: str
    user_message: str
    draft_response: str
    retrieved_asins: list[str]
    catalog_snapshot: dict[str, Any]
    authorized_pii_fields: list[str]
    auth_state: str
    policy_version: str
    metadata: dict[str, Any] = field(default_factory=dict)


@dataclass
class StageFinding:
    stage: str
    severity: str
    finding_type: str
    message: str
    metadata: dict[str, Any] = field(default_factory=dict)


@dataclass
class StageDecision:
    action: str
    updated_response: str | None = None
    findings: list[StageFinding] = field(default_factory=list)


@dataclass
class GuardrailOutcome:
    final_action: str
    response_to_deliver: str
    stage_results: list[StageDecision]
    blocked_stage: str | None = None


class GuardrailStage(Protocol):
    name: str

    def applies(self, ctx: GuardrailContext, policy: dict[str, Any]) -> bool:
        ...

    def evaluate(self, ctx: GuardrailContext, policy: dict[str, Any]) -> StageDecision:
        ...

Coordinator Class Diagram

classDiagram
    class GuardrailsCoordinator {
        +evaluate(ctx) GuardrailOutcome
        -resolvePolicy(intent, locale, authState) dict
        -runStage(stage, ctx, policy) StageDecision
        -emitAudit(outcome) None
    }

    class PolicyResolver {
        +resolve(intent, locale, authState) dict
    }

    class StageRegistry {
        +getEnabledStages(policy) list
    }

    class GuardrailStage {
        <<interface>>
        +applies(ctx, policy) bool
        +evaluate(ctx, policy) StageDecision
    }

    class PiiFilterStage
    class PriceValidatorStage
    class ToxicityFilterStage
    class CompetitorFilterStage
    class AsinValidatorStage
    class ScopeCheckStage

    GuardrailsCoordinator --> PolicyResolver
    GuardrailsCoordinator --> StageRegistry
    StageRegistry --> GuardrailStage
    GuardrailStage <|.. PiiFilterStage
    GuardrailStage <|.. PriceValidatorStage
    GuardrailStage <|.. ToxicityFilterStage
    GuardrailStage <|.. CompetitorFilterStage
    GuardrailStage <|.. AsinValidatorStage
    GuardrailStage <|.. ScopeCheckStage

Pipeline Runner

def evaluate_guardrails(ctx: GuardrailContext) -> GuardrailOutcome:
    policy = policy_resolver.resolve(ctx.intent, ctx.locale, ctx.auth_state)
    current_response = ctx.draft_response
    stage_results = []

    for stage in stage_registry.get_enabled_stages(policy):
        stage_ctx = GuardrailContext(**{**ctx.__dict__, "draft_response": current_response})

        if not stage.applies(stage_ctx, policy):
            continue

        result = stage.evaluate(stage_ctx, policy)
        stage_results.append(result)

        if result.action == "modify":
            current_response = result.updated_response or current_response
            continue

        if result.action in {"pass", "audit_only"}:
            continue

        if result.action in {"block", "redirect"}:
            safe_response = fallback_factory.build(stage.name, stage_ctx, result)
            return GuardrailOutcome(
                final_action=result.action,
                response_to_deliver=safe_response,
                stage_results=stage_results,
                blocked_stage=stage.name,
            )

    return GuardrailOutcome(
        final_action="pass",
        response_to_deliver=current_response,
        stage_results=stage_results,
    )

Runtime State Machine

stateDiagram-v2
    [*] --> DraftReady
    DraftReady --> EvaluateStage
    EvaluateStage --> Continue: pass
    EvaluateStage --> Modify: modify
    EvaluateStage --> Audit: audit_only
    EvaluateStage --> Blocked: block
    EvaluateStage --> Redirected: redirect

    Modify --> Continue
    Audit --> Continue
    Continue --> EvaluateStage: next stage
    Continue --> Approved: last stage passed

    Approved --> Delivered
    Blocked --> Delivered
    Redirected --> Delivered
    Delivered --> [*]

Detailed Implementation By Stage

Stage 1: PII Filter

Goal

Stop the model from exposing unauthorized personal data while preserving as much useful answer content as possible.

Why Output PII Filtering Still Matters

Even if input sanitization is strong, the model can still:

echo account data from authenticated context
hallucinate realistic-looking phone numbers or email addresses
combine fragments from retrieved content into a new sensitive-looking string

That is why input-side privacy controls and output-side privacy controls must both exist.

PII Filter Dataflow

flowchart LR
    A[Draft Response] --> B[Regex Detectors]
    A --> C[NER Detector]
    A --> D[Custom Commerce Detectors]
    B --> E[Span Merger]
    C --> E
    D --> E
    E --> F{Authorized By Context}
    F -->|Yes| G[Pass Span]
    F -->|No| H[Typed Redaction]
    G --> I[Updated Response]
    H --> I

Detailed Implementation Notes

Regex handles structured patterns such as email, phone, order ID, and postal code.
NER handles names and free-form addresses.
Custom detectors handle Amazon-specific entities such as tracking numbers and customer identifiers.
Character-name allowlists from the manga catalog reduce false positives on fictional names.
Authorization is context-sensitive. Order tracking may authorize masked address fragments, but recommendation flows do not.

Pseudocode

def pii_filter_stage(response: str, ctx: GuardrailContext) -> StageDecision:
    findings = scan_regex(response)
    findings += scan_ner(response, locale=ctx.locale)
    findings += scan_custom(response, ctx.metadata)
    merged = merge_overlapping_findings(findings)

    unauthorized = []
    for finding in merged:
        if finding["type"] not in ctx.authorized_pii_fields:
            unauthorized.append(finding)

    if not unauthorized:
        return StageDecision(action="pass")

    redacted = apply_typed_redactions(response, unauthorized)
    return StageDecision(
        action="modify",
        updated_response=redacted,
        findings=[
            StageFinding(
                stage="pii_filter",
                severity="high",
                finding_type=f["type"],
                message="Unauthorized PII redacted",
            )
            for f in unauthorized
        ],
    )

Action Choice

This stage usually modifies instead of blocks. If the answer is mostly useful and only contains one unsafe substring, full blocking creates unnecessary user pain. PII redaction is one of the few cases where inline repair is safer than full rejection.

Main Failure Modes

false positives on manga character names
false negatives on unusual international address formats
over-authorizing account data because the intent metadata was too broad

Stage SLO

P50 under 3 ms
P99 under 8 ms
false positive rate on character names under 3 percent

Stage 2: Price Validator

Goal

Ensure that every price, discount, savings claim, or promotional statement shown to the user is backed by the current catalog snapshot or a deterministic calculation.

Important Design Rule

The model should never invent or calculate commerce facts. The orchestrator should pre-compute them and the model should only explain them.

Preferred Response Contract

The most robust implementation is not free text first. It is structured claims first, natural language second.

{
  "price_claims": [
    {
      "asin": "B0ABC12345",
      "currency": "USD",
      "display_price": 12.99,
      "source": "catalog_snapshot"
    }
  ],
  "derived_claims": [
    {
      "type": "bundle_savings",
      "asin": "B0BOXSET123",
      "individual_total": 207.84,
      "bundle_price": 149.99,
      "savings": 57.85
    }
  ]
}

If the model produces only free text without structured price claims, the validator must infer more from text and reliability drops.

Validation Flow

flowchart TD
    A[Draft Response and Price Claims] --> B{Any Price Mentioned}
    B -->|No| P[PASS]
    B -->|Yes| C[Map Each Claim To ASIN]
    C --> D{ASIN In Catalog Snapshot}
    D -->|No| X[BLOCK Unknown Product Price]
    D -->|Yes| E[Compare Display Price]
    E --> F{Exact Match Within Tolerance}
    F -->|No| Y[BLOCK Price Mismatch]
    F -->|Yes| G{Derived Claim Present}
    G -->|No| P
    G -->|Yes| H[Recompute Savings Percent Math]
    H --> I{Derived Claim Correct}
    I -->|Yes| P
    I -->|No| Z[BLOCK Invalid Derived Math]

What This Stage Checks

raw price values
currency
promo IDs and discount presence
bundle savings arithmetic
percent-off calculations
"cheaper than" or "save more" comparative claims

Pseudocode

def price_validator_stage(response: str, ctx: GuardrailContext) -> StageDecision:
    claims = extract_price_claims(response, ctx.metadata.get("structured_response"))
    catalog = ctx.catalog_snapshot
    findings = []

    for claim in claims:
        asin = claim["asin"]
        if asin not in catalog:
            findings.append(("unknown_asin", asin))
            continue

        actual = catalog[asin]["price"]
        if abs(claim["display_price"] - actual) > 0.01:
            findings.append(("price_mismatch", asin))

        if "derived" in claim:
            if not validate_derived_claim(claim["derived"], catalog[asin], catalog):
                findings.append(("invalid_derived_math", asin))

    if findings:
        return StageDecision(
            action="block",
            findings=[
                StageFinding(
                    stage="price_validator",
                    severity="critical",
                    finding_type=item[0],
                    message=f"Price validation failed for {item[1]}",
                )
                for item in findings
            ],
        )

    return StageDecision(action="pass")

Why This Stage Blocks Instead Of Fixing Inline

Inline correction sounds attractive, but it is risky unless the response is fully structured and deterministic. A partially corrected commerce answer can still leave surrounding language inconsistent.

Example:

The validator can repair $9.99 to $12.99.
But the sentence may still claim "saving you $9.85" when the real savings is $57.85.

That is why a wrong price normally triggers a safe fallback or a controlled regeneration path outside the stage itself.

Main Failure Modes

stale catalog snapshot
free-text price references not mapped to an ASIN
derived claim math that uses correct arithmetic on hallucinated inputs

Stage SLO

P50 under 2 ms
P99 under 5 ms
zero tolerance for shipped incorrect prices

Stage 3: Toxicity Filter

Goal

Block harmful, abusive, sexual, or unsafe responses without breaking legitimate manga discussion.

Why This Stage Is Hard In The Manga Domain

Generic classifiers do not understand that many popular titles contain words which look unsafe out of context:

Term	Why A Generic Model Flags It	Why MangaAssist Must Often Allow It
`Chainsaw Man`	weapon keyword	legitimate series title
`Attack on Titan`	violence keyword	legitimate franchise
`Demon Slayer`	violent and occult keywords	legitimate franchise
`Death Note`	death keyword	legitimate franchise
`Hell's Paradise`	profanity-adjacent term	legitimate title

Two-Layer Toxicity Architecture

flowchart TD
    A[Draft Response] --> B[Generic Toxicity Classifier]
    B -->|Clean| P[PASS]
    B -->|Flagged| C[Domain Override Layer]
    C --> D{Flagged Terms In Manga Allowlist}
    D -->|No| X[BLOCK]
    D -->|Yes| E{Term Used In Product Context}
    E -->|No| X
    E -->|Yes| F{Overall Severity Above Hard Threshold}
    F -->|Yes| X
    F -->|No| P

Implementation Approach

Layer 1 handles general unsafe language detection.

Layer 2 applies domain-aware logic:

title allowlists
genre term allowlists
context checks against retrieved products
higher thresholds when the term clearly refers to a product title

Pseudocode

def toxicity_filter_stage(response: str, ctx: GuardrailContext) -> StageDecision:
    result = generic_toxicity_provider.score(response)
    if result["is_clean"]:
        return StageDecision(action="pass")

    findings = []
    for term in result["flagged_terms"]:
        if is_manga_allowlisted(term) and is_product_context(term, response, ctx.metadata):
            continue

        findings.append(
            StageFinding(
                stage="toxicity_filter",
                severity="critical" if result["score"] >= 0.8 else "high",
                finding_type="toxic_content",
                message=f"Unsafe term detected: {term}",
            )
        )

    if findings:
        return StageDecision(action="block", findings=findings)

    if result["score"] >= 0.65:
        return StageDecision(
            action="audit_only",
            findings=[
                StageFinding(
                    stage="toxicity_filter",
                    severity="medium",
                    finding_type="borderline_toxicity",
                    message="Borderline content passed after domain override",
                )
            ],
        )

    return StageDecision(action="pass")

Key Calibration Rules

A high generic toxicity score is not enough to block if the flagged tokens map to known product titles and the sentence is clearly catalog-oriented.
A low generic score is not enough to pass if the phrasing is directly abusive or threatening.
Borderline cases should feed the audit loop so the threshold can be tuned with real data.

Main Failure Modes

false positives on title names
false negatives on sarcastic or indirect abuse
allowlist drift when new series are added

Stage SLO

P50 under 10 ms
P99 under 20 ms
false positive rate on title mentions under 1 percent after override

Stage 4: Competitor Filter

Goal

Prevent the assistant from steering users to competitor stores or making unsupported competitive claims.

What This Stage Should Catch

direct competitor names
"buy it elsewhere" phrasing
unsupported claims like "cheaper at X"
comparative calls to action that pull the user off-platform

Design Detail That Usually Gets Missed

This stage validates the model output, not the user message.

If the user asks:

Is it cheaper on another store?

the system can still answer safely by redirecting:

I can help compare editions, formats, and current offers available here on Amazon.

The user is not blocked. The model output is constrained.

Competitor Decision Flow

flowchart LR
    A[Draft Response] --> B[Regex and Entity Match]
    B --> C{Competitor Mentioned}
    C -->|No| P[PASS]
    C -->|Yes| D{Policy Mode}
    D -->|strict| X[BLOCK]
    D -->|redirect| Y[Return Safe Redirect]

Pseudocode

COMPETITOR_PATTERNS = [
    r"\bBarnes\s*&?\s*Noble\b",
    r"\bRight\s*Stuf\b",
    r"\bBookWalker\b",
    r"\bKinokuniya\b",
    r"\bCrunchyroll\s*Store\b",
    r"\bcheaper\s+(?:at|on|from)\b",
    r"\bbuy\s+(?:it|this|them)\s+(?:at|on|from)\s+(?!Amazon)\b",
]


def competitor_filter_stage(response: str, mode: str) -> StageDecision:
    for pattern in COMPETITOR_PATTERNS:
        match = re.search(pattern, response, re.IGNORECASE)
        if not match:
            continue

        if mode == "redirect":
            return StageDecision(
                action="redirect",
                findings=[
                    StageFinding(
                        stage="competitor_filter",
                        severity="high",
                        finding_type="competitor_mention",
                        message=f"Redirected competitor phrasing: {match.group(0)}",
                    )
                ],
            )

        return StageDecision(
            action="block",
            findings=[
                StageFinding(
                    stage="competitor_filter",
                    severity="high",
                    finding_type="competitor_mention",
                    message=f"Blocked competitor phrasing: {match.group(0)}",
                )
            ],
        )

    return StageDecision(action="pass")

Main Failure Modes

patterns that are too broad and accidentally match intra-Amazon comparisons
missing competitor aliases
retrieved policy text or user text echoed into the answer without normalization

Stage SLO

P50 under 1 ms
P99 under 2 ms

Stage 5: ASIN Validator

Goal

Ensure that every product identifier or product card shown to the user is real, catalog-valid, and grounded in the candidate set retrieved for the current request.

This Stage Is Really About Grounding

The failure is not only fake ASIN. It is also:

real ASIN but not retrieved for this request
real ASIN from the wrong locale or edition
product title in prose that does not match the attached product card

ASIN Validation Dataflow

flowchart TD
    A[Draft Response] --> B[Extract Mentioned ASINs and Product Titles]
    B --> C[Compare Against Catalog Snapshot]
    C --> D{Exists In Catalog}
    D -->|No| X[BLOCK Hallucinated Product]
    D -->|Yes| E[Compare Against Retrieved Candidate Set]
    E --> F{In Retrieved Set}
    F -->|No| Y[BLOCK Ungrounded Product]
    F -->|Yes| G[Compare Title Format Locale]
    G --> H{Consistent}
    H -->|No| Z[BLOCK Inconsistent Product Metadata]
    H -->|Yes| P[PASS]

Pseudocode

def asin_validator_stage(response: str, ctx: GuardrailContext) -> StageDecision:
    mentioned_asins = extract_asins(response, ctx.metadata.get("structured_response"))
    retrieved = set(ctx.retrieved_asins)
    catalog = ctx.catalog_snapshot
    findings = []

    for asin in mentioned_asins:
        if asin not in catalog:
            findings.append(("hallucinated_asin", asin))
            continue

        if asin not in retrieved:
            findings.append(("ungrounded_asin", asin))
            continue

        if not metadata_consistent(asin, response, catalog[asin]):
            findings.append(("metadata_mismatch", asin))

    if findings:
        return StageDecision(
            action="block",
            findings=[
                StageFinding(
                    stage="asin_validator",
                    severity="critical",
                    finding_type=item[0],
                    message=f"ASIN validation failed for {item[1]}",
                )
                for item in findings
            ],
        )

    return StageDecision(action="pass")

Why Both Catalog And Retrieval Checks Are Needed

Catalog check prevents fake products.
Retrieval check prevents the model from using training memory instead of the live candidate set.

Without the retrieval check, a model can recommend a real but contextually wrong product and still look credible.

Main Failure Modes

model recalls a real product from training data
title text refers to one edition, but the structured card points to another
locale mismatch between JP and EN catalog variants

Stage SLO

P50 under 2 ms
P99 under 5 ms

Stage 6: Scope Check

Goal

Ensure the response stays within MangaAssist's allowed capability boundary.

Scope Means More Than Topic

A response can be about manga and still be out of scope if it:

gives legal or medical advice
claims to have performed an account action it cannot perform
offers unsupported guarantees
drifts into politics, religion, or personal counseling

What This Stage Checks

intent-to-response alignment
restricted-domain phrases
unsupported action verbs such as "I refunded", "I changed your address", or "I cancelled the order" when no execution token exists
capability mismatch for guest users versus authenticated users

Pseudocode

OUT_OF_SCOPE_PATTERNS = [
    r"\bmedical advice\b",
    r"\blegal advice\b",
    r"\binvestment advice\b",
    r"\bhow to hack\b",
    r"\bdownload free manga\b",
]

UNSUPPORTED_ACTION_PATTERNS = [
    r"\bI (?:refunded|cancelled|changed|updated)\b",
]


def scope_check_stage(response: str, ctx: GuardrailContext) -> StageDecision:
    for pattern in OUT_OF_SCOPE_PATTERNS:
        if re.search(pattern, response, re.IGNORECASE):
            return StageDecision(
                action="block",
                findings=[
                    StageFinding(
                        stage="scope_check",
                        severity="critical",
                        finding_type="out_of_scope",
                        message="Response entered a restricted domain",
                    )
                ],
            )

    if ctx.auth_state == "guest":
        for pattern in UNSUPPORTED_ACTION_PATTERNS:
            if re.search(pattern, response, re.IGNORECASE):
                return StageDecision(
                    action="redirect",
                    findings=[
                        StageFinding(
                            stage="scope_check",
                            severity="high",
                            finding_type="unsupported_action_claim",
                            message="Guest response implied an unsupported account action",
                        )
                    ],
                )

    return StageDecision(action="pass")

Why This Stage Is Last

This stage benefits from everything earlier in the pipeline:

PII has already been redacted if needed.
pricing and grounding have already been verified.
toxic or competitor content has already been removed from the decision space.

So the final scope decision can focus on capability boundaries, not low-level content hazards.

Main Failure Modes

overly broad patterns that block educational manga discussions
under-constrained action language that implies system actions were executed
guest versus authenticated capability drift

Stage SLO

P50 under 1 ms
P99 under 2 ms

Action Model: Pass, Modify, Redirect, Block, Audit

Not every violation should produce the same outcome.

flowchart TD
    A[Stage Finding] --> B{Risk Type}
    B -->|Recoverable text issue| C[MODIFY]
    B -->|Brand or capability mismatch| D[REDIRECT]
    B -->|Critical trust or safety risk| E[BLOCK]
    B -->|Borderline low confidence| F[AUDIT ONLY]
    B -->|No issue| G[PASS]

Action Mapping

Action	Typical Use	User Experience
`pass`	no issue detected	normal response
`modify`	unauthorized PII or safe text repair	answer preserved with redactions
`redirect`	competitor comparison, unsupported action	safe branded alternative
`block`	wrong price, toxic content, ungrounded product	fallback response
`audit_only`	borderline or sampled review case	normal response plus review signal

Fallback Strategy

Fallback text should be stage-specific, not generic:

price_validator: "Let me pull the exact current pricing from the product page."
competitor_filter: "I can help compare editions, formats, and current Amazon offers."
scope_check: "I can help with manga products, order support, shipping, returns, and related store questions."

Generic "I can't help with that" responses hide the root cause and reduce usefulness.

Detailed Scenarios

Scenario 1: "Chainsaw Man" Triggered Toxicity Blocks

Problem

The generic toxicity provider started blocking recommendation answers that mentioned violent-sounding titles.

Symptom

Users asked for recommendations similar to Chainsaw Man or Attack on Titan and received safe fallbacks instead of valid recommendations.

Scenario Flow

flowchart LR
    A[User asks for dark fantasy recommendations] --> B[Model drafts: Try Chainsaw Man and Hell's Paradise]
    B --> C[Generic toxicity score high]
    C --> D[Domain override checks title allowlist]
    D --> E{Terms are valid catalog titles}
    E -->|Yes| F[Pass response]
    E -->|No| G[Block response]

Fix

Added a catalog-backed manga title allowlist.
Added product-context checks so the same term can pass in catalog context and block in abusive context.
Tuned thresholds using a domain-specific evaluation set instead of generic moderation benchmarks.

Prevention

daily allowlist refresh from the catalog
weekly audit of blocked toxicity cases
dashboard for false positive rate by intent

Scenario 2: Price Validator Caught Hallucinated Bundle Math

Problem

The model gave confident bundle-savings explanations that were internally consistent but factually wrong because one source value was hallucinated.

Symptom

The response sounded reasonable:

You save $9.85 by buying the box set.

But the model had invented the single-volume price and then performed correct math on the wrong base value.

Scenario Flow

flowchart TD
    A[Catalog truth: volume price 12.99 and box set 149.99] --> B[Model invents volume price 9.99]
    B --> C[Model computes savings from wrong base]
    C --> D[Response looks mathematically coherent]
    D --> E[Price validator recomputes against catalog]
    E --> F[Block invalid answer]

Fix

Moved all arithmetic into deterministic orchestrator code.
Added structured price_claims and derived_claims to the response contract.
Kept the validator as a hard gate even after prompt changes.

Prevention

never ask the model to invent bundle arithmetic
golden tests for savings, percentages, and promotion wording
alert on any price-related block spike

Scenario 3: ASIN Validator Caught Grounding Violations

Problem

The model occasionally recommended real products it had likely seen during training, even though those ASINs were not in the retrieved candidate set for the current request.

Symptom

Recommendation answers included real but contextually wrong products, which made the response look plausible while quietly breaking grounding.

Scenario Flow

flowchart LR
    A[Candidate set from retrieval] --> B[ASIN 1, ASIN 2, ASIN 3]
    C[Model parametric memory] --> D[Real ASIN 9 from old knowledge]
    B --> E[Draft response]
    D --> E
    E --> F[ASIN validator compares response against candidate set]
    F --> G{All ASINs grounded}
    G -->|No| H[Block response]
    G -->|Yes| I[Pass response]

Fix

Enforced candidate-set-only output through prompt constraints.
Validated all ASINs in both text and structured product cards.
Added metadata consistency checks so title, edition, and locale also had to match.

Prevention

replay tests with known memorized-product prompts
candidate-set mismatch metric
regression tests on multilingual catalog variants

Scenario 4: Cross-Stage Over-Blocking After Policy Tightening

Problem

After one safety incident, the team tightened multiple stages independently. Each change looked reasonable in isolation, but together they caused a large fallback spike.

Symptom

Users received fallbacks for legitimate questions such as:

"Is the deluxe edition cheaper than the standard edition?"
"Is Death Note appropriate for older teens?"
"Compare the paperback and hardcover versions."

Scenario Flow

flowchart TD
    A[Policy tightening deploy] --> B[Broader competitor regex]
    A --> C[Stricter scope rules]
    A --> D[Lower toxicity threshold]
    B --> E[Combined false positive spike]
    C --> E
    D --> E
    E --> F[Fallback rate rises]
    F --> G[Intent-aware policy review]
    G --> H[Stage-specific tuning and combined evaluation]

Fix

Added intent-aware stage activation.
Replaced broad competitor phrases with context-dependent patterns.
Added combined pipeline regression tests on representative traffic.
Started measuring fallback rate by intent, not only by stage.

Prevention

every policy change must pass offline combined evaluation
canary release for new policies
rollback trigger on fallback rate or false positive proxy spikes

Streaming Considerations

Guardrails change how streaming must be implemented.

If the system streams raw model tokens directly to the user before validation completes, several failures become possible:

a wrong price is shown before Stage 2 blocks it
a competitor mention leaks before Stage 4 blocks it
an ungrounded ASIN is visible before Stage 5 catches it

Practical Design Choice

For MangaAssist, the orchestrator buffers the draft response until guardrail validation finishes, then streams the approved response to the client.

This slightly increases time-to-first-token, but it avoids partial unsafe output. In commerce and trust-sensitive flows, that is the correct trade.

Observability And Audit

What Must Be Measured

Metric	Meaning	Why It Matters
`guardrail_block_rate`	overall percentage of blocked answers	product health and safety strictness
`guardrail_block_rate_by_stage`	block rate per stage	tuning and root cause isolation
`guardrail_modify_rate`	percentage of answers redacted or repaired	privacy and repair pressure
`fallback_rate_by_intent`	fallback frequency split by use case	user experience impact
`guardrail_latency_p95`	high-percentile validation latency	hot-path performance
`ungrounded_asin_rate`	percentage of answers with candidate-set violations	grounding health
`price_mismatch_rate`	price validation failures	commerce trust
`toxicity_false_positive_rate`	audited false positives	domain calibration quality
`policy_version_traffic_share`	request share by active policy version	rollout visibility

Audit Record Example

{
  "timestamp": "2026-03-24T15:20:41Z",
  "request_id": "req_123",
  "session_id": "sess_456",
  "intent": "price_comparison",
  "policy_version": "guardrails_v17",
  "final_action": "block",
  "blocked_stage": "price_validator",
  "stage_results": [
    {"stage": "pii_filter", "action": "pass", "latency_ms": 3.1},
    {"stage": "price_validator", "action": "block", "latency_ms": 1.8}
  ],
  "response_delivered": false
}

Logging Rule

Audit logs should store decision metadata and minimal finding metadata. They should not store raw PII values or entire unsafe response strings by default.

Testing Strategy

Test Layers

Unit tests Each stage is tested with deterministic fixtures.
Contract tests Structured response contracts are tested so validators receive stable fields.
Golden set evaluation Realistic prompts and expected pass or block outcomes are replayed on every policy change.
Combined pipeline tests Multi-stage interactions are tested to catch false-positive cascades.
Shadow mode New policies run without user impact and compare outcomes to the live policy.
Canary rollout A small traffic slice uses the new policy version before full rollout.

Rollout Flow

flowchart LR
    A[Policy Change] --> B[Offline Evaluation]
    B --> C{Passes Golden Set}
    C -->|No| X[Reject Change]
    C -->|Yes| D[Shadow Mode]
    D --> E{Metrics Stable}
    E -->|No| X
    E -->|Yes| F[Canary Release]
    F --> G{Fallback and Block Rates Healthy}
    G -->|No| H[Rollback]
    G -->|Yes| I[Full Rollout]

What The Golden Set Must Include

popular manga titles with violent words
box-set and promotion price explanations
competitor-baiting prompts
guest-user account questions
multilingual title and locale variants
real and hallucinated ASIN examples

Failure Handling And Resilience

What If A Detector Times Out

Not all stage dependencies are equal, so timeout behavior must be stage-specific.

Stage	Timeout Policy	Reason
PII Filter	fail closed for high-confidence regex, fail open for NER with audit	regex already catches most structured PII
Price Validator	fail closed	wrong prices cannot be shown
Toxicity Filter	fail closed on high-risk intents, otherwise fallback	safer to block than ship harmful output
Competitor Filter	fail closed	brand risk is cheap to prevent
ASIN Validator	fail closed	grounding is required
Scope Check	fail closed	unsupported claims are risky

The important design point is that availability first is not always correct. For trust-sensitive commerce answers, failing closed is often the safer product choice.

Architecture Decisions And Tradeoffs

Decision	Chosen Approach	Alternative	Upside	Downside
Placement	in-process with orchestrator	standalone service	lower latency and easier access to context	tighter runtime coupling
Execution model	serial fail-fast	parallel validation	simpler tracing and cheaper on early blocks	latency is cumulative
PII action	modify first	block everything	preserves useful content	requires safe redaction logic
Price action	hard block	inline correction	avoids accidental wrong commerce claims	can feel less helpful
Toxicity design	generic model plus domain override	generic model only	far fewer false positives on manga titles	allowlist maintenance
Grounding check	catalog plus retrieval	catalog only	catches memorized products	needs more request context
Policy routing	intent-aware	same rules for all traffic	fewer false positives	more configuration complexity
Streaming	buffer before render	raw token streaming	no unsafe partial output	slightly slower first token

Follow-Up Questions And Deep Answers

Follow-Up Question 1: Why not rely on Bedrock Guardrails or one moderation API for everything?

Because MangaAssist has domain-specific business constraints that a generic moderation API does not understand. A moderation API can help with toxicity, but it cannot authoritatively validate catalog prices, ASIN grounding, competitor policy, or capability scope. Those checks require structured business data and deterministic logic.

The deeper lesson is that guardrails in commerce are a mix of safety policy and business correctness. A single generic model can assist, but it cannot be the only enforcement mechanism.

Follow-Up Question 2: Why are guardrails post-generation instead of only pre-generation?

Pre-generation controls reduce risk. They do not prove the final output is safe. The model can still hallucinate or synthesize new content after seeing safe inputs.

Post-generation guardrails inspect what will actually be shown to the user. That is the only place where the system can deterministically validate the final answer against catalog truth, scope rules, and brand policy.

Follow-Up Question 3: Why does PII usually redact while price usually blocks?

The risk profiles are different. If the model says something useful but accidentally includes an email address, the unsafe part is localized and can be removed while preserving value. If the model gives a wrong price, the surrounding explanation may also be wrong, and the user may treat it as a store commitment.

So PII is often a text-repair problem, while pricing is a trust-and-liability problem. That is why the action differs.

Follow-Up Question 4: How do you keep false positives low without making the system permissive?

Three things matter:

use structured context, not raw text-only heuristics
activate stages by intent and user state
calibrate with domain-specific datasets and audit loops

The most common failure is treating threshold tuning as the only lever. In reality, false positives usually fall when context improves, not only when thresholds change.

Follow-Up Question 5: How do you support streaming if the response must be fully validated first?

You do not stream raw model tokens directly. You buffer the model output, run guardrails, then stream the approved final text to the client. That preserves the frontend streaming contract without exposing unvalidated content.

If lower time-to-first-token becomes critical, the safe optimization is to stream deterministic scaffolding such as product cards or loading states, not raw unvalidated model text.

Follow-Up Question 6: What happens when policy changes need to roll out quickly after an incident?

Policy is versioned outside the stage code. That allows a fast configuration change, but the rollout still goes through shadow mode and canary unless the incident is severe enough to justify an emergency fail-closed switch.

The important operational discipline is to separate "ship a rule" from "prove the rule is healthy." Emergency changes are sometimes necessary, but they must be easy to roll back with a policy version change and not require a code deploy.

Follow-Up Question 7: How do you detect false negatives when users do not always report them?

You need proxy signals:

sampled human review of passed answers
downstream complaint or thumbs-down analysis
replay on adversarial datasets
consistency checks between structured data and delivered text

False negatives are harder than false positives because they are often silent. That is why audit sampling on passed traffic matters as much as reviewing blocked traffic.

Follow-Up Question 8: How do you make this work across locales like JP and US catalogs?

Locale affects multiple stages:

PII patterns differ
titles and editions differ
pricing currency and format differ
competitor policies can differ by market

So locale must be part of policy resolution, allowlist selection, and catalog snapshot selection. A guardrail pipeline that ignores locale usually works in one market and quietly degrades in another.

Follow-Up Question 9: Why not run the stages in parallel to reduce latency?

Parallel execution lowers wall-clock latency but removes some operational advantages of fail-fast serial execution. In this system, the total serial budget is small enough that simpler tracing and cheaper short-circuiting are worth more than saving a few milliseconds.

If traffic or latency pressure grows, the right next step is usually a hybrid model: run cheap deterministic stages first, then parallelize only the expensive ones. Full parallelism is rarely the first optimization worth taking.

Follow-Up Question 10: How do you prevent the guardrails layer from becoming a fragile pile of regexes?

By treating it as a versioned subsystem with contracts, datasets, policies, and ownership. Regex still has a place, especially for fast deterministic checks, but it should sit alongside structured data validation, allowlists, classifier outputs, and offline evaluation.

The real anti-pattern is not regex itself. The anti-pattern is unversioned rules with no tests, no metrics, and no clear action model.

Key Takeaways

Guardrails are a runtime trust boundary, not a documentation checkbox.
Structured context is what turns content filtering into real business validation.
Domain-aware tuning is mandatory for manga titles, genres, and product language.
Cross-stage evaluation matters because the pipeline can fail as a system even when each stage looks fine alone.
Observability, auditability, and policy rollout discipline are part of the architecture.

Cross-References

PII detection internals: 02-pii-protection-data-privacy.md
System HLD placement: 04-architecture-hld.md
Original LLD anchor: 04b-architecture-lld.md
Prompt-side defenses: 05-guardrails-and-prompt-hardening.md
Security architecture: 12-security-privacy.md