LOCAL PREVIEW View on GitHub

3. Guardrails Pipeline Deep Dive

This document expands the original LLD-6 guardrails section into a deeper architecture and implementation reference. It explains where the pipeline sits in the end-to-end system, how each stage works, how policies are activated per intent, and what follow-up questions usually matter in interviews and design reviews.

For MangaAssist, guardrails are not only about moderation. They are a trust boundary between a probabilistic model and a commerce product that must stay correct, privacy-safe, brand-safe, and in scope.


Why This Subsystem Exists

Prompt hardening reduces the chance of bad model behavior. Guardrails deal with the bad behavior that still gets generated.

Failure Class Example Why Prompting Alone Is Not Enough Primary Guardrail
PII leakage Model echoes a shipping address or invents an email address The model can still emit sensitive-looking text even when instructed not to Stage 1: PII Filter
Commerce inaccuracy Model invents a price, discount, or savings claim The model can produce fluent but unverifiable math Stage 2: Price Validator
Harmful language Response contains toxic or abusive wording Generic prompt rules do not guarantee safe output Stage 3: Toxicity Filter
Brand violation Response recommends another store The model may mirror user phrasing or retrieved text Stage 4: Competitor Filter
Grounding failure Response mentions an ASIN not in the retrieved candidate set The model can use training memory instead of request context Stage 5: ASIN Validator
Capability drift Response gives unsupported advice or claims actions it cannot perform The model can drift outside business scope Stage 6: Scope Check

The important distinction is simple:

  • Prompt hardening is probabilistic.
  • Guardrails are deterministic enforcement.

Guardrails In The Full Architecture

The pipeline sits after generation but depends on structured data collected before generation. That is why it must be treated as a real subsystem, not as a final regex wrapper.

flowchart TB
    User[User Chat Widget] --> Gateway[API Gateway and Auth]
    Gateway --> InputDef[Input Sanitization and Injection Defense]
    InputDef --> Orch[Chat Orchestrator]
    Orch --> Intent[Intent Classifier]
    Orch --> Data[Catalog, RAG, Order, Promo Services]
    Intent --> Prompt[Prompt Builder]
    Data --> Prompt
    Prompt --> FM[Foundation Model]

    subgraph Guardrails["Post-Generation Guardrails"]
        GC[Guardrails Coordinator]
        S1[Stage 1 PII Filter]
        S2[Stage 2 Price Validator]
        S3[Stage 3 Toxicity Filter]
        S4[Stage 4 Competitor Filter]
        S5[Stage 5 ASIN Validator]
        S6[Stage 6 Scope Check]
        GC --> S1 --> S2 --> S3 --> S4 --> S5 --> S6
    end

    FM --> GC
    S6 --> Formatter[Response Formatter]
    Formatter --> User

    Policy[Policy Store and Versioned Config] --> GC
    CatalogCache[Catalog Snapshot Cache] --> S2
    CatalogCache --> S5
    Allowlists[Manga Term and Brand Allowlists] --> S1
    Allowlists --> S3
    GC --> Audit[Audit Log Sink]
    GC --> Metrics[Metrics and Alerts]
    Audit --> Review[Human Review Queue]

Architectural Readout

This placement matters for five reasons:

  1. Guardrails are downstream of the model, but upstream of the user.
  2. Stages require structured context from the orchestrator, not just raw text.
  3. Policy is versioned and externalized, so rules can be tuned without redeploying all stage code.
  4. Audit and observability are part of the design, not an afterthought.
  5. The pipeline is a hot-path dependency, so latency and failure handling matter as much as correctness.

End-To-End Dataflow

sequenceDiagram
    participant User
    participant Gateway
    participant Orch as Orchestrator
    participant Data as Domain Services
    participant FM as Foundation Model
    participant Guardrails
    participant Audit

    User->>Gateway: Chat message
    Gateway->>Orch: Authenticated request context
    Orch->>Orch: Sanitize input, load memory, classify intent
    Orch->>Data: Fetch catalog, retrieval, order, promo state
    Data-->>Orch: Structured context
    Orch->>FM: Prompt plus structured context
    FM-->>Orch: Draft response
    Orch->>Guardrails: Draft plus guardrail context
    Guardrails->>Guardrails: Run stages 1 through 6
    Guardrails-->>Orch: pass, modify, block, or redirect
    Orch->>Audit: Async stage findings and metrics
    Orch-->>Gateway: Approved response or fallback
    Gateway-->>User: Render final response

What Flows Into The Guardrails Layer

The pipeline should never receive only a plain string. It needs:

  • user_message
  • draft_response
  • classified_intent
  • authentication_state
  • retrieved_asins
  • catalog_snapshot
  • authorized_pii_fields
  • policy_version
  • locale
  • request_metadata

Without that structure, several stages become unreliable:

  • Price validation becomes text guessing instead of fact validation.
  • ASIN validation cannot distinguish real products from memorized products.
  • PII filtering cannot tell authorized account data from accidental leakage.
  • Scope checking cannot distinguish supported order help from disallowed advice.

Design Principles

  • Deterministic code owns hard business facts such as prices, discounts, ASINs, and entitlements.
  • The model can format and explain, but it should not be the source of truth for commerce facts.
  • Guardrails should be composable and independently observable.
  • Each stage should have a narrow responsibility and a clear action contract.
  • Low-risk violations can be modified; trust or legal violations should block.
  • Policy should be intent-aware, because not every stage applies equally to every request.
  • Guardrail tuning must be evaluated as a whole-system problem, not only stage by stage.

Pipeline Overview

Logical Stage Order

flowchart LR
    A[Draft Response] --> B[1 PII Filter]
    B --> C[2 Price Validator]
    C --> D[3 Toxicity Filter]
    D --> E[4 Competitor Filter]
    E --> F[5 ASIN Validator]
    F --> G[6 Scope Check]
    G --> H[Approved Response]

    B --> X[Fallback or Redacted Response]
    C --> X
    D --> X
    E --> X
    F --> X
    G --> X

Why This Order

  • PII comes first because redaction can preserve an otherwise useful answer.
  • Price comes early because incorrect pricing is a business liability.
  • Toxicity runs after deterministic correctness checks because it is the slowest stage.
  • Competitor and ASIN enforce brand and grounding constraints before delivery.
  • Scope runs last as the final capability boundary.

Stage Summary

Stage Main Question Typical Action Needs Structured Context? Target Latency
PII Filter Did the model expose unauthorized PII? Modify or block Yes 3-5 ms
Price Validator Are price claims backed by catalog truth? Block Yes 1-3 ms
Toxicity Filter Is the answer harmful or abusive? Block or audit Yes 8-15 ms
Competitor Filter Did the answer mention or redirect to competitors? Block or redirect Sometimes under 1 ms
ASIN Validator Are product identifiers real and grounded? Block Yes 1-3 ms
Scope Check Is the answer within supported capability? Block or redirect Yes under 1 ms

Intent-Aware Policy Matrix

One of the biggest sources of false positives in real systems is applying every rule to every request. MangaAssist uses intent-aware stage activation.

Intent PII Price Toxicity Competitor ASIN Scope
recommendation Required Skip unless prices shown Required Required Required Required
product_question Required Required if price appears Required Required Required Required
price_comparison Required Required Required Redirect mode Required Required
order_tracking Required Skip Required Skip Skip Required
faq Required Conditional Required Skip Skip Required
chitchat Required Skip Required Skip Skip Required
GUARDRAIL_POLICY = {
    "recommendation": {
        "enabled_stages": [
            "pii_filter",
            "toxicity_filter",
            "competitor_filter",
            "asin_validator",
            "scope_check",
        ],
        "price_validator": {"mode": "conditional"},
    },
    "price_comparison": {
        "enabled_stages": [
            "pii_filter",
            "price_validator",
            "toxicity_filter",
            "competitor_filter",
            "asin_validator",
            "scope_check",
        ],
        "competitor_filter": {"mode": "redirect"},
    },
}

The key idea is not only enable or disable. Some stages have modes:

  • strict: block on violation
  • modify: rewrite or redact
  • redirect: replace with a safe on-platform answer
  • audit_only: log but do not block

High-Level Design

Component Responsibility Why It Exists
Guardrails Coordinator Owns stage ordering, short-circuiting, and final action selection Central orchestration point
Policy Resolver Loads config by intent, locale, user type, and version Prevents hard-coded rule sprawl
Stage Registry Maps stage names to implementation objects Enables modular rollout
Catalog Snapshot Cache Provides low-latency product truth Required for deterministic validation
Manga Allowlist Service Supplies title, genre, and domain exceptions Reduces false positives
Audit Logger Emits structured stage-level decisions Required for review and tuning
Metrics Emitter Publishes latency, block rate, and false positive proxies Required for operations
Fallback Factory Builds safe user-facing responses Avoids generic and frustrating blocks

HLD Decision

The guardrails layer is implemented as an in-process module inside the chat orchestrator, not as a separate network hop.

That decision gives:

  • no extra service hop on every response
  • easier access to already-fetched request context
  • simpler hot-path error handling

The downside is tighter runtime coupling, but that is acceptable here because guardrails depend heavily on orchestration state anyway.


Low-Level Design

Core Interfaces

from dataclasses import dataclass, field
from typing import Any, Protocol


@dataclass
class GuardrailContext:
    request_id: str
    session_id: str
    intent: str
    locale: str
    user_message: str
    draft_response: str
    retrieved_asins: list[str]
    catalog_snapshot: dict[str, Any]
    authorized_pii_fields: list[str]
    auth_state: str
    policy_version: str
    metadata: dict[str, Any] = field(default_factory=dict)


@dataclass
class StageFinding:
    stage: str
    severity: str
    finding_type: str
    message: str
    metadata: dict[str, Any] = field(default_factory=dict)


@dataclass
class StageDecision:
    action: str
    updated_response: str | None = None
    findings: list[StageFinding] = field(default_factory=list)


@dataclass
class GuardrailOutcome:
    final_action: str
    response_to_deliver: str
    stage_results: list[StageDecision]
    blocked_stage: str | None = None


class GuardrailStage(Protocol):
    name: str

    def applies(self, ctx: GuardrailContext, policy: dict[str, Any]) -> bool:
        ...

    def evaluate(self, ctx: GuardrailContext, policy: dict[str, Any]) -> StageDecision:
        ...

Coordinator Class Diagram

classDiagram
    class GuardrailsCoordinator {
        +evaluate(ctx) GuardrailOutcome
        -resolvePolicy(intent, locale, authState) dict
        -runStage(stage, ctx, policy) StageDecision
        -emitAudit(outcome) None
    }

    class PolicyResolver {
        +resolve(intent, locale, authState) dict
    }

    class StageRegistry {
        +getEnabledStages(policy) list
    }

    class GuardrailStage {
        <<interface>>
        +applies(ctx, policy) bool
        +evaluate(ctx, policy) StageDecision
    }

    class PiiFilterStage
    class PriceValidatorStage
    class ToxicityFilterStage
    class CompetitorFilterStage
    class AsinValidatorStage
    class ScopeCheckStage

    GuardrailsCoordinator --> PolicyResolver
    GuardrailsCoordinator --> StageRegistry
    StageRegistry --> GuardrailStage
    GuardrailStage <|.. PiiFilterStage
    GuardrailStage <|.. PriceValidatorStage
    GuardrailStage <|.. ToxicityFilterStage
    GuardrailStage <|.. CompetitorFilterStage
    GuardrailStage <|.. AsinValidatorStage
    GuardrailStage <|.. ScopeCheckStage

Pipeline Runner

def evaluate_guardrails(ctx: GuardrailContext) -> GuardrailOutcome:
    policy = policy_resolver.resolve(ctx.intent, ctx.locale, ctx.auth_state)
    current_response = ctx.draft_response
    stage_results = []

    for stage in stage_registry.get_enabled_stages(policy):
        stage_ctx = GuardrailContext(**{**ctx.__dict__, "draft_response": current_response})

        if not stage.applies(stage_ctx, policy):
            continue

        result = stage.evaluate(stage_ctx, policy)
        stage_results.append(result)

        if result.action == "modify":
            current_response = result.updated_response or current_response
            continue

        if result.action in {"pass", "audit_only"}:
            continue

        if result.action in {"block", "redirect"}:
            safe_response = fallback_factory.build(stage.name, stage_ctx, result)
            return GuardrailOutcome(
                final_action=result.action,
                response_to_deliver=safe_response,
                stage_results=stage_results,
                blocked_stage=stage.name,
            )

    return GuardrailOutcome(
        final_action="pass",
        response_to_deliver=current_response,
        stage_results=stage_results,
    )

Runtime State Machine

stateDiagram-v2
    [*] --> DraftReady
    DraftReady --> EvaluateStage
    EvaluateStage --> Continue: pass
    EvaluateStage --> Modify: modify
    EvaluateStage --> Audit: audit_only
    EvaluateStage --> Blocked: block
    EvaluateStage --> Redirected: redirect

    Modify --> Continue
    Audit --> Continue
    Continue --> EvaluateStage: next stage
    Continue --> Approved: last stage passed

    Approved --> Delivered
    Blocked --> Delivered
    Redirected --> Delivered
    Delivered --> [*]

Detailed Implementation By Stage

Stage 1: PII Filter

Goal

Stop the model from exposing unauthorized personal data while preserving as much useful answer content as possible.

Why Output PII Filtering Still Matters

Even if input sanitization is strong, the model can still:

  • echo account data from authenticated context
  • hallucinate realistic-looking phone numbers or email addresses
  • combine fragments from retrieved content into a new sensitive-looking string

That is why input-side privacy controls and output-side privacy controls must both exist.

PII Filter Dataflow

flowchart LR
    A[Draft Response] --> B[Regex Detectors]
    A --> C[NER Detector]
    A --> D[Custom Commerce Detectors]
    B --> E[Span Merger]
    C --> E
    D --> E
    E --> F{Authorized By Context}
    F -->|Yes| G[Pass Span]
    F -->|No| H[Typed Redaction]
    G --> I[Updated Response]
    H --> I

Detailed Implementation Notes

  • Regex handles structured patterns such as email, phone, order ID, and postal code.
  • NER handles names and free-form addresses.
  • Custom detectors handle Amazon-specific entities such as tracking numbers and customer identifiers.
  • Character-name allowlists from the manga catalog reduce false positives on fictional names.
  • Authorization is context-sensitive. Order tracking may authorize masked address fragments, but recommendation flows do not.

Pseudocode

def pii_filter_stage(response: str, ctx: GuardrailContext) -> StageDecision:
    findings = scan_regex(response)
    findings += scan_ner(response, locale=ctx.locale)
    findings += scan_custom(response, ctx.metadata)
    merged = merge_overlapping_findings(findings)

    unauthorized = []
    for finding in merged:
        if finding["type"] not in ctx.authorized_pii_fields:
            unauthorized.append(finding)

    if not unauthorized:
        return StageDecision(action="pass")

    redacted = apply_typed_redactions(response, unauthorized)
    return StageDecision(
        action="modify",
        updated_response=redacted,
        findings=[
            StageFinding(
                stage="pii_filter",
                severity="high",
                finding_type=f["type"],
                message="Unauthorized PII redacted",
            )
            for f in unauthorized
        ],
    )

Action Choice

This stage usually modifies instead of blocks. If the answer is mostly useful and only contains one unsafe substring, full blocking creates unnecessary user pain. PII redaction is one of the few cases where inline repair is safer than full rejection.

Main Failure Modes

  • false positives on manga character names
  • false negatives on unusual international address formats
  • over-authorizing account data because the intent metadata was too broad

Stage SLO

  • P50 under 3 ms
  • P99 under 8 ms
  • false positive rate on character names under 3 percent

Stage 2: Price Validator

Goal

Ensure that every price, discount, savings claim, or promotional statement shown to the user is backed by the current catalog snapshot or a deterministic calculation.

Important Design Rule

The model should never invent or calculate commerce facts. The orchestrator should pre-compute them and the model should only explain them.

Preferred Response Contract

The most robust implementation is not free text first. It is structured claims first, natural language second.

{
  "price_claims": [
    {
      "asin": "B0ABC12345",
      "currency": "USD",
      "display_price": 12.99,
      "source": "catalog_snapshot"
    }
  ],
  "derived_claims": [
    {
      "type": "bundle_savings",
      "asin": "B0BOXSET123",
      "individual_total": 207.84,
      "bundle_price": 149.99,
      "savings": 57.85
    }
  ]
}

If the model produces only free text without structured price claims, the validator must infer more from text and reliability drops.

Validation Flow

flowchart TD
    A[Draft Response and Price Claims] --> B{Any Price Mentioned}
    B -->|No| P[PASS]
    B -->|Yes| C[Map Each Claim To ASIN]
    C --> D{ASIN In Catalog Snapshot}
    D -->|No| X[BLOCK Unknown Product Price]
    D -->|Yes| E[Compare Display Price]
    E --> F{Exact Match Within Tolerance}
    F -->|No| Y[BLOCK Price Mismatch]
    F -->|Yes| G{Derived Claim Present}
    G -->|No| P
    G -->|Yes| H[Recompute Savings Percent Math]
    H --> I{Derived Claim Correct}
    I -->|Yes| P
    I -->|No| Z[BLOCK Invalid Derived Math]

What This Stage Checks

  • raw price values
  • currency
  • promo IDs and discount presence
  • bundle savings arithmetic
  • percent-off calculations
  • "cheaper than" or "save more" comparative claims

Pseudocode

def price_validator_stage(response: str, ctx: GuardrailContext) -> StageDecision:
    claims = extract_price_claims(response, ctx.metadata.get("structured_response"))
    catalog = ctx.catalog_snapshot
    findings = []

    for claim in claims:
        asin = claim["asin"]
        if asin not in catalog:
            findings.append(("unknown_asin", asin))
            continue

        actual = catalog[asin]["price"]
        if abs(claim["display_price"] - actual) > 0.01:
            findings.append(("price_mismatch", asin))

        if "derived" in claim:
            if not validate_derived_claim(claim["derived"], catalog[asin], catalog):
                findings.append(("invalid_derived_math", asin))

    if findings:
        return StageDecision(
            action="block",
            findings=[
                StageFinding(
                    stage="price_validator",
                    severity="critical",
                    finding_type=item[0],
                    message=f"Price validation failed for {item[1]}",
                )
                for item in findings
            ],
        )

    return StageDecision(action="pass")

Why This Stage Blocks Instead Of Fixing Inline

Inline correction sounds attractive, but it is risky unless the response is fully structured and deterministic. A partially corrected commerce answer can still leave surrounding language inconsistent.

Example:

  • The validator can repair $9.99 to $12.99.
  • But the sentence may still claim "saving you $9.85" when the real savings is $57.85.

That is why a wrong price normally triggers a safe fallback or a controlled regeneration path outside the stage itself.

Main Failure Modes

  • stale catalog snapshot
  • free-text price references not mapped to an ASIN
  • derived claim math that uses correct arithmetic on hallucinated inputs

Stage SLO

  • P50 under 2 ms
  • P99 under 5 ms
  • zero tolerance for shipped incorrect prices

Stage 3: Toxicity Filter

Goal

Block harmful, abusive, sexual, or unsafe responses without breaking legitimate manga discussion.

Why This Stage Is Hard In The Manga Domain

Generic classifiers do not understand that many popular titles contain words which look unsafe out of context:

Term Why A Generic Model Flags It Why MangaAssist Must Often Allow It
Chainsaw Man weapon keyword legitimate series title
Attack on Titan violence keyword legitimate franchise
Demon Slayer violent and occult keywords legitimate franchise
Death Note death keyword legitimate franchise
Hell's Paradise profanity-adjacent term legitimate title

Two-Layer Toxicity Architecture

flowchart TD
    A[Draft Response] --> B[Generic Toxicity Classifier]
    B -->|Clean| P[PASS]
    B -->|Flagged| C[Domain Override Layer]
    C --> D{Flagged Terms In Manga Allowlist}
    D -->|No| X[BLOCK]
    D -->|Yes| E{Term Used In Product Context}
    E -->|No| X
    E -->|Yes| F{Overall Severity Above Hard Threshold}
    F -->|Yes| X
    F -->|No| P

Implementation Approach

Layer 1 handles general unsafe language detection.

Layer 2 applies domain-aware logic:

  • title allowlists
  • genre term allowlists
  • context checks against retrieved products
  • higher thresholds when the term clearly refers to a product title

Pseudocode

def toxicity_filter_stage(response: str, ctx: GuardrailContext) -> StageDecision:
    result = generic_toxicity_provider.score(response)
    if result["is_clean"]:
        return StageDecision(action="pass")

    findings = []
    for term in result["flagged_terms"]:
        if is_manga_allowlisted(term) and is_product_context(term, response, ctx.metadata):
            continue

        findings.append(
            StageFinding(
                stage="toxicity_filter",
                severity="critical" if result["score"] >= 0.8 else "high",
                finding_type="toxic_content",
                message=f"Unsafe term detected: {term}",
            )
        )

    if findings:
        return StageDecision(action="block", findings=findings)

    if result["score"] >= 0.65:
        return StageDecision(
            action="audit_only",
            findings=[
                StageFinding(
                    stage="toxicity_filter",
                    severity="medium",
                    finding_type="borderline_toxicity",
                    message="Borderline content passed after domain override",
                )
            ],
        )

    return StageDecision(action="pass")

Key Calibration Rules

  • A high generic toxicity score is not enough to block if the flagged tokens map to known product titles and the sentence is clearly catalog-oriented.
  • A low generic score is not enough to pass if the phrasing is directly abusive or threatening.
  • Borderline cases should feed the audit loop so the threshold can be tuned with real data.

Main Failure Modes

  • false positives on title names
  • false negatives on sarcastic or indirect abuse
  • allowlist drift when new series are added

Stage SLO

  • P50 under 10 ms
  • P99 under 20 ms
  • false positive rate on title mentions under 1 percent after override

Stage 4: Competitor Filter

Goal

Prevent the assistant from steering users to competitor stores or making unsupported competitive claims.

What This Stage Should Catch

  • direct competitor names
  • "buy it elsewhere" phrasing
  • unsupported claims like "cheaper at X"
  • comparative calls to action that pull the user off-platform

Design Detail That Usually Gets Missed

This stage validates the model output, not the user message.

If the user asks:

Is it cheaper on another store?

the system can still answer safely by redirecting:

I can help compare editions, formats, and current offers available here on Amazon.

The user is not blocked. The model output is constrained.

Competitor Decision Flow

flowchart LR
    A[Draft Response] --> B[Regex and Entity Match]
    B --> C{Competitor Mentioned}
    C -->|No| P[PASS]
    C -->|Yes| D{Policy Mode}
    D -->|strict| X[BLOCK]
    D -->|redirect| Y[Return Safe Redirect]

Pseudocode

COMPETITOR_PATTERNS = [
    r"\bBarnes\s*&?\s*Noble\b",
    r"\bRight\s*Stuf\b",
    r"\bBookWalker\b",
    r"\bKinokuniya\b",
    r"\bCrunchyroll\s*Store\b",
    r"\bcheaper\s+(?:at|on|from)\b",
    r"\bbuy\s+(?:it|this|them)\s+(?:at|on|from)\s+(?!Amazon)\b",
]


def competitor_filter_stage(response: str, mode: str) -> StageDecision:
    for pattern in COMPETITOR_PATTERNS:
        match = re.search(pattern, response, re.IGNORECASE)
        if not match:
            continue

        if mode == "redirect":
            return StageDecision(
                action="redirect",
                findings=[
                    StageFinding(
                        stage="competitor_filter",
                        severity="high",
                        finding_type="competitor_mention",
                        message=f"Redirected competitor phrasing: {match.group(0)}",
                    )
                ],
            )

        return StageDecision(
            action="block",
            findings=[
                StageFinding(
                    stage="competitor_filter",
                    severity="high",
                    finding_type="competitor_mention",
                    message=f"Blocked competitor phrasing: {match.group(0)}",
                )
            ],
        )

    return StageDecision(action="pass")

Main Failure Modes

  • patterns that are too broad and accidentally match intra-Amazon comparisons
  • missing competitor aliases
  • retrieved policy text or user text echoed into the answer without normalization

Stage SLO

  • P50 under 1 ms
  • P99 under 2 ms

Stage 5: ASIN Validator

Goal

Ensure that every product identifier or product card shown to the user is real, catalog-valid, and grounded in the candidate set retrieved for the current request.

This Stage Is Really About Grounding

The failure is not only fake ASIN. It is also:

  • real ASIN but not retrieved for this request
  • real ASIN from the wrong locale or edition
  • product title in prose that does not match the attached product card

ASIN Validation Dataflow

flowchart TD
    A[Draft Response] --> B[Extract Mentioned ASINs and Product Titles]
    B --> C[Compare Against Catalog Snapshot]
    C --> D{Exists In Catalog}
    D -->|No| X[BLOCK Hallucinated Product]
    D -->|Yes| E[Compare Against Retrieved Candidate Set]
    E --> F{In Retrieved Set}
    F -->|No| Y[BLOCK Ungrounded Product]
    F -->|Yes| G[Compare Title Format Locale]
    G --> H{Consistent}
    H -->|No| Z[BLOCK Inconsistent Product Metadata]
    H -->|Yes| P[PASS]

Pseudocode

def asin_validator_stage(response: str, ctx: GuardrailContext) -> StageDecision:
    mentioned_asins = extract_asins(response, ctx.metadata.get("structured_response"))
    retrieved = set(ctx.retrieved_asins)
    catalog = ctx.catalog_snapshot
    findings = []

    for asin in mentioned_asins:
        if asin not in catalog:
            findings.append(("hallucinated_asin", asin))
            continue

        if asin not in retrieved:
            findings.append(("ungrounded_asin", asin))
            continue

        if not metadata_consistent(asin, response, catalog[asin]):
            findings.append(("metadata_mismatch", asin))

    if findings:
        return StageDecision(
            action="block",
            findings=[
                StageFinding(
                    stage="asin_validator",
                    severity="critical",
                    finding_type=item[0],
                    message=f"ASIN validation failed for {item[1]}",
                )
                for item in findings
            ],
        )

    return StageDecision(action="pass")

Why Both Catalog And Retrieval Checks Are Needed

  • Catalog check prevents fake products.
  • Retrieval check prevents the model from using training memory instead of the live candidate set.

Without the retrieval check, a model can recommend a real but contextually wrong product and still look credible.

Main Failure Modes

  • model recalls a real product from training data
  • title text refers to one edition, but the structured card points to another
  • locale mismatch between JP and EN catalog variants

Stage SLO

  • P50 under 2 ms
  • P99 under 5 ms

Stage 6: Scope Check

Goal

Ensure the response stays within MangaAssist's allowed capability boundary.

Scope Means More Than Topic

A response can be about manga and still be out of scope if it:

  • gives legal or medical advice
  • claims to have performed an account action it cannot perform
  • offers unsupported guarantees
  • drifts into politics, religion, or personal counseling

What This Stage Checks

  • intent-to-response alignment
  • restricted-domain phrases
  • unsupported action verbs such as "I refunded", "I changed your address", or "I cancelled the order" when no execution token exists
  • capability mismatch for guest users versus authenticated users

Pseudocode

OUT_OF_SCOPE_PATTERNS = [
    r"\bmedical advice\b",
    r"\blegal advice\b",
    r"\binvestment advice\b",
    r"\bhow to hack\b",
    r"\bdownload free manga\b",
]

UNSUPPORTED_ACTION_PATTERNS = [
    r"\bI (?:refunded|cancelled|changed|updated)\b",
]


def scope_check_stage(response: str, ctx: GuardrailContext) -> StageDecision:
    for pattern in OUT_OF_SCOPE_PATTERNS:
        if re.search(pattern, response, re.IGNORECASE):
            return StageDecision(
                action="block",
                findings=[
                    StageFinding(
                        stage="scope_check",
                        severity="critical",
                        finding_type="out_of_scope",
                        message="Response entered a restricted domain",
                    )
                ],
            )

    if ctx.auth_state == "guest":
        for pattern in UNSUPPORTED_ACTION_PATTERNS:
            if re.search(pattern, response, re.IGNORECASE):
                return StageDecision(
                    action="redirect",
                    findings=[
                        StageFinding(
                            stage="scope_check",
                            severity="high",
                            finding_type="unsupported_action_claim",
                            message="Guest response implied an unsupported account action",
                        )
                    ],
                )

    return StageDecision(action="pass")

Why This Stage Is Last

This stage benefits from everything earlier in the pipeline:

  • PII has already been redacted if needed.
  • pricing and grounding have already been verified.
  • toxic or competitor content has already been removed from the decision space.

So the final scope decision can focus on capability boundaries, not low-level content hazards.

Main Failure Modes

  • overly broad patterns that block educational manga discussions
  • under-constrained action language that implies system actions were executed
  • guest versus authenticated capability drift

Stage SLO

  • P50 under 1 ms
  • P99 under 2 ms

Action Model: Pass, Modify, Redirect, Block, Audit

Not every violation should produce the same outcome.

flowchart TD
    A[Stage Finding] --> B{Risk Type}
    B -->|Recoverable text issue| C[MODIFY]
    B -->|Brand or capability mismatch| D[REDIRECT]
    B -->|Critical trust or safety risk| E[BLOCK]
    B -->|Borderline low confidence| F[AUDIT ONLY]
    B -->|No issue| G[PASS]

Action Mapping

Action Typical Use User Experience
pass no issue detected normal response
modify unauthorized PII or safe text repair answer preserved with redactions
redirect competitor comparison, unsupported action safe branded alternative
block wrong price, toxic content, ungrounded product fallback response
audit_only borderline or sampled review case normal response plus review signal

Fallback Strategy

Fallback text should be stage-specific, not generic:

  • price_validator: "Let me pull the exact current pricing from the product page."
  • competitor_filter: "I can help compare editions, formats, and current Amazon offers."
  • scope_check: "I can help with manga products, order support, shipping, returns, and related store questions."

Generic "I can't help with that" responses hide the root cause and reduce usefulness.


Detailed Scenarios

Scenario 1: "Chainsaw Man" Triggered Toxicity Blocks

Problem

The generic toxicity provider started blocking recommendation answers that mentioned violent-sounding titles.

Symptom

Users asked for recommendations similar to Chainsaw Man or Attack on Titan and received safe fallbacks instead of valid recommendations.

Scenario Flow

flowchart LR
    A[User asks for dark fantasy recommendations] --> B[Model drafts: Try Chainsaw Man and Hell's Paradise]
    B --> C[Generic toxicity score high]
    C --> D[Domain override checks title allowlist]
    D --> E{Terms are valid catalog titles}
    E -->|Yes| F[Pass response]
    E -->|No| G[Block response]

Fix

  1. Added a catalog-backed manga title allowlist.
  2. Added product-context checks so the same term can pass in catalog context and block in abusive context.
  3. Tuned thresholds using a domain-specific evaluation set instead of generic moderation benchmarks.

Prevention

  • daily allowlist refresh from the catalog
  • weekly audit of blocked toxicity cases
  • dashboard for false positive rate by intent

Scenario 2: Price Validator Caught Hallucinated Bundle Math

Problem

The model gave confident bundle-savings explanations that were internally consistent but factually wrong because one source value was hallucinated.

Symptom

The response sounded reasonable:

You save $9.85 by buying the box set.

But the model had invented the single-volume price and then performed correct math on the wrong base value.

Scenario Flow

flowchart TD
    A[Catalog truth: volume price 12.99 and box set 149.99] --> B[Model invents volume price 9.99]
    B --> C[Model computes savings from wrong base]
    C --> D[Response looks mathematically coherent]
    D --> E[Price validator recomputes against catalog]
    E --> F[Block invalid answer]

Fix

  1. Moved all arithmetic into deterministic orchestrator code.
  2. Added structured price_claims and derived_claims to the response contract.
  3. Kept the validator as a hard gate even after prompt changes.

Prevention

  • never ask the model to invent bundle arithmetic
  • golden tests for savings, percentages, and promotion wording
  • alert on any price-related block spike

Scenario 3: ASIN Validator Caught Grounding Violations

Problem

The model occasionally recommended real products it had likely seen during training, even though those ASINs were not in the retrieved candidate set for the current request.

Symptom

Recommendation answers included real but contextually wrong products, which made the response look plausible while quietly breaking grounding.

Scenario Flow

flowchart LR
    A[Candidate set from retrieval] --> B[ASIN 1, ASIN 2, ASIN 3]
    C[Model parametric memory] --> D[Real ASIN 9 from old knowledge]
    B --> E[Draft response]
    D --> E
    E --> F[ASIN validator compares response against candidate set]
    F --> G{All ASINs grounded}
    G -->|No| H[Block response]
    G -->|Yes| I[Pass response]

Fix

  1. Enforced candidate-set-only output through prompt constraints.
  2. Validated all ASINs in both text and structured product cards.
  3. Added metadata consistency checks so title, edition, and locale also had to match.

Prevention

  • replay tests with known memorized-product prompts
  • candidate-set mismatch metric
  • regression tests on multilingual catalog variants

Scenario 4: Cross-Stage Over-Blocking After Policy Tightening

Problem

After one safety incident, the team tightened multiple stages independently. Each change looked reasonable in isolation, but together they caused a large fallback spike.

Symptom

Users received fallbacks for legitimate questions such as:

  • "Is the deluxe edition cheaper than the standard edition?"
  • "Is Death Note appropriate for older teens?"
  • "Compare the paperback and hardcover versions."

Scenario Flow

flowchart TD
    A[Policy tightening deploy] --> B[Broader competitor regex]
    A --> C[Stricter scope rules]
    A --> D[Lower toxicity threshold]
    B --> E[Combined false positive spike]
    C --> E
    D --> E
    E --> F[Fallback rate rises]
    F --> G[Intent-aware policy review]
    G --> H[Stage-specific tuning and combined evaluation]

Fix

  1. Added intent-aware stage activation.
  2. Replaced broad competitor phrases with context-dependent patterns.
  3. Added combined pipeline regression tests on representative traffic.
  4. Started measuring fallback rate by intent, not only by stage.

Prevention

  • every policy change must pass offline combined evaluation
  • canary release for new policies
  • rollback trigger on fallback rate or false positive proxy spikes

Streaming Considerations

Guardrails change how streaming must be implemented.

If the system streams raw model tokens directly to the user before validation completes, several failures become possible:

  • a wrong price is shown before Stage 2 blocks it
  • a competitor mention leaks before Stage 4 blocks it
  • an ungrounded ASIN is visible before Stage 5 catches it

Practical Design Choice

For MangaAssist, the orchestrator buffers the draft response until guardrail validation finishes, then streams the approved response to the client.

This slightly increases time-to-first-token, but it avoids partial unsafe output. In commerce and trust-sensitive flows, that is the correct trade.


Observability And Audit

What Must Be Measured

Metric Meaning Why It Matters
guardrail_block_rate overall percentage of blocked answers product health and safety strictness
guardrail_block_rate_by_stage block rate per stage tuning and root cause isolation
guardrail_modify_rate percentage of answers redacted or repaired privacy and repair pressure
fallback_rate_by_intent fallback frequency split by use case user experience impact
guardrail_latency_p95 high-percentile validation latency hot-path performance
ungrounded_asin_rate percentage of answers with candidate-set violations grounding health
price_mismatch_rate price validation failures commerce trust
toxicity_false_positive_rate audited false positives domain calibration quality
policy_version_traffic_share request share by active policy version rollout visibility

Audit Record Example

{
  "timestamp": "2026-03-24T15:20:41Z",
  "request_id": "req_123",
  "session_id": "sess_456",
  "intent": "price_comparison",
  "policy_version": "guardrails_v17",
  "final_action": "block",
  "blocked_stage": "price_validator",
  "stage_results": [
    {"stage": "pii_filter", "action": "pass", "latency_ms": 3.1},
    {"stage": "price_validator", "action": "block", "latency_ms": 1.8}
  ],
  "response_delivered": false
}

Logging Rule

Audit logs should store decision metadata and minimal finding metadata. They should not store raw PII values or entire unsafe response strings by default.


Testing Strategy

Test Layers

  1. Unit tests Each stage is tested with deterministic fixtures.
  2. Contract tests Structured response contracts are tested so validators receive stable fields.
  3. Golden set evaluation Realistic prompts and expected pass or block outcomes are replayed on every policy change.
  4. Combined pipeline tests Multi-stage interactions are tested to catch false-positive cascades.
  5. Shadow mode New policies run without user impact and compare outcomes to the live policy.
  6. Canary rollout A small traffic slice uses the new policy version before full rollout.

Rollout Flow

flowchart LR
    A[Policy Change] --> B[Offline Evaluation]
    B --> C{Passes Golden Set}
    C -->|No| X[Reject Change]
    C -->|Yes| D[Shadow Mode]
    D --> E{Metrics Stable}
    E -->|No| X
    E -->|Yes| F[Canary Release]
    F --> G{Fallback and Block Rates Healthy}
    G -->|No| H[Rollback]
    G -->|Yes| I[Full Rollout]

What The Golden Set Must Include

  • popular manga titles with violent words
  • box-set and promotion price explanations
  • competitor-baiting prompts
  • guest-user account questions
  • multilingual title and locale variants
  • real and hallucinated ASIN examples

Failure Handling And Resilience

What If A Detector Times Out

Not all stage dependencies are equal, so timeout behavior must be stage-specific.

Stage Timeout Policy Reason
PII Filter fail closed for high-confidence regex, fail open for NER with audit regex already catches most structured PII
Price Validator fail closed wrong prices cannot be shown
Toxicity Filter fail closed on high-risk intents, otherwise fallback safer to block than ship harmful output
Competitor Filter fail closed brand risk is cheap to prevent
ASIN Validator fail closed grounding is required
Scope Check fail closed unsupported claims are risky

The important design point is that availability first is not always correct. For trust-sensitive commerce answers, failing closed is often the safer product choice.


Architecture Decisions And Tradeoffs

Decision Chosen Approach Alternative Upside Downside
Placement in-process with orchestrator standalone service lower latency and easier access to context tighter runtime coupling
Execution model serial fail-fast parallel validation simpler tracing and cheaper on early blocks latency is cumulative
PII action modify first block everything preserves useful content requires safe redaction logic
Price action hard block inline correction avoids accidental wrong commerce claims can feel less helpful
Toxicity design generic model plus domain override generic model only far fewer false positives on manga titles allowlist maintenance
Grounding check catalog plus retrieval catalog only catches memorized products needs more request context
Policy routing intent-aware same rules for all traffic fewer false positives more configuration complexity
Streaming buffer before render raw token streaming no unsafe partial output slightly slower first token

Follow-Up Questions And Deep Answers

Follow-Up Question 1: Why not rely on Bedrock Guardrails or one moderation API for everything?

Because MangaAssist has domain-specific business constraints that a generic moderation API does not understand. A moderation API can help with toxicity, but it cannot authoritatively validate catalog prices, ASIN grounding, competitor policy, or capability scope. Those checks require structured business data and deterministic logic.

The deeper lesson is that guardrails in commerce are a mix of safety policy and business correctness. A single generic model can assist, but it cannot be the only enforcement mechanism.

Follow-Up Question 2: Why are guardrails post-generation instead of only pre-generation?

Pre-generation controls reduce risk. They do not prove the final output is safe. The model can still hallucinate or synthesize new content after seeing safe inputs.

Post-generation guardrails inspect what will actually be shown to the user. That is the only place where the system can deterministically validate the final answer against catalog truth, scope rules, and brand policy.

Follow-Up Question 3: Why does PII usually redact while price usually blocks?

The risk profiles are different. If the model says something useful but accidentally includes an email address, the unsafe part is localized and can be removed while preserving value. If the model gives a wrong price, the surrounding explanation may also be wrong, and the user may treat it as a store commitment.

So PII is often a text-repair problem, while pricing is a trust-and-liability problem. That is why the action differs.

Follow-Up Question 4: How do you keep false positives low without making the system permissive?

Three things matter:

  1. use structured context, not raw text-only heuristics
  2. activate stages by intent and user state
  3. calibrate with domain-specific datasets and audit loops

The most common failure is treating threshold tuning as the only lever. In reality, false positives usually fall when context improves, not only when thresholds change.

Follow-Up Question 5: How do you support streaming if the response must be fully validated first?

You do not stream raw model tokens directly. You buffer the model output, run guardrails, then stream the approved final text to the client. That preserves the frontend streaming contract without exposing unvalidated content.

If lower time-to-first-token becomes critical, the safe optimization is to stream deterministic scaffolding such as product cards or loading states, not raw unvalidated model text.

Follow-Up Question 6: What happens when policy changes need to roll out quickly after an incident?

Policy is versioned outside the stage code. That allows a fast configuration change, but the rollout still goes through shadow mode and canary unless the incident is severe enough to justify an emergency fail-closed switch.

The important operational discipline is to separate "ship a rule" from "prove the rule is healthy." Emergency changes are sometimes necessary, but they must be easy to roll back with a policy version change and not require a code deploy.

Follow-Up Question 7: How do you detect false negatives when users do not always report them?

You need proxy signals:

  • sampled human review of passed answers
  • downstream complaint or thumbs-down analysis
  • replay on adversarial datasets
  • consistency checks between structured data and delivered text

False negatives are harder than false positives because they are often silent. That is why audit sampling on passed traffic matters as much as reviewing blocked traffic.

Follow-Up Question 8: How do you make this work across locales like JP and US catalogs?

Locale affects multiple stages:

  • PII patterns differ
  • titles and editions differ
  • pricing currency and format differ
  • competitor policies can differ by market

So locale must be part of policy resolution, allowlist selection, and catalog snapshot selection. A guardrail pipeline that ignores locale usually works in one market and quietly degrades in another.

Follow-Up Question 9: Why not run the stages in parallel to reduce latency?

Parallel execution lowers wall-clock latency but removes some operational advantages of fail-fast serial execution. In this system, the total serial budget is small enough that simpler tracing and cheaper short-circuiting are worth more than saving a few milliseconds.

If traffic or latency pressure grows, the right next step is usually a hybrid model: run cheap deterministic stages first, then parallelize only the expensive ones. Full parallelism is rarely the first optimization worth taking.

Follow-Up Question 10: How do you prevent the guardrails layer from becoming a fragile pile of regexes?

By treating it as a versioned subsystem with contracts, datasets, policies, and ownership. Regex still has a place, especially for fast deterministic checks, but it should sit alongside structured data validation, allowlists, classifier outputs, and offline evaluation.

The real anti-pattern is not regex itself. The anti-pattern is unversioned rules with no tests, no metrics, and no clear action model.


Key Takeaways

  1. Guardrails are a runtime trust boundary, not a documentation checkbox.
  2. Structured context is what turns content filtering into real business validation.
  3. Domain-aware tuning is mandatory for manga titles, genres, and product language.
  4. Cross-stage evaluation matters because the pipeline can fail as a system even when each stage looks fine alone.
  5. Observability, auditability, and policy rollout discipline are part of the architecture.

Cross-References