LOCAL PREVIEW View on GitHub

GenAI Scenario 02 — Policy / T&C Document Versioning

TL;DR

Legal updates the return policy on the 15th of every month, and 9 of those changes per year materially alter the answer the bot must give. The Support-Policy MCP's golden set was built against a single policy version, regionless. Today there are 4 region variants (US, EU, JP, BR) and 14 active policy versions across them. Old golden answers cite clauses that have been retracted, the retriever happily returns superseded documents because they're indexed alongside the live ones, and the judge keeps approving "looks legally-flavored" answers that are technically wrong. The fix shape is policy ground truth as a triple-keyed object — (question, region, version) — with a versioned corpus, version-aware retrieval, and a judge that fails closed when the cited clause is from a retracted version.

Context & Trigger

  • Axis of change: Requirements (the policy text, the regional split, and the compliance posture all shift on a legal/PM clock, not a data clock).
  • Subsystem affected: RAG-MCP-Integration/05-support-policy-mcp.md and the versioned S3 + OpenSearch corpus described there.
  • Trigger event: A combined incident in Q2 where (a) the EU return window changed from 14 to 30 days under updated consumer-protection rules, and (b) Brazil was added as a new region with its own policy doc. The bot continued telling EU customers "you have 14 days" for three weeks before legal noticed in a customer escalation.

This scenario is about policy ground truth, which is fundamentally different from data ground truth: it is normative, not empirical. Re-labeling against a moved spec is the work; "looking at production" doesn't help because production is what's wrong.

The Old Ground Truth

The original Support-Policy MCP design:

  • One policy doc set in S3 (policies/return-policy.md, policies/refund-policy.md, etc.) covering ~80 distinct policy questions.
  • A golden set of 400 (question, expected_answer, expected_clause_id) triples built by support ops + legal sign-off in Q1.
  • Confidence threshold (already documented): if retriever similarity < 0.78 or the cited clause is missing, defer to a human agent via Zendesk.
  • Reasonable assumptions:
  • Policy doc is canonical — there's one of it.
  • Region is irrelevant — the company shipped to one region (US).
  • "Correct" = "answer matches expected_answer text within 0.85 ROUGE + cites expected_clause_id."

That design was fine for one region, one policy version. The trouble is it bakes regionlessness and versionlessness into the data model.

The New Reality

Three things have moved:

  1. Region became a primary key. "What's your return window?" has different correct answers in US (30 days), EU (30 days under new rule, 14 under old contracts still in force for purchases before 2026-02-01), JP (8 days), BR (7 days). A region-blind golden answer is now actively wrong somewhere.
  2. Version became a primary key. A customer who bought on 2026-01-15 in the EU is governed by the old 14-day window. A customer who buys today is governed by the new 30-day window. The bot must answer based on the purchase-time policy, not the latest one. "Latest version is correct" is wrong.
  3. Retracted clauses are still in the index. When legal updated the EU policy, the new version was added to the corpus. The old version was not removed — quite intentionally, for audit and for in-flight orders. But the retriever is version-blind; it can rank the retracted clause higher than the live one and the bot will cite it confidently.

The schema has expanded from (question → answer) to (question, region, effective_date) → (answer, citation_clause_id, source_version_id). This is a triple-keyed lookup, not a single-keyed one. Treating it as single-keyed is no longer a quality bug — it's a compliance bug.

Why Naive Approaches Fail

  • "Just re-index when policy changes." Loses the audit trail. In-flight orders are still governed by the old version; the policy MCP must still cite it correctly when asked.
  • "Just route by region in the system prompt." Region routing alone doesn't fix version-by-purchase-date. And if the region detector is wrong (e.g., user is logged into a different store than the one they're shipping to), the answer is silently wrong.
  • "Just always cite the latest." Customer with a 2026-01-15 EU purchase gets the new 30-day window, demands a refund on day 25, finds out their purchase is governed by 14-day rule. Now you owe them — and probably owe a regulator an explanation.
  • "Just have a human review every policy answer." Defeats the purpose of self-service deflection. The CSAT and cost case for the chatbot collapses.
  • "Re-label the golden set to the new policy." This is necessary but insufficient. It fixes the test, not the corpus. And it loses the ability to test old-version correctness for in-flight orders.

Detection — How You Notice the Shift

Online signals.

  • Customer escalations with "the bot said 14 days but I have 30" or vice versa — these are the loudest and they often arrive after the regulatory window matters.
  • Refund-rate spike on returns processed beyond the policy window (i.e., agents overriding what the bot told the customer).
  • Region-vs-IP mismatch rate: how often does a chat originate from an IP that doesn't match the customer's billing region? If high, region routing is unreliable.

Offline signals.

  • Re-running the golden set after a policy change shows drops only in the changed region — but the test is run against the new policy doc, so it actually masks failures on in-flight orders.
  • Citation-grounding rate (fraction of bot answers that cite a non-retracted clause) is the leading metric. If it drops below 99.5%, you have retracted clauses leaking.
  • LLM-as-judge disagreement rate spikes on policy questions specifically when versions changed.

Distribution signals.

  • Mean age of cited clause: tracked per (region, intent). A sudden uptick means retracted clauses are winning ranking.
  • Per-region answer entropy: if the bot gives the same answer regardless of region for a question that should differ by region, region-conditioning is broken.

Architecture / Implementation Deep Dive

flowchart TB
    subgraph Authoring["Policy authoring (legal + PM)"]
        EDIT["Policy doc<br/>versioned in Git"]
        APPROVE["Legal sign-off<br/>(commit message)"]
    end

    subgraph Store["Versioned corpus"]
        S3["S3<br/>policies/{region}/v{N}/<br/>policy.md"]
        MAN["Manifest<br/>region · version · effective_dates"]
        OS["OpenSearch index<br/>per-(region, version)<br/>or filter-by-version"]
    end

    subgraph GT["Triple-keyed golden set"]
        GS["(question, region, effective_date)<br/>→ answer, clause_id, version_id"]
        AUDIT["Audit table:<br/>which version was used<br/>for which conversation"]
    end

    subgraph Serve["Serving"]
        ROUTER["Region resolver<br/>+ purchase-date resolver"]
        RAG["RAG with version filter<br/>retrieves only effective<br/>(region, version) clauses"]
        FM["FM with strict prompt:<br/>cite version_id explicitly"]
        JUDGE["Judge (online sample)<br/>fails closed if cited<br/>version != effective version"]
    end

    EDIT --> APPROVE --> S3
    S3 --> MAN --> OS
    MAN --> GS
    ROUTER --> RAG --> OS
    OS --> FM
    FM --> JUDGE
    JUDGE --> AUDIT
    GS -.->|offline eval| RAG

    style GS fill:#fde68a,stroke:#92400e,color:#111
    style ROUTER fill:#dbeafe,stroke:#1e40af,color:#111
    style JUDGE fill:#fee2e2,stroke:#991b1b,color:#111
    style AUDIT fill:#dcfce7,stroke:#166534,color:#111

1. Data layer — triple-keyed ground truth + versioned corpus

Corpus layout in S3:

policies/
├── manifest.json
├── us/
│   ├── v3/policy.md   (effective 2026-02-01 → null)
│   ├── v2/policy.md   (effective 2025-08-01 → 2026-02-01)
│   └── v1/policy.md   (effective 2024-01-01 → 2025-08-01)
├── eu/
│   ├── v4/policy.md   (effective 2026-02-01 → null)
│   ├── v3/policy.md   (effective 2025-04-01 → 2026-02-01)
│   └── ...
├── jp/...
└── br/
    └── v1/policy.md   (effective 2026-03-01 → null)

Manifest.json is the source of truth for "which version applies when in which region":

{
  "regions": {
    "eu": {
      "versions": [
        {"id": "eu-v4", "effective_from": "2026-02-01", "effective_to": null,
         "git_sha": "a1b2c3", "approved_by": "legal-eu@..."},
        {"id": "eu-v3", "effective_from": "2025-04-01", "effective_to": "2026-02-01",
         "git_sha": "9f8e7d"}
      ]
    }
  }
}

Triple-keyed golden set:

{
  "id": "gt-eu-return-window-001",
  "question": "What is the return window for my purchase?",
  "region": "eu",
  "effective_date": "2026-03-01",
  "expected_answer_template": "30 calendar days from delivery.",
  "expected_clause_id": "eu-v4#sec-3.2",
  "expected_version_id": "eu-v4",
  "decay_class": "policy_normative"
}

The same question with effective_date: 2026-01-15 resolves to expected_version_id: "eu-v3" and a different answer. The golden set must enumerate both — they are independent test cases.

2. Pipeline layer — version-aware RAG

The retriever is version-aware:

def retrieve_policy(question: str, region: str, effective_date: date, top_k: int = 5):
    version_id = manifest.resolve_version(region=region, on=effective_date)
    return os_client.search(
        index="policies",
        query={
            "bool": {
                "must": [{"match": {"text": question}}],
                "filter": [
                    {"term": {"region": region}},
                    {"term": {"version_id": version_id}},
                ],
            }
        },
        size=top_k,
    )

Two non-obvious choices:

  • Filter-by-version, single index — simpler than a per-version index. Trade-off: index size grows with versions retained, but query latency stays flat with proper sharding.
  • effective_date is mandatory. No default to today(). If the caller can't determine the date (anonymous chat, no purchase context), the system explicitly serves "the latest" with a banner "This reflects current policy as of {date}; if your purchase was before {pivot}, please ask a human agent." — and the conversation is flagged for shadow review.

3. Serving layer — three layers of containment

Region resolver — pulls from authenticated session first, then billing address, then IP geolocation. Disagreements between session and IP raise a "low confidence" flag that affects the next layer.

Prompt-time hard constraint:

You are answering a policy question. The applicable policy is:
- Region: {region}
- Version: {version_id} (effective {effective_from} to {effective_to})

You MUST cite the specific clause id you used. Format:
"... <answer text> ... [clause: {version_id}#{clause_id}]"

If the retrieved chunks do not contain a relevant clause for the question,
respond exactly: "I can't confirm the answer for your specific situation;
let me hand you to a human agent."

Do NOT use prior knowledge of policy. Use only the provided clauses.

Online judge sample — every Nth (configurable, default 1/100) policy answer is sampled, the cited version_id is parsed from the response, and compared against the resolved-version-for-this-conversation. If they disagree, the answer is held back, the conversation is escalated, and the operator is paged. Mismatch rate on this gate is the strongest leading indicator of trouble.

4. Governance — audit and rollback

Every policy answer is logged with:

{
  "conversation_id": "...",
  "question": "...",
  "resolved_region": "eu",
  "resolved_version": "eu-v4",
  "purchase_effective_date": "2026-03-01",
  "cited_clause_ids": ["eu-v4#sec-3.2"],
  "answer_text": "...",
  "judge_verdict": "consistent",
  "manifest_sha": "a1b2c3"
}

If a policy is later determined to have been mis-cited, this log lets legal reconstruct exactly what the bot said to whom, under which version, and how to remediate. The manifest_sha makes it reproducible — the policy doc and the manifest at conversation time are pinned.

Rollback path: revert the manifest.json (which is in Git), force the manifest to be re-pulled on the next request. Within ~60 seconds, the system serves the previous version. The corpus doesn't have to be re-indexed because all versions are already there.

Trade-offs & Alternatives Considered

Approach Compliance safety Latency Ops complexity Verdict
Single doc, latest only Low (in-flight orders mis-served) Best Lowest Original — unsafe
Per-region routing only (no version) Medium Best Low Fixes regions, not versions
Per-version index, region filter High Higher (per-region cluster) Higher Over-engineered for this scale
Single index + region+version filter, manifest-driven High Good Medium Chosen
Always escalate policy questions to humans Highest Worst Lowest cost Defeats deflection
FM-only, no RAG, retrained on new policy Low (hallucination risk) Best High (retrain cycle) Unacceptable for legal text

The chosen approach optimizes for compliance safety while keeping retrieval simple. The manifest is the only piece that legal touches; everything else is engineering.

Production Pitfalls

  1. The manifest becomes a write-bottleneck. Legal wants to edit it directly. They'll use spreadsheets that don't validate. Force changes through a PR template with required fields and CI validation that effective-date ranges don't overlap or leave gaps.
  2. Effective-date ambiguity at midnight. A purchase at 2026-02-01T00:00:00+09:00 (JP) versus 2026-02-01T00:00:00Z (manifest) can resolve to different versions. Pin the manifest to UTC and document this loudly. Edge cases will hit the human-handoff path; that's the right behavior.
  3. The golden set's effective_date field gets stale. A test case dated 2026-01-15 still tests an answer that should refer to eu-v3 — but if you let effective_date mean "today" by default in tests, the test silently re-targets eu-v4 and you've lost coverage of in-flight-order behavior. Pin dates explicitly in golden test fixtures, never use now().
  4. The "older version still in index" assumption needs an expiry. Three years of versions × four regions × monthly updates = 144 versions in your index. Periodically retire versions older than the longest possible in-flight order (e.g., gift cards may have multi-year windows). Replace retired versions with stub clauses that say "this policy is no longer in effect" so retrieval still works for legacy audit queries.
  5. Region detection from IP geolocation is unreliable. ~5% of users are on VPNs, traveling, etc. If only IP says "they're in EU" but the account is US, the safer behavior is to serve US (their account home) and surface the discrepancy. Defaulting to IP without account is a compliance hazard.

Interview Q&A Drill

Opening question

Legal updates the return policy. The policy doc in your S3 corpus is overwritten with the new version, the index is re-built, and within 24 hours the bot is telling customers the new return window. A few weeks later, a customer who bought before the update demands a refund based on what the bot told them — but the policy at purchase time was different. How would you redesign the Support-Policy MCP to prevent this?

Model answer.

The bug is that the MCP treats policy as single-keyed (question → answer). It's actually triple-keyed (question, region, effective-date → answer, clause, version). I'd redesign in three layers.

  1. Versioned corpus. Keep all historical versions of every policy doc in S3 under policies/{region}/v{N}/. A manifest.json (in Git, legal-approved) maps each version to an effective_from/effective_to window. Index every version into OpenSearch with a version_id field.
  2. Version-aware retrieval. The retriever requires (region, effective_date) from the caller, resolves the manifest to a version_id, and filters retrieval to that version. No defaulting to "latest." If effective_date is unknown, the system explicitly says so and routes to a human.
  3. Judge that fails closed. Online judge samples answers, parses the cited version_id, and compares it to the resolved version for the conversation. Mismatch → hold and escalate. This is the gate that would have caught the regression in the question.

The conceptual move is from "policy is data we serve" to "policy is normative truth keyed on time and region, and the cite is part of the answer."

Follow-up grill 1

You said the manifest is in Git. Won't legal hate that? They want to edit a Word doc and have it apply.

Right, and that's the real organizational lift. The compromise: legal authors in their preferred tool (often Word/Confluence), an ops-eng pair converts the diff into a PR against the policy repo, the PR contains a templated change description (region, effective_from/to, summary, clause-level diff), and the merge requires a "legal approval" check that maps to a real legal sign-off captured in the commit message. This trades editing convenience for an immutable, auditable, reproducible record. From legal's perspective, they're not blocked — they sign off on a PR description, the same way they sign off on a Word doc redline. From engineering's perspective, no policy ships without a sha + an approval. The CI then validates the manifest mechanically (no overlap, no gaps, no missing fields) — that's where you save legal from accidental compliance breaches caused by typos.

Follow-up grill 2

Walk me through what happens at midnight on a policy version cutover. There's a queued conversation that started at 23:58 — which version applies?

Two failure modes here. (1) Conversation-level race — a single conversation should pin to one manifest sha at the moment the relevant policy question is asked, not at session start, because users sometimes ask follow-ups across midnight. The pin happens per-turn, not per-session. (2) Effective-date semanticseffective_from: 2026-02-01 is in UTC by manifest convention, but the customer's purchase effective date is in their local time zone if relevant. The resolver must reconcile: for "what is the policy now," use UTC against now(); for "what is the policy for my purchase," use the purchase timestamp from the order system in its native zone, normalized to UTC against the manifest. The hard cases — purchase at 23:59 local on the 31st — fall to the human handoff path explicitly. The policy answer for ambiguous purchase times must not be confidently rendered. We log the ambiguity and the operator decides.

Follow-up grill 3

The judge fails closed when cited version doesn't match resolved version. What if the FM cites no version, or hallucinates a version_id?

Three layers of defense and then a fallback. (1) Prompt enforces format. The system prompt mandates a specific citation format ([clause: {version_id}#{clause_id}]). (2) Post-generation parser. If the response doesn't parse, the gate treats it as a fail-closed event — the answer doesn't go to the user, the system retries once with a stricter prompt (or a smaller model with a more reliable format), and after one retry the conversation hands off to a human. (3) Closed citation set. The judge checks the cited version_id against the manifest's known set; an unknown id is a hard fail and pages the operator (it indicates either a hallucinating model or a corpus/manifest mismatch — both are critical). The fallback is the human handoff. Never serve the answer with a missing or invalid citation. Customers tolerate a one-in-a-thousand "let me get an agent for you" — they don't tolerate "the bot lied about my refund window."

Follow-up grill 4

Three years from now, the corpus has 200 versions across 6 regions. Index size has tripled. Recall is fine, but latency p95 is up 40%. What do you cut?

Three options ranked by safety. (1) Tier the index by recency — split into "active" (last 12 months) and "archive" (older). Route the common case (questions about recent purchases) to the active tier and only hit the archive when effective_date falls in the older range. Most queries don't need archive, so p95 returns to baseline. (2) Compress the archive — older versions can be stored at coarser granularity (one chunk per clause instead of per paragraph) since they're rarely retrieved and the recall hit is acceptable. (3) Cap the corpus at the longest in-flight order window — if the longest legitimate purchase that could trigger a policy lookup is 5 years (e.g., gift cards with multi-year validity), retire anything older. The hard-no: do not collapse versions into "latest only" to save space; that re-introduces the original bug. Any compression must preserve the version_id and effective-date metadata even if the body shrinks.

Architect-level escalation 1

A new region, India, is launching in three months. India has different consumer-protection rules, a different language requirement (Hindi + English), and the policy is still being drafted. What's your strategy from now until launch so the day-zero failure mode is bounded?

Phase the rollout. (1) Pre-launch (months -3 to -1): Build the empty India entry in the manifest. Stub policy clauses are inserted with the explicit content "policy not yet finalized — escalate to a human." India is in the corpus but every retrieval into India routes to deflection. The point of doing this work before the policy is finalized is that the system is ready and the content is the only blocker. (2) Soft launch (month -1): Final policy is approved. India v1 enters the manifest with effective_from = launch_date. Triple-keyed golden set is built (legal + India ops + an internal Hindi-fluent reviewer) — minimum 50 (question, India, effective_date) cases per language. Run the golden set in shadow mode against India traffic but keep deflecting. (3) Day-zero (launch): Promote India out of deflection only if cohort-level metrics (recall, citation-grounding, judge agreement) clear thresholds that match other launched regions. (4) Post-launch (months +1 to +3): Monitor escalation rate per (region, intent), and feed disagreements into the golden set. The bound on day-zero failure is that a policy answer for India cannot be served without the version-aware retriever finding a non-stub clause in the India v1 corpus — until India v1 is approved, all India questions deflect. The architectural commitment is: "no answer without verified ground truth."

Architect-level escalation 2

A regulator audits you and asks: "show us every customer who was told the wrong policy version because of a bot bug last year, and prove the answer the bot gave them." Can you?

That's the audit-trail question and it's why the manifest sha is logged per turn. The chain is:

  1. Conversation log captures (conversation_id, turn_id, resolved_region, resolved_version_id, manifest_sha, cited_clause_ids, answer_text).
  2. The manifest at sha X is reproducible from Git history; the policy doc at version v3-eu (say) is reproducible from S3 + version pin.
  3. To answer the auditor: query the conversation log for (judge_verdict = 'mismatch') over the audit window — that's the candidate set of bot-acknowledged misstatements. For each, pull the manifest at the conversation's sha, the resolved version, and the cited version, and reconstruct what was said.

The harder case: the regulator asks "are there customers where the bot was wrong but the judge missed it?" That requires re-running an updated judge over the historical log. The architecture supports it because the conversation log is canonical and includes the answer text — replay is offline-feasible. The cost is non-trivial (you're re-running a judge over potentially millions of conversations) which is why the online sampling rate matters: at 1/100 you have a representative sample, at 1/1000 you don't, and the audit re-run cost depends on sampling fidelity. The architectural decision I'd defend is: keep the full conversation log with citation metadata for at least the regulatory audit window (often 7 years for consumer-protection), even if you only judge-sample 1%. Logging is cheap; replay is expensive but feasible. Not logging makes the audit literally impossible.

Red-flag answers

  • "Just route by region in the system prompt." (Doesn't fix version-by-purchase-date.)
  • "Always cite the latest version." (Mis-serves in-flight orders.)
  • "Re-train the FM on the new policy." (Hallucination risk; can't audit.)
  • "Use embeddings to figure out which version applies." (Manifest is the source of truth; embeddings are wrong tool.)
  • "Re-run the golden set against the new policy and ship if it passes." (Loses coverage of old-version correctness.)

Strong-answer indicators

  • Names policy ground truth as triple-keyed (question, region, effective-date).
  • Distinguishes "latest" from "applicable."
  • Explicit about the manifest-as-truth pattern and the judge-as-gate pattern.
  • Has an opinion on the Git-vs-Word organizational lift and is realistic about it.
  • Treats audit/rollback as design-time concerns, not bolt-ons.