Constraint Scenario 05 — New Compliance Mandate Mid-Flight

Trigger: Mid-quarter, EU regulator publishes a new directive: AI-generated recommendations must include explicit disclosure, age-gating must be re-verified per session for adult content, and there's a new "right-to-deletion" SLA for any user data used in agent context. Compliance deadline: 90 days. Pillars stressed: P3 (capability flags + LLD) primarily; P1 (safety + eval) and P2 (state lifecycle) secondary.

TL;DR

Compliance changes are the constraint AI teams underestimate the most. The architecture has to express per-jurisdiction policy as data, not as branches in code. The harness pieces that survive contact with this — capability flags, scoped state, eval framework with new dimensions, audit-ready event log — are the same pieces that earned their keep in earlier scenarios. The cost is mostly engineering attention + ~6% latency budget for added safety steps.

The change

Three independent requirements landed together:

Requirement	What it demands
AI disclosure	Every AI-generated recommendation must include "AI-generated, may contain errors" or equivalent, in user's locale
Per-session age re-verification	Adult-tier content gating refreshed every session, not once-per-account
Right-to-deletion SLA	User can demand all their data used in agent context to be erased; SLA 30 days; auditable

Each is small in isolation. Together, they touch every layer.

The cascade — what changes where

flowchart TB
  Reg[Regulator directive] --> Disc[AI disclosure rule]
  Reg --> Age[Per-session age verification]
  Reg --> RTD[Right-to-deletion]

  Disc --> Compose[Compose node: append disclosure]
  Disc --> Eval[Eval rubric: disclosure-present dimension]
  Disc --> Locale[Per-locale phrasing]

  Age --> Capability[Capability flag: adult_session]
  Age --> SafetyPre[Safety pre-check enforces flag]
  Age --> SessionState[Session state schema gains verification timestamp]

  RTD --> StateSchema[State row + S3 retention extended with user-id index]
  RTD --> EventLog[Event log redaction tooling: per-user purge]
  RTD --> Audit[Audit endpoint: prove deletion]
  RTD --> CacheKey[Inference cache key includes user-deletion-epoch]

  Eval --> RegressionTest[New regression test in offline set]
  EventLog --> Glue[ETL adds purge job]

The sprawl is real. The harness contains it.

Where each piece lands in the pillars

Pillar 1 — Workflow / safety / eval changes

Compose node appends a per-locale disclosure when the response is AI-generated. The disclosure text is itself a versioned skill output (so legal can update it without a code change).

Safety pre-check gains a hard rule: if adult_session_verified_at < session_start_time, deny adult-tier content; redirect to age verification flow.

Eval rubric gains two new dimensions: - disclosure_present (binary, hard fail if missing on AI-gen content). - age_gate_respected (binary, hard fail if adult content shown without verified session).

Per-locale eval cells gain regression tests for the disclosure phrasing.

Pillar 2 — Harness changes

Session state gains an adult_verified_at timestamp, scoped per session, not per account. The pause/resume layer (story 05) already had session-scoped state — this is a field addition, not an architecture change.

Right-to-deletion is the messy one. The state row, S3 context, event log, and inference cache all hold user-derivable bits. We add: - A user-id index on the state-row table for fast lookup (DDB GSI). - An event-log purge job in Glue, runs daily, removes records by user_id_hash. - A cache invalidation epoch per user — the cache key includes user_deletion_epoch; bumping the epoch invalidates all cached entries derived from that user. - An audit endpoint that, given a user_id, returns "all data purged" with cryptographic proof (Merkle proofs over the deletion log).

Pillar 3 — LLD changes

Capability flags gain a per-jurisdiction policy:

jurisdictions:
  EU:
    require_ai_disclosure: true
    age_verification_per_session: true
    rtd_sla_days: 30
  US:
    require_ai_disclosure: false
    age_verification_per_session: false
    rtd_sla_days: 30   # CCPA equivalent
  JP:
    ...

The skill registry contracts gain a respects_jurisdiction_policy field; skills declare which policy fields they read.

The agent reads user.jurisdiction at the start of every turn and applies the policy. No code branch on country; only data lookup.

The 90-day rollout

gantt
    title 90-day compliance rollout
    dateFormat YYYY-MM-DD

    section Weeks 1-2
    Legal scoping + spec finalization  :a1, 2026-04-29, 14d

    section Weeks 3-5
    Eval rubric extension              :b1, 2026-05-13, 14d
    Disclosure skill + per-locale text :b2, 2026-05-13, 14d
    Capability flag schema update      :b3, 2026-05-13, 7d
    Age-verification flow design       :b4, 2026-05-13, 14d

    section Weeks 6-8
    State row schema migration         :c1, 2026-06-03, 14d
    User-id index for RTD              :c2, 2026-06-03, 14d
    Event-log purge job                :c3, 2026-06-03, 14d
    Cache key with deletion epoch      :c4, 2026-06-03, 7d
    Age-verification UX                :c5, 2026-06-03, 14d
    Audit endpoint                     :c6, 2026-06-10, 14d

    section Weeks 9-11
    EU canary 5pct -> 100pct           :d1, 2026-06-24, 21d
    Other jurisdictions hold on US     :d2, 2026-06-24, 21d

    section Week 12
    Final audit + signoff              :e1, 2026-07-15, 7d

Two-track design: EU-specific changes happen first under EU capability flag; US/JP/etc. unaffected. Other jurisdictions roll in later as their own laws catch up.

What goes wrong

Risk 1 — Disclosure phrasing varies in legal acceptability across EU member states

Mitigation: country-level granularity (not just "EU"). Disclosure-skill is per-locale-per-country.

Risk 2 — Age-verification UX adds friction; conversion drops

Mitigation: re-verification only triggers for adult-content sessions; non-adult content unchanged. Caches the verified status for 24 hours within session.

Risk 3 — RTD purge job has bugs; user data persists past deadline

Mitigation: end-to-end test that actually creates a synthetic user, populates data, requests deletion, and verifies absence across all stores. Runs nightly. SLA dashboard shows oldest-pending-RTD; alarm at 25 days.

Risk 4 — Inference cache poisoning during epoch bump

Mitigation: epoch-bump is online; cache misses gracefully (just slower for that user, ~3 turns until natural re-warm).

Risk 5 — Audit endpoint is queried adversarially (regulator audits 1000 users at once)

Mitigation: audit endpoint is rate-limited and authenticated; can answer ~10K/hour without affecting prod traffic. We've load-tested it.

What the harness contributed

Harness piece	Without it, this scenario costs
Capability flags (story 02)	3-4 weeks more work to introduce per-jurisdiction branching
Eval rubric extensibility (story 04)	2 weeks to retrofit dimensions; ongoing fragility
Versioned events (story 08)	Audit endpoint impossible to build without per-record provenance
Skill contracts (story 02)	Disclosure skill would have been 4 different copies
State scoping (story 05)	Per-session vs per-account distinction would have required schema rewrite
Inference cache key composition (story 06)	Cache invalidation per user would have required full flush

The compliance scenario is where the LLD investment shows up most plainly. Teams that under-invested in capability flags and skill contracts re-architect; teams that invested re-configure.

Q&A drill — opening question

*Q: Can't we just add an "if EU, then..." in the prompt? Easier than all this.

Three reasons not to: 1. Auditability. A regulator asking "show me you applied this rule on date X" needs a per-record trace, not "the prompt said to." Capability flags + event log give that; prompt rules don't. 2. Per-prompt drift. Prompts get edited by many teams. A compliance rule in a prompt is fragile; one rewrite drops it silently. 3. Per-skill enforcement. Some skills (e.g., recommendation explainer) need the rule; others don't. Prompt-level rules can't differentiate.

The compliance rule is data, not prose. The harness treats it as data.

Grilling — Round 1

Q1. Right-to-deletion across the inference cache — couldn't you just not cache anything user-specific?

Two reasons we still cache: - The cache hit rate of 38% is worth ~$3M/month. Killing it for 0.1% of users (those who'll request RTD) is overkill. - The cache holds outputs derived from user inputs, but the storage is hashed. Per-user deletion via epoch bump is the surgical option.

Q2. Age verification per session — does that mean every session asks the user to re-prove age?

Only if they request adult content. Default sessions don't trigger verification. The capability flag adult_content_requested is set only when the planner intends to surface adult content; the safety pre-check then enforces "session has fresh verification."

If user says "show me Berserk" (mature manga), we ask for age verification once. Subsequent requests for adult content within the session are allowed without re-asking. New session restarts the check.

Q3. The audit endpoint — how do you produce cryptographic proof of deletion?

We maintain a Merkle-tree-rooted log of deletion events. Each log entry is (user_id_hash, store_id, deletion_timestamp, witness_hash). The audit endpoint returns a proof path from a deletion entry to the public Merkle root. A regulator can verify the entry was logged at the claimed time and that all stores reported successful deletion.

The proof system is overkill for low-trust environments and necessary for high-trust regulatory environments. We err toward overkill given the EU regulator context.

Grilling — Round 2 (architect-level)

Q4. How does compliance interact with the active eval framework? Some adversarial probes (story 04) might violate the new rules in a sandboxed way.

Probes are scoped: they run on a synthetic user with a jurisdiction: TEST flag. The capability-flag system understands this jurisdiction and applies relaxed rules. Probes can't accidentally trigger compliance flagging; production data can't accidentally leak into the probe pipeline.

This is one of those LLD details that pays off — the jurisdiction-as-data design lets us add new "test" jurisdictions trivially.

Q5. A new EU sub-rule requires "user must opt-in to AI-generated content explicitly." How does that flow?

It's a new capability flag: ai_content_consent: bool. The agent's Plan node, on jurisdiction == EU AND !ai_content_consent, routes to a deterministic "consent-prompt" sub-flow. No AI-generated content until consent.

We extend the user-state schema to hold ai_content_consent_at. Audit endpoint adds this field. Disclosure skill adds an "opt-in needed" branch. ~3 days of work because the machinery is there.

Q6. Walk through what happens if a regulator demands a feature be turned off for EU users tomorrow.

Capability flag flip. We've practiced this — there's a runbook. Steps: 1. Legal/policy team decides which capability needs to flip. 2. Engineering pushes a config change (no code change). Flag flips for EU jurisdiction. 3. The agent's behavior changes within the next request lifetime (~seconds). 4. Audit logs record the flag change with timestamp + author. 5. Postmortem evaluates whether the flag-flip needs to become permanent (code-level removal).

Time from regulatory order to live behavior change: <1 hour. This is the harness paying off in regulatory speed.

Intuition gained

Compliance is data, not branches. Per-jurisdiction policy in capability flags; agent reads it.
Three pillars all touched, but unevenly. LLD (flags, contracts) does the heavy lifting; eval (rubric extension) and harness (state, audit) follow.
Audit-ready event log is the regulatory speed advantage; build it before you need it.
Right-to-deletion is per-store, and the harness must enumerate all stores. Cache, state, events, S3.
Capability-flag flip = regulatory hot-fix. Practice the runbook.