Constraint Scenario 05 — New Compliance Mandate Mid-Flight
Trigger: Mid-quarter, EU regulator publishes a new directive: AI-generated recommendations must include explicit disclosure, age-gating must be re-verified per session for adult content, and there's a new "right-to-deletion" SLA for any user data used in agent context. Compliance deadline: 90 days. Pillars stressed: P3 (capability flags + LLD) primarily; P1 (safety + eval) and P2 (state lifecycle) secondary.
TL;DR
Compliance changes are the constraint AI teams underestimate the most. The architecture has to express per-jurisdiction policy as data, not as branches in code. The harness pieces that survive contact with this — capability flags, scoped state, eval framework with new dimensions, audit-ready event log — are the same pieces that earned their keep in earlier scenarios. The cost is mostly engineering attention + ~6% latency budget for added safety steps.
The change
Three independent requirements landed together:
| Requirement | What it demands |
|---|---|
| AI disclosure | Every AI-generated recommendation must include "AI-generated, may contain errors" or equivalent, in user's locale |
| Per-session age re-verification | Adult-tier content gating refreshed every session, not once-per-account |
| Right-to-deletion SLA | User can demand all their data used in agent context to be erased; SLA 30 days; auditable |
Each is small in isolation. Together, they touch every layer.
The cascade — what changes where
flowchart TB
Reg[Regulator directive] --> Disc[AI disclosure rule]
Reg --> Age[Per-session age verification]
Reg --> RTD[Right-to-deletion]
Disc --> Compose[Compose node: append disclosure]
Disc --> Eval[Eval rubric: disclosure-present dimension]
Disc --> Locale[Per-locale phrasing]
Age --> Capability[Capability flag: adult_session]
Age --> SafetyPre[Safety pre-check enforces flag]
Age --> SessionState[Session state schema gains verification timestamp]
RTD --> StateSchema[State row + S3 retention extended with user-id index]
RTD --> EventLog[Event log redaction tooling: per-user purge]
RTD --> Audit[Audit endpoint: prove deletion]
RTD --> CacheKey[Inference cache key includes user-deletion-epoch]
Eval --> RegressionTest[New regression test in offline set]
EventLog --> Glue[ETL adds purge job]
The sprawl is real. The harness contains it.
Where each piece lands in the pillars
Pillar 1 — Workflow / safety / eval changes
Compose node appends a per-locale disclosure when the response is AI-generated. The disclosure text is itself a versioned skill output (so legal can update it without a code change).
Safety pre-check gains a hard rule: if adult_session_verified_at < session_start_time, deny adult-tier content; redirect to age verification flow.
Eval rubric gains two new dimensions:
- disclosure_present (binary, hard fail if missing on AI-gen content).
- age_gate_respected (binary, hard fail if adult content shown without verified session).
Per-locale eval cells gain regression tests for the disclosure phrasing.
Pillar 2 — Harness changes
Session state gains an adult_verified_at timestamp, scoped per session, not per account. The pause/resume layer (story 05) already had session-scoped state — this is a field addition, not an architecture change.
Right-to-deletion is the messy one. The state row, S3 context, event log, and inference cache all hold user-derivable bits. We add:
- A user-id index on the state-row table for fast lookup (DDB GSI).
- An event-log purge job in Glue, runs daily, removes records by user_id_hash.
- A cache invalidation epoch per user — the cache key includes user_deletion_epoch; bumping the epoch invalidates all cached entries derived from that user.
- An audit endpoint that, given a user_id, returns "all data purged" with cryptographic proof (Merkle proofs over the deletion log).
Pillar 3 — LLD changes
Capability flags gain a per-jurisdiction policy:
jurisdictions:
EU:
require_ai_disclosure: true
age_verification_per_session: true
rtd_sla_days: 30
US:
require_ai_disclosure: false
age_verification_per_session: false
rtd_sla_days: 30 # CCPA equivalent
JP:
...
respects_jurisdiction_policy field; skills declare which policy fields they read.
The agent reads user.jurisdiction at the start of every turn and applies the policy. No code branch on country; only data lookup.
The 90-day rollout
gantt
title 90-day compliance rollout
dateFormat YYYY-MM-DD
section Weeks 1-2
Legal scoping + spec finalization :a1, 2026-04-29, 14d
section Weeks 3-5
Eval rubric extension :b1, 2026-05-13, 14d
Disclosure skill + per-locale text :b2, 2026-05-13, 14d
Capability flag schema update :b3, 2026-05-13, 7d
Age-verification flow design :b4, 2026-05-13, 14d
section Weeks 6-8
State row schema migration :c1, 2026-06-03, 14d
User-id index for RTD :c2, 2026-06-03, 14d
Event-log purge job :c3, 2026-06-03, 14d
Cache key with deletion epoch :c4, 2026-06-03, 7d
Age-verification UX :c5, 2026-06-03, 14d
Audit endpoint :c6, 2026-06-10, 14d
section Weeks 9-11
EU canary 5pct -> 100pct :d1, 2026-06-24, 21d
Other jurisdictions hold on US :d2, 2026-06-24, 21d
section Week 12
Final audit + signoff :e1, 2026-07-15, 7d
Two-track design: EU-specific changes happen first under EU capability flag; US/JP/etc. unaffected. Other jurisdictions roll in later as their own laws catch up.
What goes wrong
Risk 1 — Disclosure phrasing varies in legal acceptability across EU member states
- Mitigation: country-level granularity (not just "EU"). Disclosure-skill is per-locale-per-country.
Risk 2 — Age-verification UX adds friction; conversion drops
- Mitigation: re-verification only triggers for adult-content sessions; non-adult content unchanged. Caches the verified status for 24 hours within session.
Risk 3 — RTD purge job has bugs; user data persists past deadline
- Mitigation: end-to-end test that actually creates a synthetic user, populates data, requests deletion, and verifies absence across all stores. Runs nightly. SLA dashboard shows oldest-pending-RTD; alarm at 25 days.
Risk 4 — Inference cache poisoning during epoch bump
- Mitigation: epoch-bump is online; cache misses gracefully (just slower for that user, ~3 turns until natural re-warm).
Risk 5 — Audit endpoint is queried adversarially (regulator audits 1000 users at once)
- Mitigation: audit endpoint is rate-limited and authenticated; can answer ~10K/hour without affecting prod traffic. We've load-tested it.
What the harness contributed
| Harness piece | Without it, this scenario costs |
|---|---|
| Capability flags (story 02) | 3-4 weeks more work to introduce per-jurisdiction branching |
| Eval rubric extensibility (story 04) | 2 weeks to retrofit dimensions; ongoing fragility |
| Versioned events (story 08) | Audit endpoint impossible to build without per-record provenance |
| Skill contracts (story 02) | Disclosure skill would have been 4 different copies |
| State scoping (story 05) | Per-session vs per-account distinction would have required schema rewrite |
| Inference cache key composition (story 06) | Cache invalidation per user would have required full flush |
The compliance scenario is where the LLD investment shows up most plainly. Teams that under-invested in capability flags and skill contracts re-architect; teams that invested re-configure.
Q&A drill — opening question
*Q: Can't we just add an "if EU, then..." in the prompt? Easier than all this.
Three reasons not to: 1. Auditability. A regulator asking "show me you applied this rule on date X" needs a per-record trace, not "the prompt said to." Capability flags + event log give that; prompt rules don't. 2. Per-prompt drift. Prompts get edited by many teams. A compliance rule in a prompt is fragile; one rewrite drops it silently. 3. Per-skill enforcement. Some skills (e.g., recommendation explainer) need the rule; others don't. Prompt-level rules can't differentiate.
The compliance rule is data, not prose. The harness treats it as data.
Grilling — Round 1
Q1. Right-to-deletion across the inference cache — couldn't you just not cache anything user-specific?
Two reasons we still cache: - The cache hit rate of 38% is worth ~$3M/month. Killing it for 0.1% of users (those who'll request RTD) is overkill. - The cache holds outputs derived from user inputs, but the storage is hashed. Per-user deletion via epoch bump is the surgical option.
Q2. Age verification per session — does that mean every session asks the user to re-prove age?
Only if they request adult content. Default sessions don't trigger verification. The capability flag adult_content_requested is set only when the planner intends to surface adult content; the safety pre-check then enforces "session has fresh verification."
If user says "show me Berserk" (mature manga), we ask for age verification once. Subsequent requests for adult content within the session are allowed without re-asking. New session restarts the check.
Q3. The audit endpoint — how do you produce cryptographic proof of deletion?
We maintain a Merkle-tree-rooted log of deletion events. Each log entry is (user_id_hash, store_id, deletion_timestamp, witness_hash). The audit endpoint returns a proof path from a deletion entry to the public Merkle root. A regulator can verify the entry was logged at the claimed time and that all stores reported successful deletion.
The proof system is overkill for low-trust environments and necessary for high-trust regulatory environments. We err toward overkill given the EU regulator context.
Grilling — Round 2 (architect-level)
Q4. How does compliance interact with the active eval framework? Some adversarial probes (story 04) might violate the new rules in a sandboxed way.
Probes are scoped: they run on a synthetic user with a jurisdiction: TEST flag. The capability-flag system understands this jurisdiction and applies relaxed rules. Probes can't accidentally trigger compliance flagging; production data can't accidentally leak into the probe pipeline.
This is one of those LLD details that pays off — the jurisdiction-as-data design lets us add new "test" jurisdictions trivially.
Q5. A new EU sub-rule requires "user must opt-in to AI-generated content explicitly." How does that flow?
It's a new capability flag: ai_content_consent: bool. The agent's Plan node, on jurisdiction == EU AND !ai_content_consent, routes to a deterministic "consent-prompt" sub-flow. No AI-generated content until consent.
We extend the user-state schema to hold ai_content_consent_at. Audit endpoint adds this field. Disclosure skill adds an "opt-in needed" branch. ~3 days of work because the machinery is there.
Q6. Walk through what happens if a regulator demands a feature be turned off for EU users tomorrow.
Capability flag flip. We've practiced this — there's a runbook. Steps: 1. Legal/policy team decides which capability needs to flip. 2. Engineering pushes a config change (no code change). Flag flips for EU jurisdiction. 3. The agent's behavior changes within the next request lifetime (~seconds). 4. Audit logs record the flag change with timestamp + author. 5. Postmortem evaluates whether the flag-flip needs to become permanent (code-level removal).
Time from regulatory order to live behavior change: <1 hour. This is the harness paying off in regulatory speed.
Intuition gained
- Compliance is data, not branches. Per-jurisdiction policy in capability flags; agent reads it.
- Three pillars all touched, but unevenly. LLD (flags, contracts) does the heavy lifting; eval (rubric extension) and harness (state, audit) follow.
- Audit-ready event log is the regulatory speed advantage; build it before you need it.
- Right-to-deletion is per-store, and the harness must enumerate all stores. Cache, state, events, S3.
- Capability-flag flip = regulatory hot-fix. Practice the runbook.
See also
06-tool-count-explodes.md— registry/contract scaling pattern08-new-locale-launch.md— same per-jurisdiction abstraction- User stories 02, 04, 08