Constraint Scenario 08 — New Locale Launched in 2 Weeks

Trigger: Brazil (pt-BR) launch greenlit; manga publishing partner inks deal; need MangaAssist live in Portuguese with localized catalog, locale-appropriate recommendations, BR payment flows, and LATAM-region serving in 2 weeks. Pillars stressed: All three. Tests every harness piece simultaneously.

TL;DR

A new locale launch is the integration test for the entire harness. Catalog, language, evals, capability flags, model-output style, region serving, compliance, payment flows — all must turn on together. The 2-week timeline is unforgiving; only teams with pre-built per-locale extensibility can hit it. The harness pieces (skill contracts with per-locale fields, eval rubric per cell, capability flags per jurisdiction, region-aware serving) make this feasible. The risky parts are content and ground-truth, not engineering.

The change

Dimension	Before	After (T+14d)
Supported locales	18	19 (+pt-BR)
Catalog rows for BR	0	80K manga titles + 40K manhwa
Eval set for pt-BR	0	500 hand-curated + 200 adversarial
Region serving	NA, EU, APAC	+ LATAM (sa-east-1)
Payment flows	12 supported	+ Brazilian methods (PIX, boleto)
Compliance	LGPD-applicable	LGPD-required
Marketing translations	n/a	UI + agent voice + transactional

The non-engineering parts (content, payment, marketing) are the long pole. Engineering enables them.

The cascade — what has to land

flowchart TB
  Decision[Brazil launch decision] --> CAT[Catalog data load - pt-BR titles]
  Decision --> LANG[Language model behavior in pt-BR]
  Decision --> EVAL[pt-BR eval set]
  Decision --> CFG[Capability flags - pt-BR jurisdiction]
  Decision --> REG[Regional serving capacity]
  Decision --> PAY[Payment integration]
  Decision --> CMP[LGPD compliance]
  Decision --> MKT[Translated UI strings]

  CAT --> Catalog[Catalog-search MCP - pt-BR index]
  LANG --> Prompts[Locale-aware prompts]
  EVAL --> Drift[Drift detection per locale]
  CFG --> Policy[Per-jurisdiction policy]
  REG --> Latency[Sub-1.4s p95 in BR]
  PAY --> Order[Order-inventory MCP]
  CMP --> Audit[Audit endpoint covers BR]

The 14-day plan

gantt
    title 14-day Brazil launch
    dateFormat YYYY-MM-DD

    section Days 1-3
    Catalog data ingestion + index build  :a1, 2026-04-29, 3d
    Capability flag schema + jurisdiction  :a2, 2026-04-29, 2d
    Per-locale prompt scaffolding         :a3, 2026-04-29, 3d
    LATAM region serving capacity         :a4, 2026-04-29, 3d

    section Days 4-7
    pt-BR eval set curation               :b1, 2026-05-02, 5d
    Locale-aware sub-agent prompts        :b2, 2026-05-02, 4d
    Payment integration (PIX, boleto)     :b3, 2026-05-02, 5d
    LGPD compliance checklist             :b4, 2026-05-02, 5d

    section Days 8-11
    Replay against pt-BR eval (offline)   :c1, 2026-05-06, 3d
    Shadow eval (synthetic pt-BR traffic) :c2, 2026-05-07, 4d
    UI translations + agent voice         :c3, 2026-05-06, 5d
    Internal smoke testing                :c4, 2026-05-07, 4d

    section Days 12-14
    5pct canary in BR (real users)        :d1, 2026-05-10, 2d
    25 -> 100pct rollout                   :d2, 2026-05-12, 2d
    Live monitoring                       :d3, 2026-05-13, 2d

What's per-locale and what's shared

Per-locale (must be built/curated for pt-BR)

Catalog index (titles in Portuguese, BR-specific availability flags).
Eval set (hand-curated questions, adversarial probes, domain-specific names).
UI strings (~500 strings).
Agent voice samples for compose-style guidance (a few hundred examples).
Cultural-sensitivity rules (jurisdiction-specific safety rules).
Locale-specific tool behaviors (e.g., trending-MCP weighted to BR releases).

Shared (no change required)

Graph topology.
Skill contracts (input/output schemas; locale is a parameter).
Sub-agent definitions (locale is in the input schema).
Eval rubric structure (dimensions are universal; baselines per cell).
Observability schema.
Pause/resume + checkpoint logic.
Quota manager.

The ratio shows the harness paying off — far more is shared than per-locale.

Per-locale extensibility points (already built)

# what changes for pt-BR
catalog_search:
  config:
    pt-BR:
      index: catalog-pt-br-v1
      embedding_model: multilingual-e5-large
      bm25_field: title_pt + title_original
      reranker: same-as-shared

planner_prompt:
  locale_overrides:
    pt-BR:
      style: "informal but respectful, regional idioms welcome"
      currency: "BRL"
      date_format: "DD/MM/YYYY"
      adult_terminology_handling: "explicit; LGPD-compliant"

capability_flags:
  jurisdictions:
    BR:
      lgpd_compliant: true
      ai_disclosure_required: true
      payment_methods: [PIX, boleto, credit_card]
      currency: BRL

eval_rubric:
  baselines:
    pt-BR:
      aggregate_target: 0.78  # slightly lower at launch; ramps with maturity
      per_dim_targets: { factual_grounding: 0.85, locale_appropriateness: 0.92 }

serving:
  regions:
    LATAM:
      primary: sa-east-1
      fallback: us-east-1
      warm_floor_pct: 200  # higher initial buffer for launch volatility

This is all data. No code changes. The harness was designed for this; the change list is a YAML diff.

The risky parts

Risk 1 — Hallucinations in pt-BR

LLM is excellent in pt-BR but imperfect; subtle hallucinations more likely than in en-US.
Mitigation: factual_grounding dimension uses tool lookup (story 04); independent of language.
Pre-launch: 200 adversarial probes in pt-BR for hallucination detection.

Risk 2 — Catalog data quality

New catalog ingestion is fast but quality varies; some titles have wrong genre tags.
Mitigation: per-locale catalog QA pre-launch. Top-10K titles human-verified; long-tail accepted with tagged data quality flags. Recommendations preferentially draw from verified subset for first month.

Risk 3 — Cultural mismatches

Recommendation prompts trained on en-US data style may suggest titles inappropriate for BR cultural context.
Mitigation: locale-style examples in prompts; eval includes a "cultural appropriateness" dimension.
Long-term: BR-specific user-preference signals build over time; cold-start reliance on global signals decreases.

Risk 4 — Latency from new region

LATAM users routed through us-east-1 see +120ms vs. local serving.
Mitigation: sa-east-1 capacity pre-warmed; cross-region failover in case of regional issue. Bedrock available in sa-east-1 with same model lineup.

Risk 5 — Eval coldness — too few pt-BR turns to detect drift early

Mitigation: stratified sampling (story 04) reserves a quota for pt-BR. Even at 1% of BR launch traffic (small initially), absolute count is enforced.
Synthetic probes provide eval coverage independent of organic traffic.

What the harness contributed

Harness piece	Role for new locale
Capability flags / per-jurisdiction (story 02, 05)	Configuration-only addition
Skill contracts with locale field (story 02)	No code change for tools
Per-cell eval rubric (story 04)	pt-BR baseline declared in config
Stratified sampling (story 04)	Drift detection works from day 1
Provider adapters (story 02)	Bedrock sa-east-1 endpoint is a new adapter config
Region-aware serving (story 06)	LATAM region as a deployment target
Audit endpoint (compliance scenario 5)	LGPD audit ready
Multi-version migration discipline (story 04)	Launch is just another canary

A team without these pieces would be doing 6 weeks of engineering. With them, 2 weeks is feasible because most of the work is content (catalog, eval set, translations) and integrations (payments, LGPD), not code.

Q&A drill — opening question

*Q: 2 weeks is aggressive. What's the realistic minimum without cutting corners?

The realistic minimum for full quality parity with en-US is 6-8 weeks (eval set maturation, organic-traffic-driven calibration). What we ship in 2 weeks is a launch-quality experience: functional, compliant, eval baseline lower than en-US (0.78 vs 0.81), with explicit roadmap to close the gap over the next quarter.

The launch quality bar is set by: - Above the "embarrassingly bad" threshold for any subset. - Compliance is 100% (no compromise). - All hard safety checks pass. - Soft quality (style, nuance) trails en-US for 1-2 quarters.

This phasing is honest with stakeholders and pragmatic.

Grilling — Round 1

Q1. Eval set of 500 turns is small. How do you trust the launch quality bar?

500 is the minimum for stratified statistical power on top-level dimensions. We supplement with: - Synthetic probes (200) for adversarial coverage. - Cross-locale eval — we machine-translate 1000 en-US eval items to pt-BR and re-judge. Imperfect but expands coverage. - First 4 weeks of organic traffic adds ~20K judgments via 1% sampling; the eval set "matures" rapidly post-launch.

Q2. What if pt-BR launch reveals a critical bug — the agent says something offensive in idiom?

Three gates would have to fail: - Pre-launch eval (with cultural-appropriateness dim) didn't catch it. - Synthetic probes didn't cover the input. - Active eval in canary didn't catch it in 5% rollout.

If all three fail and the bug ships: capability flag flips off the offending behavior; PR/comms team coordinates; postmortem revises eval rubric to add a new probe.

The gates make catastrophic launch failures unlikely but not impossible. We accept some residual risk because perfect pre-launch eval is impossible at 2-week timelines.

Q3. Catalog data quality varies; some recommendations cite wrong volume counts. How handled?

This is the hallucination-prevention pattern (POC-to-production war story #3). The recommendation flow uses tool lookup for factual grounding (catalog-search MCP). The agent does not free-form generate "Berserk has 41 volumes" — it tool-calls for the count. If the catalog has wrong data, that's a catalog-data bug, not an agent bug. Catalog ingestion has its own QA gate; ongoing data corrections are a steady-state team responsibility.

Grilling — Round 2 (architect-level)

Q4. How does this scale to launching 5 locales in parallel (e.g., LATAM expansion)?

The harness pieces are reusable. The scarce resources are: - Eval set curation — needs native speakers; each locale ~5-7 days of curation work. - Catalog data acquisition — varies by partner; some markets are weeks, others months. - Compliance review — locale-specific; varies. - Translation and cultural review — locale-specific.

These are content/legal work, not engineering. We could engineering-launch 5 locales simultaneously; we'd be bottlenecked on content team capacity. Realistic parallelism: 2-3 locales/quarter.

Q5. What's the cost shape of a new locale post-launch?

Initial overhead: catalog index storage, eval set hosting, additional region capacity. ~$15K/month per locale.
Marginal cost per query: same as en-US.
Long-term overhead: per-locale eval refresh, content updates. ~1 FTE per 5 locales.

Locale economics start positive once a locale reaches ~50K MAU. BR is projected to hit that in week 6.

Q6. What architectural debt do you accept by launching at 2 weeks vs 6 weeks?

Three explicit debts: - Eval coverage is shallow — need to mature over 1-2 quarters before per-locale drift detection is as sensitive as en-US. - Cold-start recommendation quality is lower — cross-locale popularity transfer is imperfect; BR-specific signals build slowly. - Cultural sub-rules are based on initial native-speaker review — likely revisions as real traffic surfaces edge cases.

We track these debts as items on the post-launch roadmap. Each has an owner and a quarterly checkpoint.

Intuition gained

A new locale is an integration test for the entire harness. Pieces that aren't extensible by-locale will be the long pole.
Most of the work is content (catalog, eval set, translations), not code. The harness pays off by minimizing code work.
Per-cell eval baselines allow honest launch with lower-than-mature quality bar.
2-week launches are content-bottlenecked, not engineering-bottlenecked, when the harness is solid.
Architectural debt taken at launch is real and must be tracked explicitly, not hidden.