LOCAL PREVIEW View on GitHub

Constraint Scenario 08 — New Locale Launched in 2 Weeks

Trigger: Brazil (pt-BR) launch greenlit; manga publishing partner inks deal; need MangaAssist live in Portuguese with localized catalog, locale-appropriate recommendations, BR payment flows, and LATAM-region serving in 2 weeks. Pillars stressed: All three. Tests every harness piece simultaneously.


TL;DR

A new locale launch is the integration test for the entire harness. Catalog, language, evals, capability flags, model-output style, region serving, compliance, payment flows — all must turn on together. The 2-week timeline is unforgiving; only teams with pre-built per-locale extensibility can hit it. The harness pieces (skill contracts with per-locale fields, eval rubric per cell, capability flags per jurisdiction, region-aware serving) make this feasible. The risky parts are content and ground-truth, not engineering.


The change

Dimension Before After (T+14d)
Supported locales 18 19 (+pt-BR)
Catalog rows for BR 0 80K manga titles + 40K manhwa
Eval set for pt-BR 0 500 hand-curated + 200 adversarial
Region serving NA, EU, APAC + LATAM (sa-east-1)
Payment flows 12 supported + Brazilian methods (PIX, boleto)
Compliance LGPD-applicable LGPD-required
Marketing translations n/a UI + agent voice + transactional

The non-engineering parts (content, payment, marketing) are the long pole. Engineering enables them.


The cascade — what has to land

flowchart TB
  Decision[Brazil launch decision] --> CAT[Catalog data load - pt-BR titles]
  Decision --> LANG[Language model behavior in pt-BR]
  Decision --> EVAL[pt-BR eval set]
  Decision --> CFG[Capability flags - pt-BR jurisdiction]
  Decision --> REG[Regional serving capacity]
  Decision --> PAY[Payment integration]
  Decision --> CMP[LGPD compliance]
  Decision --> MKT[Translated UI strings]

  CAT --> Catalog[Catalog-search MCP - pt-BR index]
  LANG --> Prompts[Locale-aware prompts]
  EVAL --> Drift[Drift detection per locale]
  CFG --> Policy[Per-jurisdiction policy]
  REG --> Latency[Sub-1.4s p95 in BR]
  PAY --> Order[Order-inventory MCP]
  CMP --> Audit[Audit endpoint covers BR]

The 14-day plan

gantt
    title 14-day Brazil launch
    dateFormat YYYY-MM-DD

    section Days 1-3
    Catalog data ingestion + index build  :a1, 2026-04-29, 3d
    Capability flag schema + jurisdiction  :a2, 2026-04-29, 2d
    Per-locale prompt scaffolding         :a3, 2026-04-29, 3d
    LATAM region serving capacity         :a4, 2026-04-29, 3d

    section Days 4-7
    pt-BR eval set curation               :b1, 2026-05-02, 5d
    Locale-aware sub-agent prompts        :b2, 2026-05-02, 4d
    Payment integration (PIX, boleto)     :b3, 2026-05-02, 5d
    LGPD compliance checklist             :b4, 2026-05-02, 5d

    section Days 8-11
    Replay against pt-BR eval (offline)   :c1, 2026-05-06, 3d
    Shadow eval (synthetic pt-BR traffic) :c2, 2026-05-07, 4d
    UI translations + agent voice         :c3, 2026-05-06, 5d
    Internal smoke testing                :c4, 2026-05-07, 4d

    section Days 12-14
    5pct canary in BR (real users)        :d1, 2026-05-10, 2d
    25 -> 100pct rollout                   :d2, 2026-05-12, 2d
    Live monitoring                       :d3, 2026-05-13, 2d

What's per-locale and what's shared

Per-locale (must be built/curated for pt-BR)

  • Catalog index (titles in Portuguese, BR-specific availability flags).
  • Eval set (hand-curated questions, adversarial probes, domain-specific names).
  • UI strings (~500 strings).
  • Agent voice samples for compose-style guidance (a few hundred examples).
  • Cultural-sensitivity rules (jurisdiction-specific safety rules).
  • Locale-specific tool behaviors (e.g., trending-MCP weighted to BR releases).

Shared (no change required)

  • Graph topology.
  • Skill contracts (input/output schemas; locale is a parameter).
  • Sub-agent definitions (locale is in the input schema).
  • Eval rubric structure (dimensions are universal; baselines per cell).
  • Observability schema.
  • Pause/resume + checkpoint logic.
  • Quota manager.

The ratio shows the harness paying off — far more is shared than per-locale.


Per-locale extensibility points (already built)

# what changes for pt-BR
catalog_search:
  config:
    pt-BR:
      index: catalog-pt-br-v1
      embedding_model: multilingual-e5-large
      bm25_field: title_pt + title_original
      reranker: same-as-shared

planner_prompt:
  locale_overrides:
    pt-BR:
      style: "informal but respectful, regional idioms welcome"
      currency: "BRL"
      date_format: "DD/MM/YYYY"
      adult_terminology_handling: "explicit; LGPD-compliant"

capability_flags:
  jurisdictions:
    BR:
      lgpd_compliant: true
      ai_disclosure_required: true
      payment_methods: [PIX, boleto, credit_card]
      currency: BRL

eval_rubric:
  baselines:
    pt-BR:
      aggregate_target: 0.78  # slightly lower at launch; ramps with maturity
      per_dim_targets: { factual_grounding: 0.85, locale_appropriateness: 0.92 }

serving:
  regions:
    LATAM:
      primary: sa-east-1
      fallback: us-east-1
      warm_floor_pct: 200  # higher initial buffer for launch volatility

This is all data. No code changes. The harness was designed for this; the change list is a YAML diff.


The risky parts

Risk 1 — Hallucinations in pt-BR

  • LLM is excellent in pt-BR but imperfect; subtle hallucinations more likely than in en-US.
  • Mitigation: factual_grounding dimension uses tool lookup (story 04); independent of language.
  • Pre-launch: 200 adversarial probes in pt-BR for hallucination detection.

Risk 2 — Catalog data quality

  • New catalog ingestion is fast but quality varies; some titles have wrong genre tags.
  • Mitigation: per-locale catalog QA pre-launch. Top-10K titles human-verified; long-tail accepted with tagged data quality flags. Recommendations preferentially draw from verified subset for first month.

Risk 3 — Cultural mismatches

  • Recommendation prompts trained on en-US data style may suggest titles inappropriate for BR cultural context.
  • Mitigation: locale-style examples in prompts; eval includes a "cultural appropriateness" dimension.
  • Long-term: BR-specific user-preference signals build over time; cold-start reliance on global signals decreases.

Risk 4 — Latency from new region

  • LATAM users routed through us-east-1 see +120ms vs. local serving.
  • Mitigation: sa-east-1 capacity pre-warmed; cross-region failover in case of regional issue. Bedrock available in sa-east-1 with same model lineup.

Risk 5 — Eval coldness — too few pt-BR turns to detect drift early

  • Mitigation: stratified sampling (story 04) reserves a quota for pt-BR. Even at 1% of BR launch traffic (small initially), absolute count is enforced.
  • Synthetic probes provide eval coverage independent of organic traffic.

What the harness contributed

Harness piece Role for new locale
Capability flags / per-jurisdiction (story 02, 05) Configuration-only addition
Skill contracts with locale field (story 02) No code change for tools
Per-cell eval rubric (story 04) pt-BR baseline declared in config
Stratified sampling (story 04) Drift detection works from day 1
Provider adapters (story 02) Bedrock sa-east-1 endpoint is a new adapter config
Region-aware serving (story 06) LATAM region as a deployment target
Audit endpoint (compliance scenario 5) LGPD audit ready
Multi-version migration discipline (story 04) Launch is just another canary

A team without these pieces would be doing 6 weeks of engineering. With them, 2 weeks is feasible because most of the work is content (catalog, eval set, translations) and integrations (payments, LGPD), not code.


Q&A drill — opening question

*Q: 2 weeks is aggressive. What's the realistic minimum without cutting corners?

The realistic minimum for full quality parity with en-US is 6-8 weeks (eval set maturation, organic-traffic-driven calibration). What we ship in 2 weeks is a launch-quality experience: functional, compliant, eval baseline lower than en-US (0.78 vs 0.81), with explicit roadmap to close the gap over the next quarter.

The launch quality bar is set by: - Above the "embarrassingly bad" threshold for any subset. - Compliance is 100% (no compromise). - All hard safety checks pass. - Soft quality (style, nuance) trails en-US for 1-2 quarters.

This phasing is honest with stakeholders and pragmatic.


Grilling — Round 1

Q1. Eval set of 500 turns is small. How do you trust the launch quality bar?

500 is the minimum for stratified statistical power on top-level dimensions. We supplement with: - Synthetic probes (200) for adversarial coverage. - Cross-locale eval — we machine-translate 1000 en-US eval items to pt-BR and re-judge. Imperfect but expands coverage. - First 4 weeks of organic traffic adds ~20K judgments via 1% sampling; the eval set "matures" rapidly post-launch.

Q2. What if pt-BR launch reveals a critical bug — the agent says something offensive in idiom?

Three gates would have to fail: - Pre-launch eval (with cultural-appropriateness dim) didn't catch it. - Synthetic probes didn't cover the input. - Active eval in canary didn't catch it in 5% rollout.

If all three fail and the bug ships: capability flag flips off the offending behavior; PR/comms team coordinates; postmortem revises eval rubric to add a new probe.

The gates make catastrophic launch failures unlikely but not impossible. We accept some residual risk because perfect pre-launch eval is impossible at 2-week timelines.

Q3. Catalog data quality varies; some recommendations cite wrong volume counts. How handled?

This is the hallucination-prevention pattern (POC-to-production war story #3). The recommendation flow uses tool lookup for factual grounding (catalog-search MCP). The agent does not free-form generate "Berserk has 41 volumes" — it tool-calls for the count. If the catalog has wrong data, that's a catalog-data bug, not an agent bug. Catalog ingestion has its own QA gate; ongoing data corrections are a steady-state team responsibility.


Grilling — Round 2 (architect-level)

Q4. How does this scale to launching 5 locales in parallel (e.g., LATAM expansion)?

The harness pieces are reusable. The scarce resources are: - Eval set curation — needs native speakers; each locale ~5-7 days of curation work. - Catalog data acquisition — varies by partner; some markets are weeks, others months. - Compliance review — locale-specific; varies. - Translation and cultural review — locale-specific.

These are content/legal work, not engineering. We could engineering-launch 5 locales simultaneously; we'd be bottlenecked on content team capacity. Realistic parallelism: 2-3 locales/quarter.

Q5. What's the cost shape of a new locale post-launch?

  • Initial overhead: catalog index storage, eval set hosting, additional region capacity. ~$15K/month per locale.
  • Marginal cost per query: same as en-US.
  • Long-term overhead: per-locale eval refresh, content updates. ~1 FTE per 5 locales.

Locale economics start positive once a locale reaches ~50K MAU. BR is projected to hit that in week 6.

Q6. What architectural debt do you accept by launching at 2 weeks vs 6 weeks?

Three explicit debts: - Eval coverage is shallow — need to mature over 1-2 quarters before per-locale drift detection is as sensitive as en-US. - Cold-start recommendation quality is lower — cross-locale popularity transfer is imperfect; BR-specific signals build slowly. - Cultural sub-rules are based on initial native-speaker review — likely revisions as real traffic surfaces edge cases.

We track these debts as items on the post-launch roadmap. Each has an owner and a quarterly checkpoint.


Intuition gained

  • A new locale is an integration test for the entire harness. Pieces that aren't extensible by-locale will be the long pole.
  • Most of the work is content (catalog, eval set, translations), not code. The harness pays off by minimizing code work.
  • Per-cell eval baselines allow honest launch with lower-than-mature quality bar.
  • 2-week launches are content-bottlenecked, not engineering-bottlenecked, when the harness is solid.
  • Architectural debt taken at launch is real and must be tracked explicitly, not hidden.

See also

  • 05-new-compliance-mandate.md — per-jurisdiction policy in capability flags
  • 06-tool-count-explodes.md — registry/contract scaling pattern
  • User stories 02, 04, 08