Constraint Scenario 08 — New Locale Launched in 2 Weeks
Trigger: Brazil (pt-BR) launch greenlit; manga publishing partner inks deal; need MangaAssist live in Portuguese with localized catalog, locale-appropriate recommendations, BR payment flows, and LATAM-region serving in 2 weeks. Pillars stressed: All three. Tests every harness piece simultaneously.
TL;DR
A new locale launch is the integration test for the entire harness. Catalog, language, evals, capability flags, model-output style, region serving, compliance, payment flows — all must turn on together. The 2-week timeline is unforgiving; only teams with pre-built per-locale extensibility can hit it. The harness pieces (skill contracts with per-locale fields, eval rubric per cell, capability flags per jurisdiction, region-aware serving) make this feasible. The risky parts are content and ground-truth, not engineering.
The change
| Dimension | Before | After (T+14d) |
|---|---|---|
| Supported locales | 18 | 19 (+pt-BR) |
| Catalog rows for BR | 0 | 80K manga titles + 40K manhwa |
| Eval set for pt-BR | 0 | 500 hand-curated + 200 adversarial |
| Region serving | NA, EU, APAC | + LATAM (sa-east-1) |
| Payment flows | 12 supported | + Brazilian methods (PIX, boleto) |
| Compliance | LGPD-applicable | LGPD-required |
| Marketing translations | n/a | UI + agent voice + transactional |
The non-engineering parts (content, payment, marketing) are the long pole. Engineering enables them.
The cascade — what has to land
flowchart TB
Decision[Brazil launch decision] --> CAT[Catalog data load - pt-BR titles]
Decision --> LANG[Language model behavior in pt-BR]
Decision --> EVAL[pt-BR eval set]
Decision --> CFG[Capability flags - pt-BR jurisdiction]
Decision --> REG[Regional serving capacity]
Decision --> PAY[Payment integration]
Decision --> CMP[LGPD compliance]
Decision --> MKT[Translated UI strings]
CAT --> Catalog[Catalog-search MCP - pt-BR index]
LANG --> Prompts[Locale-aware prompts]
EVAL --> Drift[Drift detection per locale]
CFG --> Policy[Per-jurisdiction policy]
REG --> Latency[Sub-1.4s p95 in BR]
PAY --> Order[Order-inventory MCP]
CMP --> Audit[Audit endpoint covers BR]
The 14-day plan
gantt
title 14-day Brazil launch
dateFormat YYYY-MM-DD
section Days 1-3
Catalog data ingestion + index build :a1, 2026-04-29, 3d
Capability flag schema + jurisdiction :a2, 2026-04-29, 2d
Per-locale prompt scaffolding :a3, 2026-04-29, 3d
LATAM region serving capacity :a4, 2026-04-29, 3d
section Days 4-7
pt-BR eval set curation :b1, 2026-05-02, 5d
Locale-aware sub-agent prompts :b2, 2026-05-02, 4d
Payment integration (PIX, boleto) :b3, 2026-05-02, 5d
LGPD compliance checklist :b4, 2026-05-02, 5d
section Days 8-11
Replay against pt-BR eval (offline) :c1, 2026-05-06, 3d
Shadow eval (synthetic pt-BR traffic) :c2, 2026-05-07, 4d
UI translations + agent voice :c3, 2026-05-06, 5d
Internal smoke testing :c4, 2026-05-07, 4d
section Days 12-14
5pct canary in BR (real users) :d1, 2026-05-10, 2d
25 -> 100pct rollout :d2, 2026-05-12, 2d
Live monitoring :d3, 2026-05-13, 2d
What's per-locale and what's shared
Per-locale (must be built/curated for pt-BR)
- Catalog index (titles in Portuguese, BR-specific availability flags).
- Eval set (hand-curated questions, adversarial probes, domain-specific names).
- UI strings (~500 strings).
- Agent voice samples for compose-style guidance (a few hundred examples).
- Cultural-sensitivity rules (jurisdiction-specific safety rules).
- Locale-specific tool behaviors (e.g., trending-MCP weighted to BR releases).
Shared (no change required)
- Graph topology.
- Skill contracts (input/output schemas; locale is a parameter).
- Sub-agent definitions (locale is in the input schema).
- Eval rubric structure (dimensions are universal; baselines per cell).
- Observability schema.
- Pause/resume + checkpoint logic.
- Quota manager.
The ratio shows the harness paying off — far more is shared than per-locale.
Per-locale extensibility points (already built)
# what changes for pt-BR
catalog_search:
config:
pt-BR:
index: catalog-pt-br-v1
embedding_model: multilingual-e5-large
bm25_field: title_pt + title_original
reranker: same-as-shared
planner_prompt:
locale_overrides:
pt-BR:
style: "informal but respectful, regional idioms welcome"
currency: "BRL"
date_format: "DD/MM/YYYY"
adult_terminology_handling: "explicit; LGPD-compliant"
capability_flags:
jurisdictions:
BR:
lgpd_compliant: true
ai_disclosure_required: true
payment_methods: [PIX, boleto, credit_card]
currency: BRL
eval_rubric:
baselines:
pt-BR:
aggregate_target: 0.78 # slightly lower at launch; ramps with maturity
per_dim_targets: { factual_grounding: 0.85, locale_appropriateness: 0.92 }
serving:
regions:
LATAM:
primary: sa-east-1
fallback: us-east-1
warm_floor_pct: 200 # higher initial buffer for launch volatility
This is all data. No code changes. The harness was designed for this; the change list is a YAML diff.
The risky parts
Risk 1 — Hallucinations in pt-BR
- LLM is excellent in pt-BR but imperfect; subtle hallucinations more likely than in en-US.
- Mitigation: factual_grounding dimension uses tool lookup (story 04); independent of language.
- Pre-launch: 200 adversarial probes in pt-BR for hallucination detection.
Risk 2 — Catalog data quality
- New catalog ingestion is fast but quality varies; some titles have wrong genre tags.
- Mitigation: per-locale catalog QA pre-launch. Top-10K titles human-verified; long-tail accepted with tagged data quality flags. Recommendations preferentially draw from verified subset for first month.
Risk 3 — Cultural mismatches
- Recommendation prompts trained on en-US data style may suggest titles inappropriate for BR cultural context.
- Mitigation: locale-style examples in prompts; eval includes a "cultural appropriateness" dimension.
- Long-term: BR-specific user-preference signals build over time; cold-start reliance on global signals decreases.
Risk 4 — Latency from new region
- LATAM users routed through us-east-1 see +120ms vs. local serving.
- Mitigation: sa-east-1 capacity pre-warmed; cross-region failover in case of regional issue. Bedrock available in sa-east-1 with same model lineup.
Risk 5 — Eval coldness — too few pt-BR turns to detect drift early
- Mitigation: stratified sampling (story 04) reserves a quota for pt-BR. Even at 1% of BR launch traffic (small initially), absolute count is enforced.
- Synthetic probes provide eval coverage independent of organic traffic.
What the harness contributed
| Harness piece | Role for new locale |
|---|---|
| Capability flags / per-jurisdiction (story 02, 05) | Configuration-only addition |
| Skill contracts with locale field (story 02) | No code change for tools |
| Per-cell eval rubric (story 04) | pt-BR baseline declared in config |
| Stratified sampling (story 04) | Drift detection works from day 1 |
| Provider adapters (story 02) | Bedrock sa-east-1 endpoint is a new adapter config |
| Region-aware serving (story 06) | LATAM region as a deployment target |
| Audit endpoint (compliance scenario 5) | LGPD audit ready |
| Multi-version migration discipline (story 04) | Launch is just another canary |
A team without these pieces would be doing 6 weeks of engineering. With them, 2 weeks is feasible because most of the work is content (catalog, eval set, translations) and integrations (payments, LGPD), not code.
Q&A drill — opening question
*Q: 2 weeks is aggressive. What's the realistic minimum without cutting corners?
The realistic minimum for full quality parity with en-US is 6-8 weeks (eval set maturation, organic-traffic-driven calibration). What we ship in 2 weeks is a launch-quality experience: functional, compliant, eval baseline lower than en-US (0.78 vs 0.81), with explicit roadmap to close the gap over the next quarter.
The launch quality bar is set by: - Above the "embarrassingly bad" threshold for any subset. - Compliance is 100% (no compromise). - All hard safety checks pass. - Soft quality (style, nuance) trails en-US for 1-2 quarters.
This phasing is honest with stakeholders and pragmatic.
Grilling — Round 1
Q1. Eval set of 500 turns is small. How do you trust the launch quality bar?
500 is the minimum for stratified statistical power on top-level dimensions. We supplement with: - Synthetic probes (200) for adversarial coverage. - Cross-locale eval — we machine-translate 1000 en-US eval items to pt-BR and re-judge. Imperfect but expands coverage. - First 4 weeks of organic traffic adds ~20K judgments via 1% sampling; the eval set "matures" rapidly post-launch.
Q2. What if pt-BR launch reveals a critical bug — the agent says something offensive in idiom?
Three gates would have to fail: - Pre-launch eval (with cultural-appropriateness dim) didn't catch it. - Synthetic probes didn't cover the input. - Active eval in canary didn't catch it in 5% rollout.
If all three fail and the bug ships: capability flag flips off the offending behavior; PR/comms team coordinates; postmortem revises eval rubric to add a new probe.
The gates make catastrophic launch failures unlikely but not impossible. We accept some residual risk because perfect pre-launch eval is impossible at 2-week timelines.
Q3. Catalog data quality varies; some recommendations cite wrong volume counts. How handled?
This is the hallucination-prevention pattern (POC-to-production war story #3). The recommendation flow uses tool lookup for factual grounding (catalog-search MCP). The agent does not free-form generate "Berserk has 41 volumes" — it tool-calls for the count. If the catalog has wrong data, that's a catalog-data bug, not an agent bug. Catalog ingestion has its own QA gate; ongoing data corrections are a steady-state team responsibility.
Grilling — Round 2 (architect-level)
Q4. How does this scale to launching 5 locales in parallel (e.g., LATAM expansion)?
The harness pieces are reusable. The scarce resources are: - Eval set curation — needs native speakers; each locale ~5-7 days of curation work. - Catalog data acquisition — varies by partner; some markets are weeks, others months. - Compliance review — locale-specific; varies. - Translation and cultural review — locale-specific.
These are content/legal work, not engineering. We could engineering-launch 5 locales simultaneously; we'd be bottlenecked on content team capacity. Realistic parallelism: 2-3 locales/quarter.
Q5. What's the cost shape of a new locale post-launch?
- Initial overhead: catalog index storage, eval set hosting, additional region capacity. ~$15K/month per locale.
- Marginal cost per query: same as en-US.
- Long-term overhead: per-locale eval refresh, content updates. ~1 FTE per 5 locales.
Locale economics start positive once a locale reaches ~50K MAU. BR is projected to hit that in week 6.
Q6. What architectural debt do you accept by launching at 2 weeks vs 6 weeks?
Three explicit debts: - Eval coverage is shallow — need to mature over 1-2 quarters before per-locale drift detection is as sensitive as en-US. - Cold-start recommendation quality is lower — cross-locale popularity transfer is imperfect; BR-specific signals build slowly. - Cultural sub-rules are based on initial native-speaker review — likely revisions as real traffic surfaces edge cases.
We track these debts as items on the post-launch roadmap. Each has an owner and a quarterly checkpoint.
Intuition gained
- A new locale is an integration test for the entire harness. Pieces that aren't extensible by-locale will be the long pole.
- Most of the work is content (catalog, eval set, translations), not code. The harness pays off by minimizing code work.
- Per-cell eval baselines allow honest launch with lower-than-mature quality bar.
- 2-week launches are content-bottlenecked, not engineering-bottlenecked, when the harness is solid.
- Architectural debt taken at launch is real and must be tracked explicitly, not hidden.
See also
05-new-compliance-mandate.md— per-jurisdiction policy in capability flags06-tool-count-explodes.md— registry/contract scaling pattern- User stories 02, 04, 08