Constraint Scenario 02 — Foundation Model Deprecated Mid-Quarter
Trigger: Anthropic announces Claude Sonnet 4.6 will be deprecated in 90 days. MangaAssist's planner, primary sub-agents, and eval judge all run on 4.6. Pillars stressed: P3 (Provider adapters) primarily; P1 (workflow) and P2 (eval/serving) secondary.
TL;DR
A model deprecation is a forcing function for the migration playbook the eval framework was built for. Without that framework, this is a 6-month panicked rewrite. With it, this is a controlled 6-week migration: replay → calibrate → canary → ramp. The killer detail is that in-flight workflows finish on the model they started with — there's no mid-flight cutover, ever.
The change
| Dimension | Before | After |
|---|---|---|
| Model on planner | Sonnet 4.6 | Sonnet 4.7 (target) |
| Model on sub-agents | Sonnet 4.6 | Sonnet 4.7 |
| Model on judge | Haiku 4.5 (unaffected) | Haiku 4.5 (unaffected) |
| Cost/turn (planner) | $0.0040 | $0.0044 (+10%, list price) |
| Latency p95 (planner) | 350ms | 290ms (faster, per provider) |
| Output style | tuned-for-4.6 prompts | retuned for 4.7 nuances |
| Tool-use behavior | well-characterized | new behaviors at edges |
| Deadline | — | 90 days |
The +10% cost increase is real. The "faster" claim from the provider needs verification. The output style change is the riskiest part — prompts retuned for 4.6 may not produce equivalent outputs on 4.7.
The cascade — what's at risk
flowchart TB
Dep[Sonnet 4.6 deprecation T-90d] --> P[Prompt drift risk]
Dep --> ToolUse[Tool-use behavior change]
Dep --> EvalCalib[Judge calibration valid?]
Dep --> Cost[Cost +10pct under new rates]
Dep --> CacheLoss[Inference cache invalidated]
Dep --> SubAgentReg[Many sub-agents to re-eval]
P --> RegTest[Need regression eval]
ToolUse --> SkillEval[Need per-skill calibration]
EvalCalib --> JudgeCheck[Judge vs human re-cal]
CacheLoss --> Cost2[Cost spike for first 2 weeks post-cutover]
SubAgentReg --> Time[Time-consuming for 8+ sub-agents]
The non-obvious risk: the inference cache is keyed on (prompt_hash, model_id, ...). Cutting over to 4.7 invalidates 100% of the cache. Cache hit rate goes from 38% to 0% on day one; rebuilds to 30%+ over ~2 weeks of natural traffic. Cost during that window is +18% just from cache misses, on top of the +10% list-price increase.
The migration playbook (T-90d → T-0)
gantt
title Migration timeline 90 days
dateFormat YYYY-MM-DD
axisFormat Day %j
section Replay
Bulk replay 24h traffic 4.6 vs 4.7 :a1, 2026-04-29, 7d
Per-skill replay drill-down :a2, after a1, 5d
section Calibrate
Judge re-calibration vs human :b1, 2026-05-04, 7d
Prompt re-tuning for 4.7 high-impact :b2, after b1, 14d
section Canary
Shadow 4.7 (production traffic shadow eval) :c1, 2026-05-18, 14d
5pct canary on planner only :c2, 2026-06-01, 7d
25pct canary on planner :c3, after c2, 5d
50pct then 100pct canary planner :c4, after c3, 7d
section Sub-agents
Sub-agent migration in priority order :d1, 2026-06-15, 21d
section Final
Hold + observe :e1, 2026-07-13, 14d
4.6 unloaded :milestone, 2026-07-27, 0d
Replay phase (T-90d → T-78d)
- Replay 24-72 hours of production traffic (read-side; all write-side mocked) through 4.7.
- Cost: ~$30K in token budget; pre-approved, line-itemed.
- Output: per-turn diff, per-skill diff, per-locale diff.
- Decision artifact: spreadsheet with eval delta per dimension per cell. Non-regressing cells are unblocked. Regressing cells need attention.
Calibrate phase (T-78d → T-57d)
- Judge re-calibration: 1K human-labeled turns (sourced from prior month), re-judged by Haiku 4.5 against both 4.6 and 4.7 outputs. Confirm judge isn't biased (κ ≥ 0.78 with humans for both).
- Prompt re-tuning: identify the top-N prompt regressions from replay, re-tune prompts. Re-replay to confirm. Old prompt and new prompt both versioned in registry — they ship side by side.
Canary phase (T-57d → T-29d)
- Shadow run: 4.7 receives a copy of every request 4.6 receives, response is judged in parallel but not delivered. ~24-72 hours.
- 5% canary: real users see 4.7 responses. Eval framework's online judging compares cohort means with statistical power.
- Promotion gates between ramp steps: aggregate eval ≥, no per-subset > 2% drop, no eval dimension > 5% drop, error rate equivalent.
- Rollback button hot at every stage.
Sub-agent phase (T-29d → T-8d)
- Sub-agents migrate one at a time, in priority order (most-invoked first, lowest-risk first if conflict).
- Each sub-agent: replay + canary + ramp, just like the planner. Shorter cycles since the machinery is hot.
Final phase (T-8d → T-0)
- All traffic on 4.7. 4.6 still warm-loadable for emergency revert.
- 14-day stability watch.
- T-0: 4.6 adapter unloaded.
What the harness does for free
| Harness piece | Role in migration |
|---|---|
| Provider adapter pattern (story 02) | Both 4.6 and 4.7 served behind one interface; switch is config |
| Versioned events (story 08) | Per-model dashboards; every metric partitioned cleanly |
| Active eval (story 04) | Drives every promotion decision |
| Replay harness (story 04) | Pre-deploy diff testing |
| Capability flags (story 02) | Per-cohort canary at any granularity |
| In-flight stability (story 06) | Workflows finish on starting model — no mid-flight cutover risk |
| Inference cache key (story 06) | Includes model_id; old cache transparently invalidates |
Without these, the migration is 6 months and panicked. With these, 8 weeks and methodical.
The trade-off no one likes to talk about
The +10% cost is permanent. We've baked it in by reducing cost elsewhere (model fallback for low-quality-tolerant turns; aggressive caching). Net cost change: +3% post-migration.
The cache cold period (~2 weeks at +18% effective cost) is unavoidable. We pre-arrange a one-time budget allowance for it.
The biggest hidden cost is engineering attention — 6 weeks of senior engineering time across eval, prompt tuning, sub-agent owners, and ops. That comes off other roadmap items.
What goes wrong (and how to recover)
Risk 1 — Sub-agent X regresses, but only on Korean inputs
- Caught by per-locale subset gate.
- Action: stop ramp on that sub-agent; re-tune prompt for ko-KR specifically; re-canary.
- Worst case: one sub-agent stays on 4.6 a few extra weeks. Capability flag holds.
Risk 2 — Judge calibration fails — judge prefers 4.7 outputs even when humans prefer 4.6
- Caught by κ check against humans.
- Action: re-prompt the judge; or use a different judge model. Do not migrate until judge agrees with humans on both models.
Risk 3 — Tool-use behavior change — 4.7 calls a deprecated tool more often
- Caught by per-skill invocation rate dashboards.
- Action: prompt update to remove the deprecated tool from few-shots; re-canary.
Risk 4 — 4.6 deprecation accelerated by provider (T-90 → T-30)
- Pre-mitigated: the playbook's longest pole is the canary safety period, not the work itself. Compressing to 30 days is feasible if all replay/calibration is already complete. Risk: less stability runway post-cutover.
- We treat the 90 days as the comfortable timeline; 60 days is feasible; 30 days is "drop everything else."
Q&A drill — opening question
*Q: Why not just rewrite all prompts for 4.7 and ship in one big release?
Three reasons: 1. You can't measure it. Without canary, you don't know if your "rewrites" are improvements until production tells you, and production telling you is bad news. 2. Sub-agents have different risk profiles. A safety sub-agent and a recommendation sub-agent should not migrate on the same schedule — different consequence of regression. 3. Operational headroom. Big-bang releases consume the team's entire attention. Incremental releases let regular roadmap continue.
The "ship it all at once" approach is what teams without an eval framework are forced into. It's not a choice; it's a symptom.
Grilling — Round 1
Q1. Replay costs $30K. Why so much, and is it worth it?
24 hours of MangaAssist traffic = ~57M sync turns × ~$0.0005 average for a planner-only replay = ~$28.5K. Sub-agent replay adds more. Worth it because: (a) catches regressions before they ship; (b) provides a per-cell delta dataset; © reusable for the next migration. The marginal cost of replay is the cost of one undetected production regression.
Q2. How do you stop people from "tuning" the prompt for 4.7 in ways that secretly improve metrics on 4.7 but degrade on 4.6 (regression-test smell)?
The eval set is frozen. Prompt changes during migration are reviewed against (a) the frozen test set on 4.6 (must not drop), (b) the frozen test set on 4.7 (target). The change passes only if both are non-degrading. This catches the "Goodhart" pattern of overfitting to the new model.
Q3. What if a sub-agent really does need a different prompt for 4.7 — not a tweak but a rewrite?
Allowed, but flagged. The PR template asks: "is this a 4.7-specific prompt or a 4.6-and-4.7 universal prompt?" If 4.7-specific, the registry holds two versions; the gateway routes based on the model in use. Adds maintenance cost; sometimes worth it. We aim for universal prompts where possible.
Grilling — Round 2 (architect-level)
Q4. What does the cost dashboard look like during the canary phase, and how do you control budget overrun?
The dashboard partitions cost by model (4.6 vs 4.7) and by cohort (canary cohort vs control). Two budget guardrails: - Per-model spend ceiling for the canary period. Auto-halts ramp if 4.7 cost exceeds 1.3× projected. - Per-cohort cost-per-turn alert if 4.7 turns cost >1.2× 4.6 turns for matched intent. Indicates token-count regression.
We've had cases where 4.7 produced longer responses than 4.6 on the same input, inflating output tokens. Caught by this alert; resolved by prompt-level "be concise" reinforcement.
Q5. How do you handle the case where 4.7 IS qualitatively better but the eval rubric doesn't measure the new dimension it's better at?
This is the rubric-incompleteness trap. We: - Keep a subjective qualitative review in the canary process — a human reviewer rates a sample of canary outputs on dimensions not in the rubric (creativity, helpfulness nuance). - If the human review reveals new dimensions, the rubric is updated and human-recalibrated before final ramp. The rubric is a living artifact. - We do NOT promote based on subjective alone; the rubric must be made objective first.
This protects against both rubric overfitting AND rubric staleness.
Q6. Walk through what happens if Anthropic accelerates deprecation by 60 days mid-canary.
We're at, say, 25% canary on planner, sub-agents not yet started. Two-track response: - Track 1 — accelerate planner: ramp 25→100% over 5 days instead of 14, eval gate at each step but with reduced statistical power (faster ramps mean smaller cohorts at decision time). Risk: post-cutover regression that would have been caught. - Track 2 — parallelize sub-agents: start sub-agent canaries in parallel rather than sequentially. Operationally taxing; needs more engineers in eval review. - Fallback if either track fails: stay partly on 4.6 past the deprecation date by negotiating a paid extension with the provider. Last-resort but real.
The compression is feasible because the eval and replay machinery is already running. The compression is brutal because human review and prompt-tuning are not parallelizable beyond their staffing.
Intuition gained
- A model deprecation is the eval framework's prove-itself moment. Replay → calibrate → canary → ramp is the machinery, deployed.
- Provider adapters + capability flags turn migration from a code rewrite into a config change.
- Cache invalidation has a cost; budget for the cold period.
- The eval rubric is a living artifact; migration is when you discover dimensions you forgot.
- Migration timelines are dictated by canary safety duration, not by work volume. Compressing the timeline = compressing the safety, not the work.
See also
01-10x-user-surge.md— same adapter machinery, different timeframe06-tool-count-explodes.md— sub-agent scale-out is the same eval discipline- User stories 02, 04, 06, 08