Constraint Scenario 07 — Provider Quota Revoked
Trigger: Bedrock Sonnet TPM allocation cut from 50M tokens/min to 30M tokens/min with 48-hour notice (capacity reallocation across AWS regions). MangaAssist runs at 38M TPM peak. Pillars stressed: P2 (quota manager) primarily; P3 (provider adapters) secondary.
TL;DR
A unilateral quota cut by a provider is the harness's clearest stress test. There is no negotiation runway; you must absorb the cut without an outage. The quota manager (story 08) is what does it: priority classes (P0 sync chat, P3 batch) are pre-built so the cut becomes "shed P3 first, downshift if needed." Combined with the multi-provider adapter pattern, you can spill ~25% of load to Anthropic direct API as relief. Quality dips ~3-5% on the affected window; no outage.
The change
| Metric | Before | After (forced) |
|---|---|---|
| Bedrock Sonnet TPM | 50M/min | 30M/min |
| Notice | — | 48 hours |
| Peak demand | 38M TPM | 38M TPM (unchanged) |
| Spill capacity (Anthropic direct) | provisioned at 5M TPM | scalable to 12M TPM via emergency contract |
| Acceptable quality drop | <5% on aggregate, none on safety-critical | |
| Acceptable outage | 0 minutes |
We have to find ~13M TPM of capacity in 48 hours, OR shed 13M TPM of load.
The cascade — why naive responses fail
flowchart TB
Cut[Bedrock TPM cut] --> Pri[Priority queue under saturation]
Pri --> NoQM[Without quota manager: 429s leak everywhere]
NoQM --> Retry[Retries amplify]
Retry --> Saturate[Provider 100pct saturated]
Saturate --> Outage[Outage]
Pri --> WithQM[With quota manager]
WithQM --> P3Pause[P3 batch pauses 4M TPM freed]
WithQM --> P2Drop[P2 eval throttles to 0.3pct 1M TPM freed]
WithQM --> Spill[Spill P0/P1 surplus to Anthropic direct 8M TPM]
P3Pause --> Stable[Stable]
P2Drop --> Stable
Spill --> Stable
The pre-existing quota manager + adapter pattern is the difference between outage and graceful absorption.
The 48-hour playbook
Hour 0 — Receive notice
- Bedrock support announces the cut. Slack/PagerDuty triggered.
- Architect-on-call confirms current peak vs new ceiling. Gap: 8-13M TPM at peak.
Hour 0-4 — Plan + provider engagement
- Open emergency support tickets with both Bedrock (negotiate / clarify) and Anthropic (request emergency direct-API allocation increase).
- Internal review: which features are eligible to drop? P3 batch is the easy one; eval sampling is next.
Hour 4-12 — Spill capacity provisioned
- Anthropic direct API capacity confirmed at 12M TPM (from baseline 5M).
- Provider adapter (story 02) capability flag tested: routing 5% of P0 traffic to Anthropic direct, 24-hour shadow eval.
Hour 12-36 — Phased rollout of mitigations
- T+12h: P3 batch shifts to off-peak only (overnight US/EU window).
- T+18h: Eval sampling drops 1% → 0.4% during peak hours.
- T+24h: Spill 5% of P0 chat to Anthropic direct, eval green; ramp to 25% over 6 hours.
- T+36h: All mitigations live. Peak load distribution: 30M Bedrock (capped) + 9M Anthropic direct + 2M eliminated by sampling/batch shifts = 41M effective.
Hour 36-48 — Stabilize + measure
- Peak hour passes; metrics confirm no 429s leaked to users.
- Cost analysis: Anthropic direct is ~12% more expensive per token than Bedrock for our volume; net cost increase ~3% during the affected window.
Day 3+ — Long-term decision
- Negotiate increased Bedrock allocation (multi-week process).
- Decide: keep multi-provider as steady-state (more resilience) or revert to Bedrock-primary once allocation restored.
- We chose to keep ~10% of P0 on Anthropic direct as standing redundancy — paid the small cost premium for the peace of mind.
The architecture during the cut
flowchart LR
subgraph Callers
P0[P0 sync chat]
P1[P1 async user-facing]
P2[P2 eval / judge]
P3[P3 batch enrichment]
end
subgraph QM[Quota Manager]
L[Ledger]
Pol[Priority policy]
Sp[Spill policy]
end
subgraph Bed[Bedrock - 30M TPM ceiling]
BR[Sonnet/Haiku endpoints]
end
subgraph An[Anthropic Direct - 12M TPM emergency]
AN[Sonnet/Haiku via direct API]
end
P0 --> QM
P1 --> QM
P2 --> QM
P3 --> QM
QM -->|reserve| Bed
QM -->|spill| An
QM -.pause.-> P3
Spill policy: - P0 traffic: prefers Bedrock; spills to Anthropic when Bedrock reservation denied. - P1 traffic: same as P0 with lower priority. - P2: locked to Bedrock; if quota denied, sample rate reduces. - P3: pauses entirely until Bedrock has headroom.
What goes wrong
Risk 1 — Anthropic direct produces slightly different outputs than Bedrock-Sonnet
- Detection: per-provider eval partition.
- Reality: same underlying model, near-identical outputs. Tiny variation in tokenization on rare inputs. Eval delta < 0.3%.
- If significant drift discovered, treat as a model migration (scenario 2 playbook).
Risk 2 — Anthropic direct rate-limits at the new higher allocation
- Detection: 429s from Anthropic.
- Mitigation: stay below 90% of allocation; monitor headroom continuously.
Risk 3 — Multi-provider increases observability noise
- Detection: dashboard partition by provider should show comparable metrics.
- Mitigation: provider adapter normalizes responses (latency reporting, error codes); provider is a tag in events.
Risk 4 — Cost surprise — Anthropic direct pricing is per-token; not as flat as Bedrock provisioned
- Detection: cost dashboard partition by provider.
- Mitigation: pre-budgeted; finance pre-notified.
What the harness contributed
| Harness piece | Role |
|---|---|
| Quota manager (story 08) | The reservation system that prevented 429 leakage |
| Priority classes P0-P3 (story 08) | Pre-built shedding order; no on-the-fly decisions |
| Provider adapters (story 02) | Anthropic direct as fallback was a config flip |
| Capability flags (story 02) | Per-cohort spill rollout |
| Active eval (story 04) | Shadow eval on Anthropic direct; promotion gate |
| Versioned events (story 08) | Per-provider dashboards for diagnosis |
Without these: an outage. With these: controlled absorption with quality dip.
Q&A drill — opening question
*Q: Why don't you just always use Anthropic direct? Why depend on Bedrock at all?
Bedrock advantages: - Lower per-token cost at our committed volume (~12% cheaper). - Better integration with AWS-native auth, IAM, observability. - Bundled with our enterprise agreement.
Anthropic direct advantages: - Faster cap raises during emergencies. - Often gets new model releases first. - Less abstraction means more control over rate-limit headers, retries.
The right answer is both, with Bedrock primary + Anthropic standing-redundancy at 10%. We've kept that posture since this incident.
Grilling — Round 1
Q1. Why not provision more Bedrock TPM in advance to never hit this?
Three reasons: - Cost. Reserved provisioned capacity is 25-40% more expensive than on-demand. Over-provisioning to handle a hypothetical 50% cut would double our spend. - Quota cuts can be larger than reserved. Even reservations can be partly clawed back during regional capacity shortages. No allocation is fully guaranteed. - Multi-provider strategy is more resilient than over-provisioning one provider. Shifts the dependency from a single contract to a portfolio.
Q2. 48-hour notice is generous. What about a no-notice cut (e.g., regional outage)?
The same playbook compresses to <2 hours. We have: - A "quota cut" runbook with specific commands (paste into terminal, spill turns on automatically). - Pre-tested capability flags for emergency spill. - An on-call rotation that knows the runbook.
A no-notice cut hits us with ~5-15 minutes of degraded latency before the spill activates. The quota manager's reserve-before-invoke model prevents user-visible failures even during this window — saturated requests downshift or queue, they don't 429.
Q3. Anthropic direct emergency allocation — how reliable is that promise?
Provider relationships are part of the architecture. We have: - A signed standing agreement with Anthropic for emergency cap raises with 1-hour SLA (paid separately). - A list of pre-approved use cases — "Bedrock cut" is on the list. - A dedicated relationship manager for AWS and Anthropic both.
The agreement isn't free; we pay a retainer. Worth it because the alternative is "negotiate during incident."
Grilling — Round 2 (architect-level)
Q4. Walk through the queue dynamics during the cut. With reserve-before-invoke, what happens to a P3 batch job mid-execution?
Mid-execution P3 jobs are not killed — they finish under their existing reservations. New P3 reservations are denied; new P3 jobs sit in their queue. When Bedrock has headroom (off-peak hours), P3 reservations resume. Effective queue depth grows during peak; drains overnight.
The pause/resume design (story 05) makes this clean — P3 jobs can pause indefinitely without losing state.
Q5. How do you know the Anthropic direct spill is actually safe before ramping P0 traffic to 25%?
Three gates: - Shadow eval — 24 hours where Anthropic direct receives a copy of P0 requests and is judged in parallel. Per-rubric delta computed. - 5% canary — real users see Anthropic responses for 24 hours. Online eval; user satisfaction proxy (session length, follow-up rate). - Per-locale check — some locales might respond differently. Subset gates per locale.
If any gate fails, ramp halts. We accept slower mitigation timeline (risk: more days at compromised capacity) over fast-but-broken ramp.
Q6. What's the long-term architectural lesson?
Multi-provider as steady state is now baseline. We previously treated multi-provider as a contingency; the cost (slightly more complex, ~3-5% cost premium for Anthropic direct slice) is small relative to the resilience.
The deeper lesson: provider quotas are not infrastructure you own. They're a contract. Architectural assumptions like "Bedrock TPM is constant" are wrong; design for the contract being renegotiable downward.
This shaped subsequent decisions: we evaluate every new model integration with a "what's the cap-raise SLA?" question, not just price/quality.
Intuition gained
- Quota cuts are a forcing function for multi-provider design. Standing redundancy beats over-provisioning one provider.
- Reserve-before-invoke quota model prevents user-visible 429s during saturation.
- Priority classes are the shedding order; pre-built, not invented during incident.
- Provider relationships are architecture. SLAs for emergency cap raises are real engineering choices.
- The harness's value compounds. Quota manager + adapters + capability flags + eval framework — each was useful alone, together they made this incident a non-event.
See also
01-10x-user-surge.md— internal-driven quota stress02-foundation-model-deprecated.md— same adapter pattern, different timing03-cost-budget-halved.md— voluntary version of the squeeze- User stories 02, 04, 08