LOCAL PREVIEW View on GitHub

Constraint Scenario 07 — Provider Quota Revoked

Trigger: Bedrock Sonnet TPM allocation cut from 50M tokens/min to 30M tokens/min with 48-hour notice (capacity reallocation across AWS regions). MangaAssist runs at 38M TPM peak. Pillars stressed: P2 (quota manager) primarily; P3 (provider adapters) secondary.


TL;DR

A unilateral quota cut by a provider is the harness's clearest stress test. There is no negotiation runway; you must absorb the cut without an outage. The quota manager (story 08) is what does it: priority classes (P0 sync chat, P3 batch) are pre-built so the cut becomes "shed P3 first, downshift if needed." Combined with the multi-provider adapter pattern, you can spill ~25% of load to Anthropic direct API as relief. Quality dips ~3-5% on the affected window; no outage.


The change

Metric Before After (forced)
Bedrock Sonnet TPM 50M/min 30M/min
Notice 48 hours
Peak demand 38M TPM 38M TPM (unchanged)
Spill capacity (Anthropic direct) provisioned at 5M TPM scalable to 12M TPM via emergency contract
Acceptable quality drop <5% on aggregate, none on safety-critical
Acceptable outage 0 minutes

We have to find ~13M TPM of capacity in 48 hours, OR shed 13M TPM of load.


The cascade — why naive responses fail

flowchart TB
  Cut[Bedrock TPM cut] --> Pri[Priority queue under saturation]
  Pri --> NoQM[Without quota manager: 429s leak everywhere]
  NoQM --> Retry[Retries amplify]
  Retry --> Saturate[Provider 100pct saturated]
  Saturate --> Outage[Outage]

  Pri --> WithQM[With quota manager]
  WithQM --> P3Pause[P3 batch pauses 4M TPM freed]
  WithQM --> P2Drop[P2 eval throttles to 0.3pct 1M TPM freed]
  WithQM --> Spill[Spill P0/P1 surplus to Anthropic direct 8M TPM]
  P3Pause --> Stable[Stable]
  P2Drop --> Stable
  Spill --> Stable

The pre-existing quota manager + adapter pattern is the difference between outage and graceful absorption.


The 48-hour playbook

Hour 0 — Receive notice

  • Bedrock support announces the cut. Slack/PagerDuty triggered.
  • Architect-on-call confirms current peak vs new ceiling. Gap: 8-13M TPM at peak.

Hour 0-4 — Plan + provider engagement

  • Open emergency support tickets with both Bedrock (negotiate / clarify) and Anthropic (request emergency direct-API allocation increase).
  • Internal review: which features are eligible to drop? P3 batch is the easy one; eval sampling is next.

Hour 4-12 — Spill capacity provisioned

  • Anthropic direct API capacity confirmed at 12M TPM (from baseline 5M).
  • Provider adapter (story 02) capability flag tested: routing 5% of P0 traffic to Anthropic direct, 24-hour shadow eval.

Hour 12-36 — Phased rollout of mitigations

  • T+12h: P3 batch shifts to off-peak only (overnight US/EU window).
  • T+18h: Eval sampling drops 1% → 0.4% during peak hours.
  • T+24h: Spill 5% of P0 chat to Anthropic direct, eval green; ramp to 25% over 6 hours.
  • T+36h: All mitigations live. Peak load distribution: 30M Bedrock (capped) + 9M Anthropic direct + 2M eliminated by sampling/batch shifts = 41M effective.

Hour 36-48 — Stabilize + measure

  • Peak hour passes; metrics confirm no 429s leaked to users.
  • Cost analysis: Anthropic direct is ~12% more expensive per token than Bedrock for our volume; net cost increase ~3% during the affected window.

Day 3+ — Long-term decision

  • Negotiate increased Bedrock allocation (multi-week process).
  • Decide: keep multi-provider as steady-state (more resilience) or revert to Bedrock-primary once allocation restored.
  • We chose to keep ~10% of P0 on Anthropic direct as standing redundancy — paid the small cost premium for the peace of mind.

The architecture during the cut

flowchart LR
  subgraph Callers
    P0[P0 sync chat]
    P1[P1 async user-facing]
    P2[P2 eval / judge]
    P3[P3 batch enrichment]
  end

  subgraph QM[Quota Manager]
    L[Ledger]
    Pol[Priority policy]
    Sp[Spill policy]
  end

  subgraph Bed[Bedrock - 30M TPM ceiling]
    BR[Sonnet/Haiku endpoints]
  end

  subgraph An[Anthropic Direct - 12M TPM emergency]
    AN[Sonnet/Haiku via direct API]
  end

  P0 --> QM
  P1 --> QM
  P2 --> QM
  P3 --> QM
  QM -->|reserve| Bed
  QM -->|spill| An
  QM -.pause.-> P3

Spill policy: - P0 traffic: prefers Bedrock; spills to Anthropic when Bedrock reservation denied. - P1 traffic: same as P0 with lower priority. - P2: locked to Bedrock; if quota denied, sample rate reduces. - P3: pauses entirely until Bedrock has headroom.


What goes wrong

Risk 1 — Anthropic direct produces slightly different outputs than Bedrock-Sonnet

  • Detection: per-provider eval partition.
  • Reality: same underlying model, near-identical outputs. Tiny variation in tokenization on rare inputs. Eval delta < 0.3%.
  • If significant drift discovered, treat as a model migration (scenario 2 playbook).

Risk 2 — Anthropic direct rate-limits at the new higher allocation

  • Detection: 429s from Anthropic.
  • Mitigation: stay below 90% of allocation; monitor headroom continuously.

Risk 3 — Multi-provider increases observability noise

  • Detection: dashboard partition by provider should show comparable metrics.
  • Mitigation: provider adapter normalizes responses (latency reporting, error codes); provider is a tag in events.

Risk 4 — Cost surprise — Anthropic direct pricing is per-token; not as flat as Bedrock provisioned

  • Detection: cost dashboard partition by provider.
  • Mitigation: pre-budgeted; finance pre-notified.

What the harness contributed

Harness piece Role
Quota manager (story 08) The reservation system that prevented 429 leakage
Priority classes P0-P3 (story 08) Pre-built shedding order; no on-the-fly decisions
Provider adapters (story 02) Anthropic direct as fallback was a config flip
Capability flags (story 02) Per-cohort spill rollout
Active eval (story 04) Shadow eval on Anthropic direct; promotion gate
Versioned events (story 08) Per-provider dashboards for diagnosis

Without these: an outage. With these: controlled absorption with quality dip.


Q&A drill — opening question

*Q: Why don't you just always use Anthropic direct? Why depend on Bedrock at all?

Bedrock advantages: - Lower per-token cost at our committed volume (~12% cheaper). - Better integration with AWS-native auth, IAM, observability. - Bundled with our enterprise agreement.

Anthropic direct advantages: - Faster cap raises during emergencies. - Often gets new model releases first. - Less abstraction means more control over rate-limit headers, retries.

The right answer is both, with Bedrock primary + Anthropic standing-redundancy at 10%. We've kept that posture since this incident.


Grilling — Round 1

Q1. Why not provision more Bedrock TPM in advance to never hit this?

Three reasons: - Cost. Reserved provisioned capacity is 25-40% more expensive than on-demand. Over-provisioning to handle a hypothetical 50% cut would double our spend. - Quota cuts can be larger than reserved. Even reservations can be partly clawed back during regional capacity shortages. No allocation is fully guaranteed. - Multi-provider strategy is more resilient than over-provisioning one provider. Shifts the dependency from a single contract to a portfolio.

Q2. 48-hour notice is generous. What about a no-notice cut (e.g., regional outage)?

The same playbook compresses to <2 hours. We have: - A "quota cut" runbook with specific commands (paste into terminal, spill turns on automatically). - Pre-tested capability flags for emergency spill. - An on-call rotation that knows the runbook.

A no-notice cut hits us with ~5-15 minutes of degraded latency before the spill activates. The quota manager's reserve-before-invoke model prevents user-visible failures even during this window — saturated requests downshift or queue, they don't 429.

Q3. Anthropic direct emergency allocation — how reliable is that promise?

Provider relationships are part of the architecture. We have: - A signed standing agreement with Anthropic for emergency cap raises with 1-hour SLA (paid separately). - A list of pre-approved use cases — "Bedrock cut" is on the list. - A dedicated relationship manager for AWS and Anthropic both.

The agreement isn't free; we pay a retainer. Worth it because the alternative is "negotiate during incident."


Grilling — Round 2 (architect-level)

Q4. Walk through the queue dynamics during the cut. With reserve-before-invoke, what happens to a P3 batch job mid-execution?

Mid-execution P3 jobs are not killed — they finish under their existing reservations. New P3 reservations are denied; new P3 jobs sit in their queue. When Bedrock has headroom (off-peak hours), P3 reservations resume. Effective queue depth grows during peak; drains overnight.

The pause/resume design (story 05) makes this clean — P3 jobs can pause indefinitely without losing state.

Q5. How do you know the Anthropic direct spill is actually safe before ramping P0 traffic to 25%?

Three gates: - Shadow eval — 24 hours where Anthropic direct receives a copy of P0 requests and is judged in parallel. Per-rubric delta computed. - 5% canary — real users see Anthropic responses for 24 hours. Online eval; user satisfaction proxy (session length, follow-up rate). - Per-locale check — some locales might respond differently. Subset gates per locale.

If any gate fails, ramp halts. We accept slower mitigation timeline (risk: more days at compromised capacity) over fast-but-broken ramp.

Q6. What's the long-term architectural lesson?

Multi-provider as steady state is now baseline. We previously treated multi-provider as a contingency; the cost (slightly more complex, ~3-5% cost premium for Anthropic direct slice) is small relative to the resilience.

The deeper lesson: provider quotas are not infrastructure you own. They're a contract. Architectural assumptions like "Bedrock TPM is constant" are wrong; design for the contract being renegotiable downward.

This shaped subsequent decisions: we evaluate every new model integration with a "what's the cap-raise SLA?" question, not just price/quality.


Intuition gained

  • Quota cuts are a forcing function for multi-provider design. Standing redundancy beats over-provisioning one provider.
  • Reserve-before-invoke quota model prevents user-visible 429s during saturation.
  • Priority classes are the shedding order; pre-built, not invented during incident.
  • Provider relationships are architecture. SLAs for emergency cap raises are real engineering choices.
  • The harness's value compounds. Quota manager + adapters + capability flags + eval framework — each was useful alone, together they made this incident a non-event.

See also

  • 01-10x-user-surge.md — internal-driven quota stress
  • 02-foundation-model-deprecated.md — same adapter pattern, different timing
  • 03-cost-budget-halved.md — voluntary version of the squeeze
  • User stories 02, 04, 08