LOCAL PREVIEW View on GitHub

Constraint Scenario 01 — 10× User Surge Overnight

Trigger: A surprise viral anime adaptation drops at midnight JST. MangaAssist concurrent traffic in JP, KR, US-West goes from 1.2M to 12M in 90 minutes. Pillars stressed: P2 (Harness) primarily; P1 + P3 secondary.


TL;DR

The harness either absorbs the burst or the brand takes a hit. Auto-scaling alone is insufficient — at 12M concurrent the bottleneck shifts from compute to provider quota, queue depth, and cache thermal. The pivot is to shed gracefully, not refuse loudly: downshift models, defer non-critical features, drop the eval sampling rate, and lean hard on the inference cache. Live chat stays up; quality dips ~7%, recovers as we negotiate emergency quota.


The change

Metric Steady state Surge state
Concurrent sessions 1.2M 12M
Turns/sec 27.7K 280K
Bedrock TPM consumed 65% of ceiling 980% of ceiling without action
p95 first-token latency 1.4s 7.8s if untouched
Cost run-rate $14M/day $140M/day if untouched

Three numbers all change at once. Fixing one without the others moves the bottleneck.


The cascade — what breaks first

flowchart TB
  Surge[10x traffic surge] --> Q[Queue depth grows]
  Surge --> TPM[Bedrock TPM ceiling hit]
  Surge --> CC[Container cold-starts]
  Surge --> Cache[Cache hit rate collapses - new title hot]

  TPM --> Throt[Provider 429s]
  Throt --> Retry[Retry storm amplifies]
  Retry --> TPM

  Q --> Lat[p95 latency spike]
  CC --> Lat

  Cache --> Cost[Cost per turn rises]
  Lat --> Drop[User abandonment]
  Drop --> Refresh[F5 refresh wave]
  Refresh --> Surge

The two feedback loops are the dangerous parts: - Retry storm amplifies the original throttling — without quota-aware backoff, retries make the throttle permanent. - Refresh wave — users hitting reload generate even more load when they see slow responses.


The pivot — graceful shed (the playbook)

Phase 1 — Detect & contain (T0 to T+5 min)

Action Trigger Effect
Quota saturation alarm TPM > 85% of ceiling Pages on-call; auto-fires next steps
Burst fleet pre-warm aggressive warm utilization > 65% Scales fleet 3× over 8 minutes
P3 batch pause TPM > 90% Releases ~10% of TPM for sync
P2 eval sampling drop TPM > 90% 1% → 0.2%, releases ~5% TPM

After Phase 1: ~15% TPM headroom recovered, on-call aware.

Phase 2 — Downshift quality for capacity (T+5 to T+30 min)

Action Effect on quality Effect on capacity
Planner: Sonnet 4.7 → Sonnet 4.6 small (~1% eval) +12% throughput
Sub-agents: Sonnet → Haiku 4.5 (selective) ~3% eval +28% sub-agent throughput
Drop trending-MCP from default fan-out -5% recommendation diversity +1 tool call saved/turn
Race patterns → delegate (story 03) +200ms latency -50% sub-agent token cost

After Phase 2: capacity nearly doubled, quality down ~5-7%, latency stable.

Phase 3 — Negotiate & re-balance (T+30 min to T+4 h)

  • Open ticket with Bedrock for emergency TPM increase (relationship-managed; expected response ~1-2 hours).
  • Re-route portion of Sonnet load to Anthropic direct API (fallback provider already wired in story 02's adapter pattern; capability flag flip activates it).
  • Stagger non-critical surfaces: alerts, weekly summaries delayed by 30-60 min.
  • Communications: status banner "we're seeing extra excitement; some recommendations may be simpler than usual."

Phase 4 — Recover (T+4 h to next day)

  • Provider quota raised; revert downshifts in reverse order.
  • Each revert is canary'd via eval (story 04) before going to 100%.
  • Post-mortem: what was the warning signal we missed? What did the eval say during shed? Which downshifts hurt quality most?

What made the shed possible — the harness pieces that pre-existed

Pre-existing piece What it bought during surge
Quota manager (story 08) Reservation-based; saw saturation early; orchestrated downshifts
Pause/resume (story 05) Allowed P3 batch to pause and resume cleanly
Provider adapters (story 02) Anthropic direct as fallback was a config flip, not a code change
Capability flags (story 02) Selective downshifts per-segment without redeploys
Active eval (story 04) Knew which downshifts cost the least quality
Inference cache (story 06) Re-warmed for the new viral title; ~3 hours to reach 30% hit rate on new content
Typed graph (story 01) Fallback edges already drawn; no new code paths needed

This is the entire point of the harness investment. Without these pre-existing pieces, the surge becomes an outage. With them, it becomes a controlled degradation.


What if we hadn't built the harness?

Naive system response Outcome
Scale up containers blindly Hits Bedrock TPM ceiling, 60-70% of requests 429
Retry on 429 with default backoff Retry storm; provider may temporarily ban our keys
No model fallback Stuck on saturated Sonnet endpoint
No queue separation Batch jobs draining sync chat's quota — chat fails first
No eval to guide degradation Random quality regressions; can't tell what's safe to shed
No cache Every viral request is a cold lookup; cost spirals

Net effect of "naive" path: ~45-minute outage, $XM in lost shopping conversion, brand hit, postmortem with 23 action items. Net effect of harness path: 7% quality dip for 4 hours, no outage, postmortem with 4 action items.


The architecture in surge mode

flowchart TB
  Edge[ALB + WebSocket] --> RL[Rate limiter chokepoint]
  RL --> QM[Quota manager - saturation aware]
  QM --> AgentFleet[Agent fleet - hot + burst]

  AgentFleet --> CB1[Capability flags: surge-mode active]
  CB1 --> Plan[Planner: Sonnet 4.6 fallback]
  CB1 --> Sub[Sub-agents: Haiku-selective]

  Plan --> SkillReg[Skill registry - tool subset]
  Sub --> SkillReg

  SkillReg --> Cache{Inference cache}
  Cache -->|hit 38pct| Resp[Response]
  Cache -->|miss| Provider1[Bedrock]
  Cache -->|miss| Provider2[Anthropic direct]

  Provider1 --> Resp
  Provider2 --> Resp

  Resp --> Stream[Stream to user]

  AgentFleet -.->|reduced sample| Eval[Eval - 0.2 pct]
  AgentFleet -.->|paused| Batch[Batch jobs]

The shape is the same. The knobs are turned.


Q&A drill — opening question

Q: Why not just scale infinitely? Cloud is elastic.

Compute is elastic; provider TPM is not. Bedrock has account-level token-per-minute ceilings that take days/weeks to raise via formal request. During a surprise surge you cannot scale past your TPM ceiling — period. The harness's job is to do something useful with the TPM you have, not to assume more is on the way.

This is why provider adapters and quota management are in the LLD. They are the only levers that work on the timescale of a surge.


Grilling — Round 1

Q1. How do you avoid the refresh wave from amplifying the surge?

Three tactics: - Streaming heartbeat (story 07) — even when slow, users see "thinking..." rather than a frozen page. Reduces F5 propensity. - Status banner — explicit communication that we're busy, with an ETA. Surprisingly effective at reducing refreshes. - Connection-keepalive prioritization at the load balancer — existing connections served before new ones during saturation. New refreshes effectively queued.

Q2. Capability flags switching surge-mode on globally — what's the rollout shape? You don't want to flip 12M users at once and break things.

The "surge-mode" flag is a gradient, not boolean. Phases 1-2 of the playbook each flip flags for ~15% of traffic at a time, with eval observation, before going to 100%. Even at surge speeds (5-min phases), the canary discipline holds. The one exception is P3 batch pause, which is binary (and safe — pausing batch is non-customer-facing).

Q3. What about the new viral title's data — can we even fetch its metadata fast enough?

This is the catalog-search MCP scenario (story 02). On a surge for a single new title, we observe hot-key behavior; the MCP's internal cache amplifies hits ~50× compared to long-tail queries. The bottleneck is not the catalog; it's the LLM call interpreting the query. The catalog-search itself is not the constraint at this scale — it's the planner's reasoning about the query.


Grilling — Round 2 (architect-level)

Q4. Walk through the trade-off between race-pattern (parallel sub-agents) and delegate-pattern under surge.

Race costs ~2.5× tokens for ~40% latency win. Under surge, tokens are the scarce resource, not latency. Switching race → delegate frees tokens at the cost of slower responses. We accept slower because the alternative is failing requests.

The trade-off is asymmetric: a 1.5s slower response is acceptable; a failure is not. Under steady state, the calculus reverses: tokens are cheap-ish, latency is the user-visible signal, race pattern wins. Different constraint, different policy. Same agent code.

Q5. How does the eval framework cope when sampling drops to 0.2%?

Detection latency suffers. Steady state: drift detection in 30 min. At 0.2% sampling: drift detection in 2-3 hours. We accept this during surge because the alternative is starving live chat of quota. We compensate by: - Synthetic probe rate held constant (counterfactual probes from story 04 are not reduced — they're a tiny fixed budget). - Per-priority-flag dashboards — we have specific eval coverage on the surge-mode-on cohort, even if absolute volume is low. We can tell within 2 hours whether the downshifts caused unexpected quality regressions.

This is a deliberate accepted degradation of the eval signal in service of survival.

Q6. What's the post-surge re-warming strategy?

Three layers, in order: 1. Provider TPM — restore reservations across priorities; un-pause P3 batch carefully (don't drive saturation up while sync is still high). 2. Cache re-warming — the inference cache for the viral title is now hot; for the longer-tail content that dropped out of cache during surge, no special action — natural traffic re-warms it. 3. Capability flags — flip surge-mode flags off in reverse order (least quality-impactful first → most). Each flip is canary'd. Total revert window ~4 hours.

Postmortem must capture: what was the leading indicator of surge? Did the rate-of-traffic-change alarm fire ahead of saturation? If not, that's an action item. The detection latency is the most important number to improve cycle-over-cycle.


Intuition gained

  • The bottleneck under surge is provider TPM, not compute. The harness levers are quota mgmt, fallback adapters, capability flags, and eval-guided downshifts.
  • Graceful shed > loud refusal. Users prefer slightly worse answers to no answer.
  • The harness is the surge response. Without pre-built pieces, the surge becomes an outage; with them, it's a controlled dip.
  • Eval sampling rate is itself a knob. Drop it under pressure; restore on recovery.
  • Refresh waves are amplifiers. Heartbeats, banners, and connection prioritization mitigate them.

See also

  • 02-foundation-model-deprecated.md — different timing, same adapter machinery
  • 04-latency-sla-tightened.md — different lever, same graph
  • 07-provider-quota-revoked.md — surge is temporary; quota cut is permanent
  • User stories 02, 04, 05, 06, 07, 08