03: Scenarios and Runbooks -- High-Performance FM Systems

MangaAssist is a JP manga store chatbot running on AWS. It uses Bedrock Claude 3 (Sonnet for complex queries, Haiku for simple ones), OpenSearch Serverless for vector retrieval, DynamoDB for session and catalog data, ECS Fargate for orchestration, API Gateway WebSocket for real-time delivery, and ElastiCache Redis for caching. The system handles 1M messages/day with a target of under 3 seconds end-to-end response time.

Skill Mapping

Field	Value
Domain	4 -- Operational Efficiency Optimization
Task	4.1 -- Optimize foundation model cost and performance
Skill	4.1.3 -- Implement strategies for high-performance FM systems including batching, capacity planning, utilization monitoring, auto-scaling, and provisioned throughput optimization
Focus	Five production scenarios with full Problem-Detection-Root Cause-Resolution-Prevention runbooks and decision trees

Scenario 1: New Manga Release Causes 5x Traffic Spike and Bedrock Throttling

Problem Statement

Monday 00:05 JST. Weekly Shonen Jump digital release just went live. MangaAssist traffic surges from 7,000 tokens/s to 35,000 tokens/s within 3 minutes. The on-call engineer receives a PagerDuty alert: BedrockThrottleAlarm ALARM -- InvocationThrottles > 0 for 1 evaluation period. Users report the chatbot returning "I'm sorry, I'm having trouble responding right now" fallback messages. WebSocket connections remain open, but responses are timing out at the 10-second API Gateway limit.

Detection

Signal	Source	Value	Normal Range
`InvocationThrottles`	AWS/Bedrock CloudWatch	47 in last 60s	0
`TokensPerSecond`	MangaAssist/TokensPerSecond	35,200	7,000 (baseline)
`BedrockQueueDepth`	MangaAssist/BedrockQueueDepth	340	< 50
`RequestLatencyP99`	MangaAssist/RequestLatencyP99	9,800ms	< 3,000ms
ECS task count	AWS/ECS RunningCount	30 (scaling in progress)	15 (pre-event)
Provisioned throughput	Bedrock console	15 Sonnet units (15K TPS)	15 Sonnet units
User-visible errors	WebSocket fallback rate	12% of responses	< 0.1%

Root Cause Analysis

flowchart TD
    A[Bedrock throttling at 00:05 JST] --> B{Was event<br>calendar configured?}
    B -->|No| C[ROOT CAUSE 1:<br>Event not in calendar.<br>No pre-provisioning happened.]
    B -->|Yes| D{Did provisioning<br>trigger on time?}
    D -->|No| E[ROOT CAUSE 2:<br>Scheduler Lambda failed<br>or timed out.]
    D -->|Yes| F{Was provisioned<br>capacity sufficient?}
    F -->|No| G[ROOT CAUSE 3:<br>Capacity multiplier<br>underestimated. 3x was<br>configured but 5x occurred.]
    F -->|Yes| H{Did provisioning<br>complete before event?}
    H -->|No| I[ROOT CAUSE 4:<br>Lead time too short.<br>Provisioning was still<br>in progress at 00:00.]
    H -->|Yes| J[Investigate:<br>Traffic exceeded all<br>predictions. Check if<br>new manga title drove<br>abnormal interest.]

In this incident: Root Cause 3. The event calendar had the Monday release configured with a 3x multiplier (21K TPS provisioned), but the actual traffic reached 5x because a highly anticipated manga series had its final chapter release -- an unusual event layered on top of the normal weekly release.

Resolution -- Step-by-Step Runbook

Immediate (0-5 minutes):

Acknowledge the PagerDuty alert.
Open the MangaAssist CloudWatch dashboard. Confirm throttles are active and TPS exceeds provisioned capacity.

Activate emergency scaling in the GenAIAutoScaler:

# Via the operations API or direct AWS CLI
aws bedrock update-provisioned-model-throughput \
  --provisioned-model-id <sonnet-provisioned-id> \
  --desired-model-units 30

Extend Redis cache TTL from 30s to 120s to reduce Bedrock invocation volume:

# Via the operations API
curl -X POST https://ops.mangaassist.internal/cache/ttl \
  -d '{"ttl_seconds": 120, "reason": "throttle_mitigation"}'

Activate Sonnet-to-Haiku downgrade for faq and greeting intents (they don't need Sonnet quality):

curl -X POST https://ops.mangaassist.internal/model-routing/override \
  -d '{"intents": ["faq", "greeting"], "model": "haiku", "reason": "throttle_mitigation"}'

Short-term (5-15 minutes):

Monitor InvocationThrottles -- should drop to 0 within 5-10 minutes after provisioned throughput increase completes.
Verify RequestLatencyP99 is returning below 3,000ms.
Check ECS task count has stabilized (auto-scaler should have added tasks).
Verify Redis cache hit rate has increased (expected: from 15% to 40%+ during mitigation).

Post-incident (within 24 hours):

Revert Sonnet-to-Haiku override for faq and greeting intents.
Restore Redis cache TTL to 30s.
Update the event calendar: add a "final chapter release" event type with 5x multiplier.
Write incident post-mortem.

Prevention

Prevention Measure	Implementation	Owner
Add "final chapter" event type to calendar	Update DynamoDB event table with 5x multiplier when publisher announces final chapters	Content ops team
Increase default Monday multiplier from 3x to 4x	Update GenAIAutoScaler event config	Platform team
Add pre-provisioning lead time buffer	Change from 2h to 4h before Monday releases	Platform team
Implement automatic escalation when TPS exceeds 80% of provisioned capacity for 2 minutes	Add CloudWatch alarm + Lambda to request additional units automatically	Platform team
Add publisher announcement monitoring	RSS/API feed from Shueisha for release announcements	Content ops team

Scenario 2: Over-Provisioned Throughput During Low-Traffic Hours Wastes Budget

Problem Statement

The monthly AWS bill review reveals that MangaAssist spent $18,400 on Bedrock provisioned throughput during the 02:00-06:00 JST window over the past month, but actual utilization during those hours averaged only 8% of provisioned capacity. The provisioned throughput (15 Sonnet units + 5 Haiku units = 25K TPS capacity) was running 24/7 with no time-of-day adjustment. At 1,500 tokens/s actual overnight traffic, 92% of provisioned capacity was wasted -- paying for 25,000 TPS while using 1,500 TPS.

Detection

Signal	Source	Value	Expected
Waste Ratio (02:00-06:00 JST)	Custom metric: `1 - (actual_TPS / provisioned_TPS)`	92%	< 40%
Overnight provisioned cost (monthly)	Cost Explorer, filtered to Bedrock provisioned	$18,400	Should be ~$0 (on-demand cheaper)
Overnight on-demand equivalent cost	`1,500 TPS * on-demand rate * 4 hours * 30 days`	~$2,600	N/A
Wasted spend (monthly)	Provisioned cost - on-demand equivalent	$15,800	$0
Provisioned unit schedule	GenAIAutoScaler config	No overnight reduction configured	Reduce to 0 overnight

Root Cause Analysis

flowchart TD
    A[92% waste ratio overnight] --> B{Is time-of-day<br>scheduling configured?}
    B -->|No| C[ROOT CAUSE 1:<br>Provisioned throughput<br>runs 24/7 at peak capacity.<br>No schedule was ever set up.]
    B -->|Yes| D{Does the schedule<br>include an overnight<br>reduction?}
    D -->|No| E[ROOT CAUSE 2:<br>Schedule exists but only<br>covers peak scale-up, not<br>off-peak scale-down.]
    D -->|Yes| F{Is the scheduled<br>reduction executing?}
    F -->|No| G[ROOT CAUSE 3:<br>Scheduler Lambda has<br>permission or execution<br>errors. Check CloudWatch<br>Logs for the Lambda.]
    F -->|Yes| H{Is the reduction<br>sufficient?}
    H -->|No| I[ROOT CAUSE 4:<br>Overnight minimum is set<br>too high. Review minimum<br>unit configuration.]
    H -->|Yes| J[Not a provisioning issue.<br>Check if on-demand<br>charges are the actual<br>cost driver.]

In this incident: Root Cause 1. The initial deployment configured provisioned throughput for the evening peak and never added a schedule to reduce it overnight. The team assumed provisioned throughput would be needed around the clock to avoid throttling, but overnight traffic is so low that on-demand handles it without any throttling risk.

Resolution -- Step-by-Step Runbook

Immediate (within 1 business day):

Verify overnight traffic patterns over the last 30 days. Confirm that 02:00-06:00 JST consistently runs below 2,000 TPS.
Calculate the breakeven point: at what TPS does provisioned become cheaper than on-demand? For MangaAssist, this is approximately 5,000 TPS.
Design a time-of-day provisioning schedule:

Time Window (JST)	Sonnet Units	Haiku Units	Rationale
02:00 - 06:00	0	0	Pure on-demand; traffic well below breakeven
06:00 - 12:00	5	3	Morning ramp; above breakeven by 08:00
12:00 - 17:00	8	3	Afternoon steady state
17:00 - 02:00	15	5	Evening peak through late night

Implement the schedule using EventBridge Scheduler + Lambda:

# Pseudo-configuration for EventBridge rules
Rule: overnight_reduction
  Schedule: cron(0 17 * * ? *)   # 02:00 JST = 17:00 UTC
  Target: Lambda "adjust_provisioned_throughput"
  Payload: {"sonnet_units": 0, "haiku_units": 0}

Rule: morning_ramp
  Schedule: cron(0 21 * * ? *)   # 06:00 JST = 21:00 UTC
  Target: Lambda "adjust_provisioned_throughput"
  Payload: {"sonnet_units": 5, "haiku_units": 3}

Rule: evening_peak
  Schedule: cron(0 8 * * ? *)    # 17:00 JST = 08:00 UTC
  Target: Lambda "adjust_provisioned_throughput"
  Payload: {"sonnet_units": 15, "haiku_units": 5}

Deploy and monitor for 1 week. Verify no throttling during transitions.

Validation:

After 1 week, compare costs. Expected monthly savings: - Previous: $18,400/month for overnight window. - New: ~$2,600/month (on-demand overnight) + reduced provisioned during morning. - Net savings: ~$15,800/month ($189,600/year).

Prevention

Prevention Measure	Implementation	Owner
Mandatory time-of-day scheduling for all provisioned throughput	Add to infrastructure checklist and CDK template	Platform team
Weekly waste ratio report	Automated CloudWatch Insights query emailed every Monday	FinOps team
Alarm when waste ratio exceeds 50% for 2 consecutive hours	CloudWatch alarm on custom WasteRatio metric	Platform team
Quarterly provisioning review	Calendar invite for platform + FinOps to review utilization data	Engineering manager

Scenario 3: Micro-Batching Latency Increase During Peak When Batch Window Is Too Large

Problem Statement

Tuesday 20:00 JST, during the evening peak. The P99 latency alarm fires: RequestLatencyP99 > 4000ms for 3 consecutive evaluation periods. Investigation shows that Bedrock invocation latency itself is normal (Sonnet P99 = 2,100ms, Haiku P99 = 350ms), ECS tasks are healthy, and no throttling is occurring. But end-to-end response time has crept above 4 seconds for 15% of requests.

The root cause is in the micro-batcher: the batch window was recently increased from 150ms to 500ms to improve batching efficiency during a cost optimization initiative. At peak traffic, the 500ms window is adding unacceptable latency to the request pipeline.

Detection

Signal	Source	Value	Normal
`RequestLatencyP99`	MangaAssist custom metric	4,200ms	< 3,000ms
`BedrockInvocationLatency` (Sonnet P99)	AWS/Bedrock	2,100ms	2,000ms (normal)
`BedrockInvocationLatency` (Haiku P99)	AWS/Bedrock	350ms	300ms (normal)
`InvocationThrottles`	AWS/Bedrock	0	0
Batch window config	ThroughputManager config	500ms	150ms (previous)
Average batch fill rate	MangaAssist/BatchFillRate	7.2 (out of 8 max)	Was 3.1 at 150ms window
Queue wait time (pre-Bedrock)	MangaAssist/QueueWaitTime P99	480ms	Was 140ms at 150ms window
ECS CPU	AWS/ECS	28%	Normal

Root Cause Analysis

flowchart TD
    A[P99 latency > 4000ms<br>but Bedrock latency normal] --> B{Check queue<br>wait time}
    B -->|High| C{Was batch window<br>recently changed?}
    C -->|Yes| D[ROOT CAUSE:<br>Batch window increased<br>from 150ms to 500ms.<br>At peak traffic, requests<br>wait up to 500ms before<br>batching begins.]
    C -->|No| E{Check batch<br>queue depth}
    E -->|High| F[Batching backlog:<br>more requests arriving<br>than batches can flush.]
    E -->|Normal| G[Investigate other<br>pipeline stages:<br>Redis, OpenSearch,<br>DynamoDB latency.]

    B -->|Normal| H{Check Redis<br>latency}
    H -->|High| I[Redis hotkey or<br>connection pool<br>exhaustion.]
    H -->|Normal| J{Check OpenSearch<br>latency}
    J -->|High| K[OpenSearch query<br>latency spike.]
    J -->|Normal| L[Check DynamoDB<br>and API Gateway.]

In this incident: The batch window increase was a well-intentioned cost optimization. At 500ms, batches fill to 7.2 requests on average (vs 3.1 at 150ms), reducing the number of Bedrock invocations by ~57%. But the 500ms wait adds directly to user-perceived latency.

Latency budget breakdown:

Stage	At 150ms Window	At 500ms Window	Budget
WebSocket + API Gateway	50ms	50ms	--
Intent classification	30ms	30ms	--
Micro-batch queue wait	140ms (P99)	480ms (P99)	--
OpenSearch retrieval	200ms	200ms	--
Bedrock invocation (Sonnet P99)	2,100ms	2,100ms	--
Response formatting + delivery	80ms	80ms	--
Total P99	2,600ms	2,940ms	< 3,000ms

The 150ms window kept the total under 3 seconds. The 500ms window pushes it to 2,940ms on average -- but the P99 is 4,200ms because batch timing and Bedrock latency both have long tails that compound.

Resolution -- Step-by-Step Runbook

Immediate (0-5 minutes):

Revert the batch window to 150ms:

curl -X POST https://ops.mangaassist.internal/config/batch-window \
  -d '{"batch_window_ms": 150, "reason": "latency_regression"}'

Monitor RequestLatencyP99 -- should drop below 3,000ms within 2-3 minutes.

Short-term (within 1 week):

Implement an adaptive batch window that adjusts based on current traffic and latency:

Condition	Batch Window	Rationale
Off-peak (TPS < 5,000)	500ms	Low traffic; batching saves cost, latency budget has room
Normal (5,000 < TPS < 15,000)	250ms	Balance between cost and latency
Peak (TPS > 15,000)	100ms	Latency-sensitive; minimal batching delay
Latency breach (P99 > 2,500ms)	50ms	Emergency: minimize all non-essential latency

Add the adaptive logic to the ThroughputManager:

def _adaptive_batch_window(self, current_tps: float, current_p99_ms: float) -> int:
    if current_p99_ms > 2_500:
        return 50
    if current_tps > 15_000:
        return 100
    if current_tps > 5_000:
        return 250
    return 500

Add monitoring for QueueWaitTime with an alarm at P99 > 300ms.

Prevention

Prevention Measure	Implementation	Owner
Latency impact assessment required for all batch config changes	Add to change management checklist	Platform team
Adaptive batch window (auto-adjusts to traffic conditions)	Deploy the adaptive logic above	Platform team
QueueWaitTime P99 alarm at 300ms	CloudWatch alarm	Platform team
Load test batch window changes at peak traffic before deploying	Add to performance test suite	QA team

Scenario 4: Auto-Scaling Lag -- ECS Scales But Bedrock Provisioned Throughput Doesn't Match

Problem Statement

Wednesday 17:30 JST. The evening ramp is underway. The ECS auto-scaler correctly detects rising traffic and scales from 15 to 35 tasks over 10 minutes. However, Bedrock provisioned throughput remains at the afternoon level (8 Sonnet units = 8,000 TPS). The 35 ECS tasks generate enough concurrent Bedrock invocations to exceed provisioned capacity, causing throttling. The irony: scaling the orchestrator layer made the Bedrock bottleneck worse by increasing the rate of Bedrock calls.

Detection

Signal	Source	Value	Expected
ECS RunningCount	AWS/ECS	35 (scaled from 15)	Correct; auto-scaler working
`InvocationThrottles`	AWS/Bedrock	23 in last 60s	0
`TokensPerSecond`	MangaAssist custom metric	12,500	Below evening peak, but above 8K provisioned
Provisioned Sonnet units	Bedrock console	8 (afternoon level)	Should be 15 (evening level)
Scheduled throughput increase	EventBridge Scheduler	Scheduled for 17:00 JST	Should have fired 30 min ago
Scheduler Lambda execution	CloudWatch Logs	Last execution: ERROR at 17:00	Should have succeeded

Root Cause Analysis

flowchart TD
    A[ECS scaled correctly but<br>Bedrock throttling] --> B{Did provisioned<br>throughput schedule<br>fire at 17:00?}
    B -->|No| C[ROOT CAUSE 1:<br>EventBridge rule disabled<br>or deleted.]
    B -->|Yes| D{Did the Lambda<br>succeed?}
    D -->|No| E{Check Lambda<br>error logs}
    E --> F{Error type?}
    F -->|Permission denied| G[ROOT CAUSE 2:<br>IAM role for Lambda<br>missing bedrock:Update<br>ProvisionedModelThroughput<br>permission.]
    F -->|Resource not found| H[ROOT CAUSE 3:<br>Provisioned model ARN<br>changed after a model<br>upgrade. Lambda has<br>stale ARN.]
    F -->|Timeout| I[ROOT CAUSE 4:<br>Lambda timeout too short<br>for Bedrock API call.]
    F -->|Throttled| J[ROOT CAUSE 5:<br>Bedrock API rate limit<br>on management operations.]
    D -->|Yes| K{Is the new capacity<br>active?}
    K -->|No| L[ROOT CAUSE 6:<br>Provisioning still in<br>progress. Lead time<br>insufficient.]
    K -->|Yes| M[Not a provisioning issue.<br>Investigate traffic pattern<br>exceeding expectations.]

In this incident: Root Cause 2. During a recent security hardening sprint, the Lambda execution role was recreated with a tighter policy. The bedrock:UpdateProvisionedModelThroughput permission was accidentally omitted from the new role. The Lambda ran at 17:00 JST, received an AccessDeniedException, logged the error, and exited. No alarm was configured on Lambda errors for this function.

Resolution -- Step-by-Step Runbook

Immediate (0-5 minutes):

Manually increase provisioned throughput via AWS CLI:

aws bedrock update-provisioned-model-throughput \
  --provisioned-model-id <sonnet-provisioned-id> \
  --desired-model-units 15

While provisioning activates (10-15 min), apply throttle mitigation: - Extend Redis cache TTL to 90s. - Downgrade faq and greeting intents to Haiku. - Activate aggressive request coalescing.

Short-term (0-30 minutes):

Fix the Lambda IAM role:

{
  "Effect": "Allow",
  "Action": [
    "bedrock:UpdateProvisionedModelThroughput",
    "bedrock:GetProvisionedModelThroughput",
    "bedrock:ListProvisionedModelThroughputs"
  ],
  "Resource": "arn:aws:bedrock:ap-northeast-1:*:provisioned-model/*"
}

Re-run the Lambda manually to confirm it succeeds.
Verify provisioned throughput reaches the expected level.

Post-incident (within 24 hours):

Add a CloudWatch alarm on Lambda errors for all provisioning Lambdas.
Add a "provisioned throughput health check" that runs every 15 minutes and verifies that the current provisioned capacity matches the expected schedule.
Add IAM policy validation to the CI/CD pipeline for infrastructure changes.

Prevention

Prevention Measure	Implementation	Owner
Alarm on provisioning Lambda errors	CloudWatch alarm: Lambda Errors > 0 for provisioning functions	Platform team
Provisioned throughput health check	Lambda that compares actual vs expected capacity every 15 min	Platform team
IAM policy unit tests	CDK/Terraform tests that assert required permissions are present	Security + Platform
Coupled scaling: ECS + Bedrock in single workflow	Modify auto-scaler to adjust provisioned throughput whenever ECS scales	Platform team
Decouple ECS scaling rate from Bedrock capacity	Rate-limit Bedrock invocations per ECS task to prevent overwhelming provisioned capacity	Platform team

Decision Tree: ECS Scaling + Bedrock Capacity Mismatch

flowchart TD
    A[ECS tasks increased but<br>Bedrock throttling detected] --> B{Is provisioned<br>throughput at expected<br>level for this time?}

    B -->|No, it's lower| C{Check scheduler<br>Lambda logs}
    C -->|Lambda error| D[Fix Lambda error<br>and re-run manually]
    C -->|Lambda didn't run| E[Check EventBridge rule<br>Is it enabled?]
    C -->|Lambda succeeded| F[Check Bedrock:<br>is provisioning still<br>in progress?]

    B -->|Yes, at expected level| G{Is actual TPS<br>exceeding provisioned<br>capacity?}
    G -->|Yes| H[Traffic higher than<br>planned. Increase<br>provisioned units.]
    G -->|No, TPS is within capacity| I{Check per-task<br>invocation rate}
    I -->|Too many invocations per task| J[Rate-limit invocations<br>per ECS task to prevent<br>burst amplification]
    I -->|Normal| K[Investigate Bedrock<br>internal throttling.<br>Contact AWS support.]

Scenario 5: Capacity Planning Miss -- Underestimated Token Throughput for New Recommendation Feature

Problem Statement

A new "Manga Taste Profile" feature launches on Thursday. This feature generates a personalized taste analysis by sending the user's reading history (last 50 manga titles, genres, ratings) plus a detailed system prompt to Claude 3 Sonnet, asking it to produce a structured taste profile with explanations. The feature was load-tested using synthetic data with 10-title histories, but real users have much richer histories.

One week after launch, the feature accounts for 35% of total Bedrock token consumption despite being only 8% of total requests. The average token count per "taste profile" request is 3,200 tokens (vs the 800 tokens estimated during planning). Monthly Bedrock cost projections have increased by $42,000.

Detection

Signal	Source	Value	Planned
Avg tokens per "taste_profile" request	MangaAssist/TokensPerIntent	3,200	800 (planned)
"taste_profile" share of total tokens	CloudWatch Insights query	35%	8% (planned)
"taste_profile" share of total requests	CloudWatch Insights query	8%	8% (matches)
Monthly Bedrock cost delta	Cost Explorer	+$42,000 projected	+$10,000 (planned)
Provisioned throughput utilization	MangaAssist/WasteRatio	Inverse: utilization now 88% during peak	Was 65% before feature
P99 latency for "taste_profile"	MangaAssist/LatencyByIntent	5,800ms	3,000ms (target)

Root Cause Analysis

flowchart TD
    A[taste_profile consuming 4x<br>more tokens than planned] --> B{Was the feature<br>load-tested?}
    B -->|No| C[ROOT CAUSE 1:<br>No load test. Deploy<br>without performance<br>validation.]
    B -->|Yes| D{Was the test data<br>representative?}
    D -->|No| E[ROOT CAUSE 2:<br>Synthetic test data had<br>10-title histories.<br>Real users have 50+<br>titles. Prompt size<br>scales with history length.]
    D -->|Yes| F{Was token counting<br>included in the test?}
    F -->|No| G[ROOT CAUSE 3:<br>Load test measured<br>latency and error rate<br>but not token consumption.<br>Cost impact was invisible.]
    F -->|Yes| H{Were results reviewed<br>against capacity plan?}
    H -->|No| I[ROOT CAUSE 4:<br>Results existed but<br>no one compared them<br>to the capacity model.]
    H -->|Yes| J[Investigate: did user<br>behavior change post-launch<br>in unexpected ways?]

In this incident: Root Cause 2 + Root Cause 3. The load test used synthetic users with short histories (10 titles), and the test framework measured latency and HTTP errors but did not aggregate token counts. The 4x token multiplier was invisible until it hit production and showed up in cost reports a week later.

Token breakdown for a taste_profile request:

Component	Estimated (10-title history)	Actual (50-title history)	Multiplier
System prompt (taste analysis instructions)	200 tokens	200 tokens	1x
User reading history context	150 tokens (10 titles)	750 tokens (50 titles)	5x
Genre and rating metadata	50 tokens	250 tokens	5x
Output (taste profile analysis)	400 tokens	2,000 tokens	5x (model generates proportionally longer analysis for richer input)
Total	800 tokens	3,200 tokens	4x

Resolution -- Step-by-Step Runbook

Immediate (0-2 hours):

Add token count tracking per intent to the monitoring dashboard (if not already present).

Increase provisioned throughput to accommodate the higher token load:

# Evening peak needs additional capacity
# Previous: 15 Sonnet units (15K TPS)
# New: 20 Sonnet units (20K TPS) to absorb the taste_profile load

Implement a history truncation strategy for taste_profile prompts: - Instead of sending all 50 titles, send the 20 most recent + top 10 by rating. - Summarize the remaining titles as genre counts: "Also read: 15 shonen, 5 seinen, 3 josei." - This reduces input context from 750 tokens to ~350 tokens without significant quality loss.

Short-term (within 1 week):

Redesign the taste_profile prompt to be token-efficient:

Optimization	Token Savings	Quality Impact
Truncate history to 30 titles (20 recent + 10 top)	-200 input tokens	Minimal; most taste signal is in recent reads
Summarize remaining history as genre counts	-200 input tokens	Low; genre distribution preserved
Constrain output to structured JSON instead of prose	-1,200 output tokens	Moderate; less readable but still useful
Use `max_tokens=800` instead of default 1,024	-200 output tokens (avg)	None; most profiles complete in 600-800 tokens
Total savings	~1,800 tokens (56% reduction)	Low-to-moderate

After optimization, the taste_profile request drops from 3,200 to ~1,400 tokens -- much closer to the original 800-token estimate.
Update the capacity model with the corrected per-intent token counts.

Post-incident (within 2 weeks):

Add a token budget per intent to the orchestrator. Any intent exceeding its token budget triggers a warning log and a CloudWatch metric, catching future planning misses early.
Require load tests to use production-representative data (anonymized). Add a pre-launch checklist item: "Load test data has representative distribution of user history lengths."
Add token consumption to load test reports alongside latency and error rates.

Prevention

Prevention Measure	Implementation	Owner
Token budget per intent	Orchestrator config + alarm when exceeded	Platform team
Production-representative load test data	Anonymized data pipeline from production to test	Data engineering
Token counting in load test reports	Update load test framework to aggregate and report tokens	QA team
Pre-launch capacity review gate	New features must present token estimates with supporting data before launch	Engineering manager
Prompt token estimation tool	Utility that estimates token count for a prompt template + sample input distribution	Platform team

Decision Tree: Unexpected Token Consumption After Feature Launch

flowchart TD
    A[New feature consuming more<br>tokens than planned] --> B{Was per-intent<br>token tracking active<br>before launch?}
    B -->|No| C[Enable token tracking<br>per intent immediately.<br>Measure actual vs planned.]
    B -->|Yes| D{How large is<br>the variance?}

    D -->|< 50% over plan| E[Acceptable variance.<br>Update capacity model.<br>Monitor for 1 week.]
    D -->|50-200% over plan| F{Is the feature<br>prompt optimizable?}
    F -->|Yes| G[Optimize prompt:<br>truncate context,<br>constrain output,<br>reduce system prompt.]
    F -->|No| H[Increase provisioned<br>throughput. Update<br>cost projections.<br>Inform stakeholders.]

    D -->|> 200% over plan| I{Is the feature<br>critical?}
    I -->|Yes| J[Emergency: Increase<br>provisioned throughput.<br>Redesign prompt in<br>parallel. Consider<br>Haiku for non-critical<br>portions.]
    I -->|No| K[Consider feature flag<br>rollback until prompt<br>is redesigned.]

Cross-Scenario Summary

#	Scenario	Core Skill Tested	Key Metric	Resolution Time
1	Manga release 5x spike + throttling	Capacity planning, event provisioning	InvocationThrottles, TPS	5-15 min (mitigation), 24h (prevention)
2	Over-provisioned overnight waste	Utilization monitoring, cost optimization	Waste Ratio	1 day (schedule deploy), 1 week (validation)
3	Batch window too large at peak	Batching strategy, latency management	QueueWaitTime P99, RequestLatencyP99	5 min (revert), 1 week (adaptive batching)
4	ECS scales but Bedrock doesn't match	Auto-scaling coordination, coupled scaling	InvocationThrottles + ECS RunningCount divergence	15-30 min (manual fix), 24h (IAM fix + alarms)
5	Token underestimate for new feature	Capacity planning, token budgeting	TokensPerIntent, cost delta	2h (increase throughput), 1 week (prompt optimization)

Runbook Quick Reference

Universal First Response Checklist (Any High-Performance FM Incident)

Open the CloudWatch dashboard. Identify which metrics are breaching.
Check Bedrock throttles. If InvocationThrottles > 0, this is the top priority.
Check provisioned throughput status. Is it at the expected level for this time of day?
Check ECS task count. Is the auto-scaler responding?
Check the event calendar. Is there a scheduled event that should have triggered pre-provisioning?
Check recent deployments. Was anything changed in the last 24 hours (batch window, model routing, prompt templates, IAM policies)?

Escalation Criteria

Condition	Action
Throttles > 0 for more than 5 minutes after mitigation started	Page senior on-call + AWS TAM
P99 latency > 8 seconds for more than 10 minutes	Page senior on-call
Provisioned throughput increase request fails (API error)	Contact AWS Support (Severity 1)
Cost anomaly: daily spend > 2x forecast	Alert FinOps team + engineering manager
Multiple scenarios occurring simultaneously (e.g., throttling + scaling lag)	Incident commander mode: dedicated war room