LOCAL PREVIEW View on GitHub

03: Scenarios and Runbooks -- High-Performance FM Systems

MangaAssist is a JP manga store chatbot running on AWS. It uses Bedrock Claude 3 (Sonnet for complex queries, Haiku for simple ones), OpenSearch Serverless for vector retrieval, DynamoDB for session and catalog data, ECS Fargate for orchestration, API Gateway WebSocket for real-time delivery, and ElastiCache Redis for caching. The system handles 1M messages/day with a target of under 3 seconds end-to-end response time.


Skill Mapping

Field Value
Domain 4 -- Operational Efficiency Optimization
Task 4.1 -- Optimize foundation model cost and performance
Skill 4.1.3 -- Implement strategies for high-performance FM systems including batching, capacity planning, utilization monitoring, auto-scaling, and provisioned throughput optimization
Focus Five production scenarios with full Problem-Detection-Root Cause-Resolution-Prevention runbooks and decision trees

Scenario 1: New Manga Release Causes 5x Traffic Spike and Bedrock Throttling

Problem Statement

Monday 00:05 JST. Weekly Shonen Jump digital release just went live. MangaAssist traffic surges from 7,000 tokens/s to 35,000 tokens/s within 3 minutes. The on-call engineer receives a PagerDuty alert: BedrockThrottleAlarm ALARM -- InvocationThrottles > 0 for 1 evaluation period. Users report the chatbot returning "I'm sorry, I'm having trouble responding right now" fallback messages. WebSocket connections remain open, but responses are timing out at the 10-second API Gateway limit.

Detection

Signal Source Value Normal Range
InvocationThrottles AWS/Bedrock CloudWatch 47 in last 60s 0
TokensPerSecond MangaAssist/TokensPerSecond 35,200 7,000 (baseline)
BedrockQueueDepth MangaAssist/BedrockQueueDepth 340 < 50
RequestLatencyP99 MangaAssist/RequestLatencyP99 9,800ms < 3,000ms
ECS task count AWS/ECS RunningCount 30 (scaling in progress) 15 (pre-event)
Provisioned throughput Bedrock console 15 Sonnet units (15K TPS) 15 Sonnet units
User-visible errors WebSocket fallback rate 12% of responses < 0.1%

Root Cause Analysis

flowchart TD
    A[Bedrock throttling at 00:05 JST] --> B{Was event<br>calendar configured?}
    B -->|No| C[ROOT CAUSE 1:<br>Event not in calendar.<br>No pre-provisioning happened.]
    B -->|Yes| D{Did provisioning<br>trigger on time?}
    D -->|No| E[ROOT CAUSE 2:<br>Scheduler Lambda failed<br>or timed out.]
    D -->|Yes| F{Was provisioned<br>capacity sufficient?}
    F -->|No| G[ROOT CAUSE 3:<br>Capacity multiplier<br>underestimated. 3x was<br>configured but 5x occurred.]
    F -->|Yes| H{Did provisioning<br>complete before event?}
    H -->|No| I[ROOT CAUSE 4:<br>Lead time too short.<br>Provisioning was still<br>in progress at 00:00.]
    H -->|Yes| J[Investigate:<br>Traffic exceeded all<br>predictions. Check if<br>new manga title drove<br>abnormal interest.]

In this incident: Root Cause 3. The event calendar had the Monday release configured with a 3x multiplier (21K TPS provisioned), but the actual traffic reached 5x because a highly anticipated manga series had its final chapter release -- an unusual event layered on top of the normal weekly release.

Resolution -- Step-by-Step Runbook

Immediate (0-5 minutes):

  1. Acknowledge the PagerDuty alert.
  2. Open the MangaAssist CloudWatch dashboard. Confirm throttles are active and TPS exceeds provisioned capacity.
  3. Activate emergency scaling in the GenAIAutoScaler:
    # Via the operations API or direct AWS CLI
    aws bedrock update-provisioned-model-throughput \
      --provisioned-model-id <sonnet-provisioned-id> \
      --desired-model-units 30
    
  4. Extend Redis cache TTL from 30s to 120s to reduce Bedrock invocation volume:
    # Via the operations API
    curl -X POST https://ops.mangaassist.internal/cache/ttl \
      -d '{"ttl_seconds": 120, "reason": "throttle_mitigation"}'
    
  5. Activate Sonnet-to-Haiku downgrade for faq and greeting intents (they don't need Sonnet quality):
    curl -X POST https://ops.mangaassist.internal/model-routing/override \
      -d '{"intents": ["faq", "greeting"], "model": "haiku", "reason": "throttle_mitigation"}'
    

Short-term (5-15 minutes):

  1. Monitor InvocationThrottles -- should drop to 0 within 5-10 minutes after provisioned throughput increase completes.
  2. Verify RequestLatencyP99 is returning below 3,000ms.
  3. Check ECS task count has stabilized (auto-scaler should have added tasks).
  4. Verify Redis cache hit rate has increased (expected: from 15% to 40%+ during mitigation).

Post-incident (within 24 hours):

  1. Revert Sonnet-to-Haiku override for faq and greeting intents.
  2. Restore Redis cache TTL to 30s.
  3. Update the event calendar: add a "final chapter release" event type with 5x multiplier.
  4. Write incident post-mortem.

Prevention

Prevention Measure Implementation Owner
Add "final chapter" event type to calendar Update DynamoDB event table with 5x multiplier when publisher announces final chapters Content ops team
Increase default Monday multiplier from 3x to 4x Update GenAIAutoScaler event config Platform team
Add pre-provisioning lead time buffer Change from 2h to 4h before Monday releases Platform team
Implement automatic escalation when TPS exceeds 80% of provisioned capacity for 2 minutes Add CloudWatch alarm + Lambda to request additional units automatically Platform team
Add publisher announcement monitoring RSS/API feed from Shueisha for release announcements Content ops team

Scenario 2: Over-Provisioned Throughput During Low-Traffic Hours Wastes Budget

Problem Statement

The monthly AWS bill review reveals that MangaAssist spent $18,400 on Bedrock provisioned throughput during the 02:00-06:00 JST window over the past month, but actual utilization during those hours averaged only 8% of provisioned capacity. The provisioned throughput (15 Sonnet units + 5 Haiku units = 25K TPS capacity) was running 24/7 with no time-of-day adjustment. At 1,500 tokens/s actual overnight traffic, 92% of provisioned capacity was wasted -- paying for 25,000 TPS while using 1,500 TPS.

Detection

Signal Source Value Expected
Waste Ratio (02:00-06:00 JST) Custom metric: 1 - (actual_TPS / provisioned_TPS) 92% < 40%
Overnight provisioned cost (monthly) Cost Explorer, filtered to Bedrock provisioned $18,400 Should be ~$0 (on-demand cheaper)
Overnight on-demand equivalent cost 1,500 TPS * on-demand rate * 4 hours * 30 days ~$2,600 N/A
Wasted spend (monthly) Provisioned cost - on-demand equivalent $15,800 $0
Provisioned unit schedule GenAIAutoScaler config No overnight reduction configured Reduce to 0 overnight

Root Cause Analysis

flowchart TD
    A[92% waste ratio overnight] --> B{Is time-of-day<br>scheduling configured?}
    B -->|No| C[ROOT CAUSE 1:<br>Provisioned throughput<br>runs 24/7 at peak capacity.<br>No schedule was ever set up.]
    B -->|Yes| D{Does the schedule<br>include an overnight<br>reduction?}
    D -->|No| E[ROOT CAUSE 2:<br>Schedule exists but only<br>covers peak scale-up, not<br>off-peak scale-down.]
    D -->|Yes| F{Is the scheduled<br>reduction executing?}
    F -->|No| G[ROOT CAUSE 3:<br>Scheduler Lambda has<br>permission or execution<br>errors. Check CloudWatch<br>Logs for the Lambda.]
    F -->|Yes| H{Is the reduction<br>sufficient?}
    H -->|No| I[ROOT CAUSE 4:<br>Overnight minimum is set<br>too high. Review minimum<br>unit configuration.]
    H -->|Yes| J[Not a provisioning issue.<br>Check if on-demand<br>charges are the actual<br>cost driver.]

In this incident: Root Cause 1. The initial deployment configured provisioned throughput for the evening peak and never added a schedule to reduce it overnight. The team assumed provisioned throughput would be needed around the clock to avoid throttling, but overnight traffic is so low that on-demand handles it without any throttling risk.

Resolution -- Step-by-Step Runbook

Immediate (within 1 business day):

  1. Verify overnight traffic patterns over the last 30 days. Confirm that 02:00-06:00 JST consistently runs below 2,000 TPS.
  2. Calculate the breakeven point: at what TPS does provisioned become cheaper than on-demand? For MangaAssist, this is approximately 5,000 TPS.
  3. Design a time-of-day provisioning schedule:
Time Window (JST) Sonnet Units Haiku Units Rationale
02:00 - 06:00 0 0 Pure on-demand; traffic well below breakeven
06:00 - 12:00 5 3 Morning ramp; above breakeven by 08:00
12:00 - 17:00 8 3 Afternoon steady state
17:00 - 02:00 15 5 Evening peak through late night
  1. Implement the schedule using EventBridge Scheduler + Lambda:

    # Pseudo-configuration for EventBridge rules
    Rule: overnight_reduction
      Schedule: cron(0 17 * * ? *)   # 02:00 JST = 17:00 UTC
      Target: Lambda "adjust_provisioned_throughput"
      Payload: {"sonnet_units": 0, "haiku_units": 0}
    
    Rule: morning_ramp
      Schedule: cron(0 21 * * ? *)   # 06:00 JST = 21:00 UTC
      Target: Lambda "adjust_provisioned_throughput"
      Payload: {"sonnet_units": 5, "haiku_units": 3}
    
    Rule: evening_peak
      Schedule: cron(0 8 * * ? *)    # 17:00 JST = 08:00 UTC
      Target: Lambda "adjust_provisioned_throughput"
      Payload: {"sonnet_units": 15, "haiku_units": 5}
    

  2. Deploy and monitor for 1 week. Verify no throttling during transitions.

Validation:

  1. After 1 week, compare costs. Expected monthly savings: - Previous: $18,400/month for overnight window. - New: ~$2,600/month (on-demand overnight) + reduced provisioned during morning. - Net savings: ~$15,800/month ($189,600/year).

Prevention

Prevention Measure Implementation Owner
Mandatory time-of-day scheduling for all provisioned throughput Add to infrastructure checklist and CDK template Platform team
Weekly waste ratio report Automated CloudWatch Insights query emailed every Monday FinOps team
Alarm when waste ratio exceeds 50% for 2 consecutive hours CloudWatch alarm on custom WasteRatio metric Platform team
Quarterly provisioning review Calendar invite for platform + FinOps to review utilization data Engineering manager

Scenario 3: Micro-Batching Latency Increase During Peak When Batch Window Is Too Large

Problem Statement

Tuesday 20:00 JST, during the evening peak. The P99 latency alarm fires: RequestLatencyP99 > 4000ms for 3 consecutive evaluation periods. Investigation shows that Bedrock invocation latency itself is normal (Sonnet P99 = 2,100ms, Haiku P99 = 350ms), ECS tasks are healthy, and no throttling is occurring. But end-to-end response time has crept above 4 seconds for 15% of requests.

The root cause is in the micro-batcher: the batch window was recently increased from 150ms to 500ms to improve batching efficiency during a cost optimization initiative. At peak traffic, the 500ms window is adding unacceptable latency to the request pipeline.

Detection

Signal Source Value Normal
RequestLatencyP99 MangaAssist custom metric 4,200ms < 3,000ms
BedrockInvocationLatency (Sonnet P99) AWS/Bedrock 2,100ms 2,000ms (normal)
BedrockInvocationLatency (Haiku P99) AWS/Bedrock 350ms 300ms (normal)
InvocationThrottles AWS/Bedrock 0 0
Batch window config ThroughputManager config 500ms 150ms (previous)
Average batch fill rate MangaAssist/BatchFillRate 7.2 (out of 8 max) Was 3.1 at 150ms window
Queue wait time (pre-Bedrock) MangaAssist/QueueWaitTime P99 480ms Was 140ms at 150ms window
ECS CPU AWS/ECS 28% Normal

Root Cause Analysis

flowchart TD
    A[P99 latency > 4000ms<br>but Bedrock latency normal] --> B{Check queue<br>wait time}
    B -->|High| C{Was batch window<br>recently changed?}
    C -->|Yes| D[ROOT CAUSE:<br>Batch window increased<br>from 150ms to 500ms.<br>At peak traffic, requests<br>wait up to 500ms before<br>batching begins.]
    C -->|No| E{Check batch<br>queue depth}
    E -->|High| F[Batching backlog:<br>more requests arriving<br>than batches can flush.]
    E -->|Normal| G[Investigate other<br>pipeline stages:<br>Redis, OpenSearch,<br>DynamoDB latency.]

    B -->|Normal| H{Check Redis<br>latency}
    H -->|High| I[Redis hotkey or<br>connection pool<br>exhaustion.]
    H -->|Normal| J{Check OpenSearch<br>latency}
    J -->|High| K[OpenSearch query<br>latency spike.]
    J -->|Normal| L[Check DynamoDB<br>and API Gateway.]

In this incident: The batch window increase was a well-intentioned cost optimization. At 500ms, batches fill to 7.2 requests on average (vs 3.1 at 150ms), reducing the number of Bedrock invocations by ~57%. But the 500ms wait adds directly to user-perceived latency.

Latency budget breakdown:

Stage At 150ms Window At 500ms Window Budget
WebSocket + API Gateway 50ms 50ms --
Intent classification 30ms 30ms --
Micro-batch queue wait 140ms (P99) 480ms (P99) --
OpenSearch retrieval 200ms 200ms --
Bedrock invocation (Sonnet P99) 2,100ms 2,100ms --
Response formatting + delivery 80ms 80ms --
Total P99 2,600ms 2,940ms < 3,000ms

The 150ms window kept the total under 3 seconds. The 500ms window pushes it to 2,940ms on average -- but the P99 is 4,200ms because batch timing and Bedrock latency both have long tails that compound.

Resolution -- Step-by-Step Runbook

Immediate (0-5 minutes):

  1. Revert the batch window to 150ms:
    curl -X POST https://ops.mangaassist.internal/config/batch-window \
      -d '{"batch_window_ms": 150, "reason": "latency_regression"}'
    
  2. Monitor RequestLatencyP99 -- should drop below 3,000ms within 2-3 minutes.

Short-term (within 1 week):

  1. Implement an adaptive batch window that adjusts based on current traffic and latency:
Condition Batch Window Rationale
Off-peak (TPS < 5,000) 500ms Low traffic; batching saves cost, latency budget has room
Normal (5,000 < TPS < 15,000) 250ms Balance between cost and latency
Peak (TPS > 15,000) 100ms Latency-sensitive; minimal batching delay
Latency breach (P99 > 2,500ms) 50ms Emergency: minimize all non-essential latency
  1. Add the adaptive logic to the ThroughputManager:

    def _adaptive_batch_window(self, current_tps: float, current_p99_ms: float) -> int:
        if current_p99_ms > 2_500:
            return 50
        if current_tps > 15_000:
            return 100
        if current_tps > 5_000:
            return 250
        return 500
    

  2. Add monitoring for QueueWaitTime with an alarm at P99 > 300ms.

Prevention

Prevention Measure Implementation Owner
Latency impact assessment required for all batch config changes Add to change management checklist Platform team
Adaptive batch window (auto-adjusts to traffic conditions) Deploy the adaptive logic above Platform team
QueueWaitTime P99 alarm at 300ms CloudWatch alarm Platform team
Load test batch window changes at peak traffic before deploying Add to performance test suite QA team

Scenario 4: Auto-Scaling Lag -- ECS Scales But Bedrock Provisioned Throughput Doesn't Match

Problem Statement

Wednesday 17:30 JST. The evening ramp is underway. The ECS auto-scaler correctly detects rising traffic and scales from 15 to 35 tasks over 10 minutes. However, Bedrock provisioned throughput remains at the afternoon level (8 Sonnet units = 8,000 TPS). The 35 ECS tasks generate enough concurrent Bedrock invocations to exceed provisioned capacity, causing throttling. The irony: scaling the orchestrator layer made the Bedrock bottleneck worse by increasing the rate of Bedrock calls.

Detection

Signal Source Value Expected
ECS RunningCount AWS/ECS 35 (scaled from 15) Correct; auto-scaler working
InvocationThrottles AWS/Bedrock 23 in last 60s 0
TokensPerSecond MangaAssist custom metric 12,500 Below evening peak, but above 8K provisioned
Provisioned Sonnet units Bedrock console 8 (afternoon level) Should be 15 (evening level)
Scheduled throughput increase EventBridge Scheduler Scheduled for 17:00 JST Should have fired 30 min ago
Scheduler Lambda execution CloudWatch Logs Last execution: ERROR at 17:00 Should have succeeded

Root Cause Analysis

flowchart TD
    A[ECS scaled correctly but<br>Bedrock throttling] --> B{Did provisioned<br>throughput schedule<br>fire at 17:00?}
    B -->|No| C[ROOT CAUSE 1:<br>EventBridge rule disabled<br>or deleted.]
    B -->|Yes| D{Did the Lambda<br>succeed?}
    D -->|No| E{Check Lambda<br>error logs}
    E --> F{Error type?}
    F -->|Permission denied| G[ROOT CAUSE 2:<br>IAM role for Lambda<br>missing bedrock:Update<br>ProvisionedModelThroughput<br>permission.]
    F -->|Resource not found| H[ROOT CAUSE 3:<br>Provisioned model ARN<br>changed after a model<br>upgrade. Lambda has<br>stale ARN.]
    F -->|Timeout| I[ROOT CAUSE 4:<br>Lambda timeout too short<br>for Bedrock API call.]
    F -->|Throttled| J[ROOT CAUSE 5:<br>Bedrock API rate limit<br>on management operations.]
    D -->|Yes| K{Is the new capacity<br>active?}
    K -->|No| L[ROOT CAUSE 6:<br>Provisioning still in<br>progress. Lead time<br>insufficient.]
    K -->|Yes| M[Not a provisioning issue.<br>Investigate traffic pattern<br>exceeding expectations.]

In this incident: Root Cause 2. During a recent security hardening sprint, the Lambda execution role was recreated with a tighter policy. The bedrock:UpdateProvisionedModelThroughput permission was accidentally omitted from the new role. The Lambda ran at 17:00 JST, received an AccessDeniedException, logged the error, and exited. No alarm was configured on Lambda errors for this function.

Resolution -- Step-by-Step Runbook

Immediate (0-5 minutes):

  1. Manually increase provisioned throughput via AWS CLI:
    aws bedrock update-provisioned-model-throughput \
      --provisioned-model-id <sonnet-provisioned-id> \
      --desired-model-units 15
    
  2. While provisioning activates (10-15 min), apply throttle mitigation: - Extend Redis cache TTL to 90s. - Downgrade faq and greeting intents to Haiku. - Activate aggressive request coalescing.

Short-term (0-30 minutes):

  1. Fix the Lambda IAM role:
    {
      "Effect": "Allow",
      "Action": [
        "bedrock:UpdateProvisionedModelThroughput",
        "bedrock:GetProvisionedModelThroughput",
        "bedrock:ListProvisionedModelThroughputs"
      ],
      "Resource": "arn:aws:bedrock:ap-northeast-1:*:provisioned-model/*"
    }
    
  2. Re-run the Lambda manually to confirm it succeeds.
  3. Verify provisioned throughput reaches the expected level.

Post-incident (within 24 hours):

  1. Add a CloudWatch alarm on Lambda errors for all provisioning Lambdas.
  2. Add a "provisioned throughput health check" that runs every 15 minutes and verifies that the current provisioned capacity matches the expected schedule.
  3. Add IAM policy validation to the CI/CD pipeline for infrastructure changes.

Prevention

Prevention Measure Implementation Owner
Alarm on provisioning Lambda errors CloudWatch alarm: Lambda Errors > 0 for provisioning functions Platform team
Provisioned throughput health check Lambda that compares actual vs expected capacity every 15 min Platform team
IAM policy unit tests CDK/Terraform tests that assert required permissions are present Security + Platform
Coupled scaling: ECS + Bedrock in single workflow Modify auto-scaler to adjust provisioned throughput whenever ECS scales Platform team
Decouple ECS scaling rate from Bedrock capacity Rate-limit Bedrock invocations per ECS task to prevent overwhelming provisioned capacity Platform team

Decision Tree: ECS Scaling + Bedrock Capacity Mismatch

flowchart TD
    A[ECS tasks increased but<br>Bedrock throttling detected] --> B{Is provisioned<br>throughput at expected<br>level for this time?}

    B -->|No, it's lower| C{Check scheduler<br>Lambda logs}
    C -->|Lambda error| D[Fix Lambda error<br>and re-run manually]
    C -->|Lambda didn't run| E[Check EventBridge rule<br>Is it enabled?]
    C -->|Lambda succeeded| F[Check Bedrock:<br>is provisioning still<br>in progress?]

    B -->|Yes, at expected level| G{Is actual TPS<br>exceeding provisioned<br>capacity?}
    G -->|Yes| H[Traffic higher than<br>planned. Increase<br>provisioned units.]
    G -->|No, TPS is within capacity| I{Check per-task<br>invocation rate}
    I -->|Too many invocations per task| J[Rate-limit invocations<br>per ECS task to prevent<br>burst amplification]
    I -->|Normal| K[Investigate Bedrock<br>internal throttling.<br>Contact AWS support.]

Scenario 5: Capacity Planning Miss -- Underestimated Token Throughput for New Recommendation Feature

Problem Statement

A new "Manga Taste Profile" feature launches on Thursday. This feature generates a personalized taste analysis by sending the user's reading history (last 50 manga titles, genres, ratings) plus a detailed system prompt to Claude 3 Sonnet, asking it to produce a structured taste profile with explanations. The feature was load-tested using synthetic data with 10-title histories, but real users have much richer histories.

One week after launch, the feature accounts for 35% of total Bedrock token consumption despite being only 8% of total requests. The average token count per "taste profile" request is 3,200 tokens (vs the 800 tokens estimated during planning). Monthly Bedrock cost projections have increased by $42,000.

Detection

Signal Source Value Planned
Avg tokens per "taste_profile" request MangaAssist/TokensPerIntent 3,200 800 (planned)
"taste_profile" share of total tokens CloudWatch Insights query 35% 8% (planned)
"taste_profile" share of total requests CloudWatch Insights query 8% 8% (matches)
Monthly Bedrock cost delta Cost Explorer +$42,000 projected +$10,000 (planned)
Provisioned throughput utilization MangaAssist/WasteRatio Inverse: utilization now 88% during peak Was 65% before feature
P99 latency for "taste_profile" MangaAssist/LatencyByIntent 5,800ms 3,000ms (target)

Root Cause Analysis

flowchart TD
    A[taste_profile consuming 4x<br>more tokens than planned] --> B{Was the feature<br>load-tested?}
    B -->|No| C[ROOT CAUSE 1:<br>No load test. Deploy<br>without performance<br>validation.]
    B -->|Yes| D{Was the test data<br>representative?}
    D -->|No| E[ROOT CAUSE 2:<br>Synthetic test data had<br>10-title histories.<br>Real users have 50+<br>titles. Prompt size<br>scales with history length.]
    D -->|Yes| F{Was token counting<br>included in the test?}
    F -->|No| G[ROOT CAUSE 3:<br>Load test measured<br>latency and error rate<br>but not token consumption.<br>Cost impact was invisible.]
    F -->|Yes| H{Were results reviewed<br>against capacity plan?}
    H -->|No| I[ROOT CAUSE 4:<br>Results existed but<br>no one compared them<br>to the capacity model.]
    H -->|Yes| J[Investigate: did user<br>behavior change post-launch<br>in unexpected ways?]

In this incident: Root Cause 2 + Root Cause 3. The load test used synthetic users with short histories (10 titles), and the test framework measured latency and HTTP errors but did not aggregate token counts. The 4x token multiplier was invisible until it hit production and showed up in cost reports a week later.

Token breakdown for a taste_profile request:

Component Estimated (10-title history) Actual (50-title history) Multiplier
System prompt (taste analysis instructions) 200 tokens 200 tokens 1x
User reading history context 150 tokens (10 titles) 750 tokens (50 titles) 5x
Genre and rating metadata 50 tokens 250 tokens 5x
Output (taste profile analysis) 400 tokens 2,000 tokens 5x (model generates proportionally longer analysis for richer input)
Total 800 tokens 3,200 tokens 4x

Resolution -- Step-by-Step Runbook

Immediate (0-2 hours):

  1. Add token count tracking per intent to the monitoring dashboard (if not already present).
  2. Increase provisioned throughput to accommodate the higher token load:
    # Evening peak needs additional capacity
    # Previous: 15 Sonnet units (15K TPS)
    # New: 20 Sonnet units (20K TPS) to absorb the taste_profile load
    
  3. Implement a history truncation strategy for taste_profile prompts: - Instead of sending all 50 titles, send the 20 most recent + top 10 by rating. - Summarize the remaining titles as genre counts: "Also read: 15 shonen, 5 seinen, 3 josei." - This reduces input context from 750 tokens to ~350 tokens without significant quality loss.

Short-term (within 1 week):

  1. Redesign the taste_profile prompt to be token-efficient:
Optimization Token Savings Quality Impact
Truncate history to 30 titles (20 recent + 10 top) -200 input tokens Minimal; most taste signal is in recent reads
Summarize remaining history as genre counts -200 input tokens Low; genre distribution preserved
Constrain output to structured JSON instead of prose -1,200 output tokens Moderate; less readable but still useful
Use max_tokens=800 instead of default 1,024 -200 output tokens (avg) None; most profiles complete in 600-800 tokens
Total savings ~1,800 tokens (56% reduction) Low-to-moderate
  1. After optimization, the taste_profile request drops from 3,200 to ~1,400 tokens -- much closer to the original 800-token estimate.

  2. Update the capacity model with the corrected per-intent token counts.

Post-incident (within 2 weeks):

  1. Add a token budget per intent to the orchestrator. Any intent exceeding its token budget triggers a warning log and a CloudWatch metric, catching future planning misses early.
  2. Require load tests to use production-representative data (anonymized). Add a pre-launch checklist item: "Load test data has representative distribution of user history lengths."
  3. Add token consumption to load test reports alongside latency and error rates.

Prevention

Prevention Measure Implementation Owner
Token budget per intent Orchestrator config + alarm when exceeded Platform team
Production-representative load test data Anonymized data pipeline from production to test Data engineering
Token counting in load test reports Update load test framework to aggregate and report tokens QA team
Pre-launch capacity review gate New features must present token estimates with supporting data before launch Engineering manager
Prompt token estimation tool Utility that estimates token count for a prompt template + sample input distribution Platform team

Decision Tree: Unexpected Token Consumption After Feature Launch

flowchart TD
    A[New feature consuming more<br>tokens than planned] --> B{Was per-intent<br>token tracking active<br>before launch?}
    B -->|No| C[Enable token tracking<br>per intent immediately.<br>Measure actual vs planned.]
    B -->|Yes| D{How large is<br>the variance?}

    D -->|< 50% over plan| E[Acceptable variance.<br>Update capacity model.<br>Monitor for 1 week.]
    D -->|50-200% over plan| F{Is the feature<br>prompt optimizable?}
    F -->|Yes| G[Optimize prompt:<br>truncate context,<br>constrain output,<br>reduce system prompt.]
    F -->|No| H[Increase provisioned<br>throughput. Update<br>cost projections.<br>Inform stakeholders.]

    D -->|> 200% over plan| I{Is the feature<br>critical?}
    I -->|Yes| J[Emergency: Increase<br>provisioned throughput.<br>Redesign prompt in<br>parallel. Consider<br>Haiku for non-critical<br>portions.]
    I -->|No| K[Consider feature flag<br>rollback until prompt<br>is redesigned.]

Cross-Scenario Summary

# Scenario Core Skill Tested Key Metric Resolution Time
1 Manga release 5x spike + throttling Capacity planning, event provisioning InvocationThrottles, TPS 5-15 min (mitigation), 24h (prevention)
2 Over-provisioned overnight waste Utilization monitoring, cost optimization Waste Ratio 1 day (schedule deploy), 1 week (validation)
3 Batch window too large at peak Batching strategy, latency management QueueWaitTime P99, RequestLatencyP99 5 min (revert), 1 week (adaptive batching)
4 ECS scales but Bedrock doesn't match Auto-scaling coordination, coupled scaling InvocationThrottles + ECS RunningCount divergence 15-30 min (manual fix), 24h (IAM fix + alarms)
5 Token underestimate for new feature Capacity planning, token budgeting TokensPerIntent, cost delta 2h (increase throughput), 1 week (prompt optimization)

Runbook Quick Reference

Universal First Response Checklist (Any High-Performance FM Incident)

  1. Open the CloudWatch dashboard. Identify which metrics are breaching.
  2. Check Bedrock throttles. If InvocationThrottles > 0, this is the top priority.
  3. Check provisioned throughput status. Is it at the expected level for this time of day?
  4. Check ECS task count. Is the auto-scaler responding?
  5. Check the event calendar. Is there a scheduled event that should have triggered pre-provisioning?
  6. Check recent deployments. Was anything changed in the last 24 hours (batch window, model routing, prompt templates, IAM policies)?

Escalation Criteria

Condition Action
Throttles > 0 for more than 5 minutes after mitigation started Page senior on-call + AWS TAM
P99 latency > 8 seconds for more than 10 minutes Page senior on-call
Provisioned throughput increase request fails (API error) Contact AWS Support (Severity 1)
Cost anomaly: daily spend > 2x forecast Alert FinOps team + engineering manager
Multiple scenarios occurring simultaneously (e.g., throttling + scaling lag) Incident commander mode: dedicated war room