Resource Allocation Scenarios & Runbooks

AWS AIP-C01 Task 4.2 — Skill 4.2.5: Right-size resources to optimize FM application throughput Context: MangaAssist JP manga e-commerce chatbot — Bedrock Claude 3 (Sonnet/Haiku), OpenSearch Serverless, DynamoDB, ECS Fargate (orchestrator), API Gateway WebSocket, ElastiCache Redis. 1M messages/day, peak 20K concurrent sessions.

Skill Mapping

Certification	Domain	Task	Skill
AWS AIP-C01	Domain 4 — Responsible AI & Monitoring	Task 4.2 — Optimize FM application performance	Skill 4.2.5 — Right-size resources to optimize FM application throughput

Focus: Five real-world resource allocation failure scenarios grounded in MangaAssist operations. Each scenario includes problem statement, detection, root cause analysis, resolution steps, and prevention — with decision tree diagrams.

Scenario 1 — ECS Under-Provisioned During Manga Release Event

Problem

A highly anticipated manga chapter release (Attack on Titan final volume reprint) drives 8x normal traffic to MangaAssist within 15 minutes. The ECS Fargate orchestrator CPU hits 95%, auto-scaling responds but cannot provision new tasks fast enough. Users experience 5-15 second response times (normal: 1-2 seconds). Some WebSocket connections time out entirely.

Detection

Signal	Source	Value	Threshold
ECS CPUUtilization	CloudWatch `AWS/ECS`	95%	Alarm at >85% for 2 min
ECS RunningTaskCount	CloudWatch `AWS/ECS`	45 (max was 50)	Approaching max capacity
Orchestrator P95 latency	Custom metric	12,400 ms	SLA: < 2,000 ms
API Gateway WebSocket errors	CloudWatch `AWS/ApiGateway`	4xx errors spike	Alarm at >1% error rate
Bedrock invocations queued	Custom metric `PendingInvocations`	340 queued	Alarm at >50

First alert: CloudWatch composite alarm fires — ECS CPU >85% AND P95 latency >3,000ms simultaneously.

Decision Tree

flowchart TD
    A["ALERT: ECS CPU > 85%<br/>P95 Latency > 3,000ms"] --> B{"Is auto-scaling<br/>already responding?"}

    B -->|"Yes — tasks increasing"| C{"Current task count<br/>near max capacity?"}
    B -->|"No — no scaling activity"| D["CHECK: Is scaling policy<br/>attached and active?"]

    C -->|"Yes — at or near max"| E["IMMEDIATE: Increase<br/>max task count limit<br/>via ECS service update"]
    C -->|"No — room to grow"| F{"Are new tasks<br/>starting successfully?"}

    D --> D1["Fix: Verify auto-scaling<br/>target, re-attach policy"]
    D1 --> D2["Manual: Set desired count<br/>to immediate need"]

    F -->|"Yes — tasks pending"| G["WAIT: Fargate<br/>provisioning lag ~60-120s<br/>Monitor recovery"]
    F -->|"No — tasks failing"| H["CHECK: Task definition<br/>errors, ECR pull failures,<br/>health check failures"]

    E --> E1["Increase max to 200"]
    E1 --> E2{"CPU still > 90%<br/>after 5 min?"}
    E2 -->|"Yes"| E3["ESCALATE: Increase task<br/>size (4 vCPU / 8 GB)<br/>OR enable Bedrock throttle<br/>graceful degradation"]
    E2 -->|"No — recovering"| E4["Monitor for 30 min<br/>then review scaling policy"]

    G --> G1{"Latency recovering<br/>within 5 min?"}
    G1 -->|"Yes"| G2["Incident resolved.<br/>Post-mortem: tune scale-out<br/>aggressiveness"]
    G1 -->|"No"| E["Increase max capacity"]

    H --> H1["Fix task definition<br/>or container image issues"]
    H1 --> D2

Root Cause

Max task count too low: The service was configured with max_capacity = 50, sufficient for normal 3x peaks but not for an 8x event spike.
Scale-out cooldown too long: The target tracking policy had a 120-second scale-out cooldown, meaning the service could only add tasks every 2 minutes.
No predictive or scheduled scaling: The manga release was a known event (announced 2 weeks prior) but no scheduled scaling action was configured.
Task start time: Fargate cold start is 60-120 seconds. At 8x traffic, the 2-minute scaling + 2-minute start cycle means 4+ minutes before relief.

Resolution

Immediate (during incident):

1. Increase max task count: 50 → 200 (ECS service update)
2. Set desired count manually: 50 → 120 (skip auto-scaling wait)
3. Enable graceful degradation: route all traffic to Haiku (faster, cheaper)
4. Activate semantic cache aggressive mode: lower similarity threshold 0.92 → 0.85

Within 1 hour:

5. Verify all 120 tasks are healthy and serving
6. Monitor CPU stabilization below 70%
7. Monitor latency recovery to < 2,000ms P95
8. Re-enable Sonnet routing once load stabilizes

Prevention

Prevention Measure	Implementation	Owner
Increase default max to 200	Update ECS service auto-scaling config	Platform team
Reduce scale-out cooldown to 30s	Update target tracking policy	Platform team
Add scheduled scaling for known events	Create runbook: 30 min before event, set min=100	On-call SRE
Predictive scaling based on traffic history	Enable AWS predictive scaling using 4-week data	Platform team
Warm pool with pre-pulled containers	Keep 10 stopped tasks ready with image pre-cached	Platform team
Graceful degradation playbook	Auto-switch to Haiku-only when CPU > 90%	Application team

Scenario 2 — DynamoDB Throttling on Session Reads During Concurrent User Spike

Problem

During a Friday evening flash sale (19:00 JST), 25,000 users simultaneously open MangaAssist to check manga deals. Each session start triggers 3-5 DynamoDB reads (session context, conversation history, user preferences). The session table — configured in provisioned mode with auto-scaling — cannot scale fast enough. ReadThrottleEvents spike to 500/second. Users see "Session loading..." for 10+ seconds or lose conversation history entirely.

Detection

Signal	Source	Value	Threshold
ReadThrottleEvents	CloudWatch `AWS/DynamoDB`	500/sec	Alarm at >0 sustained
ConsumedReadCapacityUnits	CloudWatch `AWS/DynamoDB`	4,800 RCU	Provisioned: 3,000 RCU
SuccessfulRequestLatency (read)	CloudWatch `AWS/DynamoDB`	45 ms (normally 5 ms)	Alarm at >20 ms
Session load errors	Application logs	"ProvisionedThroughputExceededException"	Any occurrence
Orchestrator retry rate	Custom metric	35% of DynamoDB calls retrying	Alarm at >5%

Decision Tree

flowchart TD
    A["ALERT: DynamoDB<br/>ReadThrottleEvents > 0"] --> B{"Table mode?"}

    B -->|"Provisioned"| C{"Auto-scaling<br/>active and responding?"}
    B -->|"On-Demand"| D["Check for<br/>hot partition keys"]

    C -->|"Yes — scaling up"| E{"Current RCU near<br/>account/table limit?"}
    C -->|"No — not scaling"| F["CHECK: Auto-scaling<br/>policy configuration"]

    E -->|"Yes — near limit"| G["Request limit increase<br/>via AWS Support"]
    E -->|"No — has headroom"| H{"Throttle events<br/>persisting > 5 min?"}

    H -->|"Yes"| I["Hot partition detected.<br/>Check partition key<br/>distribution."]
    H -->|"No"| J["Scaling caught up.<br/>Monitor recovery."]

    F --> F1["Fix: Re-attach auto-scaling<br/>Set target utilization 70%"]
    F1 --> F2["Manual: Increase provisioned<br/>RCU immediately"]

    D --> D1{"Throttle on specific<br/>partition keys?"}
    D1 -->|"Yes"| D2["Hot partition: session_id<br/>or user_id skew.<br/>Add write sharding suffix."]
    D1 -->|"No"| D3["On-demand table hit<br/>2x previous peak limit.<br/>Gradually increase traffic<br/>or pre-warm."]

    I --> I1["Redesign partition key:<br/>Add shard suffix<br/>or use composite key<br/>(user_region#user_id)"]

    G --> G1["SHORT-TERM: Switch<br/>table to On-Demand mode<br/>(takes effect immediately)"]

Root Cause

Provisioned mode with insufficient headroom: Table had 3,000 RCU provisioned with auto-scaling target at 70%. Auto-scaling takes 1-2 minutes to react, but flash sale traffic arrived in seconds.
Auto-scaling daily decrease limit: DynamoDB auto-scaling can only decrease throughput 4 times per day (but can increase without limit). However, the increase speed is limited by CloudWatch alarm evaluation periods (typically 2 data points of 1 minute each).
Hot partition on session_id: The flash sale landing page created new sessions that all mapped to a small number of DynamoDB partitions because session_id is time-sequential (UUIDs with timestamp prefix), causing temporal hot spots.

Resolution

Immediate (during incident):

1. Switch table to On-Demand mode (AWS Console or CLI — takes effect in minutes)
2. If staying provisioned: manually set RCU to 10,000 (well above current need)
3. Enable application-level retry with exponential backoff (should already exist)
4. If session reads fail after 3 retries: serve with empty context (degraded but functional)

Within 24 hours:

5. Verify throttle events dropped to zero
6. Analyze partition key distribution using CloudWatch Contributor Insights
7. If hot partitions confirmed: plan partition key redesign
8. Cost-compare on-demand vs provisioned for this traffic pattern

Prevention

Prevention Measure	Implementation	Owner
Switch to On-Demand mode permanently	Update table configuration	Platform team
Use random UUID session IDs (no timestamp prefix)	Update session ID generation	Application team
Add ElastiCache read-through for session data	Cache active sessions in Redis	Application team
DynamoDB Contributor Insights enabled	Turn on for session table	Platform team
Pre-warm provisioned capacity before events	If staying provisioned: schedule RCU increase 1hr before	On-call SRE
Circuit breaker on DynamoDB reads	Return cached/empty session after 3 failures	Application team

Scenario 3 — OpenSearch OCU Auto-Scale Lag Causing Search Latency Degradation

Problem

During a weekend manga recommendation campaign, search traffic increases 4x over 30 minutes. OpenSearch Serverless auto-scaling detects the need for additional search OCUs but takes 10-15 minutes to provision them. During this window, vector search P95 latency degrades from 40ms to 350ms, causing overall chatbot response time to exceed the 3-second SLA for recommendation queries.

Detection

Signal	Source	Value	Threshold
SearchLatency P95	CloudWatch `AWS/AOSS`	350 ms	Alarm at >100 ms
SearchRate	CloudWatch `AWS/AOSS`	450 queries/sec	Baseline: 120 queries/sec
Active Search OCU	CloudWatch `AWS/AOSS`	4 (same as baseline)	Expected: 8-12 for this load
Orchestrator RAG latency	Custom metric	1,800 ms (includes retries)	Alarm at >500 ms
End-to-end response time	Custom metric	4,200 ms	SLA: < 3,000 ms

Decision Tree

flowchart TD
    A["ALERT: OpenSearch<br/>P95 Latency > 100ms"] --> B{"Is search traffic<br/>elevated vs baseline?"}

    B -->|"Yes — traffic spike"| C{"Are OCUs<br/>auto-scaling up?"}
    B -->|"No — normal traffic"| D["Check index health:<br/>segment count, merge<br/>activity, shard imbalance"]

    C -->|"Yes — OCUs increasing<br/>(visible in CloudWatch)"| E["WAIT: OCU provisioning<br/>takes 5-15 min.<br/>Enable mitigations."]
    C -->|"No — OCUs static"| F["CHECK: Collection<br/>auto-scaling config.<br/>Is max OCU set too low?"]

    E --> E1["MITIGATE while waiting:"]
    E1 --> E2["1. Reduce k in k-NN<br/>from 10 to 5"]
    E1 --> E3["2. Increase cache<br/>aggressiveness for<br/>search results"]
    E1 --> E4["3. Route simple queries<br/>to keyword search<br/>instead of vector search"]

    E2 --> G{"Latency recovered<br/>after OCU scale?"}
    G -->|"Yes"| H["Resolved. Review base<br/>OCU allocation."]
    G -->|"No"| I["ESCALATE: Check<br/>HNSW index size vs OCU<br/>memory. May need<br/>index optimization."]

    F --> F1["Increase max search OCU<br/>from 4 to 20"]
    F1 --> F2["If still not scaling:<br/>Check if collection<br/>type supports auto-scaling"]

    D --> D1["Run _cat/segments<br/>Check for large<br/>segment merges"]
    D1 --> D2["If merge storms:<br/>Reduce refresh interval<br/>or schedule bulk indexing<br/>during off-peak"]

Root Cause

OCU scaling lag: OpenSearch Serverless auto-scaling detects load increases but takes 5-15 minutes to provision additional search OCUs. This is an inherent platform limitation, not a misconfiguration.
Base search OCU too low: The collection was configured with only 2 search OCU as the baseline, leaving no buffer for sudden traffic increases.
HNSW index memory pressure: The manga embedding index (500K vectors x 1536 dimensions) requires ~6 GB of memory. At 2 search OCU, the index fits but leaves little headroom for concurrent search operations.
No search result caching: Identical vector searches (common during campaigns where users search similar terms) hit OpenSearch every time.

Resolution

Immediate (during incident):

1. Reduce k-NN k parameter from 10 to 5 (fewer neighbors = faster search)
2. Enable search result caching in Redis (cache top queries for 5 min TTL)
3. Route FAQ-intent queries to keyword search (bypass vector search)
4. If available: increase max search OCU limit in collection settings

Within 24 hours:

5. Increase base search OCU to 4 (prevents cold-start lag for moderate spikes)
6. Implement search result caching layer in orchestrator
7. Analyze query patterns: are users sending identical queries? Pre-compute and cache
8. Review HNSW index parameters (ef_search, m) for efficiency

Prevention

Prevention Measure	Implementation	Owner
Increase base search OCU to 4+	Update collection configuration	Platform team
Add Redis-based search result caching	Cache vector search results with 5-min TTL	Application team
Implement approximate search fallback	If P95 > 200ms, switch to pre-computed recommendations	Application team
Pre-warm before campaigns	Synthetically increase search load 30 min before events	SRE team
Monitor OCU utilization trend	Dashboard with search OCU count vs search rate correlation	Platform team
Consider OpenSearch provisioned (non-serverless)	For predictable high-traffic workloads, managed clusters give more control	Architecture team

Scenario 4 — ElastiCache Memory Exhaustion from Growing Semantic Cache

Problem

Over 3 weeks, the MangaAssist semantic cache in ElastiCache Redis grows from 4 GB to 12.5 GB (on a 13 GB r6g.large node). The cache was intended to store frequently-asked query embeddings and their responses, but the eviction policy was set to noeviction instead of allkeys-lfu. When memory hits 100%, Redis rejects all new write commands. New cache entries cannot be stored, and the cache hit rate drops from 35% to 0% (no new entries) while existing entries serve stale data. Bedrock invocation count spikes 35% as all queries bypass cache.

Detection

Signal	Source	Value	Threshold
DatabaseMemoryUsagePercentage	CloudWatch `AWS/ElastiCache`	96% (then 100%)	Alarm at >80%
BytesUsedForCache	CloudWatch `AWS/ElastiCache`	12.5 GB	Node capacity: 13 GB
CacheHitRate	Custom metric	Dropping from 35% to 0%	Alarm at <20%
Evictions	CloudWatch `AWS/ElastiCache`	0 (noeviction policy)	Should be >0 when full
Bedrock InvocationsPerMinute	CloudWatch `AWS/Bedrock`	+35% spike	Alarm at >20% increase
CommandsFailed (OOM errors)	CloudWatch `AWS/ElastiCache`	Rising	Alarm at >0

Decision Tree

flowchart TD
    A["ALERT: ElastiCache Memory > 80%"] --> B{"Eviction policy?"}

    B -->|"noeviction"| C["CRITICAL: Redis will reject<br/>writes at 100% memory.<br/>Change policy immediately."]
    B -->|"allkeys-lfu or allkeys-lru"| D{"Eviction rate<br/>abnormally high?"}

    C --> C1["IMMEDIATE: Change maxmemory-policy<br/>to allkeys-lfu<br/>(no restart needed)"]
    C1 --> C2{"Memory still > 95%<br/>after policy change?"}

    C2 -->|"Yes — eviction not<br/>fast enough"| C3["Manually flush low-value<br/>cache entries by pattern:<br/>SCAN + DEL old TTL keys"]
    C2 -->|"No — eviction working"| C4["Monitor. Evictions will<br/>keep memory in check."]

    D -->|"Yes — evicting too much"| E{"Cache growing due to<br/>new entry volume or<br/>entry size increase?"}
    D -->|"No — healthy eviction"| F["System healthy.<br/>Adjust alarm threshold."]

    E -->|"New entries (unique queries)"| G["Check: Are low-value<br/>queries being cached?<br/>Add minimum similarity<br/>threshold for caching."]
    E -->|"Entry size growth"| H["Check: Are cached responses<br/>getting larger? Review<br/>response truncation policy."]

    G --> G1["Implement selective caching:<br/>Only cache queries with<br/>>5 occurrences in 24hr"]

    C3 --> C5["PLAN: Upgrade node type<br/>to r6g.xlarge (26 GB)<br/>or enable cluster mode"]

Root Cause

Wrong eviction policy: The Redis cluster was configured with maxmemory-policy: noeviction (the default). This means Redis returns errors on writes when memory is full instead of evicting old keys.
No TTL on cache entries: Cached query embeddings and responses were stored without an expiration time, causing unbounded growth.
Caching everything: Every unique query — including one-off misspellings and extremely specific questions — was cached. The cache stored 250K entries when only the top 30K (most-repeated queries) provided meaningful hit rate.
No cache size monitoring alarm: Memory usage was not monitored with an alarm. The team discovered the issue only when Bedrock costs spiked.

Resolution

Immediate (during incident):

1. Change maxmemory-policy to allkeys-lfu:
   CONFIG SET maxmemory-policy allkeys-lfu
   (takes effect immediately, no restart)

2. Set TTL on all existing keys without one:
   Use SCAN to iterate + EXPIRE on each key (24hr TTL)

3. If memory at 100% and write errors occurring:
   Flush low-frequency keys manually:
   Use SCAN + OBJECT FREQ + DEL pattern

4. Monitor: Eviction count should start rising, memory should stabilize

Within 48 hours:

5. Implement selective caching logic in orchestrator:
   - Only cache queries with similarity < 0.95 to existing entries
   - Only cache queries seen > 3 times in rolling 24hr window
6. Set default TTL of 24 hours on all new cache entries
7. Add CloudWatch alarm: memory > 75% for 10 min → warning
8. Add CloudWatch alarm: memory > 85% for 5 min → critical

Prevention

Prevention Measure	Implementation	Owner
Set `maxmemory-policy` to `allkeys-lfu`	Update parameter group (persist across restarts)	Platform team
Default TTL on all cache entries	24-hour TTL in application caching logic	Application team
Selective caching (frequency threshold)	Only cache queries seen 3+ times in 24hr	Application team
Memory monitoring alarms	75% warning, 85% critical in CloudWatch	Platform team
Weekly cache efficiency report	Dashboard: entries, hit rate, memory, eviction rate	SRE team
Cache size budgeting	Enforce max 70% memory target; resize node if baseline > 60%	Platform team

Scenario 5 — Cost Spike from Over-Provisioned Resources During Quiet Period

Problem

After scaling up for a major manga release event on Saturday, the MangaAssist infrastructure remains at peak capacity through the following quiet week. ECS is running 100 tasks (normal need: 15-25), Bedrock provisioned throughput remains at event levels, and OpenSearch has 12 search OCUs (normal: 4). The team does not notice because there are no performance issues — everything works perfectly. But the weekly AWS cost report shows a 280% cost increase ($42,000 for the week vs normal $15,000).

Detection

Signal	Source	Value	Threshold
ECS RunningTaskCount	CloudWatch `AWS/ECS`	100 tasks	Normal baseline: 20 tasks
ECS CPUUtilization	CloudWatch `AWS/ECS`	12%	Under-utilization: <30%
Bedrock InvocationsPerMinute	CloudWatch `AWS/Bedrock`	15% of provisioned	Under-utilization: <30% for >24hr
OpenSearch Active Search OCU	CloudWatch `AWS/AOSS`	12 OCU	Baseline need: 4 OCU
AWS Cost Explorer daily spend	Cost Explorer	$6,000/day	Budget: $2,200/day
Cost anomaly detection	AWS Cost Anomaly Detection	280% over baseline	Alert at >50% deviation

Decision Tree

flowchart TD
    A["ALERT: Daily cost<br/>280% over budget"] --> B["Identify cost drivers<br/>via Cost Explorer<br/>by service and tag"]

    B --> C{"Which services<br/>are over-budget?"}

    C --> D["ECS: 100 tasks running<br/>at 12% CPU utilization"]
    C --> E["Bedrock: Provisioned throughput<br/>at event levels"]
    C --> F["OpenSearch: 12 OCU<br/>vs 4 needed"]

    D --> D1{"Auto-scaling policy<br/>scale-in working?"}
    D1 -->|"Yes but min capacity<br/>was raised for event"| D2["Reset min capacity<br/>from 100 → 10"]
    D1 -->|"No — scale-in disabled<br/>or cooldown too long"| D3["Fix scale-in policy:<br/>Enable, reduce cooldown"]

    E --> E1["Review provisioned throughput:<br/>Is it auto-managed or manual?"]
    E1 --> E2["Manual: Reduce to normal<br/>levels immediately"]
    E1 --> E3["If using on-demand: costs<br/>scale automatically (no action)"]

    F --> F1["OpenSearch auto-scaling<br/>should reduce OCU.<br/>Check if collection has<br/>min OCU set too high."]
    F1 --> F2["Reset min search OCU<br/>to 2 if manually raised"]

    D2 --> G["VERIFY: All services<br/>returning to baseline<br/>capacity within 1 hour"]
    D3 --> G
    E2 --> G
    F2 --> G

    G --> H["POST-MORTEM: Why wasn't<br/>this caught sooner?<br/>Add cost monitoring."]

    H --> I["Implement: Daily cost<br/>anomaly alert + weekly<br/>utilization efficiency report"]

Root Cause

Manual scaling not reverted: During the event, the on-call SRE manually increased ECS min_capacity to 100 and raised Bedrock provisioned throughput. After the event, no one reverted these settings.
No post-event checklist: There was no runbook step for "revert temporary scaling changes after event."
Scale-in too conservative: ECS auto-scaling scale-in cooldown was 300 seconds, and the target tracking policy required 15 minutes below target before scaling in. With min_capacity = 100, it could not scale below 100 regardless.
No cost alerting: AWS Budgets was configured with monthly thresholds only. A weekly 280% spike did not trigger the monthly alarm until week 3.

Resolution

Immediate (within 1 hour):

1. Reset ECS min_capacity: 100 → 10
2. Reset ECS desired count: let auto-scaling determine (remove manual override)
3. Reduce Bedrock provisioned throughput to normal levels
4. Verify OpenSearch search OCU is auto-scaling down
5. Estimated savings: ~$3,800/day immediately

Within 1 week:

6. Implement daily cost anomaly detection (AWS Cost Anomaly Detection)
7. Create post-event scaling revert checklist in runbook
8. Add automated revert: EventBridge rule that resets min_capacity
   48 hours after a manual override event
9. Create weekly "efficiency report" dashboard:
   - Service utilization vs capacity
   - Cost per message trend
   - Over-provisioned resource flags

Prevention

Prevention Measure	Implementation	Owner
Automated scaling revert	EventBridge + Lambda: auto-revert min_capacity 24hr after manual increase	Platform team
Daily cost anomaly alerts	AWS Cost Anomaly Detection with Slack notification	Finance / SRE
Post-event runbook checklist	"Revert event scaling" step in incident response runbook	SRE team
Weekly efficiency dashboard	CloudWatch dashboard: utilization heatmap across all services	Platform team
Tag event-driven scaling changes	Tag manual overrides with `event:manga-release-20250322` for tracking	SRE team
Cost per message metric	Custom metric: total daily cost / total daily messages. Alert at >$0.015/msg	Platform team
Budget alerts at daily granularity	AWS Budgets with daily spend threshold ($3,000/day)	Finance team

Cross-Scenario Summary

Scenario	Service	Failure Mode	Detection Speed	Impact	Key Prevention
1. ECS under-provisioned	ECS Fargate	Scale-out too slow for 8x spike	2 min (alarm)	High latency, timeouts	Predictive scaling + scheduled pre-warm
2. DynamoDB throttling	DynamoDB	Provisioned capacity exceeded	1 min (throttle alarm)	Session load failures	Switch to on-demand mode
3. OpenSearch OCU lag	OpenSearch Serverless	OCU provision takes 10-15 min	5 min (latency alarm)	Search latency 10x degraded	Higher base OCU + search result caching
4. ElastiCache memory full	ElastiCache Redis	Wrong eviction policy	30 min (cost spike noticed)	Cache disabled, Bedrock cost +35%	Set `allkeys-lfu` + TTL on entries
5. Cost over-provision	All services	Manual scaling not reverted	7 days (weekly cost report)	280% cost overrun ($27K waste)	Automated revert + daily cost alerts

Severity and Urgency Matrix

quadrantChart
    title Scenario Severity vs Detection Urgency
    x-axis Low Urgency --> High Urgency
    y-axis Low Severity --> High Severity
    quadrant-1 Critical — Immediate action
    quadrant-2 Important — Action within hours
    quadrant-3 Monitor — Track and prevent
    quadrant-4 Urgent — Fix fast, low blast radius
    ECS under-provisioned: [0.85, 0.90]
    DynamoDB throttling: [0.80, 0.75]
    OpenSearch OCU lag: [0.65, 0.60]
    ElastiCache memory: [0.45, 0.55]
    Cost over-provision: [0.25, 0.70]

Universal Resource Allocation Runbook Template

For any resource allocation incident in MangaAssist, follow this structured approach:

Phase 1 — Detect (0-2 minutes)

Identify which CloudWatch alarm fired
Determine affected service and metric
Check if auto-scaling is already responding
Classify: under-provisioned (performance) or over-provisioned (cost)

Phase 2 — Assess (2-5 minutes)

Check current capacity vs required capacity
Determine if this is a known event or unexpected spike
Evaluate blast radius: which user-facing features are affected?
Decide: auto-scaling will catch up vs manual intervention needed

Phase 3 — Mitigate (5-15 minutes)

If under-provisioned: increase capacity manually (desired count, provisioned units)
If over-provisioned: schedule reduction (do not reduce during business hours without buffer)
Enable graceful degradation if capacity cannot be increased fast enough
Communicate status to stakeholders

Phase 4 — Resolve (15-60 minutes)

Verify metrics returning to healthy range
Confirm auto-scaling policies are stable
Document actions taken and timeline
If manual overrides applied: schedule revert

Phase 5 — Prevent (within 1 week)

Post-mortem: why did auto-scaling not prevent this?
Update scaling policies (thresholds, cooldowns, min/max)
Add or refine monitoring alarms
Update runbook with lessons learned
If recurring: implement automated remediation (Lambda, EventBridge)

Key Takeaways

Auto-scaling is necessary but not sufficient: Every scenario above involved auto-scaling that was either misconfigured, too slow, or not engaged. Manual intervention is still needed for extreme events.
Known events need scheduled scaling: Manga releases and flash sales are predictable. Pre-warming resources 30 minutes before eliminates the most common cause of under-provisioning.
Cost over-provisioning is a silent failure: Unlike under-provisioning (which users notice immediately), over-provisioning can persist for weeks. Daily cost anomaly detection is essential.
Defaults matter: ElastiCache noeviction default, DynamoDB provisioned mode, ECS conservative cooldowns — all of these defaults are wrong for a high-traffic GenAI application. Review and override every default.
Always plan the revert: Every manual scaling action during an incident must have a corresponding revert action. Automate the revert with a time-delayed trigger to ensure it never gets forgotten.