Resource Allocation Scenarios & Runbooks
AWS AIP-C01 Task 4.2 — Skill 4.2.5: Right-size resources to optimize FM application throughput Context: MangaAssist JP manga e-commerce chatbot — Bedrock Claude 3 (Sonnet/Haiku), OpenSearch Serverless, DynamoDB, ECS Fargate (orchestrator), API Gateway WebSocket, ElastiCache Redis. 1M messages/day, peak 20K concurrent sessions.
Skill Mapping
| Certification | Domain | Task | Skill |
|---|---|---|---|
| AWS AIP-C01 | Domain 4 — Responsible AI & Monitoring | Task 4.2 — Optimize FM application performance | Skill 4.2.5 — Right-size resources to optimize FM application throughput |
Focus: Five real-world resource allocation failure scenarios grounded in MangaAssist operations. Each scenario includes problem statement, detection, root cause analysis, resolution steps, and prevention — with decision tree diagrams.
Scenario 1 — ECS Under-Provisioned During Manga Release Event
Problem
A highly anticipated manga chapter release (Attack on Titan final volume reprint) drives 8x normal traffic to MangaAssist within 15 minutes. The ECS Fargate orchestrator CPU hits 95%, auto-scaling responds but cannot provision new tasks fast enough. Users experience 5-15 second response times (normal: 1-2 seconds). Some WebSocket connections time out entirely.
Detection
| Signal | Source | Value | Threshold |
|---|---|---|---|
| ECS CPUUtilization | CloudWatch AWS/ECS |
95% | Alarm at >85% for 2 min |
| ECS RunningTaskCount | CloudWatch AWS/ECS |
45 (max was 50) | Approaching max capacity |
| Orchestrator P95 latency | Custom metric | 12,400 ms | SLA: < 2,000 ms |
| API Gateway WebSocket errors | CloudWatch AWS/ApiGateway |
4xx errors spike | Alarm at >1% error rate |
| Bedrock invocations queued | Custom metric PendingInvocations |
340 queued | Alarm at >50 |
First alert: CloudWatch composite alarm fires — ECS CPU >85% AND P95 latency >3,000ms simultaneously.
Decision Tree
flowchart TD
A["ALERT: ECS CPU > 85%<br/>P95 Latency > 3,000ms"] --> B{"Is auto-scaling<br/>already responding?"}
B -->|"Yes — tasks increasing"| C{"Current task count<br/>near max capacity?"}
B -->|"No — no scaling activity"| D["CHECK: Is scaling policy<br/>attached and active?"]
C -->|"Yes — at or near max"| E["IMMEDIATE: Increase<br/>max task count limit<br/>via ECS service update"]
C -->|"No — room to grow"| F{"Are new tasks<br/>starting successfully?"}
D --> D1["Fix: Verify auto-scaling<br/>target, re-attach policy"]
D1 --> D2["Manual: Set desired count<br/>to immediate need"]
F -->|"Yes — tasks pending"| G["WAIT: Fargate<br/>provisioning lag ~60-120s<br/>Monitor recovery"]
F -->|"No — tasks failing"| H["CHECK: Task definition<br/>errors, ECR pull failures,<br/>health check failures"]
E --> E1["Increase max to 200"]
E1 --> E2{"CPU still > 90%<br/>after 5 min?"}
E2 -->|"Yes"| E3["ESCALATE: Increase task<br/>size (4 vCPU / 8 GB)<br/>OR enable Bedrock throttle<br/>graceful degradation"]
E2 -->|"No — recovering"| E4["Monitor for 30 min<br/>then review scaling policy"]
G --> G1{"Latency recovering<br/>within 5 min?"}
G1 -->|"Yes"| G2["Incident resolved.<br/>Post-mortem: tune scale-out<br/>aggressiveness"]
G1 -->|"No"| E["Increase max capacity"]
H --> H1["Fix task definition<br/>or container image issues"]
H1 --> D2
Root Cause
- Max task count too low: The service was configured with
max_capacity = 50, sufficient for normal 3x peaks but not for an 8x event spike. - Scale-out cooldown too long: The target tracking policy had a 120-second scale-out cooldown, meaning the service could only add tasks every 2 minutes.
- No predictive or scheduled scaling: The manga release was a known event (announced 2 weeks prior) but no scheduled scaling action was configured.
- Task start time: Fargate cold start is 60-120 seconds. At 8x traffic, the 2-minute scaling + 2-minute start cycle means 4+ minutes before relief.
Resolution
Immediate (during incident):
1. Increase max task count: 50 → 200 (ECS service update)
2. Set desired count manually: 50 → 120 (skip auto-scaling wait)
3. Enable graceful degradation: route all traffic to Haiku (faster, cheaper)
4. Activate semantic cache aggressive mode: lower similarity threshold 0.92 → 0.85
Within 1 hour:
5. Verify all 120 tasks are healthy and serving
6. Monitor CPU stabilization below 70%
7. Monitor latency recovery to < 2,000ms P95
8. Re-enable Sonnet routing once load stabilizes
Prevention
| Prevention Measure | Implementation | Owner |
|---|---|---|
| Increase default max to 200 | Update ECS service auto-scaling config | Platform team |
| Reduce scale-out cooldown to 30s | Update target tracking policy | Platform team |
| Add scheduled scaling for known events | Create runbook: 30 min before event, set min=100 | On-call SRE |
| Predictive scaling based on traffic history | Enable AWS predictive scaling using 4-week data | Platform team |
| Warm pool with pre-pulled containers | Keep 10 stopped tasks ready with image pre-cached | Platform team |
| Graceful degradation playbook | Auto-switch to Haiku-only when CPU > 90% | Application team |
Scenario 2 — DynamoDB Throttling on Session Reads During Concurrent User Spike
Problem
During a Friday evening flash sale (19:00 JST), 25,000 users simultaneously open MangaAssist to check manga deals. Each session start triggers 3-5 DynamoDB reads (session context, conversation history, user preferences). The session table — configured in provisioned mode with auto-scaling — cannot scale fast enough. ReadThrottleEvents spike to 500/second. Users see "Session loading..." for 10+ seconds or lose conversation history entirely.
Detection
| Signal | Source | Value | Threshold |
|---|---|---|---|
| ReadThrottleEvents | CloudWatch AWS/DynamoDB |
500/sec | Alarm at >0 sustained |
| ConsumedReadCapacityUnits | CloudWatch AWS/DynamoDB |
4,800 RCU | Provisioned: 3,000 RCU |
| SuccessfulRequestLatency (read) | CloudWatch AWS/DynamoDB |
45 ms (normally 5 ms) | Alarm at >20 ms |
| Session load errors | Application logs | "ProvisionedThroughputExceededException" | Any occurrence |
| Orchestrator retry rate | Custom metric | 35% of DynamoDB calls retrying | Alarm at >5% |
Decision Tree
flowchart TD
A["ALERT: DynamoDB<br/>ReadThrottleEvents > 0"] --> B{"Table mode?"}
B -->|"Provisioned"| C{"Auto-scaling<br/>active and responding?"}
B -->|"On-Demand"| D["Check for<br/>hot partition keys"]
C -->|"Yes — scaling up"| E{"Current RCU near<br/>account/table limit?"}
C -->|"No — not scaling"| F["CHECK: Auto-scaling<br/>policy configuration"]
E -->|"Yes — near limit"| G["Request limit increase<br/>via AWS Support"]
E -->|"No — has headroom"| H{"Throttle events<br/>persisting > 5 min?"}
H -->|"Yes"| I["Hot partition detected.<br/>Check partition key<br/>distribution."]
H -->|"No"| J["Scaling caught up.<br/>Monitor recovery."]
F --> F1["Fix: Re-attach auto-scaling<br/>Set target utilization 70%"]
F1 --> F2["Manual: Increase provisioned<br/>RCU immediately"]
D --> D1{"Throttle on specific<br/>partition keys?"}
D1 -->|"Yes"| D2["Hot partition: session_id<br/>or user_id skew.<br/>Add write sharding suffix."]
D1 -->|"No"| D3["On-demand table hit<br/>2x previous peak limit.<br/>Gradually increase traffic<br/>or pre-warm."]
I --> I1["Redesign partition key:<br/>Add shard suffix<br/>or use composite key<br/>(user_region#user_id)"]
G --> G1["SHORT-TERM: Switch<br/>table to On-Demand mode<br/>(takes effect immediately)"]
Root Cause
- Provisioned mode with insufficient headroom: Table had 3,000 RCU provisioned with auto-scaling target at 70%. Auto-scaling takes 1-2 minutes to react, but flash sale traffic arrived in seconds.
- Auto-scaling daily decrease limit: DynamoDB auto-scaling can only decrease throughput 4 times per day (but can increase without limit). However, the increase speed is limited by CloudWatch alarm evaluation periods (typically 2 data points of 1 minute each).
- Hot partition on
session_id: The flash sale landing page created new sessions that all mapped to a small number of DynamoDB partitions becausesession_idis time-sequential (UUIDs with timestamp prefix), causing temporal hot spots.
Resolution
Immediate (during incident):
1. Switch table to On-Demand mode (AWS Console or CLI — takes effect in minutes)
2. If staying provisioned: manually set RCU to 10,000 (well above current need)
3. Enable application-level retry with exponential backoff (should already exist)
4. If session reads fail after 3 retries: serve with empty context (degraded but functional)
Within 24 hours:
5. Verify throttle events dropped to zero
6. Analyze partition key distribution using CloudWatch Contributor Insights
7. If hot partitions confirmed: plan partition key redesign
8. Cost-compare on-demand vs provisioned for this traffic pattern
Prevention
| Prevention Measure | Implementation | Owner |
|---|---|---|
| Switch to On-Demand mode permanently | Update table configuration | Platform team |
| Use random UUID session IDs (no timestamp prefix) | Update session ID generation | Application team |
| Add ElastiCache read-through for session data | Cache active sessions in Redis | Application team |
| DynamoDB Contributor Insights enabled | Turn on for session table | Platform team |
| Pre-warm provisioned capacity before events | If staying provisioned: schedule RCU increase 1hr before | On-call SRE |
| Circuit breaker on DynamoDB reads | Return cached/empty session after 3 failures | Application team |
Scenario 3 — OpenSearch OCU Auto-Scale Lag Causing Search Latency Degradation
Problem
During a weekend manga recommendation campaign, search traffic increases 4x over 30 minutes. OpenSearch Serverless auto-scaling detects the need for additional search OCUs but takes 10-15 minutes to provision them. During this window, vector search P95 latency degrades from 40ms to 350ms, causing overall chatbot response time to exceed the 3-second SLA for recommendation queries.
Detection
| Signal | Source | Value | Threshold |
|---|---|---|---|
| SearchLatency P95 | CloudWatch AWS/AOSS |
350 ms | Alarm at >100 ms |
| SearchRate | CloudWatch AWS/AOSS |
450 queries/sec | Baseline: 120 queries/sec |
| Active Search OCU | CloudWatch AWS/AOSS |
4 (same as baseline) | Expected: 8-12 for this load |
| Orchestrator RAG latency | Custom metric | 1,800 ms (includes retries) | Alarm at >500 ms |
| End-to-end response time | Custom metric | 4,200 ms | SLA: < 3,000 ms |
Decision Tree
flowchart TD
A["ALERT: OpenSearch<br/>P95 Latency > 100ms"] --> B{"Is search traffic<br/>elevated vs baseline?"}
B -->|"Yes — traffic spike"| C{"Are OCUs<br/>auto-scaling up?"}
B -->|"No — normal traffic"| D["Check index health:<br/>segment count, merge<br/>activity, shard imbalance"]
C -->|"Yes — OCUs increasing<br/>(visible in CloudWatch)"| E["WAIT: OCU provisioning<br/>takes 5-15 min.<br/>Enable mitigations."]
C -->|"No — OCUs static"| F["CHECK: Collection<br/>auto-scaling config.<br/>Is max OCU set too low?"]
E --> E1["MITIGATE while waiting:"]
E1 --> E2["1. Reduce k in k-NN<br/>from 10 to 5"]
E1 --> E3["2. Increase cache<br/>aggressiveness for<br/>search results"]
E1 --> E4["3. Route simple queries<br/>to keyword search<br/>instead of vector search"]
E2 --> G{"Latency recovered<br/>after OCU scale?"}
G -->|"Yes"| H["Resolved. Review base<br/>OCU allocation."]
G -->|"No"| I["ESCALATE: Check<br/>HNSW index size vs OCU<br/>memory. May need<br/>index optimization."]
F --> F1["Increase max search OCU<br/>from 4 to 20"]
F1 --> F2["If still not scaling:<br/>Check if collection<br/>type supports auto-scaling"]
D --> D1["Run _cat/segments<br/>Check for large<br/>segment merges"]
D1 --> D2["If merge storms:<br/>Reduce refresh interval<br/>or schedule bulk indexing<br/>during off-peak"]
Root Cause
- OCU scaling lag: OpenSearch Serverless auto-scaling detects load increases but takes 5-15 minutes to provision additional search OCUs. This is an inherent platform limitation, not a misconfiguration.
- Base search OCU too low: The collection was configured with only 2 search OCU as the baseline, leaving no buffer for sudden traffic increases.
- HNSW index memory pressure: The manga embedding index (500K vectors x 1536 dimensions) requires ~6 GB of memory. At 2 search OCU, the index fits but leaves little headroom for concurrent search operations.
- No search result caching: Identical vector searches (common during campaigns where users search similar terms) hit OpenSearch every time.
Resolution
Immediate (during incident):
1. Reduce k-NN k parameter from 10 to 5 (fewer neighbors = faster search)
2. Enable search result caching in Redis (cache top queries for 5 min TTL)
3. Route FAQ-intent queries to keyword search (bypass vector search)
4. If available: increase max search OCU limit in collection settings
Within 24 hours:
5. Increase base search OCU to 4 (prevents cold-start lag for moderate spikes)
6. Implement search result caching layer in orchestrator
7. Analyze query patterns: are users sending identical queries? Pre-compute and cache
8. Review HNSW index parameters (ef_search, m) for efficiency
Prevention
| Prevention Measure | Implementation | Owner |
|---|---|---|
| Increase base search OCU to 4+ | Update collection configuration | Platform team |
| Add Redis-based search result caching | Cache vector search results with 5-min TTL | Application team |
| Implement approximate search fallback | If P95 > 200ms, switch to pre-computed recommendations | Application team |
| Pre-warm before campaigns | Synthetically increase search load 30 min before events | SRE team |
| Monitor OCU utilization trend | Dashboard with search OCU count vs search rate correlation | Platform team |
| Consider OpenSearch provisioned (non-serverless) | For predictable high-traffic workloads, managed clusters give more control | Architecture team |
Scenario 4 — ElastiCache Memory Exhaustion from Growing Semantic Cache
Problem
Over 3 weeks, the MangaAssist semantic cache in ElastiCache Redis grows from 4 GB to 12.5 GB (on a 13 GB r6g.large node). The cache was intended to store frequently-asked query embeddings and their responses, but the eviction policy was set to noeviction instead of allkeys-lfu. When memory hits 100%, Redis rejects all new write commands. New cache entries cannot be stored, and the cache hit rate drops from 35% to 0% (no new entries) while existing entries serve stale data. Bedrock invocation count spikes 35% as all queries bypass cache.
Detection
| Signal | Source | Value | Threshold |
|---|---|---|---|
| DatabaseMemoryUsagePercentage | CloudWatch AWS/ElastiCache |
96% (then 100%) | Alarm at >80% |
| BytesUsedForCache | CloudWatch AWS/ElastiCache |
12.5 GB | Node capacity: 13 GB |
| CacheHitRate | Custom metric | Dropping from 35% to 0% | Alarm at <20% |
| Evictions | CloudWatch AWS/ElastiCache |
0 (noeviction policy) | Should be >0 when full |
| Bedrock InvocationsPerMinute | CloudWatch AWS/Bedrock |
+35% spike | Alarm at >20% increase |
| CommandsFailed (OOM errors) | CloudWatch AWS/ElastiCache |
Rising | Alarm at >0 |
Decision Tree
flowchart TD
A["ALERT: ElastiCache Memory > 80%"] --> B{"Eviction policy?"}
B -->|"noeviction"| C["CRITICAL: Redis will reject<br/>writes at 100% memory.<br/>Change policy immediately."]
B -->|"allkeys-lfu or allkeys-lru"| D{"Eviction rate<br/>abnormally high?"}
C --> C1["IMMEDIATE: Change maxmemory-policy<br/>to allkeys-lfu<br/>(no restart needed)"]
C1 --> C2{"Memory still > 95%<br/>after policy change?"}
C2 -->|"Yes — eviction not<br/>fast enough"| C3["Manually flush low-value<br/>cache entries by pattern:<br/>SCAN + DEL old TTL keys"]
C2 -->|"No — eviction working"| C4["Monitor. Evictions will<br/>keep memory in check."]
D -->|"Yes — evicting too much"| E{"Cache growing due to<br/>new entry volume or<br/>entry size increase?"}
D -->|"No — healthy eviction"| F["System healthy.<br/>Adjust alarm threshold."]
E -->|"New entries (unique queries)"| G["Check: Are low-value<br/>queries being cached?<br/>Add minimum similarity<br/>threshold for caching."]
E -->|"Entry size growth"| H["Check: Are cached responses<br/>getting larger? Review<br/>response truncation policy."]
G --> G1["Implement selective caching:<br/>Only cache queries with<br/>>5 occurrences in 24hr"]
C3 --> C5["PLAN: Upgrade node type<br/>to r6g.xlarge (26 GB)<br/>or enable cluster mode"]
Root Cause
- Wrong eviction policy: The Redis cluster was configured with
maxmemory-policy: noeviction(the default). This means Redis returns errors on writes when memory is full instead of evicting old keys. - No TTL on cache entries: Cached query embeddings and responses were stored without an expiration time, causing unbounded growth.
- Caching everything: Every unique query — including one-off misspellings and extremely specific questions — was cached. The cache stored 250K entries when only the top 30K (most-repeated queries) provided meaningful hit rate.
- No cache size monitoring alarm: Memory usage was not monitored with an alarm. The team discovered the issue only when Bedrock costs spiked.
Resolution
Immediate (during incident):
1. Change maxmemory-policy to allkeys-lfu:
CONFIG SET maxmemory-policy allkeys-lfu
(takes effect immediately, no restart)
2. Set TTL on all existing keys without one:
Use SCAN to iterate + EXPIRE on each key (24hr TTL)
3. If memory at 100% and write errors occurring:
Flush low-frequency keys manually:
Use SCAN + OBJECT FREQ + DEL pattern
4. Monitor: Eviction count should start rising, memory should stabilize
Within 48 hours:
5. Implement selective caching logic in orchestrator:
- Only cache queries with similarity < 0.95 to existing entries
- Only cache queries seen > 3 times in rolling 24hr window
6. Set default TTL of 24 hours on all new cache entries
7. Add CloudWatch alarm: memory > 75% for 10 min → warning
8. Add CloudWatch alarm: memory > 85% for 5 min → critical
Prevention
| Prevention Measure | Implementation | Owner |
|---|---|---|
Set maxmemory-policy to allkeys-lfu |
Update parameter group (persist across restarts) | Platform team |
| Default TTL on all cache entries | 24-hour TTL in application caching logic | Application team |
| Selective caching (frequency threshold) | Only cache queries seen 3+ times in 24hr | Application team |
| Memory monitoring alarms | 75% warning, 85% critical in CloudWatch | Platform team |
| Weekly cache efficiency report | Dashboard: entries, hit rate, memory, eviction rate | SRE team |
| Cache size budgeting | Enforce max 70% memory target; resize node if baseline > 60% | Platform team |
Scenario 5 — Cost Spike from Over-Provisioned Resources During Quiet Period
Problem
After scaling up for a major manga release event on Saturday, the MangaAssist infrastructure remains at peak capacity through the following quiet week. ECS is running 100 tasks (normal need: 15-25), Bedrock provisioned throughput remains at event levels, and OpenSearch has 12 search OCUs (normal: 4). The team does not notice because there are no performance issues — everything works perfectly. But the weekly AWS cost report shows a 280% cost increase ($42,000 for the week vs normal $15,000).
Detection
| Signal | Source | Value | Threshold |
|---|---|---|---|
| ECS RunningTaskCount | CloudWatch AWS/ECS |
100 tasks | Normal baseline: 20 tasks |
| ECS CPUUtilization | CloudWatch AWS/ECS |
12% | Under-utilization: <30% |
| Bedrock InvocationsPerMinute | CloudWatch AWS/Bedrock |
15% of provisioned | Under-utilization: <30% for >24hr |
| OpenSearch Active Search OCU | CloudWatch AWS/AOSS |
12 OCU | Baseline need: 4 OCU |
| AWS Cost Explorer daily spend | Cost Explorer | $6,000/day | Budget: $2,200/day |
| Cost anomaly detection | AWS Cost Anomaly Detection | 280% over baseline | Alert at >50% deviation |
Decision Tree
flowchart TD
A["ALERT: Daily cost<br/>280% over budget"] --> B["Identify cost drivers<br/>via Cost Explorer<br/>by service and tag"]
B --> C{"Which services<br/>are over-budget?"}
C --> D["ECS: 100 tasks running<br/>at 12% CPU utilization"]
C --> E["Bedrock: Provisioned throughput<br/>at event levels"]
C --> F["OpenSearch: 12 OCU<br/>vs 4 needed"]
D --> D1{"Auto-scaling policy<br/>scale-in working?"}
D1 -->|"Yes but min capacity<br/>was raised for event"| D2["Reset min capacity<br/>from 100 → 10"]
D1 -->|"No — scale-in disabled<br/>or cooldown too long"| D3["Fix scale-in policy:<br/>Enable, reduce cooldown"]
E --> E1["Review provisioned throughput:<br/>Is it auto-managed or manual?"]
E1 --> E2["Manual: Reduce to normal<br/>levels immediately"]
E1 --> E3["If using on-demand: costs<br/>scale automatically (no action)"]
F --> F1["OpenSearch auto-scaling<br/>should reduce OCU.<br/>Check if collection has<br/>min OCU set too high."]
F1 --> F2["Reset min search OCU<br/>to 2 if manually raised"]
D2 --> G["VERIFY: All services<br/>returning to baseline<br/>capacity within 1 hour"]
D3 --> G
E2 --> G
F2 --> G
G --> H["POST-MORTEM: Why wasn't<br/>this caught sooner?<br/>Add cost monitoring."]
H --> I["Implement: Daily cost<br/>anomaly alert + weekly<br/>utilization efficiency report"]
Root Cause
- Manual scaling not reverted: During the event, the on-call SRE manually increased ECS
min_capacityto 100 and raised Bedrock provisioned throughput. After the event, no one reverted these settings. - No post-event checklist: There was no runbook step for "revert temporary scaling changes after event."
- Scale-in too conservative: ECS auto-scaling
scale-in cooldownwas 300 seconds, and the target tracking policy required 15 minutes below target before scaling in. Withmin_capacity = 100, it could not scale below 100 regardless. - No cost alerting: AWS Budgets was configured with monthly thresholds only. A weekly 280% spike did not trigger the monthly alarm until week 3.
Resolution
Immediate (within 1 hour):
1. Reset ECS min_capacity: 100 → 10
2. Reset ECS desired count: let auto-scaling determine (remove manual override)
3. Reduce Bedrock provisioned throughput to normal levels
4. Verify OpenSearch search OCU is auto-scaling down
5. Estimated savings: ~$3,800/day immediately
Within 1 week:
6. Implement daily cost anomaly detection (AWS Cost Anomaly Detection)
7. Create post-event scaling revert checklist in runbook
8. Add automated revert: EventBridge rule that resets min_capacity
48 hours after a manual override event
9. Create weekly "efficiency report" dashboard:
- Service utilization vs capacity
- Cost per message trend
- Over-provisioned resource flags
Prevention
| Prevention Measure | Implementation | Owner |
|---|---|---|
| Automated scaling revert | EventBridge + Lambda: auto-revert min_capacity 24hr after manual increase | Platform team |
| Daily cost anomaly alerts | AWS Cost Anomaly Detection with Slack notification | Finance / SRE |
| Post-event runbook checklist | "Revert event scaling" step in incident response runbook | SRE team |
| Weekly efficiency dashboard | CloudWatch dashboard: utilization heatmap across all services | Platform team |
| Tag event-driven scaling changes | Tag manual overrides with event:manga-release-20250322 for tracking |
SRE team |
| Cost per message metric | Custom metric: total daily cost / total daily messages. Alert at >$0.015/msg | Platform team |
| Budget alerts at daily granularity | AWS Budgets with daily spend threshold ($3,000/day) | Finance team |
Cross-Scenario Summary
| Scenario | Service | Failure Mode | Detection Speed | Impact | Key Prevention |
|---|---|---|---|---|---|
| 1. ECS under-provisioned | ECS Fargate | Scale-out too slow for 8x spike | 2 min (alarm) | High latency, timeouts | Predictive scaling + scheduled pre-warm |
| 2. DynamoDB throttling | DynamoDB | Provisioned capacity exceeded | 1 min (throttle alarm) | Session load failures | Switch to on-demand mode |
| 3. OpenSearch OCU lag | OpenSearch Serverless | OCU provision takes 10-15 min | 5 min (latency alarm) | Search latency 10x degraded | Higher base OCU + search result caching |
| 4. ElastiCache memory full | ElastiCache Redis | Wrong eviction policy | 30 min (cost spike noticed) | Cache disabled, Bedrock cost +35% | Set allkeys-lfu + TTL on entries |
| 5. Cost over-provision | All services | Manual scaling not reverted | 7 days (weekly cost report) | 280% cost overrun ($27K waste) | Automated revert + daily cost alerts |
Severity and Urgency Matrix
quadrantChart
title Scenario Severity vs Detection Urgency
x-axis Low Urgency --> High Urgency
y-axis Low Severity --> High Severity
quadrant-1 Critical — Immediate action
quadrant-2 Important — Action within hours
quadrant-3 Monitor — Track and prevent
quadrant-4 Urgent — Fix fast, low blast radius
ECS under-provisioned: [0.85, 0.90]
DynamoDB throttling: [0.80, 0.75]
OpenSearch OCU lag: [0.65, 0.60]
ElastiCache memory: [0.45, 0.55]
Cost over-provision: [0.25, 0.70]
Universal Resource Allocation Runbook Template
For any resource allocation incident in MangaAssist, follow this structured approach:
Phase 1 — Detect (0-2 minutes)
- Identify which CloudWatch alarm fired
- Determine affected service and metric
- Check if auto-scaling is already responding
- Classify: under-provisioned (performance) or over-provisioned (cost)
Phase 2 — Assess (2-5 minutes)
- Check current capacity vs required capacity
- Determine if this is a known event or unexpected spike
- Evaluate blast radius: which user-facing features are affected?
- Decide: auto-scaling will catch up vs manual intervention needed
Phase 3 — Mitigate (5-15 minutes)
- If under-provisioned: increase capacity manually (desired count, provisioned units)
- If over-provisioned: schedule reduction (do not reduce during business hours without buffer)
- Enable graceful degradation if capacity cannot be increased fast enough
- Communicate status to stakeholders
Phase 4 — Resolve (15-60 minutes)
- Verify metrics returning to healthy range
- Confirm auto-scaling policies are stable
- Document actions taken and timeline
- If manual overrides applied: schedule revert
Phase 5 — Prevent (within 1 week)
- Post-mortem: why did auto-scaling not prevent this?
- Update scaling policies (thresholds, cooldowns, min/max)
- Add or refine monitoring alarms
- Update runbook with lessons learned
- If recurring: implement automated remediation (Lambda, EventBridge)
Key Takeaways
-
Auto-scaling is necessary but not sufficient: Every scenario above involved auto-scaling that was either misconfigured, too slow, or not engaged. Manual intervention is still needed for extreme events.
-
Known events need scheduled scaling: Manga releases and flash sales are predictable. Pre-warming resources 30 minutes before eliminates the most common cause of under-provisioning.
-
Cost over-provisioning is a silent failure: Unlike under-provisioning (which users notice immediately), over-provisioning can persist for weeks. Daily cost anomaly detection is essential.
-
Defaults matter: ElastiCache
noevictiondefault, DynamoDB provisioned mode, ECS conservative cooldowns — all of these defaults are wrong for a high-traffic GenAI application. Review and override every default. -
Always plan the revert: Every manual scaling action during an incident must have a corresponding revert action. Automate the revert with a time-delayed trigger to ensure it never gets forgotten.