LOCAL PREVIEW View on GitHub

Resource Allocation Scenarios & Runbooks

AWS AIP-C01 Task 4.2 — Skill 4.2.5: Right-size resources to optimize FM application throughput Context: MangaAssist JP manga e-commerce chatbot — Bedrock Claude 3 (Sonnet/Haiku), OpenSearch Serverless, DynamoDB, ECS Fargate (orchestrator), API Gateway WebSocket, ElastiCache Redis. 1M messages/day, peak 20K concurrent sessions.


Skill Mapping

Certification Domain Task Skill
AWS AIP-C01 Domain 4 — Responsible AI & Monitoring Task 4.2 — Optimize FM application performance Skill 4.2.5 — Right-size resources to optimize FM application throughput

Focus: Five real-world resource allocation failure scenarios grounded in MangaAssist operations. Each scenario includes problem statement, detection, root cause analysis, resolution steps, and prevention — with decision tree diagrams.


Scenario 1 — ECS Under-Provisioned During Manga Release Event

Problem

A highly anticipated manga chapter release (Attack on Titan final volume reprint) drives 8x normal traffic to MangaAssist within 15 minutes. The ECS Fargate orchestrator CPU hits 95%, auto-scaling responds but cannot provision new tasks fast enough. Users experience 5-15 second response times (normal: 1-2 seconds). Some WebSocket connections time out entirely.

Detection

Signal Source Value Threshold
ECS CPUUtilization CloudWatch AWS/ECS 95% Alarm at >85% for 2 min
ECS RunningTaskCount CloudWatch AWS/ECS 45 (max was 50) Approaching max capacity
Orchestrator P95 latency Custom metric 12,400 ms SLA: < 2,000 ms
API Gateway WebSocket errors CloudWatch AWS/ApiGateway 4xx errors spike Alarm at >1% error rate
Bedrock invocations queued Custom metric PendingInvocations 340 queued Alarm at >50

First alert: CloudWatch composite alarm fires — ECS CPU >85% AND P95 latency >3,000ms simultaneously.

Decision Tree

flowchart TD
    A["ALERT: ECS CPU > 85%<br/>P95 Latency > 3,000ms"] --> B{"Is auto-scaling<br/>already responding?"}

    B -->|"Yes — tasks increasing"| C{"Current task count<br/>near max capacity?"}
    B -->|"No — no scaling activity"| D["CHECK: Is scaling policy<br/>attached and active?"]

    C -->|"Yes — at or near max"| E["IMMEDIATE: Increase<br/>max task count limit<br/>via ECS service update"]
    C -->|"No — room to grow"| F{"Are new tasks<br/>starting successfully?"}

    D --> D1["Fix: Verify auto-scaling<br/>target, re-attach policy"]
    D1 --> D2["Manual: Set desired count<br/>to immediate need"]

    F -->|"Yes — tasks pending"| G["WAIT: Fargate<br/>provisioning lag ~60-120s<br/>Monitor recovery"]
    F -->|"No — tasks failing"| H["CHECK: Task definition<br/>errors, ECR pull failures,<br/>health check failures"]

    E --> E1["Increase max to 200"]
    E1 --> E2{"CPU still > 90%<br/>after 5 min?"}
    E2 -->|"Yes"| E3["ESCALATE: Increase task<br/>size (4 vCPU / 8 GB)<br/>OR enable Bedrock throttle<br/>graceful degradation"]
    E2 -->|"No — recovering"| E4["Monitor for 30 min<br/>then review scaling policy"]

    G --> G1{"Latency recovering<br/>within 5 min?"}
    G1 -->|"Yes"| G2["Incident resolved.<br/>Post-mortem: tune scale-out<br/>aggressiveness"]
    G1 -->|"No"| E["Increase max capacity"]

    H --> H1["Fix task definition<br/>or container image issues"]
    H1 --> D2

Root Cause

  1. Max task count too low: The service was configured with max_capacity = 50, sufficient for normal 3x peaks but not for an 8x event spike.
  2. Scale-out cooldown too long: The target tracking policy had a 120-second scale-out cooldown, meaning the service could only add tasks every 2 minutes.
  3. No predictive or scheduled scaling: The manga release was a known event (announced 2 weeks prior) but no scheduled scaling action was configured.
  4. Task start time: Fargate cold start is 60-120 seconds. At 8x traffic, the 2-minute scaling + 2-minute start cycle means 4+ minutes before relief.

Resolution

Immediate (during incident):

1. Increase max task count: 50 → 200 (ECS service update)
2. Set desired count manually: 50 → 120 (skip auto-scaling wait)
3. Enable graceful degradation: route all traffic to Haiku (faster, cheaper)
4. Activate semantic cache aggressive mode: lower similarity threshold 0.92 → 0.85

Within 1 hour:

5. Verify all 120 tasks are healthy and serving
6. Monitor CPU stabilization below 70%
7. Monitor latency recovery to < 2,000ms P95
8. Re-enable Sonnet routing once load stabilizes

Prevention

Prevention Measure Implementation Owner
Increase default max to 200 Update ECS service auto-scaling config Platform team
Reduce scale-out cooldown to 30s Update target tracking policy Platform team
Add scheduled scaling for known events Create runbook: 30 min before event, set min=100 On-call SRE
Predictive scaling based on traffic history Enable AWS predictive scaling using 4-week data Platform team
Warm pool with pre-pulled containers Keep 10 stopped tasks ready with image pre-cached Platform team
Graceful degradation playbook Auto-switch to Haiku-only when CPU > 90% Application team

Scenario 2 — DynamoDB Throttling on Session Reads During Concurrent User Spike

Problem

During a Friday evening flash sale (19:00 JST), 25,000 users simultaneously open MangaAssist to check manga deals. Each session start triggers 3-5 DynamoDB reads (session context, conversation history, user preferences). The session table — configured in provisioned mode with auto-scaling — cannot scale fast enough. ReadThrottleEvents spike to 500/second. Users see "Session loading..." for 10+ seconds or lose conversation history entirely.

Detection

Signal Source Value Threshold
ReadThrottleEvents CloudWatch AWS/DynamoDB 500/sec Alarm at >0 sustained
ConsumedReadCapacityUnits CloudWatch AWS/DynamoDB 4,800 RCU Provisioned: 3,000 RCU
SuccessfulRequestLatency (read) CloudWatch AWS/DynamoDB 45 ms (normally 5 ms) Alarm at >20 ms
Session load errors Application logs "ProvisionedThroughputExceededException" Any occurrence
Orchestrator retry rate Custom metric 35% of DynamoDB calls retrying Alarm at >5%

Decision Tree

flowchart TD
    A["ALERT: DynamoDB<br/>ReadThrottleEvents > 0"] --> B{"Table mode?"}

    B -->|"Provisioned"| C{"Auto-scaling<br/>active and responding?"}
    B -->|"On-Demand"| D["Check for<br/>hot partition keys"]

    C -->|"Yes — scaling up"| E{"Current RCU near<br/>account/table limit?"}
    C -->|"No — not scaling"| F["CHECK: Auto-scaling<br/>policy configuration"]

    E -->|"Yes — near limit"| G["Request limit increase<br/>via AWS Support"]
    E -->|"No — has headroom"| H{"Throttle events<br/>persisting > 5 min?"}

    H -->|"Yes"| I["Hot partition detected.<br/>Check partition key<br/>distribution."]
    H -->|"No"| J["Scaling caught up.<br/>Monitor recovery."]

    F --> F1["Fix: Re-attach auto-scaling<br/>Set target utilization 70%"]
    F1 --> F2["Manual: Increase provisioned<br/>RCU immediately"]

    D --> D1{"Throttle on specific<br/>partition keys?"}
    D1 -->|"Yes"| D2["Hot partition: session_id<br/>or user_id skew.<br/>Add write sharding suffix."]
    D1 -->|"No"| D3["On-demand table hit<br/>2x previous peak limit.<br/>Gradually increase traffic<br/>or pre-warm."]

    I --> I1["Redesign partition key:<br/>Add shard suffix<br/>or use composite key<br/>(user_region#user_id)"]

    G --> G1["SHORT-TERM: Switch<br/>table to On-Demand mode<br/>(takes effect immediately)"]

Root Cause

  1. Provisioned mode with insufficient headroom: Table had 3,000 RCU provisioned with auto-scaling target at 70%. Auto-scaling takes 1-2 minutes to react, but flash sale traffic arrived in seconds.
  2. Auto-scaling daily decrease limit: DynamoDB auto-scaling can only decrease throughput 4 times per day (but can increase without limit). However, the increase speed is limited by CloudWatch alarm evaluation periods (typically 2 data points of 1 minute each).
  3. Hot partition on session_id: The flash sale landing page created new sessions that all mapped to a small number of DynamoDB partitions because session_id is time-sequential (UUIDs with timestamp prefix), causing temporal hot spots.

Resolution

Immediate (during incident):

1. Switch table to On-Demand mode (AWS Console or CLI — takes effect in minutes)
2. If staying provisioned: manually set RCU to 10,000 (well above current need)
3. Enable application-level retry with exponential backoff (should already exist)
4. If session reads fail after 3 retries: serve with empty context (degraded but functional)

Within 24 hours:

5. Verify throttle events dropped to zero
6. Analyze partition key distribution using CloudWatch Contributor Insights
7. If hot partitions confirmed: plan partition key redesign
8. Cost-compare on-demand vs provisioned for this traffic pattern

Prevention

Prevention Measure Implementation Owner
Switch to On-Demand mode permanently Update table configuration Platform team
Use random UUID session IDs (no timestamp prefix) Update session ID generation Application team
Add ElastiCache read-through for session data Cache active sessions in Redis Application team
DynamoDB Contributor Insights enabled Turn on for session table Platform team
Pre-warm provisioned capacity before events If staying provisioned: schedule RCU increase 1hr before On-call SRE
Circuit breaker on DynamoDB reads Return cached/empty session after 3 failures Application team

Scenario 3 — OpenSearch OCU Auto-Scale Lag Causing Search Latency Degradation

Problem

During a weekend manga recommendation campaign, search traffic increases 4x over 30 minutes. OpenSearch Serverless auto-scaling detects the need for additional search OCUs but takes 10-15 minutes to provision them. During this window, vector search P95 latency degrades from 40ms to 350ms, causing overall chatbot response time to exceed the 3-second SLA for recommendation queries.

Detection

Signal Source Value Threshold
SearchLatency P95 CloudWatch AWS/AOSS 350 ms Alarm at >100 ms
SearchRate CloudWatch AWS/AOSS 450 queries/sec Baseline: 120 queries/sec
Active Search OCU CloudWatch AWS/AOSS 4 (same as baseline) Expected: 8-12 for this load
Orchestrator RAG latency Custom metric 1,800 ms (includes retries) Alarm at >500 ms
End-to-end response time Custom metric 4,200 ms SLA: < 3,000 ms

Decision Tree

flowchart TD
    A["ALERT: OpenSearch<br/>P95 Latency > 100ms"] --> B{"Is search traffic<br/>elevated vs baseline?"}

    B -->|"Yes — traffic spike"| C{"Are OCUs<br/>auto-scaling up?"}
    B -->|"No — normal traffic"| D["Check index health:<br/>segment count, merge<br/>activity, shard imbalance"]

    C -->|"Yes — OCUs increasing<br/>(visible in CloudWatch)"| E["WAIT: OCU provisioning<br/>takes 5-15 min.<br/>Enable mitigations."]
    C -->|"No — OCUs static"| F["CHECK: Collection<br/>auto-scaling config.<br/>Is max OCU set too low?"]

    E --> E1["MITIGATE while waiting:"]
    E1 --> E2["1. Reduce k in k-NN<br/>from 10 to 5"]
    E1 --> E3["2. Increase cache<br/>aggressiveness for<br/>search results"]
    E1 --> E4["3. Route simple queries<br/>to keyword search<br/>instead of vector search"]

    E2 --> G{"Latency recovered<br/>after OCU scale?"}
    G -->|"Yes"| H["Resolved. Review base<br/>OCU allocation."]
    G -->|"No"| I["ESCALATE: Check<br/>HNSW index size vs OCU<br/>memory. May need<br/>index optimization."]

    F --> F1["Increase max search OCU<br/>from 4 to 20"]
    F1 --> F2["If still not scaling:<br/>Check if collection<br/>type supports auto-scaling"]

    D --> D1["Run _cat/segments<br/>Check for large<br/>segment merges"]
    D1 --> D2["If merge storms:<br/>Reduce refresh interval<br/>or schedule bulk indexing<br/>during off-peak"]

Root Cause

  1. OCU scaling lag: OpenSearch Serverless auto-scaling detects load increases but takes 5-15 minutes to provision additional search OCUs. This is an inherent platform limitation, not a misconfiguration.
  2. Base search OCU too low: The collection was configured with only 2 search OCU as the baseline, leaving no buffer for sudden traffic increases.
  3. HNSW index memory pressure: The manga embedding index (500K vectors x 1536 dimensions) requires ~6 GB of memory. At 2 search OCU, the index fits but leaves little headroom for concurrent search operations.
  4. No search result caching: Identical vector searches (common during campaigns where users search similar terms) hit OpenSearch every time.

Resolution

Immediate (during incident):

1. Reduce k-NN k parameter from 10 to 5 (fewer neighbors = faster search)
2. Enable search result caching in Redis (cache top queries for 5 min TTL)
3. Route FAQ-intent queries to keyword search (bypass vector search)
4. If available: increase max search OCU limit in collection settings

Within 24 hours:

5. Increase base search OCU to 4 (prevents cold-start lag for moderate spikes)
6. Implement search result caching layer in orchestrator
7. Analyze query patterns: are users sending identical queries? Pre-compute and cache
8. Review HNSW index parameters (ef_search, m) for efficiency

Prevention

Prevention Measure Implementation Owner
Increase base search OCU to 4+ Update collection configuration Platform team
Add Redis-based search result caching Cache vector search results with 5-min TTL Application team
Implement approximate search fallback If P95 > 200ms, switch to pre-computed recommendations Application team
Pre-warm before campaigns Synthetically increase search load 30 min before events SRE team
Monitor OCU utilization trend Dashboard with search OCU count vs search rate correlation Platform team
Consider OpenSearch provisioned (non-serverless) For predictable high-traffic workloads, managed clusters give more control Architecture team

Scenario 4 — ElastiCache Memory Exhaustion from Growing Semantic Cache

Problem

Over 3 weeks, the MangaAssist semantic cache in ElastiCache Redis grows from 4 GB to 12.5 GB (on a 13 GB r6g.large node). The cache was intended to store frequently-asked query embeddings and their responses, but the eviction policy was set to noeviction instead of allkeys-lfu. When memory hits 100%, Redis rejects all new write commands. New cache entries cannot be stored, and the cache hit rate drops from 35% to 0% (no new entries) while existing entries serve stale data. Bedrock invocation count spikes 35% as all queries bypass cache.

Detection

Signal Source Value Threshold
DatabaseMemoryUsagePercentage CloudWatch AWS/ElastiCache 96% (then 100%) Alarm at >80%
BytesUsedForCache CloudWatch AWS/ElastiCache 12.5 GB Node capacity: 13 GB
CacheHitRate Custom metric Dropping from 35% to 0% Alarm at <20%
Evictions CloudWatch AWS/ElastiCache 0 (noeviction policy) Should be >0 when full
Bedrock InvocationsPerMinute CloudWatch AWS/Bedrock +35% spike Alarm at >20% increase
CommandsFailed (OOM errors) CloudWatch AWS/ElastiCache Rising Alarm at >0

Decision Tree

flowchart TD
    A["ALERT: ElastiCache Memory > 80%"] --> B{"Eviction policy?"}

    B -->|"noeviction"| C["CRITICAL: Redis will reject<br/>writes at 100% memory.<br/>Change policy immediately."]
    B -->|"allkeys-lfu or allkeys-lru"| D{"Eviction rate<br/>abnormally high?"}

    C --> C1["IMMEDIATE: Change maxmemory-policy<br/>to allkeys-lfu<br/>(no restart needed)"]
    C1 --> C2{"Memory still > 95%<br/>after policy change?"}

    C2 -->|"Yes — eviction not<br/>fast enough"| C3["Manually flush low-value<br/>cache entries by pattern:<br/>SCAN + DEL old TTL keys"]
    C2 -->|"No — eviction working"| C4["Monitor. Evictions will<br/>keep memory in check."]

    D -->|"Yes — evicting too much"| E{"Cache growing due to<br/>new entry volume or<br/>entry size increase?"}
    D -->|"No — healthy eviction"| F["System healthy.<br/>Adjust alarm threshold."]

    E -->|"New entries (unique queries)"| G["Check: Are low-value<br/>queries being cached?<br/>Add minimum similarity<br/>threshold for caching."]
    E -->|"Entry size growth"| H["Check: Are cached responses<br/>getting larger? Review<br/>response truncation policy."]

    G --> G1["Implement selective caching:<br/>Only cache queries with<br/>>5 occurrences in 24hr"]

    C3 --> C5["PLAN: Upgrade node type<br/>to r6g.xlarge (26 GB)<br/>or enable cluster mode"]

Root Cause

  1. Wrong eviction policy: The Redis cluster was configured with maxmemory-policy: noeviction (the default). This means Redis returns errors on writes when memory is full instead of evicting old keys.
  2. No TTL on cache entries: Cached query embeddings and responses were stored without an expiration time, causing unbounded growth.
  3. Caching everything: Every unique query — including one-off misspellings and extremely specific questions — was cached. The cache stored 250K entries when only the top 30K (most-repeated queries) provided meaningful hit rate.
  4. No cache size monitoring alarm: Memory usage was not monitored with an alarm. The team discovered the issue only when Bedrock costs spiked.

Resolution

Immediate (during incident):

1. Change maxmemory-policy to allkeys-lfu:
   CONFIG SET maxmemory-policy allkeys-lfu
   (takes effect immediately, no restart)

2. Set TTL on all existing keys without one:
   Use SCAN to iterate + EXPIRE on each key (24hr TTL)

3. If memory at 100% and write errors occurring:
   Flush low-frequency keys manually:
   Use SCAN + OBJECT FREQ + DEL pattern

4. Monitor: Eviction count should start rising, memory should stabilize

Within 48 hours:

5. Implement selective caching logic in orchestrator:
   - Only cache queries with similarity < 0.95 to existing entries
   - Only cache queries seen > 3 times in rolling 24hr window
6. Set default TTL of 24 hours on all new cache entries
7. Add CloudWatch alarm: memory > 75% for 10 min → warning
8. Add CloudWatch alarm: memory > 85% for 5 min → critical

Prevention

Prevention Measure Implementation Owner
Set maxmemory-policy to allkeys-lfu Update parameter group (persist across restarts) Platform team
Default TTL on all cache entries 24-hour TTL in application caching logic Application team
Selective caching (frequency threshold) Only cache queries seen 3+ times in 24hr Application team
Memory monitoring alarms 75% warning, 85% critical in CloudWatch Platform team
Weekly cache efficiency report Dashboard: entries, hit rate, memory, eviction rate SRE team
Cache size budgeting Enforce max 70% memory target; resize node if baseline > 60% Platform team

Scenario 5 — Cost Spike from Over-Provisioned Resources During Quiet Period

Problem

After scaling up for a major manga release event on Saturday, the MangaAssist infrastructure remains at peak capacity through the following quiet week. ECS is running 100 tasks (normal need: 15-25), Bedrock provisioned throughput remains at event levels, and OpenSearch has 12 search OCUs (normal: 4). The team does not notice because there are no performance issues — everything works perfectly. But the weekly AWS cost report shows a 280% cost increase ($42,000 for the week vs normal $15,000).

Detection

Signal Source Value Threshold
ECS RunningTaskCount CloudWatch AWS/ECS 100 tasks Normal baseline: 20 tasks
ECS CPUUtilization CloudWatch AWS/ECS 12% Under-utilization: <30%
Bedrock InvocationsPerMinute CloudWatch AWS/Bedrock 15% of provisioned Under-utilization: <30% for >24hr
OpenSearch Active Search OCU CloudWatch AWS/AOSS 12 OCU Baseline need: 4 OCU
AWS Cost Explorer daily spend Cost Explorer $6,000/day Budget: $2,200/day
Cost anomaly detection AWS Cost Anomaly Detection 280% over baseline Alert at >50% deviation

Decision Tree

flowchart TD
    A["ALERT: Daily cost<br/>280% over budget"] --> B["Identify cost drivers<br/>via Cost Explorer<br/>by service and tag"]

    B --> C{"Which services<br/>are over-budget?"}

    C --> D["ECS: 100 tasks running<br/>at 12% CPU utilization"]
    C --> E["Bedrock: Provisioned throughput<br/>at event levels"]
    C --> F["OpenSearch: 12 OCU<br/>vs 4 needed"]

    D --> D1{"Auto-scaling policy<br/>scale-in working?"}
    D1 -->|"Yes but min capacity<br/>was raised for event"| D2["Reset min capacity<br/>from 100 → 10"]
    D1 -->|"No — scale-in disabled<br/>or cooldown too long"| D3["Fix scale-in policy:<br/>Enable, reduce cooldown"]

    E --> E1["Review provisioned throughput:<br/>Is it auto-managed or manual?"]
    E1 --> E2["Manual: Reduce to normal<br/>levels immediately"]
    E1 --> E3["If using on-demand: costs<br/>scale automatically (no action)"]

    F --> F1["OpenSearch auto-scaling<br/>should reduce OCU.<br/>Check if collection has<br/>min OCU set too high."]
    F1 --> F2["Reset min search OCU<br/>to 2 if manually raised"]

    D2 --> G["VERIFY: All services<br/>returning to baseline<br/>capacity within 1 hour"]
    D3 --> G
    E2 --> G
    F2 --> G

    G --> H["POST-MORTEM: Why wasn't<br/>this caught sooner?<br/>Add cost monitoring."]

    H --> I["Implement: Daily cost<br/>anomaly alert + weekly<br/>utilization efficiency report"]

Root Cause

  1. Manual scaling not reverted: During the event, the on-call SRE manually increased ECS min_capacity to 100 and raised Bedrock provisioned throughput. After the event, no one reverted these settings.
  2. No post-event checklist: There was no runbook step for "revert temporary scaling changes after event."
  3. Scale-in too conservative: ECS auto-scaling scale-in cooldown was 300 seconds, and the target tracking policy required 15 minutes below target before scaling in. With min_capacity = 100, it could not scale below 100 regardless.
  4. No cost alerting: AWS Budgets was configured with monthly thresholds only. A weekly 280% spike did not trigger the monthly alarm until week 3.

Resolution

Immediate (within 1 hour):

1. Reset ECS min_capacity: 100 → 10
2. Reset ECS desired count: let auto-scaling determine (remove manual override)
3. Reduce Bedrock provisioned throughput to normal levels
4. Verify OpenSearch search OCU is auto-scaling down
5. Estimated savings: ~$3,800/day immediately

Within 1 week:

6. Implement daily cost anomaly detection (AWS Cost Anomaly Detection)
7. Create post-event scaling revert checklist in runbook
8. Add automated revert: EventBridge rule that resets min_capacity
   48 hours after a manual override event
9. Create weekly "efficiency report" dashboard:
   - Service utilization vs capacity
   - Cost per message trend
   - Over-provisioned resource flags

Prevention

Prevention Measure Implementation Owner
Automated scaling revert EventBridge + Lambda: auto-revert min_capacity 24hr after manual increase Platform team
Daily cost anomaly alerts AWS Cost Anomaly Detection with Slack notification Finance / SRE
Post-event runbook checklist "Revert event scaling" step in incident response runbook SRE team
Weekly efficiency dashboard CloudWatch dashboard: utilization heatmap across all services Platform team
Tag event-driven scaling changes Tag manual overrides with event:manga-release-20250322 for tracking SRE team
Cost per message metric Custom metric: total daily cost / total daily messages. Alert at >$0.015/msg Platform team
Budget alerts at daily granularity AWS Budgets with daily spend threshold ($3,000/day) Finance team

Cross-Scenario Summary

Scenario Service Failure Mode Detection Speed Impact Key Prevention
1. ECS under-provisioned ECS Fargate Scale-out too slow for 8x spike 2 min (alarm) High latency, timeouts Predictive scaling + scheduled pre-warm
2. DynamoDB throttling DynamoDB Provisioned capacity exceeded 1 min (throttle alarm) Session load failures Switch to on-demand mode
3. OpenSearch OCU lag OpenSearch Serverless OCU provision takes 10-15 min 5 min (latency alarm) Search latency 10x degraded Higher base OCU + search result caching
4. ElastiCache memory full ElastiCache Redis Wrong eviction policy 30 min (cost spike noticed) Cache disabled, Bedrock cost +35% Set allkeys-lfu + TTL on entries
5. Cost over-provision All services Manual scaling not reverted 7 days (weekly cost report) 280% cost overrun ($27K waste) Automated revert + daily cost alerts

Severity and Urgency Matrix

quadrantChart
    title Scenario Severity vs Detection Urgency
    x-axis Low Urgency --> High Urgency
    y-axis Low Severity --> High Severity
    quadrant-1 Critical — Immediate action
    quadrant-2 Important — Action within hours
    quadrant-3 Monitor — Track and prevent
    quadrant-4 Urgent — Fix fast, low blast radius
    ECS under-provisioned: [0.85, 0.90]
    DynamoDB throttling: [0.80, 0.75]
    OpenSearch OCU lag: [0.65, 0.60]
    ElastiCache memory: [0.45, 0.55]
    Cost over-provision: [0.25, 0.70]

Universal Resource Allocation Runbook Template

For any resource allocation incident in MangaAssist, follow this structured approach:

Phase 1 — Detect (0-2 minutes)

  1. Identify which CloudWatch alarm fired
  2. Determine affected service and metric
  3. Check if auto-scaling is already responding
  4. Classify: under-provisioned (performance) or over-provisioned (cost)

Phase 2 — Assess (2-5 minutes)

  1. Check current capacity vs required capacity
  2. Determine if this is a known event or unexpected spike
  3. Evaluate blast radius: which user-facing features are affected?
  4. Decide: auto-scaling will catch up vs manual intervention needed

Phase 3 — Mitigate (5-15 minutes)

  1. If under-provisioned: increase capacity manually (desired count, provisioned units)
  2. If over-provisioned: schedule reduction (do not reduce during business hours without buffer)
  3. Enable graceful degradation if capacity cannot be increased fast enough
  4. Communicate status to stakeholders

Phase 4 — Resolve (15-60 minutes)

  1. Verify metrics returning to healthy range
  2. Confirm auto-scaling policies are stable
  3. Document actions taken and timeline
  4. If manual overrides applied: schedule revert

Phase 5 — Prevent (within 1 week)

  1. Post-mortem: why did auto-scaling not prevent this?
  2. Update scaling policies (thresholds, cooldowns, min/max)
  3. Add or refine monitoring alarms
  4. Update runbook with lessons learned
  5. If recurring: implement automated remediation (Lambda, EventBridge)

Key Takeaways

  1. Auto-scaling is necessary but not sufficient: Every scenario above involved auto-scaling that was either misconfigured, too slow, or not engaged. Manual intervention is still needed for extreme events.

  2. Known events need scheduled scaling: Manga releases and flash sales are predictable. Pre-warming resources 30 minutes before eliminates the most common cause of under-provisioning.

  3. Cost over-provisioning is a silent failure: Unlike under-provisioning (which users notice immediately), over-provisioning can persist for weeks. Daily cost anomaly detection is essential.

  4. Defaults matter: ElastiCache noeviction default, DynamoDB provisioned mode, ECS conservative cooldowns — all of these defaults are wrong for a high-traffic GenAI application. Review and override every default.

  5. Always plan the revert: Every manual scaling action during an incident must have a corresponding revert action. Automate the revert with a time-delayed trigger to ensure it never gets forgotten.