US-08: Traffic-Based Cost Optimization
User Story
As a site reliability engineer, I want to implement intelligent traffic management with rate limiting, request prioritization, and graceful degradation, So that total infrastructure costs decrease by 20-35% during normal operations and the system avoids runaway costs during traffic spikes.
Acceptance Criteria
- Rate limiter enforces per-user and global limits to prevent LLM cost spikes.
- Request prioritization ensures authenticated users get higher throughput than guests.
- Graceful degradation reduces costs by bypassing expensive components when they are overloaded.
- Cost circuit breaker halts LLM calls when daily spend exceeds budget.
- Off-peak traffic uses cheaper execution paths automatically.
- Total infrastructure costs decrease by 20-35% through traffic shaping.
High-Level Design
Cost Problem
Without traffic controls, cost scales linearly with traffic — and super-linearly during spikes (when provisioned capacity overflows to more expensive on-demand resources). A single abusive user could generate thousands of LLM calls.
Uncontrolled cost risks: - Bot/abuse traffic: 5-15% of messages may be low-value or abusive - Guest discovery browsing: generates LLM calls but low conversion - Traffic spikes: Black Friday, manga release events can 3-5x normal traffic
Traffic Management Architecture
graph TD
A[Incoming Request] --> B[CloudFront<br>WAF + Bot Detection]
B --> C[Rate Limiter<br>Token Bucket]
C --> D{Within<br>Rate Limit?}
D -->|Yes| E[Request Prioritizer]
D -->|No| F[429 Too Many Requests<br>Zero downstream cost]
E --> G{Priority Level?}
G -->|High: Authenticated + Active Cart| H[Full Pipeline<br>LLM + RAG + All Services]
G -->|Medium: Authenticated| I[Standard Pipeline<br>Template-first, then LLM]
G -->|Low: Guest| J[Lite Pipeline<br>Template + Cache only]
K[Cost Circuit Breaker] --> L{Daily Spend<br>> Budget?}
L -->|Yes| M[Degrade to<br>Template-Only Mode]
L -->|No| N[Normal Operation]
style F fill:#2d8,stroke:#333
style J fill:#2d8,stroke:#333
style M fill:#fd2,stroke:#333
Savings Breakdown
| Technique | Reduction | Monthly Savings |
|---|---|---|
| Rate limiting (blocks 5-10% abuse traffic) | 8% of total LLM calls | ~$25,000 |
| Guest lite pipeline (template + cache only) | 15% of LLM calls avoided | ~$15,000 |
| Request prioritization (off-peak degradation) | 10% compute savings | ~$300 |
| Cost circuit breaker (prevents runaway) | Prevents $50K+ overrun/month | Risk avoidance |
| Bot detection (WAF) | 3-5% request rejection | ~$5,000 |
| Total | ~$45,300/month |
Low-Level Design
1. Tiered Rate Limiter
Different rate limits based on user tier and request type.
graph TD
subgraph "Rate Limit Tiers"
A[Prime Authenticated<br>60 msg/min, 500/hour]
B[Standard Authenticated<br>30 msg/min, 200/hour]
C[Guest<br>10 msg/min, 50/hour]
D[Global<br>50,000 msg/min total]
end
subgraph "Token Bucket Implementation"
E[Request Arrives] --> F[Check User Tier]
F --> G[Redis MULTI:<br>Check + Decrement Token]
G --> H{Tokens > 0?}
H -->|Yes| I[Allow Request]
H -->|No| J[429 + Retry-After Header]
end
Code Example: Tiered Rate Limiter
import time
from dataclasses import dataclass
from enum import Enum
from typing import Optional
import redis
class UserTier(Enum):
PRIME = "prime"
AUTHENTICATED = "authenticated"
GUEST = "guest"
@dataclass
class RateLimitConfig:
requests_per_minute: int
requests_per_hour: int
burst_allowance: int
TIER_LIMITS = {
UserTier.PRIME: RateLimitConfig(
requests_per_minute=60,
requests_per_hour=500,
burst_allowance=10,
),
UserTier.AUTHENTICATED: RateLimitConfig(
requests_per_minute=30,
requests_per_hour=200,
burst_allowance=5,
),
UserTier.GUEST: RateLimitConfig(
requests_per_minute=10,
requests_per_hour=50,
burst_allowance=2,
),
}
GLOBAL_LIMIT_PER_MINUTE = 50_000
@dataclass
class RateLimitResult:
allowed: bool
remaining: int
retry_after_seconds: Optional[int]
tier: UserTier
class TieredRateLimiter:
"""Token-bucket rate limiter with per-tier limits stored in Redis."""
def __init__(self, redis_client: redis.Redis):
self._redis = redis_client
def check_and_consume(
self, user_id: str, tier: UserTier
) -> RateLimitResult:
config = TIER_LIMITS[tier]
now = int(time.time())
minute_key = f"rate:{user_id}:min:{now // 60}"
hour_key = f"rate:{user_id}:hr:{now // 3600}"
global_key = f"rate:global:min:{now // 60}"
pipe = self._redis.pipeline()
# Check minute limit
pipe.incr(minute_key)
pipe.expire(minute_key, 60)
# Check hour limit
pipe.incr(hour_key)
pipe.expire(hour_key, 3600)
# Check global limit
pipe.incr(global_key)
pipe.expire(global_key, 60)
results = pipe.execute()
minute_count = results[0]
hour_count = results[2]
global_count = results[4]
# Check per-user minute limit
if minute_count > config.requests_per_minute + config.burst_allowance:
return RateLimitResult(
allowed=False,
remaining=0,
retry_after_seconds=60 - (now % 60),
tier=tier,
)
# Check per-user hour limit
if hour_count > config.requests_per_hour:
return RateLimitResult(
allowed=False,
remaining=0,
retry_after_seconds=3600 - (now % 3600),
tier=tier,
)
# Check global limit
if global_count > GLOBAL_LIMIT_PER_MINUTE:
return RateLimitResult(
allowed=False,
remaining=0,
retry_after_seconds=60 - (now % 60),
tier=tier,
)
remaining = config.requests_per_minute - minute_count
return RateLimitResult(
allowed=True,
remaining=max(0, remaining),
retry_after_seconds=None,
tier=tier,
)
2. Request Priority Router
Route requests through different execution paths based on user value.
graph TD
A[Request + User Context] --> B{User Tier<br>+ Cart Status}
B -->|Authenticated + Cart > 0<br>HIGH priority| C[Full Pipeline<br>RAG + LLM + All Services]
B -->|Authenticated + No Cart<br>MEDIUM priority| D[Standard Pipeline<br>Template-first → LLM fallback]
B -->|Guest<br>LOW priority| E{Intent Type?}
E -->|faq, product_discovery| F[Lite Pipeline<br>Template + Cache only<br>No LLM]
E -->|product_question,<br>recommendation| G[Standard Pipeline<br>with Haiku model]
style C fill:#f66,stroke:#333
style D fill:#fd2,stroke:#333
style F fill:#2d8,stroke:#333
Code Example: Priority Router
from dataclasses import dataclass
from enum import Enum
from typing import Optional
class Priority(Enum):
HIGH = "high" # Full pipeline, Sonnet
MEDIUM = "medium" # Template-first, Sonnet fallback
LOW = "low" # Template + cache, Haiku fallback
@dataclass
class PipelineConfig:
priority: Priority
use_rag: bool
use_llm: bool
model_tier: str # "sonnet" or "haiku"
use_template_first: bool
use_cache_only: bool
reason: str
class RequestPriorityRouter:
"""Determine execution pipeline based on user tier and context."""
def route(
self,
user_tier: str,
is_authenticated: bool,
cart_size: int,
intent: str,
is_peak_hour: bool,
) -> PipelineConfig:
# High priority: authenticated user with active cart
if is_authenticated and cart_size > 0:
return PipelineConfig(
priority=Priority.HIGH,
use_rag=True,
use_llm=True,
model_tier="sonnet",
use_template_first=False,
use_cache_only=False,
reason="Authenticated user with active cart — full pipeline",
)
# Medium priority: authenticated user, no cart
if is_authenticated:
return PipelineConfig(
priority=Priority.MEDIUM,
use_rag=intent in ("faq", "product_question", "recommendation"),
use_llm=True,
model_tier="sonnet" if not is_peak_hour else "haiku",
use_template_first=True,
use_cache_only=False,
reason="Authenticated user — template-first with LLM fallback",
)
# Low priority: guest user
if intent in ("faq", "product_discovery", "promotion", "chitchat"):
return PipelineConfig(
priority=Priority.LOW,
use_rag=False,
use_llm=False,
model_tier="none",
use_template_first=True,
use_cache_only=True,
reason="Guest user, cacheable intent — template + cache only",
)
# Guest with complex intent: limited LLM
return PipelineConfig(
priority=Priority.LOW,
use_rag=intent in ("product_question", "recommendation"),
use_llm=True,
model_tier="haiku",
use_template_first=True,
use_cache_only=False,
reason="Guest user, complex intent — lite pipeline with Haiku",
)
3. Cost Circuit Breaker
Prevents runaway LLM costs by switching to template-only mode when daily spend exceeds budget.
stateDiagram-v2
[*] --> NormalOperation
NormalOperation --> WarningState: Daily spend > 80% budget
WarningState --> DegradedMode: Daily spend > 100% budget
DegradedMode --> NormalOperation: New day (midnight UTC reset)
WarningState --> NormalOperation: Spend rate decreases
state NormalOperation {
[*] --> FullPipeline
FullPipeline: All models available
FullPipeline: Full RAG + LLM
}
state WarningState {
[*] --> ReducedPipeline
ReducedPipeline: Only Haiku model
ReducedPipeline: Aggressive caching
ReducedPipeline: Template-first for all
}
state DegradedMode {
[*] --> TemplateOnly
TemplateOnly: No LLM calls
TemplateOnly: Template + cache responses
TemplateOnly: Escalation to human for complex queries
}
Code Example: Cost Circuit Breaker
import logging
import time
from dataclasses import dataclass
from enum import Enum
import redis
logger = logging.getLogger(__name__)
class CostState(Enum):
NORMAL = "normal"
WARNING = "warning"
DEGRADED = "degraded"
@dataclass
class CostBudget:
daily_limit_usd: float
warning_threshold_pct: float = 0.80
degraded_threshold_pct: float = 1.00
@dataclass
class CostStatus:
state: CostState
daily_spend_usd: float
budget_remaining_usd: float
llm_calls_today: int
actions: list[str]
class CostCircuitBreaker:
"""Monitor daily LLM spend and degrade execution path when budget exceeded."""
# Approximate costs per call
SONNET_AVG_COST = 0.008 # $0.008 per call (avg 1200 input + 250 output tokens)
HAIKU_AVG_COST = 0.0005 # $0.0005 per call
def __init__(
self,
redis_client: redis.Redis,
budget: CostBudget,
):
self._redis = redis_client
self._budget = budget
def check_state(self) -> CostStatus:
today = self._today_key()
spend_raw = self._redis.get(f"cost:daily:{today}")
calls_raw = self._redis.get(f"cost:calls:{today}")
daily_spend = float(spend_raw) if spend_raw else 0.0
llm_calls = int(calls_raw) if calls_raw else 0
remaining = self._budget.daily_limit_usd - daily_spend
spend_pct = daily_spend / self._budget.daily_limit_usd
if spend_pct >= self._budget.degraded_threshold_pct:
state = CostState.DEGRADED
actions = [
"BLOCK all LLM calls",
"SERVE template + cache responses only",
"ESCALATE complex queries to human agents",
"ALERT on-call engineer",
]
logger.warning(
f"Cost circuit OPEN — daily spend ${daily_spend:.2f} "
f"exceeds budget ${self._budget.daily_limit_usd:.2f}"
)
elif spend_pct >= self._budget.warning_threshold_pct:
state = CostState.WARNING
actions = [
"SWITCH all LLM calls to Haiku",
"ENABLE aggressive caching",
"FORCE template-first for all intents",
]
logger.info(
f"Cost circuit WARNING — daily spend ${daily_spend:.2f} "
f"at {spend_pct*100:.0f}% of budget"
)
else:
state = CostState.NORMAL
actions = ["Normal operation"]
return CostStatus(
state=state,
daily_spend_usd=daily_spend,
budget_remaining_usd=max(0, remaining),
llm_calls_today=llm_calls,
actions=actions,
)
def record_llm_call(self, model: str, input_tokens: int, output_tokens: int) -> None:
"""Record an LLM call and its estimated cost."""
today = self._today_key()
# Calculate cost based on model
if "sonnet" in model.lower():
cost = (input_tokens / 1_000_000 * 3.0) + (output_tokens / 1_000_000 * 15.0)
elif "haiku" in model.lower():
cost = (input_tokens / 1_000_000 * 0.25) + (output_tokens / 1_000_000 * 1.25)
else:
cost = 0.0
pipe = self._redis.pipeline()
pipe.incrbyfloat(f"cost:daily:{today}", cost)
pipe.incr(f"cost:calls:{today}")
pipe.expire(f"cost:daily:{today}", 86400)
pipe.expire(f"cost:calls:{today}", 86400)
pipe.execute()
def _today_key(self) -> str:
return time.strftime("%Y-%m-%d", time.gmtime())
4. Graceful Degradation Ladder
When downstream services are overloaded, progressively disable expensive features.
graph TD
subgraph "Level 0: Normal"
A[Full RAG + LLM + All Services]
end
subgraph "Level 1: Pressure"
B[Disable reranker<br>Use top-3 raw KNN results]
end
subgraph "Level 2: High Load"
C[Switch to Haiku model<br>Reduce RAG to top-1 chunk]
end
subgraph "Level 3: Overload"
D[Template-only for guests<br>LLM only for authenticated]
end
subgraph "Level 4: Emergency"
E[Template-only for all<br>Escalate complex to human]
end
A --> B
B --> C
C --> D
D --> E
style A fill:#2d8,stroke:#333
style B fill:#9d2,stroke:#333
style C fill:#fd2,stroke:#333
style D fill:#f92,stroke:#333
style E fill:#f66,stroke:#333
Code Example: Degradation Controller
from dataclasses import dataclass
from enum import IntEnum
import redis
class DegradationLevel(IntEnum):
NORMAL = 0
PRESSURE = 1
HIGH_LOAD = 2
OVERLOAD = 3
EMERGENCY = 4
@dataclass
class DegradationConfig:
level: DegradationLevel
use_reranker: bool
model_tier: str
max_rag_chunks: int
guest_llm_enabled: bool
auth_llm_enabled: bool
DEGRADATION_CONFIGS = {
DegradationLevel.NORMAL: DegradationConfig(
level=DegradationLevel.NORMAL,
use_reranker=True,
model_tier="sonnet",
max_rag_chunks=3,
guest_llm_enabled=True,
auth_llm_enabled=True,
),
DegradationLevel.PRESSURE: DegradationConfig(
level=DegradationLevel.PRESSURE,
use_reranker=False, # Skip reranker
model_tier="sonnet",
max_rag_chunks=3,
guest_llm_enabled=True,
auth_llm_enabled=True,
),
DegradationLevel.HIGH_LOAD: DegradationConfig(
level=DegradationLevel.HIGH_LOAD,
use_reranker=False,
model_tier="haiku", # Cheaper model
max_rag_chunks=1, # Fewer chunks
guest_llm_enabled=True,
auth_llm_enabled=True,
),
DegradationLevel.OVERLOAD: DegradationConfig(
level=DegradationLevel.OVERLOAD,
use_reranker=False,
model_tier="haiku",
max_rag_chunks=1,
guest_llm_enabled=False, # Guests get templates only
auth_llm_enabled=True,
),
DegradationLevel.EMERGENCY: DegradationConfig(
level=DegradationLevel.EMERGENCY,
use_reranker=False,
model_tier="none",
max_rag_chunks=0,
guest_llm_enabled=False,
auth_llm_enabled=False, # Everyone gets templates
),
}
class DegradationController:
"""Determine degradation level based on system health metrics."""
def __init__(self, redis_client: redis.Redis):
self._redis = redis_client
def get_current_config(self) -> DegradationConfig:
level = self._compute_level()
return DEGRADATION_CONFIGS[level]
def _compute_level(self) -> DegradationLevel:
# Read health signals from Redis (set by monitoring system)
p99_latency = float(self._redis.get("health:p99_latency_ms") or 0)
error_rate = float(self._redis.get("health:error_rate_pct") or 0)
cpu_util = float(self._redis.get("health:cpu_util_pct") or 0)
bedrock_throttle_rate = float(
self._redis.get("health:bedrock_throttle_pct") or 0
)
if bedrock_throttle_rate > 20 or error_rate > 10:
return DegradationLevel.EMERGENCY
if cpu_util > 85 or p99_latency > 5000:
return DegradationLevel.OVERLOAD
if cpu_util > 70 or p99_latency > 3000:
return DegradationLevel.HIGH_LOAD
if cpu_util > 55 or p99_latency > 2000:
return DegradationLevel.PRESSURE
return DegradationLevel.NORMAL
Combined Cost Impact Summary
graph LR
subgraph "Before Optimization"
A[Rate Limiting: None<br>All traffic hits LLM]
B[Guest Pipeline: Same as Auth<br>Full LLM for all]
C[Cost Control: Manual monitoring<br>No automatic limits]
end
subgraph "After Optimization"
D[Rate Limiting: Tiered<br>5-10% abuse blocked]
E[Guest Pipeline: Lite<br>Template + Cache priority]
F[Cost Control: Circuit Breaker<br>Auto-degrade at budget]
end
A --> D
B --> E
C --> F
style D fill:#2d8,stroke:#333
style E fill:#2d8,stroke:#333
style F fill:#2d8,stroke:#333
Monitoring and Metrics
| Metric | Target | Alert |
|---|---|---|
| Rate limit rejection rate | 3-8% | > 15% (too aggressive) or < 1% (too lenient) |
| Guest template-only rate | ≥ 50% | < 30% |
| Cost circuit breaker triggers | 0/month | Any trigger |
| Daily Bedrock spend | ≤ $4,200 | > $5,000 |
| Degradation level | 0 (Normal) | ≥ 2 for > 10 min |
| Bot detection block rate | 3-5% | > 10% |
Risks and Mitigations
| Risk | Impact | Mitigation |
|---|---|---|
| Rate limits too aggressive for power users | Frustrated customers leave | Prime users get 60 msg/min; monitor feedback from rate-limited users |
| Guest lite pipeline hurts conversion | Fewer guest → authenticated upgrades | A/B test: measure conversion rate with full vs. lite pipeline for guests |
| Cost circuit breaker triggers during legitimate spike | Degraded experience during manga release event | Pre-announce events; temporarily raise daily budget for known spikes |
| Degradation level miscalculated | Degraded mode when system is healthy | Multiple health signals (latency + CPU + error rate + throttle); require 2+ signals to degrade |
Deep Dive: Why This Works on a Manga Chatbot Workload
This story is the outer control loop of the cost-optimization stack. While US-01 through US-07 each reduce the steady-state per-request cost, none of them can prevent a runaway: a botnet, a misconfigured retry loop, a viral event, or a single unbounded conversation can blow through the daily budget no matter how cheap each individual request is. US-08 exists because per-request cost optimization is a necessary-but-not-sufficient cost strategy — you also need a hard ceiling enforced at the demand side. The 20–35% saving target combines three structurally different cost reductions.
Property 1: Chatbot traffic has a long tail of low-value and abusive requests. Public estimates put bot traffic at ~30% of internet traffic, with ~5% explicitly malicious (Imperva annual reports). For a public-facing chatbot, the breakdown is similar: scrapers probing the API, jailbreak attempts, abusive prompts, accidental retry loops from misbehaving clients, and one-message-and-leave guests who never convert. Each of these consumes the full Bedrock + RAG + DDB cost path while producing zero business value. Tiered rate limiting (Prime > Auth > Guest) does not just stop abuse — it implicitly downgrades the cost-per-request multiplier on the lowest-value segment, which is also the largest. The architectural assumption is that the tier-detection signal (Prime status, Auth cookie, IP reputation) is reliable; the failure mode is misclassification (a real Prime user gets Guest treatment) and the mitigation is a feedback signal from CSAT into the rate-limit tuning.
Property 2: The cost circuit breaker is a one-way ratchet against runaways. Cost optimization is fundamentally a probabilistic argument — most requests should be cheap, on average. But averages hide tail-risk: a single 30K-token prompt to Sonnet costs $0.45, the equivalent of 56 average requests. A poisoned conversation that grows to 100K tokens before terminating costs $1.50. A scripted attacker sending 1K such requests per minute can blow the daily budget in hours. The cost circuit breaker (story line ~260s) is a last line of defense: when daily Bedrock spend approaches the budget cap, the breaker forces all traffic to Haiku (or template-only) for the rest of the day. This is not "cost optimization" in the steady-state sense — it is risk control that bounds the worst-case daily spend. The architectural assumption is that degraded service for the rest of the day is acceptable in exchange for a hard cost cap; the alternative (uncapped spending) is unacceptable for any production system. The kill-switch granularity matters: per-tier breakers (Prime keeps full service, Guest goes to template-only) preserve revenue while shedding cost.
Property 3: Graceful degradation matches load shedding to user value. Under sustained overload (CPU pressure, Bedrock throttle, RAG latency spikes), the system has three options: (a) accept all requests at degraded latency, (b) reject excess requests with HTTP 429, or © serve excess requests with a cheaper, faster pipeline. Option © — graceful degradation — preserves the user experience for the maximum number of users at the cost of reduced quality on the marginal users. The 5-level degradation ladder (Normal → Pressure → High Load → Overload → Emergency) implements this with progressively more aggressive cost-shedding: at Pressure, only RAG bypass tightens; at Emergency, only template responses are served. The architectural assumption is that the degradation levels are calibrated to real failure cliffs (CPU > 85% is genuinely unhealthy, not just busy); the failure mode is over-reactive degradation (degraded mode during normal busy periods) and the mitigation is requiring 2+ concurrent signals before degrading.
Property 4: Off-peak cheaper paths exploit diurnal cost asymmetry. Late-night JST traffic is dominated by automated agents, scrapers, and overseas users — a different value mix than peak-hour traffic. Routing this traffic to a "lite" pipeline (template + cache priority + Haiku-only) trades quality-on-the-marginal-user for cost. This is invisible during peak (when full pipeline is needed for paying customers) and large during off-peak (when 60% of requests can be served by template+cache without quality regression). The mechanism is just a time-of-day-aware tier ceiling: between 11pm and 8am JST, no request escalates above tier-2 (Haiku).
Bottom line: the savings stack non-linearly. Steady-state savings (rate limiter blocking abuse + guest-lite pipeline + off-peak tier ceiling) compound multiplicatively over time — every hour, savings continue. The cost circuit breaker is more like insurance: it has zero benefit until it has enormous benefit. Pricing the breaker by "how often does it trigger" misses the point — its value is the worst-case scenario it prevents, not the average-case savings it produces.
Real-World Validation
Industry Benchmarks & Case Studies
- Stripe engineering: "Scaling your API with rate limiters" — Documents tiered rate limiting with token bucket, Redis pipelines, and per-user + per-global limit composition. The implementation pattern in this story is closely modeled on the Stripe pattern.
- Shopify engineering: "Surviving Black Friday" — Documents traffic-based degradation with per-feature kill switches and graceful-degradation tiers. Validates the 5-level degradation ladder pattern.
- Google SRE Book, Chapter 22 ("Addressing Cascading Failures") and Chapter 21 ("Handling Overload") — Foundational text on load shedding, queueing, and graceful degradation. The "must shed before saturation" principle underpins the requirement that degradation triggers fire before CPU saturation, not at it.
- Hystrix / Polly circuit breaker patterns — Netflix's Hystrix (now in maintenance) and the .NET Polly library codified the circuit breaker pattern; the cost-tracking variant in this story (cost-as-circuit-state instead of error-rate-as-circuit-state) is a direct extension.
- Cisco IOS rate limiting documentation — The token bucket algorithm originates here; the per-tier limits in this story (Prime 60/min, Auth 30/min, Guest 10/min) follow the standard token-bucket sizing approach.
- Imperva Bad Bot Report (annual) — Bot traffic accounts for ~30% of internet traffic with ~5% explicitly malicious. The story's "5–10% abuse blocked" target is consistent with this baseline; values much higher would suggest legitimate users being blocked, lower would suggest insufficient enforcement.
- Cloudflare engineering: "How we built rate limiting capable of scaling to millions of domains" — Distributed rate limiting architecture and Redis pipeline patterns. Validates the choice of Redis pipeline (atomic INCR + TTL) for the per-tier counter.
- Internal cross-reference:
POC-to-Production-War-Story/02-seven-production-catastrophes.md— The "cost explosion" catastrophe was specifically a runaway that this circuit breaker would have caught; its failure was the absence of a hard ceiling. This story is the documented fix. - Internal cross-reference:
Optimization-Tradeoffs-User-Stories/— Covers the broader trade-off between cost-control aggressiveness and user-experience preservation; this story is the cost-floor operating point.
Math Validation
- Token bucket math: at Prime tier 60/min, each user consumes ≤ 86,400 requests/day (60 × 1440 min). At Bedrock peak cost of ~$0.012/req on Sonnet, a maxed-out Prime user costs ~$1,037/day. With ~10K Prime users, theoretical max daily cost = $10M. Rate limit + circuit breaker are the only mechanisms preventing this theoretical max from becoming actual.
- Rate limit blocking 5% × 1M req/day × $0.008 average = $400/day = $12K/month saved, just from abuse blocking. ✅
- Guest lite pipeline: ~30% of traffic is guest, of which ~70% can be served by template+cache+Haiku only. Cost per Guest req drops from ~$0.008 (full Sonnet) to ~$0.0005 (Haiku) → 30% × 70% × ($0.008 − $0.0005) × 1M req/day = $1,575/day = ~$47K/month saved. Flag: the story claims "$15K from guest lite" — recheck the conversion-rate impact assumption (some guests upgrade to Auth and pay full pipeline cost, reducing the saving).
- Cost circuit breaker: zero recurring savings; insurance value capped at "daily budget cap × 30 days = monthly worst-case avoided." If daily budget is $5K, the breaker prevents up to $150K/month in runaway scenarios.
Conservative vs Aggressive Savings Bounds
| Bound | Source | Total monthly savings |
|---|---|---|
| Conservative | Rate limit + circuit breaker only | ~10% (~$30K/month) |
| Aggressive | Rate limit + guest-lite + off-peak ceiling + circuit breaker + degradation | ~35% (~$110K/month) |
| Story's projected savings | 20–35% | Aligns with the aggressive bound; insurance value of breaker is uncounted. |
Cross-Story Interactions & Conflicts
This story is the integrating cost-control layer for all other stories. Most edges are authoritative on this side.
- US-04 (Compute) — Authoritative side: this story for the degradation–autoscaling contract. Conflict mode: auto-scaling adds capacity in response to load; degradation sheds load in response to capacity pressure. If both fire on the same trigger, you get oscillation (degradation reduces load → CPU drops → auto-scaler scales in → next traffic burst hits with no headroom → degradation fires again). Resolution: when
degradation_level >= 2, this story emits asuspend_scale_in=truesignal that US-04's auto-scaler honors. Scale-out is still allowed (more capacity is always safe); only scale-in is suspended. - US-01 (LLM Tokens) — Authoritative side: this story for the
model_tier_floorconfig. Conflict mode: US-01's complexity classifier may route a query to Sonnet while this story's breaker is forcing Haiku-only. Resolution: themodel_tier_flooris a Redis-backed shared config (10s TTL on the client). When the breaker trips,model_tier_floor=haikuis set; US-01's tier selector reads this and suppresses Sonnet routing regardless of complexity. Shared kill-switch path: SSM Parameter Store mirror for emergency manual override. - US-06 (RAG) — Authoritative side: this story for the
degradation_levelsignal. Conflict mode: under DEGRADED state, RAG bypass should be more aggressive (bypass on lower-confidence intents). Resolution: US-06's bypass gate readsdegradation_leveland applies a threshold matrix: at level 0, bypass on intent-match only; at level 2+, bypass on any non-RAG-shaped intent regardless of confidence; at level 4 (Emergency), bypass everything — RAG is fully off. - US-07 (Analytics Pipeline) — Authoritative side: US-07. The cost circuit breaker reads cumulative daily Bedrock spend from the cost-tracking event stream. Conflict mode: if event-batching latency exceeds 5 minutes, the breaker decides on stale data. Resolution: cost-tracking events use a dedicated stream with 5-event / 1-second batching, not the default 50-event / 5-second batching. Event lag SLO: ≤ 5 minutes P95.
- US-02 (Intent Classifier) — Authoritative side: US-02 owns the intent label. Conflict mode: the request prioritizer routes by intent + tier. If the classifier is unavailable (cold start, scale-from-zero failure), prioritization defaults to MEDIUM — but the system is actually overloaded. Resolution: this story emits a
classifier_unavailable=truesignal when intent-unavailable rate > 1% over 1 minute; under this signal, all traffic without explicit Auth/Prime tier is treated as Guest tier (rate-limited harder). - US-03 (Caching) — Indirect interaction. Rate limiter state lives on the Redis tier from US-03. Conflict mode: during ElastiCache failover, the rate limiter state is unavailable for ~30–90 seconds — potential burst pass-through. Resolution: local in-process token-bucket fallback during Redis unavailability; counts merged on Redis recovery.
Rollback & Experimentation
Shadow-Mode Plan
- Rate limiter: deploy in observe mode for 2 weeks — log "would have rate-limited" decisions but allow all requests through. Compare projected block rates against bot-traffic estimates. Tune per-tier limits based on observed user behaviors before enforcing.
- Cost circuit breaker: deploy in alarm-only mode for 4 weeks — when projected breaker-trip threshold is crossed, fire alerts but do not actually degrade. Measure how often manual intervention would have been needed.
- Degradation controller: deploy with
degradation_active=falseflag for 2 weeks; all signals computed but no enforcement. Validate that degradation triggers correctly correspond to real overload events. - Guest-lite pipeline: A/B test against full pipeline on 10% of guest traffic for 4 weeks; measure conversion rate (guest → authenticated upgrade) for both arms.
Canary Thresholds
- Rate limiter: start at 2× the planned per-tier limit (effectively only blocking egregious abusers); halve to planned limit over 4 weeks.
- Cost breaker: start with daily budget at 150% of planned cap (effectively only catches runaway scenarios); ramp down to planned cap over 4 weeks.
- Abort criteria (any one trips): false-positive rate-limit complaints from authenticated users > expected baseline + 50%, conversion rate drop on guest-lite arm > 10%, circuit breaker triggers during normal traffic > 0.
Kill Switches
This story has the most kill switches because it has the most safety-critical control loops:
- rate_limit_enabled — disables all rate limiting.
- cost_circuit_breaker_enabled — disables the cost cap; cost can run unbounded.
- degradation_controller_enabled — disables all graceful-degradation behavior.
- guest_lite_pipeline_enabled — guests get full pipeline.
- off_peak_tier_ceiling_enabled — disables time-of-day cost shaping.
All flags read from SSM Parameter Store with 30-second client cache; rollback < 2 minutes per flag.
Quality Regression Criteria (story-specific)
- Rate-limit false positive rate (complaints from authenticated users): ≤ baseline + 5%.
- Cost circuit breaker triggers during normal traffic (ground-truth from post-incident review): 0/quarter.
- Conversion rate impact of guest-lite pipeline: ≤ 5% reduction (above this, narrow guest-lite to template-only-when-eligible).
- Degradation controller miscalculation rate (degradation triggered when post-hoc analysis shows system was healthy): ≤ 1 event/quarter.
Multi-Reviewer Validation Findings & Resolutions
The cross-reviewer pass identified the following story-specific findings. This story carries the highest density of S1 issues because it is the outer cost-control loop — failures here have the largest blast radius. README's "Multi-Reviewer Validation & Cross-Cutting Hardening" section covers concerns that span all stories.
S1 (must-fix before production)
Rate-limiter tier signal is spoofable. Tier (Prime / Auth / Guest) is read from request context. If derived from a header, cookie, or unverified JWT claim, an attacker can set X-User-Tier: prime and bypass per-tier limits, achieving 6× cost escalation as a guest. Resolution: tier MUST be derived from user_id lookup in an IAM-protected DynamoDB or Aurora table on every request — never from a request header. JWT-based tier requires HS256/RS256 signature verification with key rotation. Server-side immutable source only. Add tier_signal_origin log field; alarm on any non-DDB origin.
Cost ledger in unauthenticated Redis can be tampered. cost:daily:{today} in Redis with no write protection. Compromised service code (or Redis exposure) can DECRBY the counter to keep the breaker from tripping. Attacker keeps cost under the cap while incurring real $50K+ spend. Resolution: cost ledger authoritative storage is DynamoDB (cost_ledger table, strongly consistent reads, IAM write-restricted to a single cost-recorder role); Redis is a read-through cache only. Breaker reads from DDB on disagreement-with-Redis; weekly reconciliation alerts on divergence > 5%. CloudTrail on every write to the cost-recorder role.
Kill-switch privilege escalation. cost_circuit_breaker_enabled and model_tier_floor flags in SSM with default IAM allowing any service role to write. Compromised low-privilege task can flip the flag and disable cost protection. Resolution: SSM Parameter Store IAM policy restricts PutParameter on /cost-control/* to a single finops-lead role (named human IAM role, MFA required). CloudTrail alarm on every parameter change. CloudFormation StackPolicy prevents drift.
Retry storm amplification. Token bucket rejects with 429; client retries with deterministic backoff create synchronized retry waves at the same second boundary, re-rejecting in cascading waves. Resolution: every 429 response carries Retry-After header with server-computed jittered value (1 + uniform_random(0, 10) seconds), spreading retries across a window. Document this in the public API spec so client SDKs respect it.
Control-plane / data-plane mixing in CostCircuitBreaker. check_state() runs on every request (data plane); also writes to Redis (control plane). Per-request latency is now in the cost-write critical path. Resolution: separate concerns:
- Data plane (per-request, fast): read
current_statefrom Redis (with DDB fallback), allow/deny. - Control plane (async, eventual): Bedrock-call-completed events flow through Kinesis (US-07) to a Lambda that updates the DDB cost ledger and recomputes
current_stateon a 60-second tick. Per-request decisions never write the ledger.
This bounds per-minute decision lag (< 60s) while making the data-plane gate read-only and fast.
Cost circuit breaker as DoS surface. An attacker who knows the breaker trips at $5K daily can intentionally drive spend to $5K, forcing all users (including legitimate Prime) into degraded mode for 24 hours. Resolution: per-tier breakers (Prime stays in NORMAL even when global breaker trips, up to its own per-tier sub-budget); rate-limit suspicious sources hard before they can drive global spend; alarm on "single source consumed > 10% of daily budget" as DoS indicator.
S2 (fix before scale-up)
degradation_level precedence with US-01/US-04/US-06 must be enforced through the central evaluator. Each story currently reads the signal independently; flag-cache divergence (30s SSM cache) can cause inconsistent state across stories. Resolution: mandatory feature-flag evaluator module per README precedence rules; direct SSM reads forbidden in story code.
Global rate-limit key collision at minute boundary. rate:global:min:{now // 60} resets at the second boundary, causing brief spikes. Resolution: sliding-window counter (e.g., 10-second buckets aggregated into a 60-second sum) instead of fixed-window; or randomized key offset per second.
Bedrock throttle as side-channel signal. Degradation reads health:bedrock_throttle_pct. Attacker can artificially trigger Bedrock throttling (high-fanout calls) to force degradation. Resolution: require ≥ 2 concurrent signals (throttle + sustained CPU + sustained error rate) before degrading; do not degrade on single-signal evidence.
Cost circuit breaker stuck-in-DEGRADED scenario. If baseline cost is already above budget at midnight UTC, the breaker stays in DEGRADED forever. Resolution: alarm on cost_state == DEGRADED for > 4 hours triggers manual budget-review escalation; DEGRADED state has a "do you really want to extend" check at every shift handoff.
Rate-limit logs contain IPs (PII under GDPR). Resolution: hash IPs (SHA-256(IP + monthly-rotated salt)) before logging; log retention ≤ 30 days.
Tier auth fallback when DDB unavailable. If the DDB tier-lookup fails, do not fall back to "Prime" by default. Resolution: fallback to Guest tier on DDB lookup failure; this is fail-secure (more rate-limiting, never less).
S3 (acknowledged / future work)
- Per-tier sub-budgets (each tier has its own daily spend cap, breaker-tripped independently).
- Anomaly detection on
cost:dailyvelocity (alert when 1-hour spend rate exceeds 1.5× rolling 7-day same-hour average). - Multi-region active-active for the rate-limiter — out of scope.
- Token-cost validation: monthly compare estimated cost vs Bedrock billing; alarm on > 5% divergence.
Runbook: Cost Circuit Breaker Trips at 3am JST
Symptoms: PagerDuty alert "cost_state == WARNING" or "DEGRADED"; daily Bedrock spend > 80% (WARNING) or > 100% (DEGRADED) of cap.
Triage (in order):
- Confirm trip is real, not stale data: query DDB
cost_ledgerdirectly (skip Redis cache) and verify spend value; if Redis ≠ DDB, suspect tampering and escalate to security. - Identify the source: query US-07 cost events for the last 4 hours, group by
tier_used+customer_id_hash+model_id. Look for: (a) a single customer driving > 10% of spend (possible compromised account or runaway client retry); (b)tier_usedskew (Sonnet routing rate doubled?); © intent-distribution shift (new pattern indicating bot traffic). - If single-customer runaway: revoke or rate-limit that customer specifically; let the rest of the system continue normally.
- If broad spend increase: keep DEGRADED state for the rest of the day; the breaker is doing its job.
- If false trip (post-hoc analysis shows healthy traffic): tune the daily budget upward at next FinOps review; do not flip
cost_circuit_breaker_enabled=falseto mask the issue. - Page FinOps lead if the trip persists > 4 hours.
Escalation: if global runaway and trip is insufficient (cost still climbing), manually flip bedrock_invocation_enabled=false (per US-01 runbook) — all chat traffic returns degraded template responses; cost goes to zero.