US-08: Traffic-Based Cost Optimization

User Story

As a site reliability engineer, I want to implement intelligent traffic management with rate limiting, request prioritization, and graceful degradation, So that total infrastructure costs decrease by 20-35% during normal operations and the system avoids runaway costs during traffic spikes.

Acceptance Criteria

Rate limiter enforces per-user and global limits to prevent LLM cost spikes.
Request prioritization ensures authenticated users get higher throughput than guests.
Graceful degradation reduces costs by bypassing expensive components when they are overloaded.
Cost circuit breaker halts LLM calls when daily spend exceeds budget.
Off-peak traffic uses cheaper execution paths automatically.
Total infrastructure costs decrease by 20-35% through traffic shaping.

High-Level Design

Cost Problem

Without traffic controls, cost scales linearly with traffic — and super-linearly during spikes (when provisioned capacity overflows to more expensive on-demand resources). A single abusive user could generate thousands of LLM calls.

Uncontrolled cost risks: - Bot/abuse traffic: 5-15% of messages may be low-value or abusive - Guest discovery browsing: generates LLM calls but low conversion - Traffic spikes: Black Friday, manga release events can 3-5x normal traffic

Traffic Management Architecture

graph TD
    A[Incoming Request] --> B[CloudFront<br>WAF + Bot Detection]
    B --> C[Rate Limiter<br>Token Bucket]
    C --> D{Within<br>Rate Limit?}
    D -->|Yes| E[Request Prioritizer]
    D -->|No| F[429 Too Many Requests<br>Zero downstream cost]

    E --> G{Priority Level?}
    G -->|High: Authenticated + Active Cart| H[Full Pipeline<br>LLM + RAG + All Services]
    G -->|Medium: Authenticated| I[Standard Pipeline<br>Template-first, then LLM]
    G -->|Low: Guest| J[Lite Pipeline<br>Template + Cache only]

    K[Cost Circuit Breaker] --> L{Daily Spend<br>> Budget?}
    L -->|Yes| M[Degrade to<br>Template-Only Mode]
    L -->|No| N[Normal Operation]

    style F fill:#2d8,stroke:#333
    style J fill:#2d8,stroke:#333
    style M fill:#fd2,stroke:#333

Savings Breakdown

Technique	Reduction	Monthly Savings
Rate limiting (blocks 5-10% abuse traffic)	8% of total LLM calls	~$25,000
Guest lite pipeline (template + cache only)	15% of LLM calls avoided	~$15,000
Request prioritization (off-peak degradation)	10% compute savings	~$300
Cost circuit breaker (prevents runaway)	Prevents $50K+ overrun/month	Risk avoidance
Bot detection (WAF)	3-5% request rejection	~$5,000
Total		~$45,300/month

Low-Level Design

1. Tiered Rate Limiter

Different rate limits based on user tier and request type.

graph TD
    subgraph "Rate Limit Tiers"
        A[Prime Authenticated<br>60 msg/min, 500/hour]
        B[Standard Authenticated<br>30 msg/min, 200/hour]
        C[Guest<br>10 msg/min, 50/hour]
        D[Global<br>50,000 msg/min total]
    end

    subgraph "Token Bucket Implementation"
        E[Request Arrives] --> F[Check User Tier]
        F --> G[Redis MULTI:<br>Check + Decrement Token]
        G --> H{Tokens > 0?}
        H -->|Yes| I[Allow Request]
        H -->|No| J[429 + Retry-After Header]
    end

Code Example: Tiered Rate Limiter

import time
from dataclasses import dataclass
from enum import Enum
from typing import Optional

import redis


class UserTier(Enum):
    PRIME = "prime"
    AUTHENTICATED = "authenticated"
    GUEST = "guest"


@dataclass
class RateLimitConfig:
    requests_per_minute: int
    requests_per_hour: int
    burst_allowance: int


TIER_LIMITS = {
    UserTier.PRIME: RateLimitConfig(
        requests_per_minute=60,
        requests_per_hour=500,
        burst_allowance=10,
    ),
    UserTier.AUTHENTICATED: RateLimitConfig(
        requests_per_minute=30,
        requests_per_hour=200,
        burst_allowance=5,
    ),
    UserTier.GUEST: RateLimitConfig(
        requests_per_minute=10,
        requests_per_hour=50,
        burst_allowance=2,
    ),
}

GLOBAL_LIMIT_PER_MINUTE = 50_000


@dataclass
class RateLimitResult:
    allowed: bool
    remaining: int
    retry_after_seconds: Optional[int]
    tier: UserTier


class TieredRateLimiter:
    """Token-bucket rate limiter with per-tier limits stored in Redis."""

    def __init__(self, redis_client: redis.Redis):
        self._redis = redis_client

    def check_and_consume(
        self, user_id: str, tier: UserTier
    ) -> RateLimitResult:
        config = TIER_LIMITS[tier]
        now = int(time.time())
        minute_key = f"rate:{user_id}:min:{now // 60}"
        hour_key = f"rate:{user_id}:hr:{now // 3600}"
        global_key = f"rate:global:min:{now // 60}"

        pipe = self._redis.pipeline()

        # Check minute limit
        pipe.incr(minute_key)
        pipe.expire(minute_key, 60)

        # Check hour limit
        pipe.incr(hour_key)
        pipe.expire(hour_key, 3600)

        # Check global limit
        pipe.incr(global_key)
        pipe.expire(global_key, 60)

        results = pipe.execute()
        minute_count = results[0]
        hour_count = results[2]
        global_count = results[4]

        # Check per-user minute limit
        if minute_count > config.requests_per_minute + config.burst_allowance:
            return RateLimitResult(
                allowed=False,
                remaining=0,
                retry_after_seconds=60 - (now % 60),
                tier=tier,
            )

        # Check per-user hour limit
        if hour_count > config.requests_per_hour:
            return RateLimitResult(
                allowed=False,
                remaining=0,
                retry_after_seconds=3600 - (now % 3600),
                tier=tier,
            )

        # Check global limit
        if global_count > GLOBAL_LIMIT_PER_MINUTE:
            return RateLimitResult(
                allowed=False,
                remaining=0,
                retry_after_seconds=60 - (now % 60),
                tier=tier,
            )

        remaining = config.requests_per_minute - minute_count
        return RateLimitResult(
            allowed=True,
            remaining=max(0, remaining),
            retry_after_seconds=None,
            tier=tier,
        )

2. Request Priority Router

Route requests through different execution paths based on user value.

graph TD
    A[Request + User Context] --> B{User Tier<br>+ Cart Status}
    B -->|Authenticated + Cart > 0<br>HIGH priority| C[Full Pipeline<br>RAG + LLM + All Services]
    B -->|Authenticated + No Cart<br>MEDIUM priority| D[Standard Pipeline<br>Template-first → LLM fallback]
    B -->|Guest<br>LOW priority| E{Intent Type?}
    E -->|faq, product_discovery| F[Lite Pipeline<br>Template + Cache only<br>No LLM]
    E -->|product_question,<br>recommendation| G[Standard Pipeline<br>with Haiku model]

    style C fill:#f66,stroke:#333
    style D fill:#fd2,stroke:#333
    style F fill:#2d8,stroke:#333

Code Example: Priority Router

from dataclasses import dataclass
from enum import Enum
from typing import Optional


class Priority(Enum):
    HIGH = "high"        # Full pipeline, Sonnet
    MEDIUM = "medium"    # Template-first, Sonnet fallback
    LOW = "low"          # Template + cache, Haiku fallback


@dataclass
class PipelineConfig:
    priority: Priority
    use_rag: bool
    use_llm: bool
    model_tier: str           # "sonnet" or "haiku"
    use_template_first: bool
    use_cache_only: bool
    reason: str


class RequestPriorityRouter:
    """Determine execution pipeline based on user tier and context."""

    def route(
        self,
        user_tier: str,
        is_authenticated: bool,
        cart_size: int,
        intent: str,
        is_peak_hour: bool,
    ) -> PipelineConfig:
        # High priority: authenticated user with active cart
        if is_authenticated and cart_size > 0:
            return PipelineConfig(
                priority=Priority.HIGH,
                use_rag=True,
                use_llm=True,
                model_tier="sonnet",
                use_template_first=False,
                use_cache_only=False,
                reason="Authenticated user with active cart — full pipeline",
            )

        # Medium priority: authenticated user, no cart
        if is_authenticated:
            return PipelineConfig(
                priority=Priority.MEDIUM,
                use_rag=intent in ("faq", "product_question", "recommendation"),
                use_llm=True,
                model_tier="sonnet" if not is_peak_hour else "haiku",
                use_template_first=True,
                use_cache_only=False,
                reason="Authenticated user — template-first with LLM fallback",
            )

        # Low priority: guest user
        if intent in ("faq", "product_discovery", "promotion", "chitchat"):
            return PipelineConfig(
                priority=Priority.LOW,
                use_rag=False,
                use_llm=False,
                model_tier="none",
                use_template_first=True,
                use_cache_only=True,
                reason="Guest user, cacheable intent — template + cache only",
            )

        # Guest with complex intent: limited LLM
        return PipelineConfig(
            priority=Priority.LOW,
            use_rag=intent in ("product_question", "recommendation"),
            use_llm=True,
            model_tier="haiku",
            use_template_first=True,
            use_cache_only=False,
            reason="Guest user, complex intent — lite pipeline with Haiku",
        )

3. Cost Circuit Breaker

Prevents runaway LLM costs by switching to template-only mode when daily spend exceeds budget.

stateDiagram-v2
    [*] --> NormalOperation
    NormalOperation --> WarningState: Daily spend > 80% budget
    WarningState --> DegradedMode: Daily spend > 100% budget
    DegradedMode --> NormalOperation: New day (midnight UTC reset)
    WarningState --> NormalOperation: Spend rate decreases

    state NormalOperation {
        [*] --> FullPipeline
        FullPipeline: All models available
        FullPipeline: Full RAG + LLM
    }

    state WarningState {
        [*] --> ReducedPipeline
        ReducedPipeline: Only Haiku model
        ReducedPipeline: Aggressive caching
        ReducedPipeline: Template-first for all
    }

    state DegradedMode {
        [*] --> TemplateOnly
        TemplateOnly: No LLM calls
        TemplateOnly: Template + cache responses
        TemplateOnly: Escalation to human for complex queries
    }

Code Example: Cost Circuit Breaker

import logging
import time
from dataclasses import dataclass
from enum import Enum

import redis

logger = logging.getLogger(__name__)


class CostState(Enum):
    NORMAL = "normal"
    WARNING = "warning"
    DEGRADED = "degraded"


@dataclass
class CostBudget:
    daily_limit_usd: float
    warning_threshold_pct: float = 0.80
    degraded_threshold_pct: float = 1.00


@dataclass
class CostStatus:
    state: CostState
    daily_spend_usd: float
    budget_remaining_usd: float
    llm_calls_today: int
    actions: list[str]


class CostCircuitBreaker:
    """Monitor daily LLM spend and degrade execution path when budget exceeded."""

    # Approximate costs per call
    SONNET_AVG_COST = 0.008   # $0.008 per call (avg 1200 input + 250 output tokens)
    HAIKU_AVG_COST = 0.0005   # $0.0005 per call

    def __init__(
        self,
        redis_client: redis.Redis,
        budget: CostBudget,
    ):
        self._redis = redis_client
        self._budget = budget

    def check_state(self) -> CostStatus:
        today = self._today_key()
        spend_raw = self._redis.get(f"cost:daily:{today}")
        calls_raw = self._redis.get(f"cost:calls:{today}")

        daily_spend = float(spend_raw) if spend_raw else 0.0
        llm_calls = int(calls_raw) if calls_raw else 0

        remaining = self._budget.daily_limit_usd - daily_spend
        spend_pct = daily_spend / self._budget.daily_limit_usd

        if spend_pct >= self._budget.degraded_threshold_pct:
            state = CostState.DEGRADED
            actions = [
                "BLOCK all LLM calls",
                "SERVE template + cache responses only",
                "ESCALATE complex queries to human agents",
                "ALERT on-call engineer",
            ]
            logger.warning(
                f"Cost circuit OPEN — daily spend ${daily_spend:.2f} "
                f"exceeds budget ${self._budget.daily_limit_usd:.2f}"
            )
        elif spend_pct >= self._budget.warning_threshold_pct:
            state = CostState.WARNING
            actions = [
                "SWITCH all LLM calls to Haiku",
                "ENABLE aggressive caching",
                "FORCE template-first for all intents",
            ]
            logger.info(
                f"Cost circuit WARNING — daily spend ${daily_spend:.2f} "
                f"at {spend_pct*100:.0f}% of budget"
            )
        else:
            state = CostState.NORMAL
            actions = ["Normal operation"]

        return CostStatus(
            state=state,
            daily_spend_usd=daily_spend,
            budget_remaining_usd=max(0, remaining),
            llm_calls_today=llm_calls,
            actions=actions,
        )

    def record_llm_call(self, model: str, input_tokens: int, output_tokens: int) -> None:
        """Record an LLM call and its estimated cost."""
        today = self._today_key()

        # Calculate cost based on model
        if "sonnet" in model.lower():
            cost = (input_tokens / 1_000_000 * 3.0) + (output_tokens / 1_000_000 * 15.0)
        elif "haiku" in model.lower():
            cost = (input_tokens / 1_000_000 * 0.25) + (output_tokens / 1_000_000 * 1.25)
        else:
            cost = 0.0

        pipe = self._redis.pipeline()
        pipe.incrbyfloat(f"cost:daily:{today}", cost)
        pipe.incr(f"cost:calls:{today}")
        pipe.expire(f"cost:daily:{today}", 86400)
        pipe.expire(f"cost:calls:{today}", 86400)
        pipe.execute()

    def _today_key(self) -> str:
        return time.strftime("%Y-%m-%d", time.gmtime())

4. Graceful Degradation Ladder

When downstream services are overloaded, progressively disable expensive features.

graph TD
    subgraph "Level 0: Normal"
        A[Full RAG + LLM + All Services]
    end

    subgraph "Level 1: Pressure"
        B[Disable reranker<br>Use top-3 raw KNN results]
    end

    subgraph "Level 2: High Load"
        C[Switch to Haiku model<br>Reduce RAG to top-1 chunk]
    end

    subgraph "Level 3: Overload"
        D[Template-only for guests<br>LLM only for authenticated]
    end

    subgraph "Level 4: Emergency"
        E[Template-only for all<br>Escalate complex to human]
    end

    A --> B
    B --> C
    C --> D
    D --> E

    style A fill:#2d8,stroke:#333
    style B fill:#9d2,stroke:#333
    style C fill:#fd2,stroke:#333
    style D fill:#f92,stroke:#333
    style E fill:#f66,stroke:#333

Code Example: Degradation Controller

from dataclasses import dataclass
from enum import IntEnum

import redis


class DegradationLevel(IntEnum):
    NORMAL = 0
    PRESSURE = 1
    HIGH_LOAD = 2
    OVERLOAD = 3
    EMERGENCY = 4


@dataclass
class DegradationConfig:
    level: DegradationLevel
    use_reranker: bool
    model_tier: str
    max_rag_chunks: int
    guest_llm_enabled: bool
    auth_llm_enabled: bool


DEGRADATION_CONFIGS = {
    DegradationLevel.NORMAL: DegradationConfig(
        level=DegradationLevel.NORMAL,
        use_reranker=True,
        model_tier="sonnet",
        max_rag_chunks=3,
        guest_llm_enabled=True,
        auth_llm_enabled=True,
    ),
    DegradationLevel.PRESSURE: DegradationConfig(
        level=DegradationLevel.PRESSURE,
        use_reranker=False,       # Skip reranker
        model_tier="sonnet",
        max_rag_chunks=3,
        guest_llm_enabled=True,
        auth_llm_enabled=True,
    ),
    DegradationLevel.HIGH_LOAD: DegradationConfig(
        level=DegradationLevel.HIGH_LOAD,
        use_reranker=False,
        model_tier="haiku",       # Cheaper model
        max_rag_chunks=1,         # Fewer chunks
        guest_llm_enabled=True,
        auth_llm_enabled=True,
    ),
    DegradationLevel.OVERLOAD: DegradationConfig(
        level=DegradationLevel.OVERLOAD,
        use_reranker=False,
        model_tier="haiku",
        max_rag_chunks=1,
        guest_llm_enabled=False,  # Guests get templates only
        auth_llm_enabled=True,
    ),
    DegradationLevel.EMERGENCY: DegradationConfig(
        level=DegradationLevel.EMERGENCY,
        use_reranker=False,
        model_tier="none",
        max_rag_chunks=0,
        guest_llm_enabled=False,
        auth_llm_enabled=False,   # Everyone gets templates
    ),
}


class DegradationController:
    """Determine degradation level based on system health metrics."""

    def __init__(self, redis_client: redis.Redis):
        self._redis = redis_client

    def get_current_config(self) -> DegradationConfig:
        level = self._compute_level()
        return DEGRADATION_CONFIGS[level]

    def _compute_level(self) -> DegradationLevel:
        # Read health signals from Redis (set by monitoring system)
        p99_latency = float(self._redis.get("health:p99_latency_ms") or 0)
        error_rate = float(self._redis.get("health:error_rate_pct") or 0)
        cpu_util = float(self._redis.get("health:cpu_util_pct") or 0)
        bedrock_throttle_rate = float(
            self._redis.get("health:bedrock_throttle_pct") or 0
        )

        if bedrock_throttle_rate > 20 or error_rate > 10:
            return DegradationLevel.EMERGENCY

        if cpu_util > 85 or p99_latency > 5000:
            return DegradationLevel.OVERLOAD

        if cpu_util > 70 or p99_latency > 3000:
            return DegradationLevel.HIGH_LOAD

        if cpu_util > 55 or p99_latency > 2000:
            return DegradationLevel.PRESSURE

        return DegradationLevel.NORMAL

Combined Cost Impact Summary

graph LR
    subgraph "Before Optimization"
        A[Rate Limiting: None<br>All traffic hits LLM]
        B[Guest Pipeline: Same as Auth<br>Full LLM for all]
        C[Cost Control: Manual monitoring<br>No automatic limits]
    end

    subgraph "After Optimization"
        D[Rate Limiting: Tiered<br>5-10% abuse blocked]
        E[Guest Pipeline: Lite<br>Template + Cache priority]
        F[Cost Control: Circuit Breaker<br>Auto-degrade at budget]
    end

    A --> D
    B --> E
    C --> F

    style D fill:#2d8,stroke:#333
    style E fill:#2d8,stroke:#333
    style F fill:#2d8,stroke:#333

Monitoring and Metrics

Metric	Target	Alert
Rate limit rejection rate	3-8%	> 15% (too aggressive) or < 1% (too lenient)
Guest template-only rate	≥ 50%	< 30%
Cost circuit breaker triggers	0/month	Any trigger
Daily Bedrock spend	≤ $4,200	> $5,000
Degradation level	0 (Normal)	≥ 2 for > 10 min
Bot detection block rate	3-5%	> 10%

Risks and Mitigations

Risk	Impact	Mitigation
Rate limits too aggressive for power users	Frustrated customers leave	Prime users get 60 msg/min; monitor feedback from rate-limited users
Guest lite pipeline hurts conversion	Fewer guest → authenticated upgrades	A/B test: measure conversion rate with full vs. lite pipeline for guests
Cost circuit breaker triggers during legitimate spike	Degraded experience during manga release event	Pre-announce events; temporarily raise daily budget for known spikes
Degradation level miscalculated	Degraded mode when system is healthy	Multiple health signals (latency + CPU + error rate + throttle); require 2+ signals to degrade

Deep Dive: Why This Works on a Manga Chatbot Workload

This story is the outer control loop of the cost-optimization stack. While US-01 through US-07 each reduce the steady-state per-request cost, none of them can prevent a runaway: a botnet, a misconfigured retry loop, a viral event, or a single unbounded conversation can blow through the daily budget no matter how cheap each individual request is. US-08 exists because per-request cost optimization is a necessary-but-not-sufficient cost strategy — you also need a hard ceiling enforced at the demand side. The 20–35% saving target combines three structurally different cost reductions.

Property 1: Chatbot traffic has a long tail of low-value and abusive requests. Public estimates put bot traffic at ~30% of internet traffic, with ~5% explicitly malicious (Imperva annual reports). For a public-facing chatbot, the breakdown is similar: scrapers probing the API, jailbreak attempts, abusive prompts, accidental retry loops from misbehaving clients, and one-message-and-leave guests who never convert. Each of these consumes the full Bedrock + RAG + DDB cost path while producing zero business value. Tiered rate limiting (Prime > Auth > Guest) does not just stop abuse — it implicitly downgrades the cost-per-request multiplier on the lowest-value segment, which is also the largest. The architectural assumption is that the tier-detection signal (Prime status, Auth cookie, IP reputation) is reliable; the failure mode is misclassification (a real Prime user gets Guest treatment) and the mitigation is a feedback signal from CSAT into the rate-limit tuning.

Property 2: The cost circuit breaker is a one-way ratchet against runaways. Cost optimization is fundamentally a probabilistic argument — most requests should be cheap, on average. But averages hide tail-risk: a single 30K-token prompt to Sonnet costs $0.45, the equivalent of 56 average requests. A poisoned conversation that grows to 100K tokens before terminating costs $1.50. A scripted attacker sending 1K such requests per minute can blow the daily budget in hours. The cost circuit breaker (story line ~260s) is a last line of defense: when daily Bedrock spend approaches the budget cap, the breaker forces all traffic to Haiku (or template-only) for the rest of the day. This is not "cost optimization" in the steady-state sense — it is risk control that bounds the worst-case daily spend. The architectural assumption is that degraded service for the rest of the day is acceptable in exchange for a hard cost cap; the alternative (uncapped spending) is unacceptable for any production system. The kill-switch granularity matters: per-tier breakers (Prime keeps full service, Guest goes to template-only) preserve revenue while shedding cost.

Property 3: Graceful degradation matches load shedding to user value. Under sustained overload (CPU pressure, Bedrock throttle, RAG latency spikes), the system has three options: (a) accept all requests at degraded latency, (b) reject excess requests with HTTP 429, or © serve excess requests with a cheaper, faster pipeline. Option © — graceful degradation — preserves the user experience for the maximum number of users at the cost of reduced quality on the marginal users. The 5-level degradation ladder (Normal → Pressure → High Load → Overload → Emergency) implements this with progressively more aggressive cost-shedding: at Pressure, only RAG bypass tightens; at Emergency, only template responses are served. The architectural assumption is that the degradation levels are calibrated to real failure cliffs (CPU > 85% is genuinely unhealthy, not just busy); the failure mode is over-reactive degradation (degraded mode during normal busy periods) and the mitigation is requiring 2+ concurrent signals before degrading.

Property 4: Off-peak cheaper paths exploit diurnal cost asymmetry. Late-night JST traffic is dominated by automated agents, scrapers, and overseas users — a different value mix than peak-hour traffic. Routing this traffic to a "lite" pipeline (template + cache priority + Haiku-only) trades quality-on-the-marginal-user for cost. This is invisible during peak (when full pipeline is needed for paying customers) and large during off-peak (when 60% of requests can be served by template+cache without quality regression). The mechanism is just a time-of-day-aware tier ceiling: between 11pm and 8am JST, no request escalates above tier-2 (Haiku).

Bottom line: the savings stack non-linearly. Steady-state savings (rate limiter blocking abuse + guest-lite pipeline + off-peak tier ceiling) compound multiplicatively over time — every hour, savings continue. The cost circuit breaker is more like insurance: it has zero benefit until it has enormous benefit. Pricing the breaker by "how often does it trigger" misses the point — its value is the worst-case scenario it prevents, not the average-case savings it produces.

Real-World Validation

Industry Benchmarks & Case Studies

Stripe engineering: "Scaling your API with rate limiters" — Documents tiered rate limiting with token bucket, Redis pipelines, and per-user + per-global limit composition. The implementation pattern in this story is closely modeled on the Stripe pattern.
Shopify engineering: "Surviving Black Friday" — Documents traffic-based degradation with per-feature kill switches and graceful-degradation tiers. Validates the 5-level degradation ladder pattern.
Google SRE Book, Chapter 22 ("Addressing Cascading Failures") and Chapter 21 ("Handling Overload") — Foundational text on load shedding, queueing, and graceful degradation. The "must shed before saturation" principle underpins the requirement that degradation triggers fire before CPU saturation, not at it.
Hystrix / Polly circuit breaker patterns — Netflix's Hystrix (now in maintenance) and the .NET Polly library codified the circuit breaker pattern; the cost-tracking variant in this story (cost-as-circuit-state instead of error-rate-as-circuit-state) is a direct extension.
Cisco IOS rate limiting documentation — The token bucket algorithm originates here; the per-tier limits in this story (Prime 60/min, Auth 30/min, Guest 10/min) follow the standard token-bucket sizing approach.
Imperva Bad Bot Report (annual) — Bot traffic accounts for ~30% of internet traffic with ~5% explicitly malicious. The story's "5–10% abuse blocked" target is consistent with this baseline; values much higher would suggest legitimate users being blocked, lower would suggest insufficient enforcement.
Cloudflare engineering: "How we built rate limiting capable of scaling to millions of domains" — Distributed rate limiting architecture and Redis pipeline patterns. Validates the choice of Redis pipeline (atomic INCR + TTL) for the per-tier counter.
Internal cross-reference: POC-to-Production-War-Story/02-seven-production-catastrophes.md — The "cost explosion" catastrophe was specifically a runaway that this circuit breaker would have caught; its failure was the absence of a hard ceiling. This story is the documented fix.
Internal cross-reference: Optimization-Tradeoffs-User-Stories/ — Covers the broader trade-off between cost-control aggressiveness and user-experience preservation; this story is the cost-floor operating point.

Math Validation

Token bucket math: at Prime tier 60/min, each user consumes ≤ 86,400 requests/day (60 × 1440 min). At Bedrock peak cost of ~$0.012/req on Sonnet, a maxed-out Prime user costs ~$1,037/day. With ~10K Prime users, theoretical max daily cost = $10M. Rate limit + circuit breaker are the only mechanisms preventing this theoretical max from becoming actual.
Rate limit blocking 5% × 1M req/day × $0.008 average = $400/day = $12K/month saved, just from abuse blocking. ✅
Guest lite pipeline: ~30% of traffic is guest, of which ~70% can be served by template+cache+Haiku only. Cost per Guest req drops from ~$0.008 (full Sonnet) to ~$0.0005 (Haiku) → 30% × 70% × ($0.008 − $0.0005) × 1M req/day = $1,575/day = ~$47K/month saved. Flag: the story claims "$15K from guest lite" — recheck the conversion-rate impact assumption (some guests upgrade to Auth and pay full pipeline cost, reducing the saving).
Cost circuit breaker: zero recurring savings; insurance value capped at "daily budget cap × 30 days = monthly worst-case avoided." If daily budget is $5K, the breaker prevents up to $150K/month in runaway scenarios.

Conservative vs Aggressive Savings Bounds

Bound	Source	Total monthly savings
Conservative	Rate limit + circuit breaker only	~10% (~$30K/month)
Aggressive	Rate limit + guest-lite + off-peak ceiling + circuit breaker + degradation	~35% (~$110K/month)
Story's projected savings	20–35%	Aligns with the aggressive bound; insurance value of breaker is uncounted.

Cross-Story Interactions & Conflicts

This story is the integrating cost-control layer for all other stories. Most edges are authoritative on this side.

US-04 (Compute) — Authoritative side: this story for the degradation–autoscaling contract. Conflict mode: auto-scaling adds capacity in response to load; degradation sheds load in response to capacity pressure. If both fire on the same trigger, you get oscillation (degradation reduces load → CPU drops → auto-scaler scales in → next traffic burst hits with no headroom → degradation fires again). Resolution: when degradation_level >= 2, this story emits a suspend_scale_in=true signal that US-04's auto-scaler honors. Scale-out is still allowed (more capacity is always safe); only scale-in is suspended.
US-01 (LLM Tokens) — Authoritative side: this story for the model_tier_floor config. Conflict mode: US-01's complexity classifier may route a query to Sonnet while this story's breaker is forcing Haiku-only. Resolution: the model_tier_floor is a Redis-backed shared config (10s TTL on the client). When the breaker trips, model_tier_floor=haiku is set; US-01's tier selector reads this and suppresses Sonnet routing regardless of complexity. Shared kill-switch path: SSM Parameter Store mirror for emergency manual override.
US-06 (RAG) — Authoritative side: this story for the degradation_level signal. Conflict mode: under DEGRADED state, RAG bypass should be more aggressive (bypass on lower-confidence intents). Resolution: US-06's bypass gate reads degradation_level and applies a threshold matrix: at level 0, bypass on intent-match only; at level 2+, bypass on any non-RAG-shaped intent regardless of confidence; at level 4 (Emergency), bypass everything — RAG is fully off.
US-07 (Analytics Pipeline) — Authoritative side: US-07. The cost circuit breaker reads cumulative daily Bedrock spend from the cost-tracking event stream. Conflict mode: if event-batching latency exceeds 5 minutes, the breaker decides on stale data. Resolution: cost-tracking events use a dedicated stream with 5-event / 1-second batching, not the default 50-event / 5-second batching. Event lag SLO: ≤ 5 minutes P95.
US-02 (Intent Classifier) — Authoritative side: US-02 owns the intent label. Conflict mode: the request prioritizer routes by intent + tier. If the classifier is unavailable (cold start, scale-from-zero failure), prioritization defaults to MEDIUM — but the system is actually overloaded. Resolution: this story emits a classifier_unavailable=true signal when intent-unavailable rate > 1% over 1 minute; under this signal, all traffic without explicit Auth/Prime tier is treated as Guest tier (rate-limited harder).
US-03 (Caching) — Indirect interaction. Rate limiter state lives on the Redis tier from US-03. Conflict mode: during ElastiCache failover, the rate limiter state is unavailable for ~30–90 seconds — potential burst pass-through. Resolution: local in-process token-bucket fallback during Redis unavailability; counts merged on Redis recovery.

Rollback & Experimentation

Shadow-Mode Plan

Rate limiter: deploy in observe mode for 2 weeks — log "would have rate-limited" decisions but allow all requests through. Compare projected block rates against bot-traffic estimates. Tune per-tier limits based on observed user behaviors before enforcing.
Cost circuit breaker: deploy in alarm-only mode for 4 weeks — when projected breaker-trip threshold is crossed, fire alerts but do not actually degrade. Measure how often manual intervention would have been needed.
Degradation controller: deploy with degradation_active=false flag for 2 weeks; all signals computed but no enforcement. Validate that degradation triggers correctly correspond to real overload events.
Guest-lite pipeline: A/B test against full pipeline on 10% of guest traffic for 4 weeks; measure conversion rate (guest → authenticated upgrade) for both arms.

Canary Thresholds

Rate limiter: start at 2× the planned per-tier limit (effectively only blocking egregious abusers); halve to planned limit over 4 weeks.
Cost breaker: start with daily budget at 150% of planned cap (effectively only catches runaway scenarios); ramp down to planned cap over 4 weeks.
Abort criteria (any one trips): false-positive rate-limit complaints from authenticated users > expected baseline + 50%, conversion rate drop on guest-lite arm > 10%, circuit breaker triggers during normal traffic > 0.

Kill Switches

This story has the most kill switches because it has the most safety-critical control loops: - rate_limit_enabled — disables all rate limiting. - cost_circuit_breaker_enabled — disables the cost cap; cost can run unbounded. - degradation_controller_enabled — disables all graceful-degradation behavior. - guest_lite_pipeline_enabled — guests get full pipeline. - off_peak_tier_ceiling_enabled — disables time-of-day cost shaping.

All flags read from SSM Parameter Store with 30-second client cache; rollback < 2 minutes per flag.

Quality Regression Criteria (story-specific)

Rate-limit false positive rate (complaints from authenticated users): ≤ baseline + 5%.
Cost circuit breaker triggers during normal traffic (ground-truth from post-incident review): 0/quarter.
Conversion rate impact of guest-lite pipeline: ≤ 5% reduction (above this, narrow guest-lite to template-only-when-eligible).
Degradation controller miscalculation rate (degradation triggered when post-hoc analysis shows system was healthy): ≤ 1 event/quarter.

Multi-Reviewer Validation Findings & Resolutions

The cross-reviewer pass identified the following story-specific findings. This story carries the highest density of S1 issues because it is the outer cost-control loop — failures here have the largest blast radius. README's "Multi-Reviewer Validation & Cross-Cutting Hardening" section covers concerns that span all stories.

S1 (must-fix before production)

Rate-limiter tier signal is spoofable. Tier (Prime / Auth / Guest) is read from request context. If derived from a header, cookie, or unverified JWT claim, an attacker can set X-User-Tier: prime and bypass per-tier limits, achieving 6× cost escalation as a guest. Resolution: tier MUST be derived from user_id lookup in an IAM-protected DynamoDB or Aurora table on every request — never from a request header. JWT-based tier requires HS256/RS256 signature verification with key rotation. Server-side immutable source only. Add tier_signal_origin log field; alarm on any non-DDB origin.

Cost ledger in unauthenticated Redis can be tampered. cost:daily:{today} in Redis with no write protection. Compromised service code (or Redis exposure) can DECRBY the counter to keep the breaker from tripping. Attacker keeps cost under the cap while incurring real $50K+ spend. Resolution: cost ledger authoritative storage is DynamoDB (cost_ledger table, strongly consistent reads, IAM write-restricted to a single cost-recorder role); Redis is a read-through cache only. Breaker reads from DDB on disagreement-with-Redis; weekly reconciliation alerts on divergence > 5%. CloudTrail on every write to the cost-recorder role.

Kill-switch privilege escalation. cost_circuit_breaker_enabled and model_tier_floor flags in SSM with default IAM allowing any service role to write. Compromised low-privilege task can flip the flag and disable cost protection. Resolution: SSM Parameter Store IAM policy restricts PutParameter on /cost-control/* to a single finops-lead role (named human IAM role, MFA required). CloudTrail alarm on every parameter change. CloudFormation StackPolicy prevents drift.

Retry storm amplification. Token bucket rejects with 429; client retries with deterministic backoff create synchronized retry waves at the same second boundary, re-rejecting in cascading waves. Resolution: every 429 response carries Retry-After header with server-computed jittered value (1 + uniform_random(0, 10) seconds), spreading retries across a window. Document this in the public API spec so client SDKs respect it.

Control-plane / data-plane mixing in CostCircuitBreaker. check_state() runs on every request (data plane); also writes to Redis (control plane). Per-request latency is now in the cost-write critical path. Resolution: separate concerns:

Data plane (per-request, fast): read current_state from Redis (with DDB fallback), allow/deny.
Control plane (async, eventual): Bedrock-call-completed events flow through Kinesis (US-07) to a Lambda that updates the DDB cost ledger and recomputes current_state on a 60-second tick. Per-request decisions never write the ledger.

This bounds per-minute decision lag (< 60s) while making the data-plane gate read-only and fast.

Cost circuit breaker as DoS surface. An attacker who knows the breaker trips at $5K daily can intentionally drive spend to $5K, forcing all users (including legitimate Prime) into degraded mode for 24 hours. Resolution: per-tier breakers (Prime stays in NORMAL even when global breaker trips, up to its own per-tier sub-budget); rate-limit suspicious sources hard before they can drive global spend; alarm on "single source consumed > 10% of daily budget" as DoS indicator.

S2 (fix before scale-up)

degradation_level precedence with US-01/US-04/US-06 must be enforced through the central evaluator. Each story currently reads the signal independently; flag-cache divergence (30s SSM cache) can cause inconsistent state across stories. Resolution: mandatory feature-flag evaluator module per README precedence rules; direct SSM reads forbidden in story code.

Global rate-limit key collision at minute boundary. rate:global:min:{now // 60} resets at the second boundary, causing brief spikes. Resolution: sliding-window counter (e.g., 10-second buckets aggregated into a 60-second sum) instead of fixed-window; or randomized key offset per second.

Bedrock throttle as side-channel signal. Degradation reads health:bedrock_throttle_pct. Attacker can artificially trigger Bedrock throttling (high-fanout calls) to force degradation. Resolution: require ≥ 2 concurrent signals (throttle + sustained CPU + sustained error rate) before degrading; do not degrade on single-signal evidence.

Cost circuit breaker stuck-in-DEGRADED scenario. If baseline cost is already above budget at midnight UTC, the breaker stays in DEGRADED forever. Resolution: alarm on cost_state == DEGRADED for > 4 hours triggers manual budget-review escalation; DEGRADED state has a "do you really want to extend" check at every shift handoff.

Rate-limit logs contain IPs (PII under GDPR). Resolution: hash IPs (SHA-256(IP + monthly-rotated salt)) before logging; log retention ≤ 30 days.

Tier auth fallback when DDB unavailable. If the DDB tier-lookup fails, do not fall back to "Prime" by default. Resolution: fallback to Guest tier on DDB lookup failure; this is fail-secure (more rate-limiting, never less).

S3 (acknowledged / future work)

Per-tier sub-budgets (each tier has its own daily spend cap, breaker-tripped independently).
Anomaly detection on cost:daily velocity (alert when 1-hour spend rate exceeds 1.5× rolling 7-day same-hour average).
Multi-region active-active for the rate-limiter — out of scope.
Token-cost validation: monthly compare estimated cost vs Bedrock billing; alarm on > 5% divergence.

Runbook: Cost Circuit Breaker Trips at 3am JST

Symptoms: PagerDuty alert "cost_state == WARNING" or "DEGRADED"; daily Bedrock spend > 80% (WARNING) or > 100% (DEGRADED) of cap.

Triage (in order):

Confirm trip is real, not stale data: query DDB cost_ledger directly (skip Redis cache) and verify spend value; if Redis ≠ DDB, suspect tampering and escalate to security.
Identify the source: query US-07 cost events for the last 4 hours, group by tier_used + customer_id_hash + model_id. Look for: (a) a single customer driving > 10% of spend (possible compromised account or runaway client retry); (b) tier_used skew (Sonnet routing rate doubled?); © intent-distribution shift (new pattern indicating bot traffic).
If single-customer runaway: revoke or rate-limit that customer specifically; let the rest of the system continue normally.
If broad spend increase: keep DEGRADED state for the rest of the day; the breaker is doing its job.
If false trip (post-hoc analysis shows healthy traffic): tune the daily budget upward at next FinOps review; do not flip cost_circuit_breaker_enabled=false to mask the issue.
Page FinOps lead if the trip persists > 4 hours.

Escalation: if global runaway and trip is insufficient (cost still climbing), manually flip bedrock_invocation_enabled=false (per US-01 runbook) — all chat traffic returns degraded template responses; cost goes to zero.