LOCAL PREVIEW View on GitHub

US-08: Traffic-Based Cost Optimization

User Story

As a site reliability engineer, I want to implement intelligent traffic management with rate limiting, request prioritization, and graceful degradation, So that total infrastructure costs decrease by 20-35% during normal operations and the system avoids runaway costs during traffic spikes.

Acceptance Criteria

  • Rate limiter enforces per-user and global limits to prevent LLM cost spikes.
  • Request prioritization ensures authenticated users get higher throughput than guests.
  • Graceful degradation reduces costs by bypassing expensive components when they are overloaded.
  • Cost circuit breaker halts LLM calls when daily spend exceeds budget.
  • Off-peak traffic uses cheaper execution paths automatically.
  • Total infrastructure costs decrease by 20-35% through traffic shaping.

High-Level Design

Cost Problem

Without traffic controls, cost scales linearly with traffic — and super-linearly during spikes (when provisioned capacity overflows to more expensive on-demand resources). A single abusive user could generate thousands of LLM calls.

Uncontrolled cost risks: - Bot/abuse traffic: 5-15% of messages may be low-value or abusive - Guest discovery browsing: generates LLM calls but low conversion - Traffic spikes: Black Friday, manga release events can 3-5x normal traffic

Traffic Management Architecture

graph TD
    A[Incoming Request] --> B[CloudFront<br>WAF + Bot Detection]
    B --> C[Rate Limiter<br>Token Bucket]
    C --> D{Within<br>Rate Limit?}
    D -->|Yes| E[Request Prioritizer]
    D -->|No| F[429 Too Many Requests<br>Zero downstream cost]

    E --> G{Priority Level?}
    G -->|High: Authenticated + Active Cart| H[Full Pipeline<br>LLM + RAG + All Services]
    G -->|Medium: Authenticated| I[Standard Pipeline<br>Template-first, then LLM]
    G -->|Low: Guest| J[Lite Pipeline<br>Template + Cache only]

    K[Cost Circuit Breaker] --> L{Daily Spend<br>> Budget?}
    L -->|Yes| M[Degrade to<br>Template-Only Mode]
    L -->|No| N[Normal Operation]

    style F fill:#2d8,stroke:#333
    style J fill:#2d8,stroke:#333
    style M fill:#fd2,stroke:#333

Savings Breakdown

Technique Reduction Monthly Savings
Rate limiting (blocks 5-10% abuse traffic) 8% of total LLM calls ~$25,000
Guest lite pipeline (template + cache only) 15% of LLM calls avoided ~$15,000
Request prioritization (off-peak degradation) 10% compute savings ~$300
Cost circuit breaker (prevents runaway) Prevents $50K+ overrun/month Risk avoidance
Bot detection (WAF) 3-5% request rejection ~$5,000
Total ~$45,300/month

Low-Level Design

1. Tiered Rate Limiter

Different rate limits based on user tier and request type.

graph TD
    subgraph "Rate Limit Tiers"
        A[Prime Authenticated<br>60 msg/min, 500/hour]
        B[Standard Authenticated<br>30 msg/min, 200/hour]
        C[Guest<br>10 msg/min, 50/hour]
        D[Global<br>50,000 msg/min total]
    end

    subgraph "Token Bucket Implementation"
        E[Request Arrives] --> F[Check User Tier]
        F --> G[Redis MULTI:<br>Check + Decrement Token]
        G --> H{Tokens > 0?}
        H -->|Yes| I[Allow Request]
        H -->|No| J[429 + Retry-After Header]
    end

Code Example: Tiered Rate Limiter

import time
from dataclasses import dataclass
from enum import Enum
from typing import Optional

import redis


class UserTier(Enum):
    PRIME = "prime"
    AUTHENTICATED = "authenticated"
    GUEST = "guest"


@dataclass
class RateLimitConfig:
    requests_per_minute: int
    requests_per_hour: int
    burst_allowance: int


TIER_LIMITS = {
    UserTier.PRIME: RateLimitConfig(
        requests_per_minute=60,
        requests_per_hour=500,
        burst_allowance=10,
    ),
    UserTier.AUTHENTICATED: RateLimitConfig(
        requests_per_minute=30,
        requests_per_hour=200,
        burst_allowance=5,
    ),
    UserTier.GUEST: RateLimitConfig(
        requests_per_minute=10,
        requests_per_hour=50,
        burst_allowance=2,
    ),
}

GLOBAL_LIMIT_PER_MINUTE = 50_000


@dataclass
class RateLimitResult:
    allowed: bool
    remaining: int
    retry_after_seconds: Optional[int]
    tier: UserTier


class TieredRateLimiter:
    """Token-bucket rate limiter with per-tier limits stored in Redis."""

    def __init__(self, redis_client: redis.Redis):
        self._redis = redis_client

    def check_and_consume(
        self, user_id: str, tier: UserTier
    ) -> RateLimitResult:
        config = TIER_LIMITS[tier]
        now = int(time.time())
        minute_key = f"rate:{user_id}:min:{now // 60}"
        hour_key = f"rate:{user_id}:hr:{now // 3600}"
        global_key = f"rate:global:min:{now // 60}"

        pipe = self._redis.pipeline()

        # Check minute limit
        pipe.incr(minute_key)
        pipe.expire(minute_key, 60)

        # Check hour limit
        pipe.incr(hour_key)
        pipe.expire(hour_key, 3600)

        # Check global limit
        pipe.incr(global_key)
        pipe.expire(global_key, 60)

        results = pipe.execute()
        minute_count = results[0]
        hour_count = results[2]
        global_count = results[4]

        # Check per-user minute limit
        if minute_count > config.requests_per_minute + config.burst_allowance:
            return RateLimitResult(
                allowed=False,
                remaining=0,
                retry_after_seconds=60 - (now % 60),
                tier=tier,
            )

        # Check per-user hour limit
        if hour_count > config.requests_per_hour:
            return RateLimitResult(
                allowed=False,
                remaining=0,
                retry_after_seconds=3600 - (now % 3600),
                tier=tier,
            )

        # Check global limit
        if global_count > GLOBAL_LIMIT_PER_MINUTE:
            return RateLimitResult(
                allowed=False,
                remaining=0,
                retry_after_seconds=60 - (now % 60),
                tier=tier,
            )

        remaining = config.requests_per_minute - minute_count
        return RateLimitResult(
            allowed=True,
            remaining=max(0, remaining),
            retry_after_seconds=None,
            tier=tier,
        )

2. Request Priority Router

Route requests through different execution paths based on user value.

graph TD
    A[Request + User Context] --> B{User Tier<br>+ Cart Status}
    B -->|Authenticated + Cart > 0<br>HIGH priority| C[Full Pipeline<br>RAG + LLM + All Services]
    B -->|Authenticated + No Cart<br>MEDIUM priority| D[Standard Pipeline<br>Template-first → LLM fallback]
    B -->|Guest<br>LOW priority| E{Intent Type?}
    E -->|faq, product_discovery| F[Lite Pipeline<br>Template + Cache only<br>No LLM]
    E -->|product_question,<br>recommendation| G[Standard Pipeline<br>with Haiku model]

    style C fill:#f66,stroke:#333
    style D fill:#fd2,stroke:#333
    style F fill:#2d8,stroke:#333

Code Example: Priority Router

from dataclasses import dataclass
from enum import Enum
from typing import Optional


class Priority(Enum):
    HIGH = "high"        # Full pipeline, Sonnet
    MEDIUM = "medium"    # Template-first, Sonnet fallback
    LOW = "low"          # Template + cache, Haiku fallback


@dataclass
class PipelineConfig:
    priority: Priority
    use_rag: bool
    use_llm: bool
    model_tier: str           # "sonnet" or "haiku"
    use_template_first: bool
    use_cache_only: bool
    reason: str


class RequestPriorityRouter:
    """Determine execution pipeline based on user tier and context."""

    def route(
        self,
        user_tier: str,
        is_authenticated: bool,
        cart_size: int,
        intent: str,
        is_peak_hour: bool,
    ) -> PipelineConfig:
        # High priority: authenticated user with active cart
        if is_authenticated and cart_size > 0:
            return PipelineConfig(
                priority=Priority.HIGH,
                use_rag=True,
                use_llm=True,
                model_tier="sonnet",
                use_template_first=False,
                use_cache_only=False,
                reason="Authenticated user with active cart — full pipeline",
            )

        # Medium priority: authenticated user, no cart
        if is_authenticated:
            return PipelineConfig(
                priority=Priority.MEDIUM,
                use_rag=intent in ("faq", "product_question", "recommendation"),
                use_llm=True,
                model_tier="sonnet" if not is_peak_hour else "haiku",
                use_template_first=True,
                use_cache_only=False,
                reason="Authenticated user — template-first with LLM fallback",
            )

        # Low priority: guest user
        if intent in ("faq", "product_discovery", "promotion", "chitchat"):
            return PipelineConfig(
                priority=Priority.LOW,
                use_rag=False,
                use_llm=False,
                model_tier="none",
                use_template_first=True,
                use_cache_only=True,
                reason="Guest user, cacheable intent — template + cache only",
            )

        # Guest with complex intent: limited LLM
        return PipelineConfig(
            priority=Priority.LOW,
            use_rag=intent in ("product_question", "recommendation"),
            use_llm=True,
            model_tier="haiku",
            use_template_first=True,
            use_cache_only=False,
            reason="Guest user, complex intent — lite pipeline with Haiku",
        )

3. Cost Circuit Breaker

Prevents runaway LLM costs by switching to template-only mode when daily spend exceeds budget.

stateDiagram-v2
    [*] --> NormalOperation
    NormalOperation --> WarningState: Daily spend > 80% budget
    WarningState --> DegradedMode: Daily spend > 100% budget
    DegradedMode --> NormalOperation: New day (midnight UTC reset)
    WarningState --> NormalOperation: Spend rate decreases

    state NormalOperation {
        [*] --> FullPipeline
        FullPipeline: All models available
        FullPipeline: Full RAG + LLM
    }

    state WarningState {
        [*] --> ReducedPipeline
        ReducedPipeline: Only Haiku model
        ReducedPipeline: Aggressive caching
        ReducedPipeline: Template-first for all
    }

    state DegradedMode {
        [*] --> TemplateOnly
        TemplateOnly: No LLM calls
        TemplateOnly: Template + cache responses
        TemplateOnly: Escalation to human for complex queries
    }

Code Example: Cost Circuit Breaker

import logging
import time
from dataclasses import dataclass
from enum import Enum

import redis

logger = logging.getLogger(__name__)


class CostState(Enum):
    NORMAL = "normal"
    WARNING = "warning"
    DEGRADED = "degraded"


@dataclass
class CostBudget:
    daily_limit_usd: float
    warning_threshold_pct: float = 0.80
    degraded_threshold_pct: float = 1.00


@dataclass
class CostStatus:
    state: CostState
    daily_spend_usd: float
    budget_remaining_usd: float
    llm_calls_today: int
    actions: list[str]


class CostCircuitBreaker:
    """Monitor daily LLM spend and degrade execution path when budget exceeded."""

    # Approximate costs per call
    SONNET_AVG_COST = 0.008   # $0.008 per call (avg 1200 input + 250 output tokens)
    HAIKU_AVG_COST = 0.0005   # $0.0005 per call

    def __init__(
        self,
        redis_client: redis.Redis,
        budget: CostBudget,
    ):
        self._redis = redis_client
        self._budget = budget

    def check_state(self) -> CostStatus:
        today = self._today_key()
        spend_raw = self._redis.get(f"cost:daily:{today}")
        calls_raw = self._redis.get(f"cost:calls:{today}")

        daily_spend = float(spend_raw) if spend_raw else 0.0
        llm_calls = int(calls_raw) if calls_raw else 0

        remaining = self._budget.daily_limit_usd - daily_spend
        spend_pct = daily_spend / self._budget.daily_limit_usd

        if spend_pct >= self._budget.degraded_threshold_pct:
            state = CostState.DEGRADED
            actions = [
                "BLOCK all LLM calls",
                "SERVE template + cache responses only",
                "ESCALATE complex queries to human agents",
                "ALERT on-call engineer",
            ]
            logger.warning(
                f"Cost circuit OPEN — daily spend ${daily_spend:.2f} "
                f"exceeds budget ${self._budget.daily_limit_usd:.2f}"
            )
        elif spend_pct >= self._budget.warning_threshold_pct:
            state = CostState.WARNING
            actions = [
                "SWITCH all LLM calls to Haiku",
                "ENABLE aggressive caching",
                "FORCE template-first for all intents",
            ]
            logger.info(
                f"Cost circuit WARNING — daily spend ${daily_spend:.2f} "
                f"at {spend_pct*100:.0f}% of budget"
            )
        else:
            state = CostState.NORMAL
            actions = ["Normal operation"]

        return CostStatus(
            state=state,
            daily_spend_usd=daily_spend,
            budget_remaining_usd=max(0, remaining),
            llm_calls_today=llm_calls,
            actions=actions,
        )

    def record_llm_call(self, model: str, input_tokens: int, output_tokens: int) -> None:
        """Record an LLM call and its estimated cost."""
        today = self._today_key()

        # Calculate cost based on model
        if "sonnet" in model.lower():
            cost = (input_tokens / 1_000_000 * 3.0) + (output_tokens / 1_000_000 * 15.0)
        elif "haiku" in model.lower():
            cost = (input_tokens / 1_000_000 * 0.25) + (output_tokens / 1_000_000 * 1.25)
        else:
            cost = 0.0

        pipe = self._redis.pipeline()
        pipe.incrbyfloat(f"cost:daily:{today}", cost)
        pipe.incr(f"cost:calls:{today}")
        pipe.expire(f"cost:daily:{today}", 86400)
        pipe.expire(f"cost:calls:{today}", 86400)
        pipe.execute()

    def _today_key(self) -> str:
        return time.strftime("%Y-%m-%d", time.gmtime())

4. Graceful Degradation Ladder

When downstream services are overloaded, progressively disable expensive features.

graph TD
    subgraph "Level 0: Normal"
        A[Full RAG + LLM + All Services]
    end

    subgraph "Level 1: Pressure"
        B[Disable reranker<br>Use top-3 raw KNN results]
    end

    subgraph "Level 2: High Load"
        C[Switch to Haiku model<br>Reduce RAG to top-1 chunk]
    end

    subgraph "Level 3: Overload"
        D[Template-only for guests<br>LLM only for authenticated]
    end

    subgraph "Level 4: Emergency"
        E[Template-only for all<br>Escalate complex to human]
    end

    A --> B
    B --> C
    C --> D
    D --> E

    style A fill:#2d8,stroke:#333
    style B fill:#9d2,stroke:#333
    style C fill:#fd2,stroke:#333
    style D fill:#f92,stroke:#333
    style E fill:#f66,stroke:#333

Code Example: Degradation Controller

from dataclasses import dataclass
from enum import IntEnum

import redis


class DegradationLevel(IntEnum):
    NORMAL = 0
    PRESSURE = 1
    HIGH_LOAD = 2
    OVERLOAD = 3
    EMERGENCY = 4


@dataclass
class DegradationConfig:
    level: DegradationLevel
    use_reranker: bool
    model_tier: str
    max_rag_chunks: int
    guest_llm_enabled: bool
    auth_llm_enabled: bool


DEGRADATION_CONFIGS = {
    DegradationLevel.NORMAL: DegradationConfig(
        level=DegradationLevel.NORMAL,
        use_reranker=True,
        model_tier="sonnet",
        max_rag_chunks=3,
        guest_llm_enabled=True,
        auth_llm_enabled=True,
    ),
    DegradationLevel.PRESSURE: DegradationConfig(
        level=DegradationLevel.PRESSURE,
        use_reranker=False,       # Skip reranker
        model_tier="sonnet",
        max_rag_chunks=3,
        guest_llm_enabled=True,
        auth_llm_enabled=True,
    ),
    DegradationLevel.HIGH_LOAD: DegradationConfig(
        level=DegradationLevel.HIGH_LOAD,
        use_reranker=False,
        model_tier="haiku",       # Cheaper model
        max_rag_chunks=1,         # Fewer chunks
        guest_llm_enabled=True,
        auth_llm_enabled=True,
    ),
    DegradationLevel.OVERLOAD: DegradationConfig(
        level=DegradationLevel.OVERLOAD,
        use_reranker=False,
        model_tier="haiku",
        max_rag_chunks=1,
        guest_llm_enabled=False,  # Guests get templates only
        auth_llm_enabled=True,
    ),
    DegradationLevel.EMERGENCY: DegradationConfig(
        level=DegradationLevel.EMERGENCY,
        use_reranker=False,
        model_tier="none",
        max_rag_chunks=0,
        guest_llm_enabled=False,
        auth_llm_enabled=False,   # Everyone gets templates
    ),
}


class DegradationController:
    """Determine degradation level based on system health metrics."""

    def __init__(self, redis_client: redis.Redis):
        self._redis = redis_client

    def get_current_config(self) -> DegradationConfig:
        level = self._compute_level()
        return DEGRADATION_CONFIGS[level]

    def _compute_level(self) -> DegradationLevel:
        # Read health signals from Redis (set by monitoring system)
        p99_latency = float(self._redis.get("health:p99_latency_ms") or 0)
        error_rate = float(self._redis.get("health:error_rate_pct") or 0)
        cpu_util = float(self._redis.get("health:cpu_util_pct") or 0)
        bedrock_throttle_rate = float(
            self._redis.get("health:bedrock_throttle_pct") or 0
        )

        if bedrock_throttle_rate > 20 or error_rate > 10:
            return DegradationLevel.EMERGENCY

        if cpu_util > 85 or p99_latency > 5000:
            return DegradationLevel.OVERLOAD

        if cpu_util > 70 or p99_latency > 3000:
            return DegradationLevel.HIGH_LOAD

        if cpu_util > 55 or p99_latency > 2000:
            return DegradationLevel.PRESSURE

        return DegradationLevel.NORMAL

Combined Cost Impact Summary

graph LR
    subgraph "Before Optimization"
        A[Rate Limiting: None<br>All traffic hits LLM]
        B[Guest Pipeline: Same as Auth<br>Full LLM for all]
        C[Cost Control: Manual monitoring<br>No automatic limits]
    end

    subgraph "After Optimization"
        D[Rate Limiting: Tiered<br>5-10% abuse blocked]
        E[Guest Pipeline: Lite<br>Template + Cache priority]
        F[Cost Control: Circuit Breaker<br>Auto-degrade at budget]
    end

    A --> D
    B --> E
    C --> F

    style D fill:#2d8,stroke:#333
    style E fill:#2d8,stroke:#333
    style F fill:#2d8,stroke:#333

Monitoring and Metrics

Metric Target Alert
Rate limit rejection rate 3-8% > 15% (too aggressive) or < 1% (too lenient)
Guest template-only rate ≥ 50% < 30%
Cost circuit breaker triggers 0/month Any trigger
Daily Bedrock spend ≤ $4,200 > $5,000
Degradation level 0 (Normal) ≥ 2 for > 10 min
Bot detection block rate 3-5% > 10%

Risks and Mitigations

Risk Impact Mitigation
Rate limits too aggressive for power users Frustrated customers leave Prime users get 60 msg/min; monitor feedback from rate-limited users
Guest lite pipeline hurts conversion Fewer guest → authenticated upgrades A/B test: measure conversion rate with full vs. lite pipeline for guests
Cost circuit breaker triggers during legitimate spike Degraded experience during manga release event Pre-announce events; temporarily raise daily budget for known spikes
Degradation level miscalculated Degraded mode when system is healthy Multiple health signals (latency + CPU + error rate + throttle); require 2+ signals to degrade

Deep Dive: Why This Works on a Manga Chatbot Workload

This story is the outer control loop of the cost-optimization stack. While US-01 through US-07 each reduce the steady-state per-request cost, none of them can prevent a runaway: a botnet, a misconfigured retry loop, a viral event, or a single unbounded conversation can blow through the daily budget no matter how cheap each individual request is. US-08 exists because per-request cost optimization is a necessary-but-not-sufficient cost strategy — you also need a hard ceiling enforced at the demand side. The 20–35% saving target combines three structurally different cost reductions.

Property 1: Chatbot traffic has a long tail of low-value and abusive requests. Public estimates put bot traffic at ~30% of internet traffic, with ~5% explicitly malicious (Imperva annual reports). For a public-facing chatbot, the breakdown is similar: scrapers probing the API, jailbreak attempts, abusive prompts, accidental retry loops from misbehaving clients, and one-message-and-leave guests who never convert. Each of these consumes the full Bedrock + RAG + DDB cost path while producing zero business value. Tiered rate limiting (Prime > Auth > Guest) does not just stop abuse — it implicitly downgrades the cost-per-request multiplier on the lowest-value segment, which is also the largest. The architectural assumption is that the tier-detection signal (Prime status, Auth cookie, IP reputation) is reliable; the failure mode is misclassification (a real Prime user gets Guest treatment) and the mitigation is a feedback signal from CSAT into the rate-limit tuning.

Property 2: The cost circuit breaker is a one-way ratchet against runaways. Cost optimization is fundamentally a probabilistic argument — most requests should be cheap, on average. But averages hide tail-risk: a single 30K-token prompt to Sonnet costs $0.45, the equivalent of 56 average requests. A poisoned conversation that grows to 100K tokens before terminating costs $1.50. A scripted attacker sending 1K such requests per minute can blow the daily budget in hours. The cost circuit breaker (story line ~260s) is a last line of defense: when daily Bedrock spend approaches the budget cap, the breaker forces all traffic to Haiku (or template-only) for the rest of the day. This is not "cost optimization" in the steady-state sense — it is risk control that bounds the worst-case daily spend. The architectural assumption is that degraded service for the rest of the day is acceptable in exchange for a hard cost cap; the alternative (uncapped spending) is unacceptable for any production system. The kill-switch granularity matters: per-tier breakers (Prime keeps full service, Guest goes to template-only) preserve revenue while shedding cost.

Property 3: Graceful degradation matches load shedding to user value. Under sustained overload (CPU pressure, Bedrock throttle, RAG latency spikes), the system has three options: (a) accept all requests at degraded latency, (b) reject excess requests with HTTP 429, or © serve excess requests with a cheaper, faster pipeline. Option © — graceful degradation — preserves the user experience for the maximum number of users at the cost of reduced quality on the marginal users. The 5-level degradation ladder (Normal → Pressure → High Load → Overload → Emergency) implements this with progressively more aggressive cost-shedding: at Pressure, only RAG bypass tightens; at Emergency, only template responses are served. The architectural assumption is that the degradation levels are calibrated to real failure cliffs (CPU > 85% is genuinely unhealthy, not just busy); the failure mode is over-reactive degradation (degraded mode during normal busy periods) and the mitigation is requiring 2+ concurrent signals before degrading.

Property 4: Off-peak cheaper paths exploit diurnal cost asymmetry. Late-night JST traffic is dominated by automated agents, scrapers, and overseas users — a different value mix than peak-hour traffic. Routing this traffic to a "lite" pipeline (template + cache priority + Haiku-only) trades quality-on-the-marginal-user for cost. This is invisible during peak (when full pipeline is needed for paying customers) and large during off-peak (when 60% of requests can be served by template+cache without quality regression). The mechanism is just a time-of-day-aware tier ceiling: between 11pm and 8am JST, no request escalates above tier-2 (Haiku).

Bottom line: the savings stack non-linearly. Steady-state savings (rate limiter blocking abuse + guest-lite pipeline + off-peak tier ceiling) compound multiplicatively over time — every hour, savings continue. The cost circuit breaker is more like insurance: it has zero benefit until it has enormous benefit. Pricing the breaker by "how often does it trigger" misses the point — its value is the worst-case scenario it prevents, not the average-case savings it produces.


Real-World Validation

Industry Benchmarks & Case Studies

  • Stripe engineering: "Scaling your API with rate limiters" — Documents tiered rate limiting with token bucket, Redis pipelines, and per-user + per-global limit composition. The implementation pattern in this story is closely modeled on the Stripe pattern.
  • Shopify engineering: "Surviving Black Friday" — Documents traffic-based degradation with per-feature kill switches and graceful-degradation tiers. Validates the 5-level degradation ladder pattern.
  • Google SRE Book, Chapter 22 ("Addressing Cascading Failures") and Chapter 21 ("Handling Overload") — Foundational text on load shedding, queueing, and graceful degradation. The "must shed before saturation" principle underpins the requirement that degradation triggers fire before CPU saturation, not at it.
  • Hystrix / Polly circuit breaker patterns — Netflix's Hystrix (now in maintenance) and the .NET Polly library codified the circuit breaker pattern; the cost-tracking variant in this story (cost-as-circuit-state instead of error-rate-as-circuit-state) is a direct extension.
  • Cisco IOS rate limiting documentation — The token bucket algorithm originates here; the per-tier limits in this story (Prime 60/min, Auth 30/min, Guest 10/min) follow the standard token-bucket sizing approach.
  • Imperva Bad Bot Report (annual) — Bot traffic accounts for ~30% of internet traffic with ~5% explicitly malicious. The story's "5–10% abuse blocked" target is consistent with this baseline; values much higher would suggest legitimate users being blocked, lower would suggest insufficient enforcement.
  • Cloudflare engineering: "How we built rate limiting capable of scaling to millions of domains" — Distributed rate limiting architecture and Redis pipeline patterns. Validates the choice of Redis pipeline (atomic INCR + TTL) for the per-tier counter.
  • Internal cross-reference: POC-to-Production-War-Story/02-seven-production-catastrophes.md — The "cost explosion" catastrophe was specifically a runaway that this circuit breaker would have caught; its failure was the absence of a hard ceiling. This story is the documented fix.
  • Internal cross-reference: Optimization-Tradeoffs-User-Stories/ — Covers the broader trade-off between cost-control aggressiveness and user-experience preservation; this story is the cost-floor operating point.

Math Validation

  • Token bucket math: at Prime tier 60/min, each user consumes ≤ 86,400 requests/day (60 × 1440 min). At Bedrock peak cost of ~$0.012/req on Sonnet, a maxed-out Prime user costs ~$1,037/day. With ~10K Prime users, theoretical max daily cost = $10M. Rate limit + circuit breaker are the only mechanisms preventing this theoretical max from becoming actual.
  • Rate limit blocking 5% × 1M req/day × $0.008 average = $400/day = $12K/month saved, just from abuse blocking. ✅
  • Guest lite pipeline: ~30% of traffic is guest, of which ~70% can be served by template+cache+Haiku only. Cost per Guest req drops from ~$0.008 (full Sonnet) to ~$0.0005 (Haiku) → 30% × 70% × ($0.008 − $0.0005) × 1M req/day = $1,575/day = ~$47K/month saved. Flag: the story claims "$15K from guest lite" — recheck the conversion-rate impact assumption (some guests upgrade to Auth and pay full pipeline cost, reducing the saving).
  • Cost circuit breaker: zero recurring savings; insurance value capped at "daily budget cap × 30 days = monthly worst-case avoided." If daily budget is $5K, the breaker prevents up to $150K/month in runaway scenarios.

Conservative vs Aggressive Savings Bounds

Bound Source Total monthly savings
Conservative Rate limit + circuit breaker only ~10% (~$30K/month)
Aggressive Rate limit + guest-lite + off-peak ceiling + circuit breaker + degradation ~35% (~$110K/month)
Story's projected savings 20–35% Aligns with the aggressive bound; insurance value of breaker is uncounted.

Cross-Story Interactions & Conflicts

This story is the integrating cost-control layer for all other stories. Most edges are authoritative on this side.

  • US-04 (Compute) — Authoritative side: this story for the degradation–autoscaling contract. Conflict mode: auto-scaling adds capacity in response to load; degradation sheds load in response to capacity pressure. If both fire on the same trigger, you get oscillation (degradation reduces load → CPU drops → auto-scaler scales in → next traffic burst hits with no headroom → degradation fires again). Resolution: when degradation_level >= 2, this story emits a suspend_scale_in=true signal that US-04's auto-scaler honors. Scale-out is still allowed (more capacity is always safe); only scale-in is suspended.
  • US-01 (LLM Tokens) — Authoritative side: this story for the model_tier_floor config. Conflict mode: US-01's complexity classifier may route a query to Sonnet while this story's breaker is forcing Haiku-only. Resolution: the model_tier_floor is a Redis-backed shared config (10s TTL on the client). When the breaker trips, model_tier_floor=haiku is set; US-01's tier selector reads this and suppresses Sonnet routing regardless of complexity. Shared kill-switch path: SSM Parameter Store mirror for emergency manual override.
  • US-06 (RAG) — Authoritative side: this story for the degradation_level signal. Conflict mode: under DEGRADED state, RAG bypass should be more aggressive (bypass on lower-confidence intents). Resolution: US-06's bypass gate reads degradation_level and applies a threshold matrix: at level 0, bypass on intent-match only; at level 2+, bypass on any non-RAG-shaped intent regardless of confidence; at level 4 (Emergency), bypass everything — RAG is fully off.
  • US-07 (Analytics Pipeline) — Authoritative side: US-07. The cost circuit breaker reads cumulative daily Bedrock spend from the cost-tracking event stream. Conflict mode: if event-batching latency exceeds 5 minutes, the breaker decides on stale data. Resolution: cost-tracking events use a dedicated stream with 5-event / 1-second batching, not the default 50-event / 5-second batching. Event lag SLO: ≤ 5 minutes P95.
  • US-02 (Intent Classifier) — Authoritative side: US-02 owns the intent label. Conflict mode: the request prioritizer routes by intent + tier. If the classifier is unavailable (cold start, scale-from-zero failure), prioritization defaults to MEDIUM — but the system is actually overloaded. Resolution: this story emits a classifier_unavailable=true signal when intent-unavailable rate > 1% over 1 minute; under this signal, all traffic without explicit Auth/Prime tier is treated as Guest tier (rate-limited harder).
  • US-03 (Caching) — Indirect interaction. Rate limiter state lives on the Redis tier from US-03. Conflict mode: during ElastiCache failover, the rate limiter state is unavailable for ~30–90 seconds — potential burst pass-through. Resolution: local in-process token-bucket fallback during Redis unavailability; counts merged on Redis recovery.

Rollback & Experimentation

Shadow-Mode Plan

  • Rate limiter: deploy in observe mode for 2 weeks — log "would have rate-limited" decisions but allow all requests through. Compare projected block rates against bot-traffic estimates. Tune per-tier limits based on observed user behaviors before enforcing.
  • Cost circuit breaker: deploy in alarm-only mode for 4 weeks — when projected breaker-trip threshold is crossed, fire alerts but do not actually degrade. Measure how often manual intervention would have been needed.
  • Degradation controller: deploy with degradation_active=false flag for 2 weeks; all signals computed but no enforcement. Validate that degradation triggers correctly correspond to real overload events.
  • Guest-lite pipeline: A/B test against full pipeline on 10% of guest traffic for 4 weeks; measure conversion rate (guest → authenticated upgrade) for both arms.

Canary Thresholds

  • Rate limiter: start at 2× the planned per-tier limit (effectively only blocking egregious abusers); halve to planned limit over 4 weeks.
  • Cost breaker: start with daily budget at 150% of planned cap (effectively only catches runaway scenarios); ramp down to planned cap over 4 weeks.
  • Abort criteria (any one trips): false-positive rate-limit complaints from authenticated users > expected baseline + 50%, conversion rate drop on guest-lite arm > 10%, circuit breaker triggers during normal traffic > 0.

Kill Switches

This story has the most kill switches because it has the most safety-critical control loops: - rate_limit_enabled — disables all rate limiting. - cost_circuit_breaker_enabled — disables the cost cap; cost can run unbounded. - degradation_controller_enabled — disables all graceful-degradation behavior. - guest_lite_pipeline_enabled — guests get full pipeline. - off_peak_tier_ceiling_enabled — disables time-of-day cost shaping.

All flags read from SSM Parameter Store with 30-second client cache; rollback < 2 minutes per flag.

Quality Regression Criteria (story-specific)

  • Rate-limit false positive rate (complaints from authenticated users): ≤ baseline + 5%.
  • Cost circuit breaker triggers during normal traffic (ground-truth from post-incident review): 0/quarter.
  • Conversion rate impact of guest-lite pipeline: ≤ 5% reduction (above this, narrow guest-lite to template-only-when-eligible).
  • Degradation controller miscalculation rate (degradation triggered when post-hoc analysis shows system was healthy): ≤ 1 event/quarter.

Multi-Reviewer Validation Findings & Resolutions

The cross-reviewer pass identified the following story-specific findings. This story carries the highest density of S1 issues because it is the outer cost-control loop — failures here have the largest blast radius. README's "Multi-Reviewer Validation & Cross-Cutting Hardening" section covers concerns that span all stories.

S1 (must-fix before production)

Rate-limiter tier signal is spoofable. Tier (Prime / Auth / Guest) is read from request context. If derived from a header, cookie, or unverified JWT claim, an attacker can set X-User-Tier: prime and bypass per-tier limits, achieving 6× cost escalation as a guest. Resolution: tier MUST be derived from user_id lookup in an IAM-protected DynamoDB or Aurora table on every request — never from a request header. JWT-based tier requires HS256/RS256 signature verification with key rotation. Server-side immutable source only. Add tier_signal_origin log field; alarm on any non-DDB origin.

Cost ledger in unauthenticated Redis can be tampered. cost:daily:{today} in Redis with no write protection. Compromised service code (or Redis exposure) can DECRBY the counter to keep the breaker from tripping. Attacker keeps cost under the cap while incurring real $50K+ spend. Resolution: cost ledger authoritative storage is DynamoDB (cost_ledger table, strongly consistent reads, IAM write-restricted to a single cost-recorder role); Redis is a read-through cache only. Breaker reads from DDB on disagreement-with-Redis; weekly reconciliation alerts on divergence > 5%. CloudTrail on every write to the cost-recorder role.

Kill-switch privilege escalation. cost_circuit_breaker_enabled and model_tier_floor flags in SSM with default IAM allowing any service role to write. Compromised low-privilege task can flip the flag and disable cost protection. Resolution: SSM Parameter Store IAM policy restricts PutParameter on /cost-control/* to a single finops-lead role (named human IAM role, MFA required). CloudTrail alarm on every parameter change. CloudFormation StackPolicy prevents drift.

Retry storm amplification. Token bucket rejects with 429; client retries with deterministic backoff create synchronized retry waves at the same second boundary, re-rejecting in cascading waves. Resolution: every 429 response carries Retry-After header with server-computed jittered value (1 + uniform_random(0, 10) seconds), spreading retries across a window. Document this in the public API spec so client SDKs respect it.

Control-plane / data-plane mixing in CostCircuitBreaker. check_state() runs on every request (data plane); also writes to Redis (control plane). Per-request latency is now in the cost-write critical path. Resolution: separate concerns:

  • Data plane (per-request, fast): read current_state from Redis (with DDB fallback), allow/deny.
  • Control plane (async, eventual): Bedrock-call-completed events flow through Kinesis (US-07) to a Lambda that updates the DDB cost ledger and recomputes current_state on a 60-second tick. Per-request decisions never write the ledger.

This bounds per-minute decision lag (< 60s) while making the data-plane gate read-only and fast.

Cost circuit breaker as DoS surface. An attacker who knows the breaker trips at $5K daily can intentionally drive spend to $5K, forcing all users (including legitimate Prime) into degraded mode for 24 hours. Resolution: per-tier breakers (Prime stays in NORMAL even when global breaker trips, up to its own per-tier sub-budget); rate-limit suspicious sources hard before they can drive global spend; alarm on "single source consumed > 10% of daily budget" as DoS indicator.

S2 (fix before scale-up)

degradation_level precedence with US-01/US-04/US-06 must be enforced through the central evaluator. Each story currently reads the signal independently; flag-cache divergence (30s SSM cache) can cause inconsistent state across stories. Resolution: mandatory feature-flag evaluator module per README precedence rules; direct SSM reads forbidden in story code.

Global rate-limit key collision at minute boundary. rate:global:min:{now // 60} resets at the second boundary, causing brief spikes. Resolution: sliding-window counter (e.g., 10-second buckets aggregated into a 60-second sum) instead of fixed-window; or randomized key offset per second.

Bedrock throttle as side-channel signal. Degradation reads health:bedrock_throttle_pct. Attacker can artificially trigger Bedrock throttling (high-fanout calls) to force degradation. Resolution: require ≥ 2 concurrent signals (throttle + sustained CPU + sustained error rate) before degrading; do not degrade on single-signal evidence.

Cost circuit breaker stuck-in-DEGRADED scenario. If baseline cost is already above budget at midnight UTC, the breaker stays in DEGRADED forever. Resolution: alarm on cost_state == DEGRADED for > 4 hours triggers manual budget-review escalation; DEGRADED state has a "do you really want to extend" check at every shift handoff.

Rate-limit logs contain IPs (PII under GDPR). Resolution: hash IPs (SHA-256(IP + monthly-rotated salt)) before logging; log retention ≤ 30 days.

Tier auth fallback when DDB unavailable. If the DDB tier-lookup fails, do not fall back to "Prime" by default. Resolution: fallback to Guest tier on DDB lookup failure; this is fail-secure (more rate-limiting, never less).

S3 (acknowledged / future work)

  • Per-tier sub-budgets (each tier has its own daily spend cap, breaker-tripped independently).
  • Anomaly detection on cost:daily velocity (alert when 1-hour spend rate exceeds 1.5× rolling 7-day same-hour average).
  • Multi-region active-active for the rate-limiter — out of scope.
  • Token-cost validation: monthly compare estimated cost vs Bedrock billing; alarm on > 5% divergence.

Runbook: Cost Circuit Breaker Trips at 3am JST

Symptoms: PagerDuty alert "cost_state == WARNING" or "DEGRADED"; daily Bedrock spend > 80% (WARNING) or > 100% (DEGRADED) of cap.

Triage (in order):

  1. Confirm trip is real, not stale data: query DDB cost_ledger directly (skip Redis cache) and verify spend value; if Redis ≠ DDB, suspect tampering and escalate to security.
  2. Identify the source: query US-07 cost events for the last 4 hours, group by tier_used + customer_id_hash + model_id. Look for: (a) a single customer driving > 10% of spend (possible compromised account or runaway client retry); (b) tier_used skew (Sonnet routing rate doubled?); © intent-distribution shift (new pattern indicating bot traffic).
  3. If single-customer runaway: revoke or rate-limit that customer specifically; let the rest of the system continue normally.
  4. If broad spend increase: keep DEGRADED state for the rest of the day; the breaker is doing its job.
  5. If false trip (post-hoc analysis shows healthy traffic): tune the daily budget upward at next FinOps review; do not flip cost_circuit_breaker_enabled=false to mask the issue.
  6. Page FinOps lead if the trip persists > 4 hours.

Escalation: if global runaway and trip is insufficient (cost still climbing), manually flip bedrock_invocation_enabled=false (per US-01 runbook) — all chat traffic returns degraded template responses; cost goes to zero.