Cost Optimization User Stories - MangaAssist Chatbot

Overview

This directory contains detailed user stories for cost optimization across every major service in the MangaAssist chatbot architecture. Each user story includes high-level design, low-level implementation details, Mermaid diagrams, and code examples.

User Stories

#	User Story	Primary Service	Estimated Savings
US-01	LLM Token Cost Optimization	Amazon Bedrock (Claude 3.5 Sonnet)	40-60% of LLM spend
US-02	Intent Classifier Cost Optimization	SageMaker Endpoint	50-70% of inference spend
US-03	Caching Strategy for Cost Reduction	ElastiCache Redis	30-50% of downstream API costs
US-04	Compute Cost Optimization	ECS Fargate + Lambda	35-55% of compute spend
US-05	DynamoDB Cost Optimization	DynamoDB	40-60% of storage/throughput costs
US-06	RAG Pipeline Cost Optimization	OpenSearch Serverless + Titan Embeddings	30-50% of RAG costs
US-07	Analytics Pipeline Cost Optimization	Kinesis + Redshift	40-60% of analytics spend
US-08	Traffic-Based Cost Optimization	Edge / Rate Limiter / Degraded Modes	20-35% of total infrastructure

Cost Distribution (Baseline Estimate)

pie title MangaAssist Monthly Cost Distribution (Before Optimization)
    "Bedrock LLM" : 35
    "ECS Fargate + Lambda" : 20
    "DynamoDB" : 10
    "OpenSearch Serverless" : 10
    "ElastiCache Redis" : 8
    "SageMaker (Intent)" : 7
    "Kinesis + Redshift" : 5
    "CloudFront + ALB" : 3
    "Other" : 2

How to Use

Start with US-01 (LLM Token Cost) — it targets the largest cost driver.
Read US-02 and US-03 next — they directly reduce LLM and downstream service calls.
Apply US-04 through US-08 based on your current cost profile.
Each user story is self-contained and can be implemented independently.

Relationship to Architecture

These user stories map directly to the components described in:

04-architecture-hld.md — High-level architecture
04b-architecture-lld.md — Low-level design

Dependency & Sequencing Graph

The 8 stories are not independent. Some are shared infrastructure that other stories depend on; some are outer control loops that wrap the others. Implementing in the wrong order creates either dead optimizations (no cost telemetry to evaluate them) or runaway risk (no circuit breaker to bound them).

graph TB
    US08[US-08 Traffic-Based<br>Outer cost-control loop]
    US07[US-07 Analytics<br>Cost telemetry pipeline]
    US03[US-03 Caching<br>Shared Redis tier]
    US02[US-02 Intent Classifier<br>Intent label provider]
    US01[US-01 LLM Tokens<br>Bedrock optimization]
    US04[US-04 Compute<br>Fargate + Lambda]
    US05[US-05 DynamoDB<br>Session-state lifecycle]
    US06[US-06 RAG<br>OpenSearch + Titan]

    US07 -->|cost events feed| US08
    US08 -->|model_tier_floor| US01
    US08 -->|degradation_level| US06
    US08 -->|suspend_scale_in| US04
    US02 -->|intent label + confidence| US01
    US02 -->|intent label| US06
    US02 -->|intent label| US08
    US03 -->|llmresp: keyspace| US01
    US03 -->|intent:sess: keyspace| US02
    US03 -->|emb: keyspace| US06
    US03 -->|fallback path| US05
    US05 -->|TURN archive| US07

    style US08 fill:#f66,stroke:#333
    style US07 fill:#fd2,stroke:#333
    style US03 fill:#fd2,stroke:#333
    style US02 fill:#fd2,stroke:#333

Recommended implementation order:

US-07 (Analytics) first — cost telemetry pipeline must exist before any other story can be evaluated or before US-08 can read spend data.
US-08 (Traffic-Based) second — the cost circuit breaker is the safety net for every aggressive optimization that follows. It must be in production before US-01's tier routing or prompt compression are turned on at scale.
US-03 (Caching) third — shared Redis tier underpins US-01 (response cache), US-02 (session intent cache), and US-06 (embedding cache). Provision Redis with all four keyspaces planned, even if only one is initially populated.
US-02 (Intent Classifier) fourth — provides the intent label that US-01, US-06, and US-08 all key on. Establish the intent-precision floor (≥ 0.92) before downstream stories rely on it.
US-01, US-04, US-05, US-06 in parallel — these are the "leaf" stories that benefit from the foundation above. They can ship independently.

Why US-08 must be early: US-01's aggressive optimizations (prompt compression, tier routing, semantic cache) all have failure modes that can increase cost (cache poisoning serving expensive long answers, compression breaking and falling back to full prompts, tier classifier misrouting to Sonnet). Without US-08's cost circuit breaker as the backstop, a misconfiguration in US-01 can blow the daily budget before it is detected. The circuit breaker is the safety harness.

Owner Mapping

#	User Story	Suggested Owner Role
US-01	LLM Token Cost Optimization	Platform Engineering Lead (LLM/Bedrock)
US-02	Intent Classifier Cost Optimization	ML Infrastructure Engineer
US-03	Caching Strategy for Cost Reduction	Platform Architect
US-04	Compute Cost Optimization	DevOps / SRE
US-05	DynamoDB Cost Optimization	Backend Engineer
US-06	RAG Pipeline Cost Optimization	ML Platform Engineer
US-07	Analytics Pipeline Cost Optimization	Data Engineer
US-08	Traffic-Based Cost Optimization	SRE / FinOps Lead

The FinOps Lead (US-08 owner) holds the cross-cutting daily-budget contract; per-story owners hold the per-service KPIs.

Unified KPI Rollup

A FinOps lead should be able to scan this table at a glance and know the headline metric and target for each story. The "Status" column is filled in during quarterly reviews.

#	Story	Headline Metric	Target	Baseline	Status
US-01	LLM Tokens	Bedrock spend reduction	40–60%	$315K/mo	Track
US-02	Intent Classifier	Inference cost reduction	50–70%	$400–600/mo	Track
US-03	Caching	Combined hit rate	≥ 70%	0%	Track
US-04	Compute	Fargate spend reduction	35–55%	$3.5–4.5K/mo	Track
US-05	DynamoDB	DDB spend reduction	40–60%	$570/mo	Track
US-06	RAG	RAG pipeline spend reduction	30–50%	$750–900/mo	Track
US-07	Analytics	Analytics spend reduction	40–60%	$500–700/mo	Track
US-08	Traffic-Based	Daily Bedrock spend cap respected	100%	$5K daily cap	Track

For the cross-story interaction matrix (which story coordinates with which on shared signals), see the Cross-Story Interactions & Conflicts section in each individual story file.

Implementation Sequencing Callout

US-08 must be deployed before US-01's aggressive optimizations land in production. The cost circuit breaker, daily budget cap, and per-tier kill switches are the safety net for model-tiering and prompt compression. A misconfiguration in US-01 (e.g., template router falling through to Sonnet on every message) can produce a cost runaway that US-08 caps within minutes; without US-08, the runaway is bounded only by the next manual cost-alarm review (typically hours to a day).

Similarly, US-07 must be deployed before US-08 can function — the cost circuit breaker reads daily Bedrock spend from US-07's event stream. Until US-07 ships, US-08's breaker can only operate on lagging billing data (24–48 hours stale), which is too slow for real-time cost control.

The recommended sequencing — US-07 → US-08 → US-03 → US-02 → others in parallel — reflects this dependency structure.

Per-Story Deep Dives, Real-World Validation, Cross-Story Interactions, and Rollback

Every story file (US-01 through US-08) ends with four deep-dive sections appended after the existing Risks table:

Deep Dive: Why This Works on a Manga Chatbot Workload — architectural intuition specific to manga-chatbot traffic properties.
Real-World Validation — industry benchmarks, named case studies, and math-validation of cost numbers against current AWS pricing.
Cross-Story Interactions & Conflicts — explicit edges between this story and the others, with conflict modes and resolution rules.
Rollback & Experimentation — shadow-mode plan, canary thresholds, kill-switch flag, and quality-regression criteria.

Read all four sections of any one story before implementing it.

Offline Testing & Interview-Loop Prep

For per-scenario offline-testing deep-dives and Amazon-loop grill chains (ML/AI Engineer and MLOps Engineer lenses) covering all 8 stories above, see ../Cost-Optimization-Offline-Testing/. Files 03–07 in that folder apply rigorous offline-test design (counterfactual replay, decision-equivalence, cost-aware golden, stress simulation) to each US story and provide multi-round interview grills with architect-level escalation.

Multi-Reviewer Validation & Cross-Cutting Hardening

These 8 stories were reviewed by five expert lenses (FinOps, Principal Architect, SRE, ML/Data Engineer, Application Security) before publication. The cross-cutting findings consolidated below apply to all 8 stories and supersede individual-story content where they conflict. Each US-XX file also has a Multi-Reviewer Validation Findings & Resolutions section with story-specific S1/S2 items.

Pricing Baseline Reconciliation

The pie chart above shows relative cost share for an illustrative deployment — the slice values are percentages, not absolute dollars, and the individual story baselines are independent estimates against different traffic-mix assumptions. The story baselines do not sum to a single portfolio total. When finance review needs a single number, derive it from production-measured per-service spend, not from this README. Specifically, US-01's $315K/month and US-04's $4K/month baselines are not slices of the same pie — they assume different per-service traffic profiles.

Region & Data Residency

The MangaAssist production deployment runs in ap-northeast-1 (Tokyo) for Japanese-customer data residency. AWS pricing in the per-story files is quoted at us-east-1 published list price for portability; ap-northeast-1 has a regional uplift of approximately:

Bedrock: +0–10% (varies by model availability — verify Anthropic model regional availability)
Fargate / Lambda: +5–10%
DynamoDB on-demand: +10–15%
OpenSearch Serverless: +5–10%
Kinesis / Firehose: +5–10%
Redshift: +5–15%

Cross-region calls are forbidden for any path touching customer data — Bedrock invocation, OpenSearch query, DynamoDB read/write, Kinesis put. Document any exception in the security review. Bedrock Anthropic model availability in ap-northeast-1 must be verified per model and per release; a fallback to a different model version (not a different region) is the documented mitigation.

Cross-Cutting Concerns Inherited by All Stories

Concern	Why required	Applies to
`request_id` (UUID) threaded through every call	Distributed tracing, cost attribution, incident forensics	All stories
Per-request cost attribution emitted to US-07 event stream	US-08 cost breaker decisions need per-component breakdown	US-01, US-04, US-06
Idempotency keys on all writes	TransactWrite retries can cause duplicate META updates; rate-limiter retries can cause double-billing	US-01, US-02, US-04, US-05, US-07, US-08
Model / classifier version pinning in cache keys	Embedding-model rotation otherwise serves stale vectors silently; intent-classifier rotation breaks template-router contract	US-01, US-02, US-06
Language stratification in metrics (English vs Japanese vs mixed)	Manga store is bilingual; aggregated metrics hide regressions on JP traffic	US-01, US-02, US-06
Drift detection (intent distribution, embedding distribution, language mix)	Cost optimizations calibrated on month-1 traffic break at month-6	US-01, US-02, US-06
Schema versioning on analytics events	US-08 reads cost events; schema drift breaks the breaker silently	US-07 (producer), US-08 (consumer)
Audit trail for cost-control actions	CloudTrail on every kill switch, budget change, breaker state transition	US-01, US-08
PII redaction at boundary (before cache, embed, archive)	GDPR / data residency / breach risk; embeddings are quasi-reversible	US-01, US-05, US-06, US-07
ReDoS protection on regex paths	Rule-based intent classifier and template router process untrusted input	US-02, US-08

These are non-negotiable shared infrastructure owned by the Platform / SRE team. Per-story implementations must conform; deviations require explicit security review.

Kill-Switch Precedence (single source of truth)

When multiple kill switches fire simultaneously, this is the precedence order, highest to lowest. A single feature-flag evaluator module owns precedence resolution; direct SSM Parameter Store reads from story code are forbidden.

degradation_active=true (US-08) — overrides every other story's behavior. When set: model_tier_floor=haiku, RAG bypass is aggressive, scale-in is suspended, guest pipeline is template-only. Cost-side safety net always wins over per-story optimizations.
cost_circuit_breaker_enabled=false (US-08) — disables the breaker only. Other stories continue normal behavior. Use only for emergency manual override; CloudTrail-audited; FinOps-lead-only IAM permission to flip.
Per-story *_optimization_enabled=false — reverts that one story to pre-optimization baseline. Honored independently of the others.
Per-technique flags within a story (e.g., compute_spot_enabled within US-04) — finest granularity, honored last.

Default value when SSM is unreachable: safe-by-default per flag — degradation_active defaults to false (do not over-degrade if signal is missing); *_optimization_enabled defaults to false (revert to pre-optimization, never run unverified path).

Bedrock Provisioned Throughput, Savings Plans & EDP

None of the per-story files evaluate negotiated discounts. These are FinOps-lead responsibility (US-08 owner role), evaluated quarterly:

Bedrock Provisioned Throughput (~50% discount on sustained traffic above the per-minute threshold). At MangaAssist's projected peak ~25K tokens/min the 100K tokens/min minimum is over-provisioned but still cheaper at high utilization. Decision deferred until 30 days of post-launch traffic data; revisit at first quarterly cost review.
Compute Savings Plans (1-year/3-year commit on Fargate + Lambda; ~15–30% discount). Reduces US-04 effective baseline.
DynamoDB Reserved Capacity — only relevant if migrating from on-demand to provisioned; not currently in scope.
Native Bedrock prompt caching (Anthropic feature, GA on Bedrock Aug 2024) — not yet exploited; estimated additional 15–25% input-cost reduction on stable system prompts. Backlog item for US-01 v2.
Enterprise Discount Program / Private Pricing Agreement — account-wide, applies to all services; out of story scope.

Redis Tier as Multi-Story SPOF

Five stories (US-01, US-02, US-03, US-06, US-08) depend on the same Redis tier. The Architect review flagged this as a distributed-monolith risk. Mitigations applied:

Multi-AZ failover with Sentinel — non-negotiable for production.
Per-keyspace logical Redis DBs — llmresp: (db=1, noeviction), intent:sess: (db=2, allkeys-lru), product/reco/promo (db=3, allkeys-lru), emb: (db=4, allkeys-lru with int8 quantization), rate: and cost: (db=0, noeviction — never evict cost-critical state).
Story-specific fallback when Redis is unavailable, documented in each story's findings appendix.
Cost-critical state (rate-limiter counters, cost ledger) replicated to DDB as immutable ledger; Redis acts as read-through cache, not authority.

Reviewer Sign-Off Status

Lens	Sign-off	Outstanding
FinOps	Conditional	Pricing reconciliation in this section + per-story Math Validation flags
Principal Architect	Conditional	Cross-story contracts and SPOF mitigations applied per-story
SRE	Conditional	Runbooks added to US-01, US-03, US-04, US-08; kill-switch precedence above
ML / Data Engineer	Conditional	Multilingual + drift + reranker calibration applied to US-02, US-06
Application Security	Conditional	Tier auth, cost-ledger immutability, PII redaction applied to US-08, US-05, US-07

Per-story details are in each file's "Multi-Reviewer Validation Findings & Resolutions" section.