Cost Optimization User Stories - MangaAssist Chatbot
Overview
This directory contains detailed user stories for cost optimization across every major service in the MangaAssist chatbot architecture. Each user story includes high-level design, low-level implementation details, Mermaid diagrams, and code examples.
User Stories
| # | User Story | Primary Service | Estimated Savings |
|---|---|---|---|
| US-01 | LLM Token Cost Optimization | Amazon Bedrock (Claude 3.5 Sonnet) | 40-60% of LLM spend |
| US-02 | Intent Classifier Cost Optimization | SageMaker Endpoint | 50-70% of inference spend |
| US-03 | Caching Strategy for Cost Reduction | ElastiCache Redis | 30-50% of downstream API costs |
| US-04 | Compute Cost Optimization | ECS Fargate + Lambda | 35-55% of compute spend |
| US-05 | DynamoDB Cost Optimization | DynamoDB | 40-60% of storage/throughput costs |
| US-06 | RAG Pipeline Cost Optimization | OpenSearch Serverless + Titan Embeddings | 30-50% of RAG costs |
| US-07 | Analytics Pipeline Cost Optimization | Kinesis + Redshift | 40-60% of analytics spend |
| US-08 | Traffic-Based Cost Optimization | Edge / Rate Limiter / Degraded Modes | 20-35% of total infrastructure |
Cost Distribution (Baseline Estimate)
pie title MangaAssist Monthly Cost Distribution (Before Optimization)
"Bedrock LLM" : 35
"ECS Fargate + Lambda" : 20
"DynamoDB" : 10
"OpenSearch Serverless" : 10
"ElastiCache Redis" : 8
"SageMaker (Intent)" : 7
"Kinesis + Redshift" : 5
"CloudFront + ALB" : 3
"Other" : 2
How to Use
- Start with US-01 (LLM Token Cost) — it targets the largest cost driver.
- Read US-02 and US-03 next — they directly reduce LLM and downstream service calls.
- Apply US-04 through US-08 based on your current cost profile.
- Each user story is self-contained and can be implemented independently.
Relationship to Architecture
These user stories map directly to the components described in:
- 04-architecture-hld.md — High-level architecture
- 04b-architecture-lld.md — Low-level design
Dependency & Sequencing Graph
The 8 stories are not independent. Some are shared infrastructure that other stories depend on; some are outer control loops that wrap the others. Implementing in the wrong order creates either dead optimizations (no cost telemetry to evaluate them) or runaway risk (no circuit breaker to bound them).
graph TB
US08[US-08 Traffic-Based<br>Outer cost-control loop]
US07[US-07 Analytics<br>Cost telemetry pipeline]
US03[US-03 Caching<br>Shared Redis tier]
US02[US-02 Intent Classifier<br>Intent label provider]
US01[US-01 LLM Tokens<br>Bedrock optimization]
US04[US-04 Compute<br>Fargate + Lambda]
US05[US-05 DynamoDB<br>Session-state lifecycle]
US06[US-06 RAG<br>OpenSearch + Titan]
US07 -->|cost events feed| US08
US08 -->|model_tier_floor| US01
US08 -->|degradation_level| US06
US08 -->|suspend_scale_in| US04
US02 -->|intent label + confidence| US01
US02 -->|intent label| US06
US02 -->|intent label| US08
US03 -->|llmresp: keyspace| US01
US03 -->|intent:sess: keyspace| US02
US03 -->|emb: keyspace| US06
US03 -->|fallback path| US05
US05 -->|TURN archive| US07
style US08 fill:#f66,stroke:#333
style US07 fill:#fd2,stroke:#333
style US03 fill:#fd2,stroke:#333
style US02 fill:#fd2,stroke:#333
Recommended implementation order:
- US-07 (Analytics) first — cost telemetry pipeline must exist before any other story can be evaluated or before US-08 can read spend data.
- US-08 (Traffic-Based) second — the cost circuit breaker is the safety net for every aggressive optimization that follows. It must be in production before US-01's tier routing or prompt compression are turned on at scale.
- US-03 (Caching) third — shared Redis tier underpins US-01 (response cache), US-02 (session intent cache), and US-06 (embedding cache). Provision Redis with all four keyspaces planned, even if only one is initially populated.
- US-02 (Intent Classifier) fourth — provides the intent label that US-01, US-06, and US-08 all key on. Establish the intent-precision floor (≥ 0.92) before downstream stories rely on it.
- US-01, US-04, US-05, US-06 in parallel — these are the "leaf" stories that benefit from the foundation above. They can ship independently.
Why US-08 must be early: US-01's aggressive optimizations (prompt compression, tier routing, semantic cache) all have failure modes that can increase cost (cache poisoning serving expensive long answers, compression breaking and falling back to full prompts, tier classifier misrouting to Sonnet). Without US-08's cost circuit breaker as the backstop, a misconfiguration in US-01 can blow the daily budget before it is detected. The circuit breaker is the safety harness.
Owner Mapping
| # | User Story | Suggested Owner Role |
|---|---|---|
| US-01 | LLM Token Cost Optimization | Platform Engineering Lead (LLM/Bedrock) |
| US-02 | Intent Classifier Cost Optimization | ML Infrastructure Engineer |
| US-03 | Caching Strategy for Cost Reduction | Platform Architect |
| US-04 | Compute Cost Optimization | DevOps / SRE |
| US-05 | DynamoDB Cost Optimization | Backend Engineer |
| US-06 | RAG Pipeline Cost Optimization | ML Platform Engineer |
| US-07 | Analytics Pipeline Cost Optimization | Data Engineer |
| US-08 | Traffic-Based Cost Optimization | SRE / FinOps Lead |
The FinOps Lead (US-08 owner) holds the cross-cutting daily-budget contract; per-story owners hold the per-service KPIs.
Unified KPI Rollup
A FinOps lead should be able to scan this table at a glance and know the headline metric and target for each story. The "Status" column is filled in during quarterly reviews.
| # | Story | Headline Metric | Target | Baseline | Status |
|---|---|---|---|---|---|
| US-01 | LLM Tokens | Bedrock spend reduction | 40–60% | $315K/mo | Track |
| US-02 | Intent Classifier | Inference cost reduction | 50–70% | $400–600/mo | Track |
| US-03 | Caching | Combined hit rate | ≥ 70% | 0% | Track |
| US-04 | Compute | Fargate spend reduction | 35–55% | $3.5–4.5K/mo | Track |
| US-05 | DynamoDB | DDB spend reduction | 40–60% | $570/mo | Track |
| US-06 | RAG | RAG pipeline spend reduction | 30–50% | $750–900/mo | Track |
| US-07 | Analytics | Analytics spend reduction | 40–60% | $500–700/mo | Track |
| US-08 | Traffic-Based | Daily Bedrock spend cap respected | 100% | $5K daily cap | Track |
For the cross-story interaction matrix (which story coordinates with which on shared signals), see the Cross-Story Interactions & Conflicts section in each individual story file.
Implementation Sequencing Callout
US-08 must be deployed before US-01's aggressive optimizations land in production. The cost circuit breaker, daily budget cap, and per-tier kill switches are the safety net for model-tiering and prompt compression. A misconfiguration in US-01 (e.g., template router falling through to Sonnet on every message) can produce a cost runaway that US-08 caps within minutes; without US-08, the runaway is bounded only by the next manual cost-alarm review (typically hours to a day).
Similarly, US-07 must be deployed before US-08 can function — the cost circuit breaker reads daily Bedrock spend from US-07's event stream. Until US-07 ships, US-08's breaker can only operate on lagging billing data (24–48 hours stale), which is too slow for real-time cost control.
The recommended sequencing — US-07 → US-08 → US-03 → US-02 → others in parallel — reflects this dependency structure.
Per-Story Deep Dives, Real-World Validation, Cross-Story Interactions, and Rollback
Every story file (US-01 through US-08) ends with four deep-dive sections appended after the existing Risks table:
- Deep Dive: Why This Works on a Manga Chatbot Workload — architectural intuition specific to manga-chatbot traffic properties.
- Real-World Validation — industry benchmarks, named case studies, and math-validation of cost numbers against current AWS pricing.
- Cross-Story Interactions & Conflicts — explicit edges between this story and the others, with conflict modes and resolution rules.
- Rollback & Experimentation — shadow-mode plan, canary thresholds, kill-switch flag, and quality-regression criteria.
Read all four sections of any one story before implementing it.
Offline Testing & Interview-Loop Prep
For per-scenario offline-testing deep-dives and Amazon-loop grill chains (ML/AI Engineer and MLOps Engineer lenses) covering all 8 stories above, see ../Cost-Optimization-Offline-Testing/. Files 03–07 in that folder apply rigorous offline-test design (counterfactual replay, decision-equivalence, cost-aware golden, stress simulation) to each US story and provide multi-round interview grills with architect-level escalation.
Multi-Reviewer Validation & Cross-Cutting Hardening
These 8 stories were reviewed by five expert lenses (FinOps, Principal Architect, SRE, ML/Data Engineer, Application Security) before publication. The cross-cutting findings consolidated below apply to all 8 stories and supersede individual-story content where they conflict. Each US-XX file also has a Multi-Reviewer Validation Findings & Resolutions section with story-specific S1/S2 items.
Pricing Baseline Reconciliation
The pie chart above shows relative cost share for an illustrative deployment — the slice values are percentages, not absolute dollars, and the individual story baselines are independent estimates against different traffic-mix assumptions. The story baselines do not sum to a single portfolio total. When finance review needs a single number, derive it from production-measured per-service spend, not from this README. Specifically, US-01's $315K/month and US-04's $4K/month baselines are not slices of the same pie — they assume different per-service traffic profiles.
Region & Data Residency
The MangaAssist production deployment runs in ap-northeast-1 (Tokyo) for Japanese-customer data residency. AWS pricing in the per-story files is quoted at us-east-1 published list price for portability; ap-northeast-1 has a regional uplift of approximately:
- Bedrock: +0–10% (varies by model availability — verify Anthropic model regional availability)
- Fargate / Lambda: +5–10%
- DynamoDB on-demand: +10–15%
- OpenSearch Serverless: +5–10%
- Kinesis / Firehose: +5–10%
- Redshift: +5–15%
Cross-region calls are forbidden for any path touching customer data — Bedrock invocation, OpenSearch query, DynamoDB read/write, Kinesis put. Document any exception in the security review. Bedrock Anthropic model availability in ap-northeast-1 must be verified per model and per release; a fallback to a different model version (not a different region) is the documented mitigation.
Cross-Cutting Concerns Inherited by All Stories
| Concern | Why required | Applies to |
|---|---|---|
request_id (UUID) threaded through every call |
Distributed tracing, cost attribution, incident forensics | All stories |
| Per-request cost attribution emitted to US-07 event stream | US-08 cost breaker decisions need per-component breakdown | US-01, US-04, US-06 |
| Idempotency keys on all writes | TransactWrite retries can cause duplicate META updates; rate-limiter retries can cause double-billing | US-01, US-02, US-04, US-05, US-07, US-08 |
| Model / classifier version pinning in cache keys | Embedding-model rotation otherwise serves stale vectors silently; intent-classifier rotation breaks template-router contract | US-01, US-02, US-06 |
| Language stratification in metrics (English vs Japanese vs mixed) | Manga store is bilingual; aggregated metrics hide regressions on JP traffic | US-01, US-02, US-06 |
| Drift detection (intent distribution, embedding distribution, language mix) | Cost optimizations calibrated on month-1 traffic break at month-6 | US-01, US-02, US-06 |
| Schema versioning on analytics events | US-08 reads cost events; schema drift breaks the breaker silently | US-07 (producer), US-08 (consumer) |
| Audit trail for cost-control actions | CloudTrail on every kill switch, budget change, breaker state transition | US-01, US-08 |
| PII redaction at boundary (before cache, embed, archive) | GDPR / data residency / breach risk; embeddings are quasi-reversible | US-01, US-05, US-06, US-07 |
| ReDoS protection on regex paths | Rule-based intent classifier and template router process untrusted input | US-02, US-08 |
These are non-negotiable shared infrastructure owned by the Platform / SRE team. Per-story implementations must conform; deviations require explicit security review.
Kill-Switch Precedence (single source of truth)
When multiple kill switches fire simultaneously, this is the precedence order, highest to lowest. A single feature-flag evaluator module owns precedence resolution; direct SSM Parameter Store reads from story code are forbidden.
degradation_active=true(US-08) — overrides every other story's behavior. When set:model_tier_floor=haiku, RAG bypass is aggressive, scale-in is suspended, guest pipeline is template-only. Cost-side safety net always wins over per-story optimizations.cost_circuit_breaker_enabled=false(US-08) — disables the breaker only. Other stories continue normal behavior. Use only for emergency manual override; CloudTrail-audited; FinOps-lead-only IAM permission to flip.- Per-story
*_optimization_enabled=false— reverts that one story to pre-optimization baseline. Honored independently of the others. - Per-technique flags within a story (e.g.,
compute_spot_enabledwithin US-04) — finest granularity, honored last.
Default value when SSM is unreachable: safe-by-default per flag — degradation_active defaults to false (do not over-degrade if signal is missing); *_optimization_enabled defaults to false (revert to pre-optimization, never run unverified path).
Bedrock Provisioned Throughput, Savings Plans & EDP
None of the per-story files evaluate negotiated discounts. These are FinOps-lead responsibility (US-08 owner role), evaluated quarterly:
- Bedrock Provisioned Throughput (~50% discount on sustained traffic above the per-minute threshold). At MangaAssist's projected peak ~25K tokens/min the 100K tokens/min minimum is over-provisioned but still cheaper at high utilization. Decision deferred until 30 days of post-launch traffic data; revisit at first quarterly cost review.
- Compute Savings Plans (1-year/3-year commit on Fargate + Lambda; ~15–30% discount). Reduces US-04 effective baseline.
- DynamoDB Reserved Capacity — only relevant if migrating from on-demand to provisioned; not currently in scope.
- Native Bedrock prompt caching (Anthropic feature, GA on Bedrock Aug 2024) — not yet exploited; estimated additional 15–25% input-cost reduction on stable system prompts. Backlog item for US-01 v2.
- Enterprise Discount Program / Private Pricing Agreement — account-wide, applies to all services; out of story scope.
Redis Tier as Multi-Story SPOF
Five stories (US-01, US-02, US-03, US-06, US-08) depend on the same Redis tier. The Architect review flagged this as a distributed-monolith risk. Mitigations applied:
- Multi-AZ failover with Sentinel — non-negotiable for production.
- Per-keyspace logical Redis DBs —
llmresp:(db=1,noeviction),intent:sess:(db=2,allkeys-lru), product/reco/promo (db=3,allkeys-lru),emb:(db=4,allkeys-lruwith int8 quantization),rate:andcost:(db=0,noeviction— never evict cost-critical state). - Story-specific fallback when Redis is unavailable, documented in each story's findings appendix.
- Cost-critical state (rate-limiter counters, cost ledger) replicated to DDB as immutable ledger; Redis acts as read-through cache, not authority.
Reviewer Sign-Off Status
| Lens | Sign-off | Outstanding |
|---|---|---|
| FinOps | Conditional | Pricing reconciliation in this section + per-story Math Validation flags |
| Principal Architect | Conditional | Cross-story contracts and SPOF mitigations applied per-story |
| SRE | Conditional | Runbooks added to US-01, US-03, US-04, US-08; kill-switch precedence above |
| ML / Data Engineer | Conditional | Multilingual + drift + reranker calibration applied to US-02, US-06 |
| Application Security | Conditional | Tier auth, cost-ledger immutability, PII redaction applied to US-08, US-05, US-07 |
Per-story details are in each file's "Multi-Reviewer Validation Findings & Resolutions" section.