LLMOps User Stories — MangaAssist

User stories covering the full LLMOps lifecycle for MangaAssist: model deployment, prompt management, evaluation, monitoring, data pipelines, cost governance, guardrails, and continuous improvement.

Story Map Overview

graph TB
    subgraph "1. Model Lifecycle"
        A1[Model Registry]
        A2[Model Deployment]
        A3[Model Versioning]
        A4[Model Rollback]
    end

    subgraph "2. Prompt Management"
        B1[Prompt Versioning]
        B2[Prompt Testing]
        B3[Prompt Review]
        B4[Prompt Rollback]
    end

    subgraph "3. Evaluation & Quality Gates"
        C1[Golden Dataset]
        C2[Offline Eval]
        C3[Shadow Mode]
        C4[Canary Deploy]
    end

    subgraph "4. Observability & Monitoring"
        D1[Tracing]
        D2[Drift Detection]
        D3[Alerting]
        D4[Dashboards]
    end

    subgraph "5. Data & RAG Pipelines"
        E1[Knowledge Base Refresh]
        E2[Embedding Pipeline]
        E3[Chunking Strategy]
        E4[Index Management]
    end

    subgraph "6. Cost & Governance"
        F1[Token Budget]
        F2[Cost Tracking]
        F3[Rate Limiting]
        F4[Quota Management]
    end

    subgraph "7. Guardrails & Safety"
        G1[Content Filters]
        G2[Anti-Hallucination]
        G3[PII Protection]
        G4[Adversarial Defense]
    end

    subgraph "8. Feedback & Improvement"
        H1[User Feedback Loop]
        H2[Error Analysis]
        H3[Retraining Triggers]
        H4[A/B Testing]
    end

Epic 1: Model Lifecycle Management

US-1.1 — Model Registry and Versioning

As a ML engineer, I want to register every model artifact (DistilBERT intent classifier, cross-encoder reranker, Titan embedding config, Claude prompt version) in a central model registry with version metadata, So that I can trace any production response back to the exact model versions that generated it.

Acceptance Criteria: - Every model has a unique version ID, training date, dataset hash, and performance metrics recorded at registration time. - Registry tracks: model name, version, framework, hosting target (SageMaker / Bedrock), status (staging / production / archived). - No model reaches production without a registry entry.

US-1.2 — Automated Model Deployment Pipeline

As a ML engineer, I want to deploy a new model version to SageMaker or Bedrock through an automated CI/CD pipeline triggered by a registry promotion, So that deployments are repeatable, auditable, and do not require manual console operations.

Acceptance Criteria: - Pipeline stages: build artifact → run offline evaluation → deploy to staging endpoint → run integration tests → promote to production. - SageMaker endpoints (DistilBERT, reranker) are deployed via Infrastructure-as-Code (CloudFormation / CDK). - Bedrock model version changes (Claude upgrades) are gated by shadow mode evaluation before promotion. - Deployment logs are stored for audit.

US-1.3 — One-Click Model Rollback

As a on-call engineer, I want to roll back any model to the previous production version within 5 minutes, So that a bad model change can be reverted before it impacts a significant number of users.

Acceptance Criteria: - SageMaker endpoints support blue-green deployment; rollback shifts traffic back to the old variant. - Bedrock prompt/model version rollback reverts the orchestrator configuration to the previous version. - Rollback triggers an automatic re-run of the golden dataset evaluation to confirm the restored version is healthy. - Rollback event is logged and an alert is sent to the team.

US-1.4 — Multi-Model Endpoint Consolidation

As a infrastructure engineer, I want to host the DistilBERT intent classifier and cross-encoder reranker on a shared SageMaker Multi-Model Endpoint, So that GPU utilization improves from ~30% to ~70% and idle cost is reduced.

Acceptance Criteria: - Both models are served from the same endpoint with independent scaling policies. - Latency does not regress: intent classifier P99 ≤ 50ms, reranker P99 ≤ 120ms. - Model-level metrics (invocation count, latency, error rate) are tracked per model on the shared endpoint.

US-1.5 — Inferentia Migration for Cost Optimization

As a ML engineer, I want to compile the DistilBERT intent classifier for AWS Inferentia (ml.inf1.xlarge) using the Neuron SDK, So that inference cost drops 3x (from $0.736/hr on GPU to $0.228/hr on Inferentia) while latency improves.

Acceptance Criteria: - Compiled model passes the golden dataset evaluation with intent accuracy ≥ 90%. - P99 latency ≤ 25ms on Inferentia (vs. 50ms on GPU). - Neuron compilation artifacts are versioned and stored in the model registry.

Epic 2: Prompt Lifecycle Management

US-2.1 — Prompt Versioning and Source Control

As a prompt engineer, I want to store all system prompts, few-shot examples, and prompt templates in version control with semantic versioning, So that every prompt change is reviewable, auditable, and rollbackable.

Acceptance Criteria: - Prompts are stored in a dedicated repository (or directory) with a changelog. - Each prompt version has: version ID, author, description of change, date, and linked evaluation results. - Prompt changes go through code review (PR-based workflow).

US-2.2 — Automated Prompt Regression Testing

As a prompt engineer, I want to automatically run the 500-query golden dataset evaluation on every prompt change PR, So that no prompt regression reaches production.

Acceptance Criteria: - CI pipeline triggers on prompt file changes. - Evaluation gates: intent accuracy ≥ 90%, BERTScore ≥ 0.80, ROUGE-L drop ≤ 10%, guardrail pass rate ≥ 95%, zero prohibited elements. - PR is blocked if any gate fails; failure report is posted as a PR comment. - Pipeline completes in ≤ 25 minutes.

US-2.3 — Prompt A/B Testing Framework

As a product manager, I want to A/B test two prompt variants on live traffic with statistical significance tracking, So that I can measure which prompt produces better business outcomes (conversion rate, CSAT, resolution rate).

Acceptance Criteria: - Traffic split is configurable (e.g., 50/50, 90/10). - Metrics tracked per variant: thumbs-up rate, escalation rate, conversion rate, avg response length, latency. - Experiment runs until statistical significance is reached or a manual stop is issued. - Winning variant can be promoted to 100% with one action.

US-2.4 — Prompt Template Management for Multi-Intent Routing

As a prompt engineer, I want to manage separate prompt templates per intent (recommendation, FAQ, product Q&A, order tracking) with a shared system prompt base, So that I can optimize prompts for each use case independently without unintended side effects.

Acceptance Criteria: - System prompt (persona, hard rules, output format) is a shared base template. - Each intent has its own context assembly template and few-shot examples. - Changes to a single intent prompt only trigger evaluation for that intent's golden dataset subset.

Epic 3: Evaluation and Quality Gates

US-3.1 — Golden Dataset Curation and Maintenance

As a ML engineer, I want to maintain a curated golden dataset of 500+ query-response pairs stratified by intent, with quarterly refresh cycles, So that offline evaluation covers the full spectrum of MangaAssist interactions including edge cases and adversarial inputs.

Acceptance Criteria: - Dataset composition: recommendations (24%), product questions (20%), FAQ (16%), order tracking (12%), multi-turn (10%), edge cases (8%), returns (6%), chitchat (4%). - Each entry includes: query, expected intent, reference response, required elements, prohibited elements, quality rubric, and tags. - Quarterly refresh: remove 50 stale entries (discontinued products, outdated policies), add 50 new entries from production errors. - Adversarial subset (40+ entries) contributed by the security team.

US-3.2 — Offline Evaluation Pipeline (Layer 1)

As a ML engineer, I want to run all 500 golden dataset queries through the full inference pipeline on every model or prompt change, So that quality regressions are caught before any code reaches staging.

Acceptance Criteria: - Automated scoring for: intent accuracy, BERTScore, ROUGE-L delta, format compliance, guardrail pass rate, response length, prohibited element check, per-class F1. - Hard thresholds gate PR merge (see US-2.2). - Pipeline cost per run ≈ $15 (500 LLM calls). - Results are stored and diffed against the previous baseline.

US-3.3 — Shadow Mode Evaluation (Layer 2)

As a ML engineer, I want to run a new model or prompt version in shadow mode — processing real production traffic in parallel without serving responses to users — for 3-7 days, So that I can compare the candidate version against the current production version on real traffic patterns.

Acceptance Criteria: - Both old and new versions process every request; only the old version's response is served. - Comparison metrics: BERTScore delta (≤ 5% drop), response length (±20%), guardrail pass rate (≥ old - 1%), hallucination score (≤ old + 0.02), intent routing changes (< 5%). - Shadow mode report is generated daily and reviewed by the DS + engineering team. - Shadow mode doubles LLM cost (~$31.5K/week at current scale); requires budget approval.

US-3.4 — Canary Deployment (Layer 3)

As a ML engineer, I want to serve a new model version to 1% of real traffic for 24 hours with real-time metric monitoring and auto-rollback, So that user-facing regressions are caught at minimal blast radius.

Acceptance Criteria: - Traffic split: 1% canary, 99% baseline. - Monitored metrics: escalation rate, thumbs-down rate, error rate, P99 latency. - Auto-rollback triggers if: escalation rate increases > 2pp, error rate > 1%, or P99 latency > 3s. - Canary promotion path: 1% → 10% → 50% → 100% with manual approval at each step.

US-3.5 — Continuous Monitoring and Auto-Rollback (Layer 4)

As a on-call engineer, I want to continuously monitor production model quality metrics with automatic rollback on hard threshold violations, So that slow drift or sudden regressions are caught and remediated without manual intervention.

Acceptance Criteria: - Real-time dashboards for: intent accuracy, hallucination rate, escalation rate, guardrail block rate, latency. - Hard auto-rollback thresholds: hallucination rate > 5%, escalation rate > 20%, error rate > 1%. - Soft alert thresholds (page on-call, no auto-rollback): hallucination rate > 3%, escalation rate > 17%, CSAT < 4.0. - Weekly quality review meeting uses trend reports from this monitoring layer.

US-3.6 — Human Evaluation Pipeline

As a quality lead, I want to run weekly human evaluation on a sample of 200 production responses scored by trained annotators, So that I can catch quality issues that automated metrics miss (tone, helpfulness, cultural appropriateness).

Acceptance Criteria: - 200 responses sampled weekly, stratified by intent. - Annotators score on: factual correctness, helpfulness, tone, format, and whether the response would lead to a purchase. - Inter-annotator agreement (Cohen's kappa) ≥ 0.7. - Human eval scores are trended weekly and correlated with automated metrics to calibrate thresholds.

Epic 4: Observability and Monitoring

US-4.1 — End-to-End LLM Pipeline Tracing

As a on-call engineer, I want to view a single trace for any user request that shows every step (intent classification → RAG retrieval → embedding → reranking → LLM generation → guardrails) with inputs, outputs, latency, and token counts, So that I can diagnose a bad response in minutes instead of 45-90 minutes.

Acceptance Criteria: - MLflow Tracing (or equivalent) captures every span in the pipeline with parent-child hierarchy. - Each span logs: inputs, outputs, latency, model version, token count, error (if any). - Traces are searchable by request ID, user ID, intent, and time range. - Trace retention: 30 days hot storage, 1 year cold storage.

US-4.2 — Model Drift Detection

As a ML engineer, I want to detect data drift (input distribution shift) and model drift (output quality degradation) automatically, So that I can trigger retraining or prompt adjustment before users are impacted.

Acceptance Criteria: - Data drift: monitor intent distribution weekly; alert if any intent's share shifts > 5pp from baseline. - Model drift: monitor intent classifier accuracy on a rolling 1,000-sample evaluation; alert if accuracy drops below 88%. - Embedding drift: monitor cosine similarity distribution of user queries vs. training data; alert on distribution shift (KS test p < 0.01). - Drift alerts include recommended actions (retrain, augment training data, update prompt).

US-4.3 — Real-Time Operational Dashboard

As a engineering manager, I want to view a real-time dashboard showing active sessions, messages/second, P99 latency, error rate, intent distribution, LLM token cost, and guardrail block rate, So that I have immediate visibility into system health and can spot anomalies quickly.

Acceptance Criteria: - Dashboard refreshes every 10 seconds. - Panels: active sessions, msg/sec, P99 latency, error rate, intent distribution (pie chart), cost/hour, guardrail blocks, escalation rate. - Historical comparison: current hour vs. same hour last week. - Drill-down from any metric to individual traces.

US-4.4 — LLM-Specific Metrics Collection

As a ML engineer, I want to capture and dashboard LLM-specific metrics per request: input tokens, output tokens, time-to-first-token (TTFT), tokens-per-second (TPS), model version, temperature, and stop reason, So that I can track LLM performance trends and optimize cost.

Acceptance Criteria: - Metrics emitted as CloudWatch custom metrics with dimensions: model_name, intent, region. - P50/P90/P99 percentiles for TTFT and TPS are graphed. - Token usage is aggregated daily for cost attribution. - Alerts on: TTFT P99 > 1.5s, TPS < 20, unexpected stop reasons > 1%.

US-4.5 — Alerting on Quality Degradation

As a on-call engineer, I want to receive automated alerts when AI quality metrics degrade beyond defined thresholds, So that I can investigate and remediate before business impact accumulates.

Acceptance Criteria: - Alert channels: PagerDuty (critical), Slack (warning), email (informational). - Critical alerts (page immediately): hallucination rate > 5%, error rate > 1%, P99 latency > 5s, guardrail pass rate < 90%. - Warning alerts (Slack): thumbs-down rate increases > 3pp over 1 hour, escalation rate > 17%, response length drifts > 30% from baseline. - Each alert includes: metric name, current value, threshold, time window, and a link to relevant traces.

Epic 5: RAG and Data Pipeline Operations

US-5.1 — Automated Knowledge Base Refresh Pipeline

As a data engineer, I want to automatically ingest updated product catalog data, FAQ articles, and policy documents into the RAG knowledge base on a defined schedule, So that the chatbot always references current information and does not serve stale answers.

Acceptance Criteria: - Product catalog: re-indexed every 4 hours (event-driven on catalog changes preferred). - FAQ and policy documents: re-indexed within 1 hour of a content update. - Editorial content: re-indexed daily. - Each ingestion run logs: documents processed, chunks created, embeddings generated, errors encountered. - Stale chunk detection: flag chunks whose source document was deleted or updated.

US-5.2 — Embedding Model Version Management

As a ML engineer, I want to manage embedding model version upgrades (e.g., Titan Embeddings V2 → V3) with full re-indexing and validation, So that embedding upgrades do not silently degrade retrieval quality.

Acceptance Criteria: - Embedding model change triggers full re-indexing of all knowledge base chunks. - Re-indexed corpus is validated against a retrieval evaluation dataset (Recall@3 ≥ 80% on curated query-document pairs). - Old and new embeddings coexist during validation; cutover happens atomically. - Embedding model version is recorded per chunk for traceability.

US-5.3 — RAG Retrieval Quality Monitoring

As a ML engineer, I want to continuously monitor RAG retrieval quality in production (relevance of retrieved chunks, recall, MRR), So that I can detect retrieval degradation caused by knowledge base staleness, embedding drift, or query distribution changes.

Acceptance Criteria: - Weekly automated evaluation: run 100 curated queries against the live index; measure Recall@3, MRR@3, and average reranker score. - Alert if Recall@3 drops below 75% or MRR@3 drops below 0.6. - Chunk-level analytics: identify chunks that are never retrieved (dead chunks) or retrieved for irrelevant queries (noisy chunks). - Monthly review of chunk strategy (size, overlap, metadata) based on retrieval performance data.

US-5.4 — Chunk Strategy Tuning Pipeline

As a ML engineer, I want to experiment with different chunking strategies (chunk size, overlap, metadata enrichment) and measure their impact on retrieval quality, So that I can optimize the RAG pipeline without guesswork.

Acceptance Criteria: - Experiment framework supports configurable chunk size (128-1024 tokens), overlap (0-100 tokens), and metadata fields. - Each experiment re-indexes a test corpus and evaluates Recall@3, MRR@3, and downstream response quality (BERTScore) on the golden dataset. - Results are logged with experiment parameters for comparison. - Winning configuration is promoted to production via the standard deployment pipeline.

Epic 6: Cost Governance and Token Budget Management

US-6.1 — Per-Session Token Budget Enforcement

As a platform engineer, I want to enforce a per-session token budget (input + output tokens) to prevent runaway LLM costs from long or adversarial sessions, So that no single session can consume disproportionate resources.

Acceptance Criteria: - Default budget: 5,000 tokens per session (configurable). - When budget is exceeded: gracefully degrade (shorter context, skip RAG) rather than hard-fail. - Budget usage is tracked per session and surfaced in the analytics dashboard. - Alert if > 1% of sessions hit the budget cap (may indicate a systemic issue or abuse).

US-6.2 — LLM Cost Attribution and Reporting

As a engineering manager, I want to attribute LLM inference cost to specific intents, user segments, and traffic sources, So that I can identify which use cases are cost-efficient and which need optimization.

Acceptance Criteria: - Cost breakdown by: intent (recommendation, FAQ, product Q&A, etc.), model (Claude, Titan, DistilBERT), user type (Prime vs. guest), and region. - Daily cost report with comparison to budget and prior period. - Alerts if daily LLM cost exceeds 120% of the 7-day rolling average. - Target: ≤ $0.025 per session average.

US-6.3 — LLM Bypass for Low-Complexity Intents

As a platform engineer, I want to route low-complexity intents (greetings, order tracking, simple FAQ) to template-based or API-based handlers that skip LLM inference entirely, So that I reduce LLM cost by 40-60% and improve latency for those intents.

Acceptance Criteria: - Intents eligible for LLM bypass: chitchat (template), order_tracking (API + template), simple FAQ (cached response). - LLM bypass rate is tracked and dashboarded (target: > 40% of all messages skip the LLM). - Quality of template responses is validated with the golden dataset (relevant subset). - New bypass routes can be added without code deployment (configuration-driven).

US-6.4 — Prompt Caching for Repeated Context

As a platform engineer, I want to cache the system prompt and frequently used context blocks (persona, policies, common product data) at the LLM provider level, So that input token cost and TTFT are reduced for repeated prompt prefixes.

Acceptance Criteria: - Bedrock prompt caching is enabled for the system prompt block (~500 tokens) and policy RAG chunks. - Cache hit rate is tracked; target ≥ 70% for system prompt prefix. - Cost savings from caching are measured and reported weekly. - Cache invalidation triggers on system prompt or policy document updates.

Epic 7: Guardrails and Safety

US-7.1 — Multi-Layer Guardrail Pipeline

As a safety engineer, I want to apply pre-generation and post-generation guardrails to every LLM response, So that responses are free from hallucinated data, PII leakage, toxic content, competitor mentions, and fabricated prices.

Acceptance Criteria: - Pre-generation guardrails: prompt injection detection, input toxicity filter, input length validation. - Post-generation guardrails: ASIN validation (all mentioned ASINs exist in catalog), price validation (no fabricated prices), PII detection and redaction, toxicity filter, competitor mention filter, URL validation. - Guardrail pass rate ≥ 95% (i.e., ≤ 5% of responses are blocked/modified by guardrails). - False positive rate (legitimate responses blocked) ≤ 1%. - Guardrail latency budget: ≤ 100ms total.

US-7.2 — Prompt Injection Defense

As a security engineer, I want to detect and block prompt injection attempts (jailbreaks, instruction overrides, role-playing attacks) before they reach the LLM, So that adversarial users cannot manipulate the chatbot into generating harmful or off-brand responses.

Acceptance Criteria: - Rule-based detection layer catches known injection patterns (e.g., "ignore previous instructions", "you are now...", base64-encoded instructions). - ML-based classifier (trained on adversarial dataset) catches novel injection attempts. - Blocked attempts return a safe fallback response: "I can only help with manga-related questions." - All injection attempts are logged for security review and model retraining. - Adversarial test suite (40+ cases) is run on every guardrail change.

US-7.3 — Anti-Hallucination Validation

As a ML engineer, I want to validate every LLM response against the provided context (catalog data, RAG chunks) to detect and prevent hallucinated information, So that users never receive fabricated product details, prices, or availability information.

Acceptance Criteria: - Post-generation checks: all ASINs mentioned exist in the provided product data, all prices match the catalog, all URLs are valid Amazon URLs. - Faithfulness score (response grounded in provided context) is computed per response; responses below threshold are re-generated with stricter constraints or escalated. - Hallucination rate target: < 2% of all responses. - Temperature is set to 0.3 to minimize creative deviation.

US-7.4 — Streaming Response Guardrails

As a platform engineer, I want to apply guardrails to streaming (token-by-token) LLM responses without breaking the streaming experience, So that users see tokens as they generate but harmful content is intercepted mid-stream.

Acceptance Criteria: - Lightweight keyword/pattern filter runs on each token chunk (detect PII patterns, competitor names, profanity). - If a violation is detected mid-stream: stop generation, replace the partial response with a safe fallback, log the incident. - Full post-generation guardrails run on the complete assembled response. - Streaming guardrail latency overhead: ≤ 5ms per chunk.

Epic 8: Feedback Loop and Continuous Improvement

US-8.1 — User Feedback Collection Pipeline

As a product manager, I want to capture thumbs-up/thumbs-down feedback on every chatbot response and correlate it with the full trace (intent, prompt version, model version, retrieved chunks), So that I can identify which response patterns drive satisfaction and which cause dissatisfaction.

Acceptance Criteria: - Feedback widget on every response (thumbs up / thumbs down / optional free-text). - Feedback events are stored with: session ID, message ID, trace ID, intent, model version, prompt version, feedback value, timestamp. - Dashboard: thumbs-up rate by intent, by model version, by prompt version, trended over time. - Target: thumbs-up rate > 60%.

US-8.2 — Automated Error Analysis and Categorization

As a ML engineer, I want to automatically categorize thumbs-down responses, escalations, and guardrail blocks into root cause categories (wrong intent, bad retrieval, hallucination, unhelpful tone, missing information), So that I can prioritize improvement efforts based on the highest-impact failure modes.

Acceptance Criteria: - LLM-assisted categorization of negative signals into predefined root cause buckets. - Weekly error analysis report: top 5 failure categories ranked by frequency and business impact. - Each failure category links to example traces for investigation. - Action items from error analysis feed into the sprint backlog.

US-8.3 — Retraining Trigger Automation

As a ML engineer, I want to automatically trigger model retraining when drift detection or quality metrics indicate degradation beyond defined thresholds, So that models are refreshed proactively rather than reactively.

Acceptance Criteria: - Intent classifier retraining triggers when: accuracy on rolling eval drops below 88%, or intent distribution shifts > 5pp for any class. - Retraining pipeline: pull latest labeled data → augment with recent production samples → train → evaluate on golden dataset → register in model registry. - Retraining does not auto-deploy; it produces a candidate that enters the standard evaluation pipeline (shadow → canary → rollout). - Retraining frequency target: at most quarterly under normal conditions; on-demand when triggered by drift.

US-8.4 — Production Data Flywheel

As a ML engineer, I want to use production conversations (anonymized) to continuously improve training data, golden datasets, and few-shot examples, So that the system learns from real user interactions over time.

Acceptance Criteria: - High-confidence production conversations (thumbs-up, resolved, no escalation) are candidates for training data augmentation. - Low-confidence or failed conversations are candidates for golden dataset edge cases. - All production data is anonymized (PII stripped) before use in training. - Data flywheel metrics: number of new training examples added per quarter, impact on model accuracy.

US-8.5 — A/B Testing for Model and Prompt Variants

As a product manager, I want to run controlled A/B experiments comparing model versions, prompt variants, or RAG configurations on live traffic, So that improvements are validated with statistical rigor before full rollout.

Acceptance Criteria: - Experiment framework supports: traffic splitting, metric collection per variant, statistical significance calculation. - Metrics per variant: conversion rate, CSAT, resolution rate, escalation rate, cost per session. - Experiments can be stopped early if a variant shows statistically significant harm (sequential testing). - Experiment results are archived with full configuration for reproducibility.

Epic 9: Infrastructure and Scaling Operations

US-9.1 — Predictive Auto-Scaling for Inference Endpoints

As a infrastructure engineer, I want to auto-scale SageMaker endpoints using a combination of scheduled scaling (for predictable events) and step scaling (for unexpected spikes), So that the system handles traffic spikes without throttling and does not waste compute during low-traffic periods.

Acceptance Criteria: - Scheduled scaling: pre-scale 2 hours before known events (manga releases, Prime Day). - Step scaling: if InvocationsPerInstance > 200/min for 1 min → add 2 instances; > 500/min → add 4 instances. - Minimum instances: 2 (never scale to zero). - Scale-down: if InvocationsPerInstance < 50/min for 10 min → remove 1 instance (cooldown 300s). - Zero throttling events during peak traffic.

US-9.2 — Bedrock Provisioned Throughput Management

As a infrastructure engineer, I want to manage Bedrock provisioned throughput commitments based on traffic forecasting, So that the LLM generation layer does not throttle during peak traffic and cost is optimized during off-peak.

Acceptance Criteria: - Provisioned throughput covers P95 traffic levels; on-demand handles spikes above that. - Throttling rate: 0% during normal operation, < 0.1% during peak events. - Monthly review of provisioned vs. on-demand cost split to optimize commitment level. - Automated alerts if on-demand usage exceeds 20% of total (signals under-provisioning).

US-9.3 — Multi-Region Inference Deployment

As a infrastructure engineer, I want to deploy inference endpoints in both US and JP regions with region-local routing, So that JP users experience sub-100ms network latency and the system is resilient to single-region failures.

Acceptance Criteria: - Intent classifier and reranker endpoints deployed in both us-east-1 and ap-northeast-1. - Requests are routed to the nearest region by default. - Cross-region failover: if a region's health check fails, traffic shifts to the other region within 60 seconds. - Model versions are synchronized across regions via blue-green cross-region deployments.

Epic 10: Compliance, Audit, and Reproducibility

US-10.1 — Full Request Lineage and Reproducibility

As a compliance officer, I want to reproduce any historical chatbot response by replaying the exact inputs (user message, context, prompt version, model version, retrieved chunks) that generated it, So that we can investigate customer complaints and regulatory inquiries with full traceability.

Acceptance Criteria: - Every response is logged with: trace ID, user message, assembled prompt (or hash), model version, prompt version, retrieved chunk IDs, raw LLM output, post-guardrail output. - Replay tool: given a trace ID, re-runs the pipeline with the logged inputs and compares the output to the historical response. - Retention: 90 days for full trace data, 1 year for metadata.

US-10.2 — Model and Prompt Change Audit Trail

As a compliance officer, I want to maintain an immutable audit trail of all model deployments, prompt changes, guardrail updates, and configuration changes, So that we can demonstrate governance and accountability for every change to the AI system.

Acceptance Criteria: - Audit log entries include: who made the change, what changed, when, why (linked to a ticket/PR), and what evaluation results gated the change. - Audit log is append-only and stored in a tamper-resistant store. - Audit reports can be generated on demand for a given time range or model/component.

US-10.3 — PII Handling and Data Retention Compliance

As a privacy engineer, I want to ensure that PII in chat conversations is handled according to Amazon's data retention policies — redacted in logs, excluded from training data, and purged on schedule, So that the system complies with privacy regulations (GDPR, CCPA).

Acceptance Criteria: - PII detection runs on all stored conversation data; detected PII is redacted before storage. - Training data pipeline includes a PII stripping step with validation. - Conversation data retention: raw data purged after 90 days; anonymized analytics retained for 2 years. - User data deletion requests are honored within 30 days.

Story Priority Matrix

Priority	Stories	Rationale
P0 — Must have for launch	US-1.2, US-2.1, US-2.2, US-3.1, US-3.2, US-4.1, US-5.1, US-6.1, US-7.1, US-7.3, US-10.3	Core safety, quality gates, and compliance
P1 — Must have within 30 days of launch	US-1.1, US-1.3, US-3.4, US-3.5, US-4.3, US-4.5, US-6.2, US-6.3, US-7.2, US-8.1, US-9.1	Production readiness and operational maturity
P2 — V2 features	US-2.3, US-3.3, US-3.6, US-4.2, US-4.4, US-5.2, US-5.3, US-6.4, US-7.4, US-8.2, US-8.3, US-8.5, US-9.2, US-10.1, US-10.2	Optimization, automation, and continuous improvement
P3 — V3 features	US-1.4, US-1.5, US-2.4, US-5.4, US-8.4, US-9.3	Cost optimization and advanced automation

Summary Metrics

Category	Story Count
Model Lifecycle	5
Prompt Management	4
Evaluation & Quality Gates	6
Observability & Monitoring	5
RAG & Data Pipelines	4
Cost Governance	4
Guardrails & Safety	4
Feedback & Improvement	5
Infrastructure & Scaling	3
Compliance & Audit	3
Total	43