US-06: RAG Pipeline Cost Optimization
User Story
As a ML platform engineer, I want to optimize OpenSearch Serverless OCU consumption and embedding costs in the RAG pipeline, So that vector search and embedding generation costs decrease by 30-50% without impacting retrieval quality.
Acceptance Criteria
- OpenSearch Serverless OCU auto-scaling is configured with appropriate min/max bounds.
- Embedding calls are reduced by caching query embeddings for repeated/similar queries.
- Chunk deduplication during indexing eliminates redundant vectors.
- Reranker calls are batched and cached to avoid redundant computation.
- RAG pipeline is bypassed entirely for intents that don't need retrieval.
- Total RAG pipeline costs decrease by 30-50%.
High-Level Design
Cost Problem
The RAG pipeline (LLD-3) has three cost drivers:
- OpenSearch Serverless: Minimum 2 OCUs for indexing + 2 OCUs for search = 4 OCUs × $0.24/hr = $691/month minimum
- Titan Embeddings V2: $0.02 per 1M input tokens. At 1M queries/day × ~50 tokens/query = $30/month
- Reranker (if using SageMaker): Endpoint cost similar to intent classifier
Baseline: ~$750-900/month
OpenSearch Serverless is the dominant cost — the 4 OCU minimum runs 24/7 even with zero traffic.
Optimization Architecture
graph TD
A[User Query] --> B{Intent needs<br>RAG?}
B -->|No: order_tracking,<br>chitchat, escalation| C[Skip RAG Pipeline<br>Zero RAG cost]
B -->|Yes: faq, product_question,<br>recommendation| D{Embedding<br>Cache?}
D -->|Hit| E[Use Cached Embedding]
D -->|Miss| F[Generate Embedding<br>Titan V2]
F --> G[Cache Embedding]
E --> H[OpenSearch KNN Search]
G --> H
H --> I{Reranker<br>Cache?}
I -->|Hit| J[Use Cached Rerank]
I -->|Miss| K[Cross-Encoder Rerank]
K --> L[Cache Reranked Result]
J --> M[Top 3 Chunks to LLM]
L --> M
style C fill:#2d8,stroke:#333
style E fill:#2d8,stroke:#333
style J fill:#2d8,stroke:#333
Savings Breakdown
| Technique | Reduction | Monthly Savings |
|---|---|---|
| Skip RAG for non-retrieval intents (~40%) | 40% fewer embedding + search calls | ~$280 |
| Embedding cache (20% hit rate) | 20% fewer Titan calls | ~$6 |
| Reduce OCU floor (off-peak) | Scheduled scaling saves ~30% | ~$207 |
| Chunk deduplication in index | 15% smaller index, faster search | ~$20 |
| Total | ~$513/month (57%) |
Low-Level Design
1. RAG Bypass for Non-Retrieval Intents
Not every intent needs vector search. The Orchestrator skips the RAG pipeline entirely for deterministic intents.
graph LR
A[Intent] --> B{In RAG_REQUIRED_INTENTS?}
B -->|Yes| C[faq, product_question,<br>recommendation,<br>product_discovery]
C --> D[Run RAG Pipeline]
B -->|No| E[order_tracking,<br>return_request,<br>checkout_help,<br>chitchat, escalation,<br>promotion]
E --> F[Skip RAG<br>Use service data only]
style F fill:#2d8,stroke:#333
Code Example: RAG Gate
from enum import Enum
from typing import Optional
class IntentType(Enum):
PRODUCT_DISCOVERY = "product_discovery"
PRODUCT_QUESTION = "product_question"
RECOMMENDATION = "recommendation"
FAQ = "faq"
ORDER_TRACKING = "order_tracking"
RETURN_REQUEST = "return_request"
PROMOTION = "promotion"
CHECKOUT_HELP = "checkout_help"
ESCALATION = "escalation"
CHITCHAT = "chitchat"
RAG_REQUIRED_INTENTS = {
IntentType.FAQ,
IntentType.PRODUCT_QUESTION,
IntentType.RECOMMENDATION,
IntentType.PRODUCT_DISCOVERY,
}
class RAGGate:
"""Decide whether the RAG pipeline should run for a given intent."""
def should_retrieve(
self,
intent: IntentType,
confidence: float,
entities: dict,
) -> bool:
# Always skip RAG for non-retrieval intents
if intent not in RAG_REQUIRED_INTENTS:
return False
# Skip RAG for product questions where ASIN is known
# and the product catalog has all needed data
if intent == IntentType.PRODUCT_QUESTION:
has_asin = bool(entities.get("asin"))
attribute = entities.get("attribute", "")
catalog_attributes = {"price", "format", "language", "pages",
"availability", "publisher"}
if has_asin and attribute in catalog_attributes:
return False # Catalog API is cheaper than RAG
return True
2. Embedding Cache
Cache query embeddings to avoid redundant Titan calls.
sequenceDiagram
participant Orchestrator
participant EmbCache as Embedding Cache<br>(Redis)
participant Titan as Titan Embeddings V2
Orchestrator->>EmbCache: GET embed:{hash(query+intent)}
alt Hit
EmbCache-->>Orchestrator: Cached 1536-dim vector
else Miss
EmbCache-->>Orchestrator: null
Orchestrator->>Titan: Embed query
Titan-->>Orchestrator: 1536-dim vector
Orchestrator->>EmbCache: SET embed:{hash} (TTL: 1h)
end
Code Example: Embedding Cache
import hashlib
import json
from typing import Optional
import boto3
import numpy as np
import redis
class CachedEmbeddingClient:
"""Caches Titan embedding results to reduce API calls."""
CACHE_TTL = 3600 # 1 hour
def __init__(self, redis_client: redis.Redis, region: str = "ap-northeast-1"):
self._redis = redis_client
self._bedrock = boto3.client("bedrock-runtime", region_name=region)
def embed(self, text: str, intent: str = "") -> np.ndarray:
cache_key = self._cache_key(text, intent)
# Check cache first
cached = self._redis.get(cache_key)
if cached is not None:
return np.frombuffer(cached, dtype=np.float32)
# Call Titan Embeddings V2
response = self._bedrock.invoke_model(
modelId="amazon.titan-embed-text-v2:0",
contentType="application/json",
accept="application/json",
body=json.dumps({
"inputText": text,
"dimensions": 1536,
"normalize": True,
}),
)
result = json.loads(response["body"].read())
embedding = np.array(result["embedding"], dtype=np.float32)
# Cache the embedding (stored as raw bytes for efficiency)
self._redis.setex(
cache_key,
self.CACHE_TTL,
embedding.tobytes(),
)
return embedding
def _cache_key(self, text: str, intent: str) -> str:
normalized = f"{intent}:{text.lower().strip()}"
return f"embed:{hashlib.sha256(normalized.encode()).hexdigest()[:24]}"
3. OpenSearch Serverless OCU Management
OpenSearch Serverless charges per OCU-hour. The collection requires a minimum of 2 search OCUs + 2 index OCUs. Optimize by managing indexing schedules.
graph TD
subgraph "Indexing Schedule"
A[Product Descriptions] --> B[Re-index every 6 hours<br>or on catalog change]
C[FAQ Pages] --> D[Re-index daily<br>2am JST off-peak]
E[Editorial Content] --> F[Re-index weekly<br>Sunday 3am JST]
G[Review Summaries] --> H[Re-index daily<br>3am JST off-peak]
end
subgraph "OCU Optimization"
I[Peak Hours] --> J[Search OCUs: 2-4<br>Index OCUs: 2]
K[Off-Peak Hours] --> L[Search OCUs: 2<br>Index OCUs: 2<br>Batch indexing]
M[Batch Indexing Window<br>2am-4am JST] --> N[Bulk index all sources<br>during single window<br>Avoid spread-out OCU usage]
end
style N fill:#2d8,stroke:#333
Code Example: Batch Indexing Pipeline
import json
import logging
from typing import Any
import boto3
from opensearchpy import OpenSearch, RequestsHttpConnection, helpers
logger = logging.getLogger()
# OpenSearch Serverless client
os_client = OpenSearch(
hosts=[{"host": "xxxxxxxx.aoss.amazonaws.com", "port": 443}],
http_auth=None, # Uses IAM SigV4 via boto3 session
use_ssl=True,
verify_certs=True,
connection_class=RequestsHttpConnection,
)
def batch_index_chunks(
chunks: list[dict],
index_name: str = "manga-knowledge-base",
batch_size: int = 500,
) -> dict:
"""Batch-index chunks to minimize OCU usage. Run during off-peak."""
total_indexed = 0
total_errors = 0
for i in range(0, len(chunks), batch_size):
batch = chunks[i : i + batch_size]
actions = [
{
"_op_type": "index",
"_index": index_name,
"_id": chunk["chunk_id"],
"_source": {
"content": chunk["content"],
"embedding": chunk["embedding"],
"source_type": chunk["source_type"],
"asin": chunk.get("asin"),
"category": chunk.get("category"),
"last_updated": chunk["last_updated"],
},
}
for chunk in batch
]
success, errors = helpers.bulk(
os_client, actions, raise_on_error=False
)
total_indexed += success
total_errors += len(errors)
logger.info(
f"Batch indexing complete: {total_indexed} indexed, "
f"{total_errors} errors"
)
return {"indexed": total_indexed, "errors": total_errors}
def deduplicate_chunks(chunks: list[dict]) -> list[dict]:
"""Remove near-duplicate chunks before indexing to reduce index size."""
seen_signatures: set[str] = set()
unique_chunks: list[dict] = []
for chunk in chunks:
# Create a signature from the first 200 chars of content + source
sig = f"{chunk['source_type']}:{chunk['content'][:200]}"
if sig not in seen_signatures:
seen_signatures.add(sig)
unique_chunks.append(chunk)
logger.info(
f"Deduplication: {len(chunks)} -> {len(unique_chunks)} chunks "
f"({len(chunks) - len(unique_chunks)} duplicates removed)"
)
return unique_chunks
4. Retrieval Optimization
Reduce the number of vectors searched and reranker calls.
graph TD
A[User Query + Intent] --> B[Pre-filter by<br>source_type and category]
B --> C[KNN Search<br>Narrower search space]
C --> D{Top result<br>score > 0.9?}
D -->|Yes| E[Skip Reranker<br>Use top 3 directly]
D -->|No| F[Reranker<br>Cross-encoder]
E --> G[Return Top 3 Chunks]
F --> G
style E fill:#2d8,stroke:#333
Code Example: Optimized Retrieval
from dataclasses import dataclass
from typing import Optional
import numpy as np
from opensearchpy import OpenSearch
@dataclass
class RetrievedChunk:
chunk_id: str
content: str
source_type: str
score: float
asin: Optional[str]
class OptimizedRetriever:
"""Cost-optimized RAG retrieval with pre-filtering and conditional reranking."""
HIGH_CONFIDENCE_THRESHOLD = 0.9
DEFAULT_TOP_K = 10
FINAL_CHUNKS = 3
def __init__(self, os_client: OpenSearch, index_name: str):
self._client = os_client
self._index = index_name
def retrieve(
self,
query_embedding: np.ndarray,
intent: str,
entities: dict,
reranker=None,
) -> list[RetrievedChunk]:
# Build pre-filter based on intent and entities
pre_filter = self._build_filter(intent, entities)
# KNN search with pre-filter (narrows search space)
results = self._client.search(
index=self._index,
body={
"size": self.DEFAULT_TOP_K,
"query": {
"knn": {
"embedding": {
"vector": query_embedding.tolist(),
"k": self.DEFAULT_TOP_K,
"filter": pre_filter,
}
}
},
"_source": ["content", "source_type", "asin", "chunk_id"],
},
)
chunks = [
RetrievedChunk(
chunk_id=hit["_source"]["chunk_id"],
content=hit["_source"]["content"],
source_type=hit["_source"]["source_type"],
score=hit["_score"],
asin=hit["_source"].get("asin"),
)
for hit in results["hits"]["hits"]
]
if not chunks:
return []
# Skip reranker if top results are high confidence
if chunks[0].score >= self.HIGH_CONFIDENCE_THRESHOLD:
return chunks[: self.FINAL_CHUNKS]
# Rerank only if needed
if reranker:
chunks = reranker.rerank(chunks)
return chunks[: self.FINAL_CHUNKS]
def _build_filter(self, intent: str, entities: dict) -> dict:
"""Build OpenSearch filter to narrow KNN search space."""
conditions = []
# Filter by source type based on intent
source_type_map = {
"faq": ["faq", "policy"],
"product_question": ["product", "review"],
"recommendation": ["product", "editorial"],
"product_discovery": ["product", "editorial"],
}
source_types = source_type_map.get(intent, [])
if source_types:
conditions.append({
"terms": {"source_type": source_types}
})
# Filter by ASIN if available
asin = entities.get("asin")
if asin:
conditions.append({"term": {"asin": asin}})
# Filter by category
genre = entities.get("genre")
if genre:
conditions.append({"term": {"category": genre}})
if not conditions:
return {"match_all": {}}
return {"bool": {"must": conditions}}
Index Size Optimization
Chunk Lifecycle Management
graph TD
A[New Content Published] --> B[Chunk and Embed]
B --> C[Index in OpenSearch]
C --> D[Chunk is active]
D --> E{Content updated<br>or deleted?}
E -->|Updated| F[Re-chunk and re-embed]
F --> G[Replace old chunks]
E -->|Deleted| H[Delete chunks<br>from index]
D --> I{Chunk age > 90 days<br>AND zero retrievals?}
I -->|Yes| J[Archive to S3<br>Delete from index]
I -->|No| D
style J fill:#2d8,stroke:#333
Monitoring and Metrics
| Metric | Target | Alert |
|---|---|---|
| RAG bypass rate | ≥ 40% | < 30% |
| Embedding cache hit rate | ≥ 20% | < 10% |
| Reranker skip rate (high-confidence) | ≥ 30% | < 15% |
| OpenSearch search latency (P99) | < 200ms | > 500ms |
| Index size (vectors) | < 500K vectors | > 750K |
| Monthly RAG pipeline cost | ≤ $400 | > $600 |
Risks and Mitigations
| Risk | Impact | Mitigation |
|---|---|---|
| Pre-filtering too aggressive | Relevant chunks excluded from results | Monitor retrieval quality scores; loosen filters if recall drops |
| Skipping reranker on false-high scores | Lower quality chunk ordering | A/B test reranker skip vs. always-rerank; threshold tuning |
| Stale embeddings in cache | Query doesn't match updated content | Embedding cache TTL (1h) is shorter than index refresh (6h) |
| Deduplication removes valid variants | Loss of nuanced content | Use content + source_type for dedup signature, not just content |
Deep Dive: Why This Works on a Manga Chatbot Workload
The RAG pipeline is a paradoxical cost target: it is one of the smaller monthly bills (~$750-900) but one of the largest cost-per-request contributors (every retrieval is an embedding call + vector search + reranker pass). Cost optimization here is not about making RAG cheaper — it is about identifying which queries should not have triggered RAG at all, and stopping the work earlier in the rest. The 30–50% saving target is dominated by avoidance, not efficiency.
Property 1: Most chatbot queries are not retrieval-shaped. Greetings, account-status questions, "what time is it", policy lookups, and explicit follow-ups ("yes please", "the second one") collectively make up roughly 40–60% of chatbot traffic. None of these benefit from semantic retrieval — the answer is either templated, already in conversation state, or in a dedicated key-value lookup. Sending them through the RAG path costs 1 embedding call (Titan) + 1 OpenSearch query + 1 reranker call per turn, all of which add latency and dollars to deliver no marginal value. The RAG bypass gate (story line ~150s) is the highest-leverage saving in this story because it eliminates 100% of the per-call cost on the bypassed traffic. The architectural assumption is that the intent label (from US-02) is reliable enough to act as the gate; if intent precision drops, the gate either bypasses queries that needed retrieval (recall drops, hallucination rises) or routes non-retrieval queries through RAG (cost rises).
Property 2: Embeddings are deterministic and queries repeat. The same user question, asked twice (by different users or the same user across sessions), produces the same Titan embedding vector. The 1-hour embedding cache exploits this. Manga chatbot queries have a Zipf-shaped distribution (top questions repeat heavily: "is this in English", "when does volume N come out", "more like X") so cache hit rates of 30–50% are achievable on a warm cache. The architectural assumption is that the embedding model version is pinned — if Titan rotates the embedding space, every cached vector becomes stale and must be invalidated. The cache key must include model_id to make this explicit.
Property 3: Cross-encoder rerankers are 100× cost of bi-encoder retrieval, but only useful on ambiguous results. The retrieval pipeline produces a candidate set sorted by bi-encoder cosine score. When the top-1 candidate's score is well above the rest (e.g., > 0.9 absolute or > 1.5× margin over top-2), reranking will not change the ordering — paying for the cross-encoder pass is pure waste. The conditional-rerank pattern (story line ~280s) skips the reranker on high-confidence retrievals; the hard target is to skip on 30%+ of queries. The failure mode is false confidence — bi-encoder gives a high score to a wrong candidate (semantic similarity but wrong fact). The mitigation is sampling: a 1% audit sample always reranks for ground-truth comparison.
Property 4: OpenSearch Serverless OCU is a step function, not a slope. OCU pricing is discrete: 4 OCU minimum (2 indexing + 2 search) is the floor regardless of utilization. Above the floor, OCUs scale up in 1-OCU increments to match load. This means cost optimization on a sub-floor workload is impossible without architectural change — the floor is paid whether you use 1% or 99% of it. The story's response is to maximize floor utilization: batch indexing (story line ~360s) runs 2am–4am JST when search OCUs are otherwise idle, cramming the day's index updates into a window that doesn't compete with serving traffic. This does not save money directly (the floor was paid anyway); it allows the search OCUs to never need to scale up for indexing work, keeping total OCU consumption within the floor as long as possible.
Bottom line: RAG bypass is the dominant lever (40% of traffic × full RAG cost = ~30% saving alone). Embedding cache adds 5-10% on top. Conditional rerank adds another 5-10%. Index dedup and chunk lifecycle are storage-side savings (smaller absolute numbers but compound over time as the catalog grows). The OCU floor is the hard architectural lower bound — you cannot get below ~$691/month on OpenSearch Serverless without migrating to OpenSearch managed clusters (different cost shape, different ops model).
Real-World Validation
Industry Benchmarks & Case Studies
- AWS OpenSearch Serverless pricing page — 4 OCU minimum (2 indexing + 2 search) at $0.24/OCU-hour = $691.20/month floor for a single collection. ✅ matches story line 691 implicit baseline. Note: 2024 pricing model added "search and indexing combined OCU" option for some collections — verify which model the deployed collection uses.
- Amazon Titan Text Embeddings V2 pricing — $0.00002 per 1,000 tokens ($0.02 per 1M tokens). At ~256 tokens per query × 1M queries/day × 30 days × $0.02/1M = ~$153/month before any caching — cache reduces by 30–50% bringing it to ~$77-107/month. Story claim of "~$30 Titan cost" assumes either (a) lower query volume or (b) high cache hit rate already; flag for clarification.
- Cohere Rerank API benchmarks (public docs) — Cross-encoder rerank latency 100–300 ms per call; cost per million reranks ~$1–2. The "100× cost of bi-encoder retrieval" claim aligns with Cohere's own published comparison.
- Pinecone / Weaviate engineering posts on hybrid retrieval — Both report 30–50% bi-encoder-only sufficiency rates on FAQ-style traffic, validating the conditional-rerank pattern.
- LlamaIndex / LangChain observability case studies — Embedding cache hit rates of 30–50% are reported as typical on customer-support chatbot workloads; manga-store traffic should be similar or higher (more keyword-driven repetition).
- Internal cross-reference:
RAG-MCP-Integration/09-rag-retrieval-pipeline-deep-dive.md— Documents the broader RAG architecture (RRF fusion, cross-encoder reranker, blue/green re-index); this story is the cost-leaning operating point of that architecture. - Internal cross-reference:
POC-to-Production-War-Story/02-seven-production-catastrophes.md— The "RAG recall collapse" catastrophe was caused by aggressive pre-filtering; informs the recall-monitoring guard in this story.
Math Validation
- OpenSearch Serverless floor: 4 OCU × $0.24/hr × 730 hr = $700.80/month. ✅ matches story.
- Titan Embed V2: 1M queries × 256 tokens = 256M tokens × $0.02/1M = $5.12/day = ~$154/month uncached. With 40% cache hit rate: ~$92/month. Story's "~$30/month Titan" implies 80%+ cache hit rate, which is aggressive — flag for re-derivation against actual query volume.
- Rerank: at $1/M reranks and 30% skip rate, 700K reranks/day × 30 = 21M reranks/month × $1/M = $21/month. Negligible at this volume.
- Total optimized: ~$700 (OpenSearch) + ~$90 (Titan) + ~$20 (rerank) = ~$810/month. Story's "~$400 monthly RAG cost" target is aggressive; achievable only if (a) Titan cost is genuinely $30 (high cache hit), or (b) OCU floor is reduced via combined-OCU model. Recommend the story make the OCU model assumption explicit.
Conservative vs Aggressive Savings Bounds
| Bound | Source | Total monthly savings |
|---|---|---|
| Conservative | 30% bypass + 25% embedding cache hit + no rerank skip | ~25% (~$200/month) |
| Aggressive | 50% bypass + 50% embedding cache hit + 40% rerank skip + chunk dedup | ~50% (~$400/month) |
| Story's projected savings | 30–50% (~$225–375/month) | Realistic; bypass rate is the dominant lever and depends on US-02 intent precision. |
Cross-Story Interactions & Conflicts
This story is the authoritative side for several edges.
- US-04 (Compute) — Authoritative side: this story owns the OCU contract. Conflict mode: when US-04's auto-scaler scales Fargate to zero (or near-zero) overnight, OpenSearch Serverless still bills the 4-OCU floor. This is by design — the floor pays for indexing capacity used by the 2am–4am batch window. Resolution: US-04's scale-to-zero must not assume OpenSearch follows; the OCU floor is sunk cost and the batch indexing window justifies it.
- US-08 (Traffic-Based) — Authoritative side: this story owns the bypass logic. Conflict mode: under US-08's DEGRADED state, more queries should bypass RAG to reduce cost and latency. The bypass gate must read US-08's
degradation_leveland lower the bypass-eligibility threshold accordingly. Resolution: the bypass gate has adegradation_level-aware threshold matrix: at level 0 (Normal), bypass on intent-match; at level 2+ (Degraded), bypass on any non-RAG-shaped intent regardless of confidence. - US-03 (Caching) — See US-03. The embedding cache lives on the Redis tier under prefix
emb:with 1h TTL. Per-key size 1.5–6 KB; storing 50K embeddings is ~300 MB of Redis memory. This is a non-trivial chunk of the 375 MB working-set estimate in US-03 — the embedding cache deserves a dedicated logical Redis DB if memory pressure becomes a concern. Quantize to int8 (1.5 KB/key) if the float32 footprint is too large. - US-02 (Intent Classifier) — Indirect interaction. Conflict mode: RAG bypass eligibility is keyed on intent label. If US-02's intent precision drops, bypass either fires for retrieval-shaped intents (recall drops, hallucination rises) or fails to fire for non-retrieval intents (cost rises). Resolution: bypass rate is a derived metric; track it as
bypass_correctness = (correctly_bypassed / total_bypassed)via sampled audit.
Rollback & Experimentation
Shadow-Mode Plan
- RAG bypass: deploy in observe mode for 2 weeks — log bypass decisions but always run the full RAG pipeline; compare LLM responses with-vs-without retrieval context using LLM-as-judge or human review on a 1K sample. Promote to live only if no-retrieval responses are rated equivalent ≥ 90% of bypass decisions.
- Embedding cache: deploy with cache-aside in shadow — every cache hit triggers a parallel uncached fetch and asserts equality. Promote after 1 week of zero divergence.
- Conditional rerank: shadow with always-rerank in parallel for 1 week; measure top-K result divergence between rerank-skipped and rerank-applied flows.
Canary Thresholds
- Bypass rate ramp: 10% → 25% → 40% over 3 weeks.
- Reranker skip threshold: start at score > 0.95 (very conservative), drop to 0.9 over 4 weeks if quality holds.
- Abort criteria (any one trips): retrieval recall@10 (sampled audit) drops > 3 percentage points, hallucination rate (sampled audit) rises > 2 percentage points, OpenSearch search p99 latency rises > 25%.
Kill Switch
- Three flags:
rag_bypass_enabled(force all traffic through RAG),rag_embedding_cache_enabled(force fresh embedding on every query),rag_conditional_rerank_enabled(force always-rerank). Each flippable via SSM in < 2 minutes.
Quality Regression Criteria (story-specific)
- Retrieval recall@10 (on a frozen golden set of 500 manga queries with known relevant chunks): ≥ baseline − 2 percentage points.
- Hallucination rate (sampled audit on RAG responses): ≤ 2%.
- Bypass correctness (sampled): ≥ 95%.
- Reranker-skip-induced top-1 divergence (sampled): ≤ 5%.
Multi-Reviewer Validation Findings & Resolutions
The cross-reviewer pass identified the following story-specific findings. README's "Multi-Reviewer Validation & Cross-Cutting Hardening" section covers concerns that span all stories.
S1 (must-fix before production)
Reranker skip threshold (0.9) is uncalibrated and not a probability. Bi-encoder cosine scores are not calibrated probabilities; "0.9" is meaningless without reference to embedding model + corpus. Hardcoded thresholds drift silently as the corpus evolves. Resolution: calibrate threshold via held-out validation set — measure precision-recall curve of "skip reranker" decisions; pick threshold at ≤ 5% top-1-divergence false-positive rate. Recalibrate monthly via scheduled job; alert if recalibration changes threshold by > 0.05 (corpus drift signal). Until calibration is in place, default to "always rerank" — disable the skip path entirely.
RAG bypass produces silent hallucination on misclassified intents. When intent classifier wrongly tags a product question as general_chat, RAG is bypassed and the LLM hallucinates manga facts confidently. Failure is invisible — no error, no alert, just wrong answer. Resolution: (a) require intent confidence ≥ 0.88 for any bypass decision (lower confidence → force RAG as safety net); (b) periodic 1% audit sample where bypass decisions are checked against retrieved-context answers using LLM-as-judge or human review; © bypass_hallucination_rate metric, alert at > 2%.
Embedding cache key omits model_version. When Titan rotates V2 → V3 (or any minor revision changes the embedding space), every cached vector becomes stale silently. Resolution: cache key prefix becomes emb:{model_id}:{model_version}:{normalized_query_hash}; on model rotation, the entire prior keyspace becomes unreachable (auto-orphaned for TTL eviction); no manual invalidation required.
PII in embedded queries (GDPR concern). Embeddings are quasi-reversible (Carlini et al. 2021); storing embeddings of "my email is X" creates PII by another name. Resolution: PII redaction before embedding (same redaction used for US-05 TURN writes); for queries tagged customer_sensitive=true (account/order related), bypass the embedding cache entirely.
Pre-filter false-negative risk on intent misclassification. Pre-filtering OpenSearch by intent narrows retrieval to specific source_type partitions. If intent is wrong, the relevant partition is skipped; recall drops to zero on that query. Resolution: pre-filter recall test against a 500-prompt golden set; require recall@10 ≥ 95% before promotion. Add "soft pre-filter" mode — score boost on intended source_type but no exclusion — for low-confidence intents.
S2 (fix before scale-up)
OCU model assumption unstated. Pricing assumes 4-OCU floor (2 indexing + 2 search legacy model); 2024 combined-OCU model has different minimum. Resolution: explicitly state which OCU model the deployed collection uses; verify in OpenSearch console; restate baseline cost accordingly.
Titan cost claim ($30/month) requires 80%+ cache hit rate. Story acceptance criteria targets only 20% hit rate, which gives ~$120/month, not $30. Resolution: pick one — either (a) tighten the AC to 50%+ hit rate (achievable with longer TTL + better key normalization), or (b) restate the cost target at $90–120/month. Pricing math must reconcile with the AC.
Chunk dedup signature uses first 200 chars + source_type. Two chunks with identical openings but divergent content collide and one is dropped silently. Resolution: use SHA-256 of full chunk content + source_type as dedup signature. Near-duplicate detection (semantic similarity above threshold) is harder and deferred to S3.
RAG fallback when OpenSearch unavailable. OCU floor is sunk cost but the cluster can still go unhealthy. Without fallback, all retrieval-shaped queries fail. Resolution: maintain a small DDB "FAQ index" (top-1000 most-retrieved chunks, refreshed weekly) as cold fallback; on OpenSearch error rate > 5% over 1 minute, route retrieval to DDB FAQ; user gets "best-effort" experience instead of failure.
S3 (acknowledged / future work)
- Embedding cache int8 quantization (1.5 KB/key) to halve memory footprint; recommended in US-03's working-set update.
- Cross-encoder reranker conditional skip suspended — see S1 fix; revisit after calibration.
- Drift detection on intent distribution and embedding-space centroid (monthly).