Scenario File 2 — Vector Store (RAG Knowledge Base)
Context in the Architecture
The Vector Store is the backbone of the RAG pipeline: the component that retrieves relevant knowledge chunks before the LLM generates an answer.
- Write path (offline): Embeddings are written during index refresh runs (6hr, daily, weekly depending on source type).
- Read path (online): On every FAQ, policy, recommendation, or product Q&A intent, a KNN search is executed against ~100K–500K indexed chunks.
- Latency budget: The vector search must complete in <150ms to stay within the chatbot's 2-second P95 target.
- Precision vs. recall tradeoff: Top-10 candidates are retrieved and reranked to top-3. If the top-10 is wrong, the LLM has bad context and hallucinates.
Current Choice: OpenSearch Serverless (Vector Store)
Why it was chosen: Managed service (no cluster sizing), native HNSW vector indexing, supports hybrid search (vector + BM25 keyword), integrates with Amazon Bedrock Knowledge Bases, and scales automatically with query volume.
Index schema recap:
{
"embedding": { "type": "knn_vector", "dimension": 1536, "method": { "name": "hnsw" } },
"content": { "type": "text" },
"source_type": { "type": "keyword" },
"asin": { "type": "keyword" }
}
Alternative 1: Pinecone
What Changes
OpenSearch is replaced by Pinecone, a purpose-built vector database. Embedding writes go to pinecone.upsert(), searches use pinecone.query(vector, top_k=10, filter={"source_type": "faq"}).
Best Case
- Zero operational overhead: fully managed SaaS, no index tuning required.
- Metadata filtering (e.g., filter by
source_type,category) is natively supported and fast. - Sub-50ms query latency out of the box for 1M-vector indexes with proper pod sizing.
- Ideal if the team has no search infrastructure expertise.
Failure Scenario — The Data Residency Blocklist
What happens: Amazon Japan Legal reviews the architecture for APPI (Japan's Personal Information Protection Act) compliance. Pinecone's US-based control plane means embedded product descriptions and FAQ content (which may contain seller-identifiable information) flow through Pinecone's SaaS infrastructure outside Japan/AWS.
Amazon's data governance policy prohibits customer-related data leaving AWS infrastructure. Pinecone is blocklisted. The entire RAG indexing pipeline must be re-implemented on OpenSearch within 4 weeks before the product launch.
Cost of the pivot: 3 engineers × 4 weeks = 12 engineer-weeks. The chatbot launches 2 months late.
The lesson: For Amazon-internal products, compliance and data residency constraints will always favor AWS-native services. Third-party SaaS vector databases are a risk.
Failure Scenario 2 — The Namespace Explosion
What happens: The team decides to shard by content type: one Pinecone namespace per source_type (faq, product_description, policy, editorial, review). That's 5 namespaces.
Six months later, the catalog team adds JP-region-specific policies, English-translated product descriptions, and review summaries per genre. Namespaces grow to 27. Cross-namespace queries (e.g., search products AND policy chunks for a checkout help question) require 27 parallel Pinecone queries and a client-side merge, adding 200ms latency.
Grilling Questions
- Pinecone bills per pod-hour. Your index has 500K vectors at 1536 dimensions. You burst to 10,000 QPS during a flash sale. How many pods do you need, and have you estimated the cost spike?
- A Pinecone upsert fails mid-batch during a nightly re-index. Pinecone does not have transactions. How do you detect partial index corruption and recover?
- If Pinecone has a regional outage, what is your RAG fallback? Does the chatbot degrade gracefully or does every FAQ question return "I don't know"?
Decision Heuristic
Use Pinecone for startups or proof-of-concepts where speed to production and simplicity matter more than compliance. At Amazon scale, data residency requirements will eliminate it.
Alternative 2: pgvector (PostgreSQL Extension)
What Changes
Aurora PostgreSQL with the pgvector extension stores embeddings as a vector(1536) column in the same database as other relational data.
CREATE TABLE knowledge_chunks (
chunk_id VARCHAR(64) PRIMARY KEY,
content TEXT,
embedding vector(1536),
source_type VARCHAR(32),
asin VARCHAR(16),
last_updated TIMESTAMP
);
CREATE INDEX idx_chunks_embedding ON knowledge_chunks
USING ivfflat (embedding vector_cosine_ops) WITH (lists = 100);
Queries: SELECT content FROM knowledge_chunks ORDER BY embedding <=> $1 LIMIT 10;
Best Case
- If you already run Aurora PostgreSQL for other services, zero additional infrastructure.
- Hybrid queries are natural SQL JOINs: "find chunks for products that are currently in stock" — one query joining
knowledge_chunksandproductstables. - Strong consistency: vectors and metadata are ACID-compliant.
- Cost: no separate vector DB; Aurora compute covers embedding queries.
Failure Scenario — The IVFFlat Cold Start on Index Rebuild
What happens: The MangaAssist catalog grows to 200,000 product ASINs. Each ASIN has 2 chunks. Plus 50,000 FAQ and policy chunks. Total: 450,000 vectors.
The IVFFlat index needs lists = sqrt(450000) ≈ 671 for optimal performance. Building the index runs:
CREATE INDEX CONCURRENTLY idx_chunks_embedding ON knowledge_chunks ...
This takes 47 minutes on db.r6g.xlarge. During this time, sequential scans are used for vector queries. A sequential scan on 450K vectors with 1536 dimensions takes 4.2 seconds per query instead of 80ms.
The production incident: The nightly re-index runs at 2am JST. It finishes at 2:47am. During those 47 minutes, every RAG query for the FAQ intent is timing out (>2s SLA broken). The chatbot falls back to LLM-only responses without grounding. Hallucination rate spikes. Price accuracy guardrail fires 3× its baseline rate.
Amplifier: The team triggers an emergency re-index at 6pm (ahead of flash sale). The CONCURRENTLY keyword is forgotten. CREATE INDEX takes an AccessShareLock, blocking all reads for 47 minutes during peak traffic.
Failure Scenario 2 — The HNSW Memory Requirement
The team switches from IVFFlat to HNSW (better recall at the cost of more RAM):
CREATE INDEX USING hnsw (embedding vector_cosine_ops) WITH (m = 16, ef_construction = 64);
HNSW requires keeping the entire graph in memory: 450K vectors × 1536 dims × 4 bytes × overhead factor ≈ 4.2GB RAM just for the index. The Aurora instance is db.r6g.large (16GB RAM). The HNSW index occupies 26% of RAM, competing with query working memory and the OS page cache. Under load, RAM pressure causes the OS to swap, doubling vector query latency.
Grilling Questions
- Your embedding model changes from
Titan Embeddings V2(1536-dim) to a new model (3072-dim). You need to re-embed all 450K chunks. How long does this take, and can you do it without taking the RAG pipeline offline? - pgvector does not support distributed scaling. Your query rate grows to 5,000 vector searches/min. A single Aurora instance handles ~500 concurrent vector queries before CPU saturates. How do you scale reads?
- The reranker uses a cross-encoder that calls out to a SageMaker endpoint. If SageMaker is slow, the total RAG latency grows. Does using pgvector make this problem better or worse compared to a purpose-built vector DB?
Decision Heuristic
Use pgvector when: (a) corpus size is <200K vectors, (b) you need transactional consistency between vector metadata and relational data, and © query rate is <2,000/min. Beyond these thresholds, a purpose-built vector store is necessary.
Alternative 3: Weaviate
What Changes
Weaviate is an open-source vector database, self-hosted on ECS/EKS or used as Weaviate Cloud. It stores objects with properties and vectors, supporting hybrid search (BM25 + vector) natively.
{
"class": "KnowledgeChunk",
"vectorizer": "none",
"properties": [
{ "name": "content", "dataType": ["text"] },
{ "name": "sourceType", "dataType": ["string"] },
{ "name": "asin", "dataType": ["string"] }
]
}
Query: nearVector: { vector: [...], certainty: 0.7 } with where: { sourceType: "faq" }.
Best Case
- Built-in hybrid search combining BM25 keyword and vector similarity in a single query. Critical for manga-specific queries: "Show me Demon Slayer volume 12 availability" — keyword
Demon Slayeranchors the search while the embedding finds semantically related chunks. - GraphQL API: queries are expressive and easy to debug.
- Module system:
text2vec-awsmodule can call Amazon Bedrock Titan Embeddings at query time, simplifying the embedding pipeline.
Failure Scenario — Self-Hosted Cluster Management During a Traffic Spike
What happens: The team deploys a 3-node Weaviate cluster on ECS. During a flash sale, query rate spikes from 500/min to 8,000/min. Weaviate's HNSW index is in-memory; each node holds the full index.
Weaviate requires a minimum replication factor of 3 for high availability. As query load spikes, CPU saturates on all 3 nodes simultaneously. Auto-scaling adds a 4th node, but it must replicate the full HNSW graph from an existing node. Replication takes 12 minutes (transferring 4.2GB of graph data over the internal VPC link at 500Mbps).
During those 12 minutes, the 3 existing nodes handle all queries at 3× normal load. Two nodes go OOM, restart, and are temporarily unavailable. Quorum is lost. Weaviate rejects writes and read consistency degrades.
Result: RAG pipeline returns stale or partial results for 12 minutes. LLM hallucinates on chunks that were in the middle of a batch re-index.
What makes this different from OpenSearch Serverless: OpenSearch Serverless scales transparently without quorum exposure. You don't manage replication or node bootstrapping.
Grilling Questions
- Weaviate stores the HNSW graph entirely in RAM. You need to upgrade the node type to handle 1M vectors. How do you do this without a RAG outage?
- The Weaviate
text2vec-awsmodule calls Bedrock Titan Embeddings at query time. At 8,000 RAG queries/min, how many Bedrock invocations is that? What is the cost? Is this cheaper or more expensive than pre-computing embeddings? - Weaviate's multi-tenancy model isolates data per tenant in separate vector spaces. If you add per-region manga catalogs (JP, US, UK) as tenants, how does this affect cross-region query latency?
Decision Heuristic
Use Weaviate when you need hybrid search (keyword + vector in one query), have the operational capacity to manage a self-hosted cluster, and corpus size is under 5M vectors. OpenSearch Serverless is simpler for teams that want managed infrastructure.
Alternative 4: In-Memory FAISS (Local to Each Orchestrator Task)
What Changes
No external vector database. Each Orchestrator ECS task loads the entire FAISS index into memory on startup (~500MB for 450K vectors at 1536 dims). Vector search is a pure in-process function call: latency <10ms without any network hop.
import faiss
index = faiss.read_index("knowledge_base.index")
D, I = index.search(query_vector, k=10)
Best Case
- Zero network latency for vector search: sub-10ms guaranteed.
- Zero external dependency: RAG works even if the network is partitioned.
- Perfect for very small knowledge bases (<50K chunks) where the index fits in memory cheaply.
Failure Scenario — The Stale Index Problem at Scale
What happens: The FAQ team updates the return policy at 3pm (before the flash sale). The Bedrock indexing pipeline re-computes embeddings and writes a new FAISS index to S3 at 3:15pm.
But 60 ECS tasks are running, each with the old FAISS index loaded in memory. A task only picks up the new index when it restarts. The deployment strategy is rolling restart with a 5-minute grace period per batch of 10 tasks. All 60 tasks are updated in 30 minutes.
For 30 minutes, users asking about the return policy get answers grounded in the old policy. Guardrails do not catch policy-level staleness (only price and ASIN validation). Users are given incorrect return policy information.
Amplifier: The support team runs a daily marketing push ("no-hassle returns!") at 3pm, the same time as the policy update. The mismatch between what the chatbot says and what the marketing email says generates 3,000 support tickets in 4 hours.
Failure Scenario 2 — Memory Ceiling Per Task
Each Orchestrator task is sized at 2GB RAM. The FAISS index takes 500MB. LLM response buffering, context loading, and prompt construction take 800MB peak. Under flash sale load, Fargate OOMs tasks. The index reload adds 8 seconds per task restart, during which the task is unhealthy.
Grilling Questions
- You add an hourly S3 sync that restarts tasks if index freshness exceeds 60 minutes. This means at flash sale peak, a task restart loop happens every hour. How do you coordinate restarts so at least 50 of 60 tasks are healthy at all times?
- FAISS does not support incremental index updates. Every policy change requires re-embedding and rebuilding the full index. At 450K chunks, how long does a full rebuild take? Is that acceptable for a "return policy just changed" scenario?
- FAISS flatL2 search is exact but O(n). IVFFlat is approximate but O(sqrt(n)). At 450K vectors, what is the latency difference? When does approximate become too approximate for return policy queries where precision matters?
Decision Heuristic
Use FAISS in-process only for static, small (<50K chunk) knowledge bases where the corpus changes at most weekly. For a production RAG pipeline with frequent catalog updates and 100K+ vectors, an external vector database with incremental indexing is mandatory.
Master Summary Table
| Choice | Query Latency | Index Freshness | Operational Complexity | Key Failure Risk |
|---|---|---|---|---|
| OpenSearch Serverless (current) | 50-150ms | Near-real-time (event-driven) | Low (managed) | Cold start cost on serverless scale-up |
| Pinecone | 30-80ms | Near-real-time | Very low (SaaS) | Data residency compliance block |
| pgvector | 80-4200ms (depends on index state) | Blocking during rebuild | Low (existing infra) | Index rebuild takes 45+ min offline |
| Weaviate | 40-100ms | Near-real-time | High (self-hosted cluster) | OOM during replication on scale-up |
| FAISS in-process | <10ms | Stale until task restart (up to 30 min) | Low (no infra) | Policy staleness window creates incorrect answers |