End-to-End Integration Testing Scenarios
Full pipeline test scenarios that validate the complete flow from user query to final response. Each scenario tests the interaction between all components — intent classifier, retriever, prompt builder, LLM, guardrails, and response formatter — wired together as one system.
These tests catch the failures that component tests miss: the subtle interactions, boundary mismatches, and cascading errors that only surface when components are connected.
Why Integration Testing Is Different from Component Testing
flowchart LR
subgraph Component["Component Tests<br/>(each passes independently)"]
IC["Intent Classifier<br/>✅ 95% accuracy"]
RT["Retriever<br/>✅ Recall@3 = 0.82"]
PB["Prompt Builder<br/>✅ All sections present"]
LLM["LLM<br/>✅ Coherent responses"]
GR["Guardrails<br/>✅ 98% pass rate"]
end
subgraph Integration["Integration Test<br/>(components wired together)"]
FAIL["❌ FAILURE: Classifier outputs<br/>'product_detail' but retriever<br/>expects 'product_details' (plural)<br/>→ retriever returns no results<br/>→ prompt has empty context<br/>→ LLM hallucinates an answer"]
end
Component -->|"All green ✅"| Integration
Integration -->|"Pipeline broken ❌"| BUG["Bug only visible<br/>when components connect"]
style Component fill:#00b894,color:#fff
style Integration fill:#e17055,color:#fff
style BUG fill:#d63031,color:#fff
Test Infrastructure Setup
flowchart TD
subgraph TestEnv["Integration Test Environment"]
DC["Docker Compose"]
DC --> OS["OpenSearch<br/>(local container)"]
DC --> DDB["DynamoDB Local"]
DC --> CACHE["Redis<br/>(local container)"]
DC --> OL["Ollama + Llama 3<br/>(local LLM)"]
end
subgraph Pipeline["Full Pipeline Under Test"]
ORCH["Orchestrator"]
ORCH --> CLS["Intent Classifier"]
ORCH --> RET["RAG Retriever"]
ORCH --> PMT["Prompt Builder"]
ORCH --> GEN["LLM Generator"]
ORCH --> GRL["Guardrails"]
ORCH --> FMT["Response Formatter"]
end
subgraph Data["Test Data"]
GD["50 E2E Test Cases"]
SEED["Pre-loaded product catalog<br/>(100 manga titles)"]
VEC["Pre-built vector index<br/>(embeddings for 100 titles)"]
MEM["Pre-loaded conversation<br/>histories for multi-turn"]
end
TestEnv --> Pipeline
Data --> Pipeline
style TestEnv fill:#0984e3,color:#fff
style Pipeline fill:#6c5ce7,color:#fff
style Data fill:#00b894,color:#fff
Environment Configuration
# docker-compose.test.yml
services:
opensearch:
image: opensearchproject/opensearch:2.11.0
environment:
- discovery.type=single-node
- DISABLE_SECURITY_PLUGIN=true
ports: ["9200:9200"]
dynamodb-local:
image: amazon/dynamodb-local:latest
ports: ["8000:8000"]
redis:
image: redis:7-alpine
ports: ["6379:6379"]
ollama:
image: ollama/ollama:latest
ports: ["11434:11434"]
volumes: ["ollama_data:/root/.ollama"]
# Pre-pull: ollama pull llama3:8b
Scenario 1: Happy Path — Recommendation Query
Description
A straightforward recommendation request that exercises the full RAG pipeline: classification → retrieval → prompt assembly → generation → safety check → formatting with product cards.
sequenceDiagram
participant U as Test Harness
participant IC as Intent Classifier
participant RT as Retriever
participant PB as Prompt Builder
participant LLM as LLM (Llama 3)
participant GR as Guardrails
participant FM as Formatter
U->>IC: "Can you recommend a manga similar to Attack on Titan?"
Note over IC: Stage 1: Regex → no match<br/>Stage 2: DistilBERT → recommendation (0.93)
IC-->>U: intent=recommendation, confidence=0.93
U->>RT: query + intent=recommendation
Note over RT: Embed query → KNN top-10<br/>→ filter by genre=action/seinen<br/>→ rerank → top-3
RT-->>U: [Vinland Saga, Berserk, Claymore]
U->>PB: query + intent + 3 chunks + empty history
Note over PB: System prompt + safety rules<br/>+ retrieved context + user query<br/>= 1,847 tokens
PB-->>U: assembled prompt
U->>LLM: prompt
Note over LLM: Generate recommendation response<br/>with product details from context
LLM-->>U: raw response text
U->>GR: validate response
Note over GR: ✅ No PII<br/>✅ No competitors<br/>✅ No hallucinated prices<br/>✅ ASINs match catalog
GR-->>U: pass
U->>FM: format response
Note over FM: Extract product cards<br/>Validate ASIN format<br/>Attach metadata
FM-->>U: structured response with 3 product cards
Test Implementation
def test_scenario_1_recommendation_happy_path():
"""E2E: Recommendation query → product cards with valid ASINs"""
response = pipeline.process(
query="Can you recommend a manga similar to Attack on Titan?",
user_id="test_user_001",
session_id="sess_001",
conversation_history=[]
)
# --- Boundary 1: Intent Classification ---
assert response.metadata.intent == "recommendation"
assert response.metadata.intent_confidence >= 0.85
assert response.metadata.classification_stage in ["regex", "model"]
# --- Boundary 2: Retrieval ---
assert len(response.metadata.retrieved_chunks) >= 2
assert all(chunk.relevance_score >= 0.5 for chunk in response.metadata.retrieved_chunks)
# Retrieved products should be in action/seinen genre (similar to AoT)
genres = [c.metadata.get("genre", "") for c in response.metadata.retrieved_chunks]
assert any(g in ["action", "seinen", "dark fantasy", "adventure"] for g in genres)
# --- Boundary 3: Prompt Assembly ---
assert response.metadata.prompt_tokens < 4000
assert response.metadata.prompt_tokens > 500 # Not suspiciously short
# --- Boundary 4: Generation ---
assert len(response.text) > 50 # Substantive response
assert len(response.text) < 2000 # Not runaway generation
# --- Boundary 5: Guardrails ---
assert response.metadata.guardrail_passed == True
assert response.metadata.pii_detected == False
assert response.metadata.competitor_mentioned == False
# --- Boundary 6: Formatting ---
assert len(response.products) >= 2
for product in response.products:
assert re.match(r'^B[0-9A-Z]{9}$', product.asin), f"Invalid ASIN: {product.asin}"
assert product.title and len(product.title) > 0
assert product.price > 0 # Price from catalog
assert product.asin in CATALOG_ASINS # ASIN exists in our catalog
# --- Cross-Boundary: Latency ---
assert response.metadata.total_latency_ms < 5000 # E2E under 5s for local model
Scenario 2: Multi-Turn Conversation with Memory
Description
A 4-turn conversation where each turn depends on context from previous turns. Tests memory persistence, entity resolution, topic switching, and back-referencing.
stateDiagram-v2
[*] --> Turn1: "I'm looking for a good manga"
Turn1 --> Turn2: "Something darker and more mature"
Turn2 --> Turn3: "Where's my order from last week?"
Turn3 --> Turn4: "Add that Monster hardcover to cart"
Turn4 --> [*]
state Turn1 {
[*] --> Classify1: recommendation
Classify1 --> Retrieve1: general manga
Retrieve1 --> Generate1: 3 products
Generate1 --> [*]
}
state Turn2 {
[*] --> Memory2: recall Turn1 context
Memory2 --> Classify2: recommendation (refined)
Classify2 --> Retrieve2: seinen/horror filter
Retrieve2 --> Generate2: 3 NEW products
Generate2 --> [*]
}
state Turn3 {
[*] --> Classify3: order_tracking
Note right of Classify3: Intent switch!
Classify3 --> API3: fetch order data
API3 --> Generate3: order status
Generate3 --> [*]
}
state Turn4 {
[*] --> Memory4: recall Turn2 products
Memory4 --> Resolve4: "Monster hardcover" → ASIN
Note right of Resolve4: Entity resolution<br/>across topic switch
Resolve4 --> Cart4: add to cart
Cart4 --> [*]
}
Test Implementation
def test_scenario_2_multi_turn_with_memory():
"""E2E: 4-turn conversation with memory, topic switch, and back-reference"""
session_id = "sess_multi_001"
history = []
# --- Turn 1: Initial recommendation ---
r1 = pipeline.process(
query="I'm looking for a good manga series",
user_id="test_user_002",
session_id=session_id,
conversation_history=history
)
assert r1.metadata.intent == "recommendation"
assert len(r1.products) >= 3
t1_products = r1.products
history.extend([
{"role": "user", "content": "I'm looking for a good manga series"},
{"role": "assistant", "content": r1.text}
])
# --- Turn 2: Refinement — uses Turn 1 context ---
r2 = pipeline.process(
query="Something darker and more mature",
user_id="test_user_002",
session_id=session_id,
conversation_history=history
)
assert r2.metadata.intent == "recommendation"
# Must use previous context — query alone is ambiguous
assert r2.metadata.memory_context_used == True
# Should NOT repeat Turn 1 products
t2_asins = {p.asin for p in r2.products}
t1_asins = {p.asin for p in t1_products}
assert len(t2_asins & t1_asins) == 0, "Should not repeat products from Turn 1"
# Genre should shift to darker themes
genres = [c.metadata.get("genre", "") for c in r2.metadata.retrieved_chunks]
assert any(g in ["seinen", "horror", "psychological", "dark fantasy"] for g in genres)
history.extend([
{"role": "user", "content": "Something darker and more mature"},
{"role": "assistant", "content": r2.text}
])
# --- Turn 3: Topic switch to order tracking ---
r3 = pipeline.process(
query="Where's my order from last week?",
user_id="test_user_002",
session_id=session_id,
conversation_history=history
)
assert r3.metadata.intent == "order_tracking"
# Clean topic switch — no recommendation content leaking
assert "recommend" not in r3.text.lower()
assert r3.order_info is not None
history.extend([
{"role": "user", "content": "Where's my order from last week?"},
{"role": "assistant", "content": r3.text}
])
# --- Turn 4: Back-reference to Turn 2 product ---
r4 = pipeline.process(
query="Add that Monster hardcover to my cart",
user_id="test_user_002",
session_id=session_id,
conversation_history=history
)
# Must resolve "Monster hardcover" to the correct ASIN from Turn 2
expected_asin = next(p.asin for p in t2_products if "monster" in p.title.lower())
assert r4.metadata.resolved_entity == expected_asin
assert r4.metadata.cart_action == "add"
assert r4.metadata.intent in ["cart_action", "product_detail"]
Scenario 3: Intent Handoff Mid-Conversation
Description
User starts with an FAQ question but transitions to a transactional intent (return request) mid-conversation. Tests that the orchestrator correctly re-classifies and routes to a different pipeline path without losing context.
sequenceDiagram
participant U as User
participant O as Orchestrator
participant FAQ as FAQ Pipeline
participant RET as Return Pipeline
participant MEM as Memory
U->>O: "What's your return policy?"
O->>FAQ: Route to FAQ (intent=faq, confidence=0.96)
FAQ-->>O: "You can return within 30 days..."
O->>MEM: Store: user asked about returns
U->>O: "Ok, I want to return order #45678"
Note over O: Re-classify: return_request (0.94)<br/>NOT faq anymore
O->>MEM: Read: user was asking about return policy
O->>RET: Route to Return Pipeline
Note over RET: Context-aware: user already<br/>knows the policy, skip explanation
RET-->>O: "I've initiated a return for order #45678..."
U->>O: "Will I get a full refund?"
Note over O: Classify: faq (0.88) vs return_followup (0.85)
Note over O: Memory indicates active return → route to return_followup
O->>RET: Route to Return Pipeline (context: active return)
RET-->>O: "Yes, you'll receive a full refund within 5-7 business days"
Test Implementation
def test_scenario_3_intent_handoff():
"""E2E: FAQ → Return Request → Return Follow-up transitions"""
session_id = "sess_handoff_001"
history = []
# Turn 1: FAQ about return policy
r1 = pipeline.process(
query="What's your return policy?",
user_id="test_user_003",
session_id=session_id,
conversation_history=history
)
assert r1.metadata.intent == "faq"
assert "30 days" in r1.text or "return" in r1.text.lower()
history.extend([
{"role": "user", "content": "What's your return policy?"},
{"role": "assistant", "content": r1.text}
])
# Turn 2: Transition to actual return request
r2 = pipeline.process(
query="Ok, I want to return order #45678",
user_id="test_user_003",
session_id=session_id,
conversation_history=history
)
assert r2.metadata.intent == "return_request"
assert r2.metadata.intent != "faq" # Must re-classify
assert "45678" in r2.text # Referenced the correct order
# Should NOT repeat the return policy (user already knows it)
assert r2.text.count("30 days") <= 1 # At most a brief mention
history.extend([
{"role": "user", "content": "Ok, I want to return order #45678"},
{"role": "assistant", "content": r2.text}
])
# Turn 3: Follow-up within return context
r3 = pipeline.process(
query="Will I get a full refund?",
user_id="test_user_003",
session_id=session_id,
conversation_history=history
)
# Should be routed as return follow-up, not generic FAQ
assert r3.metadata.intent in ["return_followup", "return_request", "faq"]
assert r3.metadata.memory_context_used == True
# Response should be specific to the active return, not generic
assert "refund" in r3.text.lower()
Scenario 4: Guardrail Trigger Mid-Pipeline
Description
The retriever returns a chunk containing competitor information (Barnes & Noble) that exists in the product catalog editorial content. The guardrails must catch and filter this before it reaches the user, and the response must gracefully degrade without the contaminated chunk.
flowchart TD
Q["Query: 'Where can I find manga deals?'"]
Q --> IC["Intent: recommendation (0.89)"]
IC --> RT["Retriever returns 3 chunks"]
RT --> C1["Chunk 1: 'MangaAssist has weekly deals<br/>on popular series...'<br/>✅ Safe"]
RT --> C2["Chunk 2: 'Barnes & Noble also offers<br/>competitive manga pricing...'<br/>❌ Competitor mention"]
RT --> C3["Chunk 3: 'Check our seasonal sales<br/>for up to 40% off...'<br/>✅ Safe"]
C1 --> GR["Guardrail Filter"]
C2 --> GR
C3 --> GR
GR -->|"Filter competitor chunk"| PB["Prompt Builder<br/>receives only Chunk 1 + Chunk 3"]
GR -->|"Log blocked chunk"| LOG["Audit Log:<br/>competitor_filter triggered<br/>chunk_id: editorial_042"]
PB --> LLM["LLM generates response<br/>using 2 clean chunks"]
LLM --> GR2["Post-generation guardrail"]
GR2 -->|"✅ No competitor in response"| FM["Formatter"]
FM --> RESP["Final response:<br/>Deals from our catalog only"]
style C2 fill:#e17055,color:#fff
style GR fill:#fdcb6e,color:#333
style RESP fill:#00b894,color:#fff
Test Implementation
def test_scenario_4_guardrail_blocks_competitor_in_retrieval():
"""E2E: Competitor content in RAG index → filtered → graceful response"""
# Pre-condition: inject a contaminated chunk into the test index
inject_test_chunk({
"doc_id": "editorial_042",
"text": "For manga deals, Barnes & Noble offers competitive pricing. "
"Their membership program includes 10% off all manga titles.",
"metadata": {"source": "editorial", "category": "deals"}
})
response = pipeline.process(
query="Where can I find manga deals?",
user_id="test_user_004",
session_id="sess_guardrail_001",
conversation_history=[]
)
# Competitor content must NOT appear in response
assert "barnes" not in response.text.lower()
assert "noble" not in response.text.lower()
assert "membership" not in response.text.lower() # Their program, not ours
# Response should still be helpful (not empty or error)
assert len(response.text) > 50
assert response.metadata.intent == "recommendation"
# Guardrail should have logged the filtering
assert response.metadata.guardrail_filters_applied >= 1
assert "competitor_filter" in response.metadata.guardrail_details
# At least 1 chunk should have been used (the clean ones)
assert len(response.metadata.chunks_used) >= 1
# The competitor chunk should NOT be in the used chunks
assert "editorial_042" not in [c.doc_id for c in response.metadata.chunks_used]
Scenario 5: Fallback Cascade
Description
A query where everything goes slightly wrong: the classifier has low confidence, the retriever returns no relevant results, and the system must gracefully cascade through fallback layers until it produces a helpful response.
flowchart TD
Q["Query: 'ugh this is so frustrating<br/>nothing works'"]
Q --> IC["Intent Classifier"]
IC -->|"confidence = 0.42<br/>(below 0.80 threshold)"| FALLBACK1["Fallback: Use DistilBERT<br/>Stage 2 classifier"]
FALLBACK1 -->|"intent = complaint (0.61)<br/>(still below 0.80)"| FALLBACK2["Fallback: Check memory<br/>for active context"]
FALLBACK2 -->|"No active order,<br/>no recent topic"| FALLBACK3["Fallback: Route to<br/>empathetic response template"]
FALLBACK3 --> TEMPLATE["Template Response:<br/>'I'm sorry you're having trouble.<br/>Could you tell me more about<br/>what's not working? I'm here to help.'"]
TEMPLATE --> GR["Guardrails: ✅ Safe"]
GR --> ESCALATION["Set escalation flag<br/>if next message is also frustrated"]
style Q fill:#2d3436,color:#fff
style FALLBACK1 fill:#fdcb6e,color:#333
style FALLBACK2 fill:#f39c12,color:#fff
style FALLBACK3 fill:#e17055,color:#fff
style TEMPLATE fill:#00b894,color:#fff
Test Implementation
def test_scenario_5_fallback_cascade():
"""E2E: Low confidence → no retrieval → template fallback"""
response = pipeline.process(
query="ugh this is so frustrating nothing works",
user_id="test_user_005",
session_id="sess_fallback_001",
conversation_history=[]
)
# Classifier should have low confidence
assert response.metadata.intent_confidence < 0.80
# Pipeline should not have crashed
assert response.text is not None
assert len(response.text) > 20
# Response should be empathetic and ask for clarification
empathy_indicators = ["sorry", "help", "trouble", "understand", "tell me more"]
assert any(word in response.text.lower() for word in empathy_indicators)
# Should NOT hallucinate a specific response (no product, no order info)
assert response.products == []
assert response.order_info is None
# Fallback path should be logged
assert response.metadata.fallback_triggered == True
assert response.metadata.response_source in ["template", "empathetic_fallback"]
# LLM should NOT have been called (template response = $0)
assert response.metadata.llm_called == False
# Escalation readiness should be set for next turn
assert response.metadata.escalation_primed == True
Scenario 6: Real-Time Data Staleness
Description
The chatbot recommends a product, but between retrieval and response delivery, the product goes out of stock. The real-time inventory check must catch this and either substitute or warn the user.
sequenceDiagram
participant U as User
participant RT as Retriever
participant INV as Inventory API
participant LLM as LLM
participant FM as Formatter
U->>RT: "Recommend me Attack on Titan volumes"
RT-->>U: [Vol 1 (in-stock), Vol 2 (in-stock), Vol 3 (in-stock)]
Note over U,FM: ⏱️ 200ms passes — inventory changes
U->>INV: Check real-time inventory for 3 ASINs
Note over INV: Vol 2 now OUT OF STOCK<br/>(someone bought the last copy)
INV-->>U: [Vol 1: ✅, Vol 2: ❌, Vol 3: ✅]
U->>LLM: Generate with 2 available + 1 note
LLM-->>U: Response mentioning Vol 2 unavailability
U->>FM: Format with availability badges
FM-->>U: Product cards with ✅/❌ status
Test Implementation
def test_scenario_6_inventory_staleness():
"""E2E: Product goes out-of-stock between retrieval and response"""
# Simulate inventory change mid-pipeline
with mock_inventory_change("B00AOT_VOL2", status="out_of_stock", after_ms=100):
response = pipeline.process(
query="I want to buy Attack on Titan volumes 1-3",
user_id="test_user_006",
session_id="sess_stale_001",
conversation_history=[]
)
# Response should acknowledge the out-of-stock item
assert any(
indicator in response.text.lower()
for indicator in ["out of stock", "unavailable", "currently not available", "sold out"]
)
# Should NOT present the out-of-stock item as purchasable
for product in response.products:
if product.asin == "B00AOT_VOL2":
assert product.availability != "in_stock"
assert product.availability in ["out_of_stock", "backordered", "unavailable"]
# Should still recommend the available volumes
available_asins = [p.asin for p in response.products if p.availability == "in_stock"]
assert len(available_asins) >= 2
# Should NOT hallucinate a price for the out-of-stock item
# (or should show last known price with "was" prefix)
assert response.metadata.stale_data_handled == True
Scenario 7: Token Budget Overflow
Description
A complex multi-turn conversation with large retrieval context that approaches or exceeds the LLM's token budget. The prompt builder must intelligently truncate without losing critical information.
flowchart TD
subgraph Input["Raw Input to Prompt Builder"]
SYS["System Prompt<br/>500 tokens"]
HIST["Conversation History (12 turns)<br/>3,200 tokens"]
CTX["Retrieved Context (5 chunks)<br/>2,800 tokens"]
QUERY["User Query<br/>50 tokens"]
INST["Instructions<br/>200 tokens"]
end
SUM["Total: 6,750 tokens"]
BUDGET["Budget: 4,000 tokens (input)<br/>+ 500 tokens (output)<br/>= 4,500 max"]
Input --> SUM
SUM -->|"OVER BUDGET by 2,250"| TRUNC["Truncation Strategy"]
TRUNC --> S1["Step 1: Summarize old history<br/>12 turns → 3-turn summary<br/>3,200 → 400 tokens<br/>Saved: 2,800"]
S1 --> S2["Step 2: Trim to top-3 chunks<br/>5 chunks → 3 chunks<br/>2,800 → 1,680 tokens<br/>Saved: 1,120"]
S2 --> S3["Step 3: Keep system +<br/>query + instructions intact"]
S3 --> FINAL["Final Prompt: 2,830 tokens<br/>✅ Within budget"]
style SUM fill:#e17055,color:#fff
style FINAL fill:#00b894,color:#fff
Test Implementation
def test_scenario_7_token_budget_overflow():
"""E2E: Long conversation + large context → truncation without info loss"""
# Build a 12-turn conversation history
long_history = []
for i in range(12):
long_history.extend([
{"role": "user", "content": f"Turn {i}: Tell me about manga genre {GENRES[i]}"},
{"role": "assistant", "content": f"Here's a detailed explanation of {GENRES[i]} manga " * 20}
])
response = pipeline.process(
query="Based on everything we discussed, what should I read first?",
user_id="test_user_007",
session_id="sess_overflow_001",
conversation_history=long_history
)
# Response should succeed (not crash on token overflow)
assert response.text is not None
assert len(response.text) > 50
# Prompt should be within budget
assert response.metadata.prompt_tokens <= 4000
# Critical information should be preserved
# System prompt: always kept in full
assert response.metadata.system_prompt_truncated == False
# User query: always kept in full
assert response.metadata.query_truncated == False
# Safety instructions: always kept in full
assert response.metadata.safety_instructions_present == True
# History should be summarized, not just chopped
assert response.metadata.history_strategy in ["summarized", "recent_only"]
assert response.metadata.history_strategy != "truncated_raw" # Don't just cut text mid-sentence
# Context should be prioritized by relevance
if response.metadata.chunks_truncated:
assert response.metadata.chunks_ordered_by_relevance == True
assert response.metadata.chunks_used <= response.metadata.chunks_retrieved
# Response quality should not degrade significantly
assert response.metadata.guardrail_passed == True
Scenario 8: Rate Limiting and Bedrock Throttling
Description
Simulates Amazon Bedrock returning a ThrottlingException due to rate limits. The pipeline must gracefully degrade — using cached responses, falling back to a simpler model, or returning a template response — without showing the user an error.
flowchart TD
Q["User Query"]
Q --> IC["Intent Classifier: recommendation"]
IC --> RT["Retriever: 3 chunks"]
RT --> PB["Prompt Builder: assembled"]
PB --> LLM["Call Bedrock<br/>Claude 3.5 Sonnet"]
LLM -->|"ThrottlingException!"| RETRY["Retry with backoff<br/>(max 2 retries)"]
RETRY -->|"Still throttled"| FALLBACK{"Fallback Strategy"}
FALLBACK -->|"Strategy 1"| CACHE["Check semantic cache<br/>for similar query"]
FALLBACK -->|"Strategy 2"| LITE["Try lighter model<br/>(Claude 3 Haiku)"]
FALLBACK -->|"Strategy 3"| TEMPLATE["Return template response<br/>with retrieved products"]
CACHE -->|"Cache hit"| SERVE["Serve cached response<br/>(with freshness flag)"]
CACHE -->|"Cache miss"| LITE
LITE -->|"Success"| SERVE2["Serve lower-quality response<br/>(note: simpler model used)"]
LITE -->|"Also throttled"| TEMPLATE
TEMPLATE --> SERVE3["Serve template:<br/>'Based on your interest, here are<br/>some popular titles: [product list]'"]
style LLM fill:#e17055,color:#fff
style RETRY fill:#fdcb6e,color:#333
style SERVE fill:#00b894,color:#fff
style SERVE2 fill:#00b894,color:#fff
style SERVE3 fill:#00b894,color:#fff
Test Implementation
def test_scenario_8_bedrock_throttling():
"""E2E: Bedrock throttled → graceful degradation"""
# Simulate Bedrock throttling all calls
with mock_bedrock_throttle(exception="ThrottlingException"):
response = pipeline.process(
query="Recommend some popular shonen manga",
user_id="test_user_008",
session_id="sess_throttle_001",
conversation_history=[]
)
# Should NOT return an error to the user
assert response.error is None
assert response.text is not None
assert len(response.text) > 30
# Should indicate degraded mode in metadata (not in user response)
assert response.metadata.degraded_mode == True
assert response.metadata.fallback_reason == "bedrock_throttled"
# Should still return products (from retrieval, even without LLM generation)
assert len(response.products) >= 1
# Products should still have valid data from catalog
for product in response.products:
assert product.asin in CATALOG_ASINS
assert product.price > 0
# Retry count should be logged
assert response.metadata.retry_count <= 2 # Max 2 retries before fallback
# Latency should not be excessive (retries + fallback)
assert response.metadata.total_latency_ms < 10000 # 10s max even with retries
# Monitoring: this should have generated an alert
assert response.metadata.alert_generated == True
assert response.metadata.alert_type == "bedrock_throttling"
Running the Integration Test Suite
Execution Flow
flowchart TD
START["Developer triggers<br/>integration tests"]
START --> ENV["Start Docker Compose<br/>(OpenSearch + DynamoDB + Redis + Ollama)"]
ENV --> SEED["Seed test data<br/>(100 products, vector index,<br/>conversation histories)"]
SEED --> RUN["Run 50 E2E scenarios<br/>(pytest -m integration)"]
RUN --> RESULTS["Collect results"]
RESULTS --> REPORT["Generate report:<br/>- Scenarios passed/failed<br/>- Boundary assertion details<br/>- Latency breakdown<br/>- Coverage by intent"]
REPORT --> GATE{"All critical<br/>scenarios pass?"}
GATE -->|"Yes"| MERGE["Allow PR merge"]
GATE -->|"No"| BLOCK["Block PR +<br/>detailed failure report"]
ENV --> CLEANUP["Teardown Docker<br/>after tests complete"]
style START fill:#2d3436,color:#fff
style MERGE fill:#00b894,color:#fff
style BLOCK fill:#e17055,color:#fff
Test Configuration
# conftest.py
import pytest
@pytest.fixture(scope="session")
def integration_env():
"""Start all local services for integration testing"""
env = IntegrationEnvironment()
env.start_docker_compose("docker-compose.test.yml")
env.wait_for_healthy(timeout=60)
env.seed_product_catalog(count=100)
env.build_vector_index()
env.seed_conversation_histories()
yield env
env.teardown()
@pytest.fixture
def pipeline(integration_env):
"""Fresh pipeline instance for each test"""
return ChatbotPipeline(
classifier=IntentClassifier.load("models/intent_v4"),
retriever=RAGRetriever(endpoint=integration_env.opensearch_url),
prompt_builder=PromptBuilder.load("prompts/v3"),
llm=LocalLLMClient(model="llama3:8b", endpoint=integration_env.ollama_url),
guardrails=GuardrailEngine.load("configs/guardrails_v6"),
formatter=ResponseFormatter(),
memory=ConversationMemory(endpoint=integration_env.dynamodb_url),
)
Summary: Integration Test Coverage Matrix
| Scenario | Components Tested | Key Assertion | Failure Mode Caught |
|---|---|---|---|
| 1. Happy Path | All 6 | Products have valid ASINs from catalog | Cross-component data format mismatch |
| 2. Multi-Turn | Classifier + Memory + Retriever | Entity resolution across turns | Memory corruption, context loss |
| 3. Intent Handoff | Classifier + Orchestrator + Memory | Clean intent transition | Routing stickiness, context bleed |
| 4. Guardrail Mid-Pipeline | Retriever + Guardrails + LLM | Competitor content filtered | RAG contamination, filter bypass |
| 5. Fallback Cascade | Classifier + Retriever + Templates | Graceful degradation path | Crash on low confidence, empty response |
| 6. Data Staleness | Retriever + Inventory API + Formatter | Out-of-stock handled | Stale recommendation, false availability |
| 7. Token Overflow | History + Context + Prompt Builder | Intelligent truncation | Prompt exceeds budget, info loss |
| 8. Throttling | LLM + Cache + Fallback + Templates | User never sees error | Unhandled exception, blank response |