End-to-End Integration Testing Scenarios

Full pipeline test scenarios that validate the complete flow from user query to final response. Each scenario tests the interaction between all components — intent classifier, retriever, prompt builder, LLM, guardrails, and response formatter — wired together as one system.

These tests catch the failures that component tests miss: the subtle interactions, boundary mismatches, and cascading errors that only surface when components are connected.

Why Integration Testing Is Different from Component Testing

flowchart LR
    subgraph Component["Component Tests<br/>(each passes independently)"]
        IC["Intent Classifier<br/>✅ 95% accuracy"]
        RT["Retriever<br/>✅ Recall@3 = 0.82"]
        PB["Prompt Builder<br/>✅ All sections present"]
        LLM["LLM<br/>✅ Coherent responses"]
        GR["Guardrails<br/>✅ 98% pass rate"]
    end

    subgraph Integration["Integration Test<br/>(components wired together)"]
        FAIL["❌ FAILURE: Classifier outputs<br/>'product_detail' but retriever<br/>expects 'product_details' (plural)<br/>→ retriever returns no results<br/>→ prompt has empty context<br/>→ LLM hallucinates an answer"]
    end

    Component -->|"All green ✅"| Integration
    Integration -->|"Pipeline broken ❌"| BUG["Bug only visible<br/>when components connect"]

    style Component fill:#00b894,color:#fff
    style Integration fill:#e17055,color:#fff
    style BUG fill:#d63031,color:#fff

Test Infrastructure Setup

flowchart TD
    subgraph TestEnv["Integration Test Environment"]
        DC["Docker Compose"]
        DC --> OS["OpenSearch<br/>(local container)"]
        DC --> DDB["DynamoDB Local"]
        DC --> CACHE["Redis<br/>(local container)"]
        DC --> OL["Ollama + Llama 3<br/>(local LLM)"]
    end

    subgraph Pipeline["Full Pipeline Under Test"]
        ORCH["Orchestrator"]
        ORCH --> CLS["Intent Classifier"]
        ORCH --> RET["RAG Retriever"]
        ORCH --> PMT["Prompt Builder"]
        ORCH --> GEN["LLM Generator"]
        ORCH --> GRL["Guardrails"]
        ORCH --> FMT["Response Formatter"]
    end

    subgraph Data["Test Data"]
        GD["50 E2E Test Cases"]
        SEED["Pre-loaded product catalog<br/>(100 manga titles)"]
        VEC["Pre-built vector index<br/>(embeddings for 100 titles)"]
        MEM["Pre-loaded conversation<br/>histories for multi-turn"]
    end

    TestEnv --> Pipeline
    Data --> Pipeline

    style TestEnv fill:#0984e3,color:#fff
    style Pipeline fill:#6c5ce7,color:#fff
    style Data fill:#00b894,color:#fff

Environment Configuration

# docker-compose.test.yml
services:
  opensearch:
    image: opensearchproject/opensearch:2.11.0
    environment:
      - discovery.type=single-node
      - DISABLE_SECURITY_PLUGIN=true
    ports: ["9200:9200"]

  dynamodb-local:
    image: amazon/dynamodb-local:latest
    ports: ["8000:8000"]

  redis:
    image: redis:7-alpine
    ports: ["6379:6379"]

  ollama:
    image: ollama/ollama:latest
    ports: ["11434:11434"]
    volumes: ["ollama_data:/root/.ollama"]
    # Pre-pull: ollama pull llama3:8b

Scenario 1: Happy Path — Recommendation Query

Description

A straightforward recommendation request that exercises the full RAG pipeline: classification → retrieval → prompt assembly → generation → safety check → formatting with product cards.

sequenceDiagram
    participant U as Test Harness
    participant IC as Intent Classifier
    participant RT as Retriever
    participant PB as Prompt Builder
    participant LLM as LLM (Llama 3)
    participant GR as Guardrails
    participant FM as Formatter

    U->>IC: "Can you recommend a manga similar to Attack on Titan?"
    Note over IC: Stage 1: Regex → no match<br/>Stage 2: DistilBERT → recommendation (0.93)
    IC-->>U: intent=recommendation, confidence=0.93

    U->>RT: query + intent=recommendation
    Note over RT: Embed query → KNN top-10<br/>→ filter by genre=action/seinen<br/>→ rerank → top-3
    RT-->>U: [Vinland Saga, Berserk, Claymore]

    U->>PB: query + intent + 3 chunks + empty history
    Note over PB: System prompt + safety rules<br/>+ retrieved context + user query<br/>= 1,847 tokens
    PB-->>U: assembled prompt

    U->>LLM: prompt
    Note over LLM: Generate recommendation response<br/>with product details from context
    LLM-->>U: raw response text

    U->>GR: validate response
    Note over GR: ✅ No PII<br/>✅ No competitors<br/>✅ No hallucinated prices<br/>✅ ASINs match catalog
    GR-->>U: pass

    U->>FM: format response
    Note over FM: Extract product cards<br/>Validate ASIN format<br/>Attach metadata
    FM-->>U: structured response with 3 product cards

Test Implementation

def test_scenario_1_recommendation_happy_path():
    """E2E: Recommendation query → product cards with valid ASINs"""

    response = pipeline.process(
        query="Can you recommend a manga similar to Attack on Titan?",
        user_id="test_user_001",
        session_id="sess_001",
        conversation_history=[]
    )

    # --- Boundary 1: Intent Classification ---
    assert response.metadata.intent == "recommendation"
    assert response.metadata.intent_confidence >= 0.85
    assert response.metadata.classification_stage in ["regex", "model"]

    # --- Boundary 2: Retrieval ---
    assert len(response.metadata.retrieved_chunks) >= 2
    assert all(chunk.relevance_score >= 0.5 for chunk in response.metadata.retrieved_chunks)
    # Retrieved products should be in action/seinen genre (similar to AoT)
    genres = [c.metadata.get("genre", "") for c in response.metadata.retrieved_chunks]
    assert any(g in ["action", "seinen", "dark fantasy", "adventure"] for g in genres)

    # --- Boundary 3: Prompt Assembly ---
    assert response.metadata.prompt_tokens < 4000
    assert response.metadata.prompt_tokens > 500  # Not suspiciously short

    # --- Boundary 4: Generation ---
    assert len(response.text) > 50  # Substantive response
    assert len(response.text) < 2000  # Not runaway generation

    # --- Boundary 5: Guardrails ---
    assert response.metadata.guardrail_passed == True
    assert response.metadata.pii_detected == False
    assert response.metadata.competitor_mentioned == False

    # --- Boundary 6: Formatting ---
    assert len(response.products) >= 2
    for product in response.products:
        assert re.match(r'^B[0-9A-Z]{9}$', product.asin), f"Invalid ASIN: {product.asin}"
        assert product.title and len(product.title) > 0
        assert product.price > 0  # Price from catalog
        assert product.asin in CATALOG_ASINS  # ASIN exists in our catalog

    # --- Cross-Boundary: Latency ---
    assert response.metadata.total_latency_ms < 5000  # E2E under 5s for local model

Scenario 2: Multi-Turn Conversation with Memory

Description

A 4-turn conversation where each turn depends on context from previous turns. Tests memory persistence, entity resolution, topic switching, and back-referencing.

stateDiagram-v2
    [*] --> Turn1: "I'm looking for a good manga"
    Turn1 --> Turn2: "Something darker and more mature"
    Turn2 --> Turn3: "Where's my order from last week?"
    Turn3 --> Turn4: "Add that Monster hardcover to cart"
    Turn4 --> [*]

    state Turn1 {
        [*] --> Classify1: recommendation
        Classify1 --> Retrieve1: general manga
        Retrieve1 --> Generate1: 3 products
        Generate1 --> [*]
    }

    state Turn2 {
        [*] --> Memory2: recall Turn1 context
        Memory2 --> Classify2: recommendation (refined)
        Classify2 --> Retrieve2: seinen/horror filter
        Retrieve2 --> Generate2: 3 NEW products
        Generate2 --> [*]
    }

    state Turn3 {
        [*] --> Classify3: order_tracking
        Note right of Classify3: Intent switch!
        Classify3 --> API3: fetch order data
        API3 --> Generate3: order status
        Generate3 --> [*]
    }

    state Turn4 {
        [*] --> Memory4: recall Turn2 products
        Memory4 --> Resolve4: "Monster hardcover" → ASIN
        Note right of Resolve4: Entity resolution<br/>across topic switch
        Resolve4 --> Cart4: add to cart
        Cart4 --> [*]
    }

Test Implementation

def test_scenario_2_multi_turn_with_memory():
    """E2E: 4-turn conversation with memory, topic switch, and back-reference"""

    session_id = "sess_multi_001"
    history = []

    # --- Turn 1: Initial recommendation ---
    r1 = pipeline.process(
        query="I'm looking for a good manga series",
        user_id="test_user_002",
        session_id=session_id,
        conversation_history=history
    )
    assert r1.metadata.intent == "recommendation"
    assert len(r1.products) >= 3
    t1_products = r1.products

    history.extend([
        {"role": "user", "content": "I'm looking for a good manga series"},
        {"role": "assistant", "content": r1.text}
    ])

    # --- Turn 2: Refinement — uses Turn 1 context ---
    r2 = pipeline.process(
        query="Something darker and more mature",
        user_id="test_user_002",
        session_id=session_id,
        conversation_history=history
    )
    assert r2.metadata.intent == "recommendation"
    # Must use previous context — query alone is ambiguous
    assert r2.metadata.memory_context_used == True
    # Should NOT repeat Turn 1 products
    t2_asins = {p.asin for p in r2.products}
    t1_asins = {p.asin for p in t1_products}
    assert len(t2_asins & t1_asins) == 0, "Should not repeat products from Turn 1"
    # Genre should shift to darker themes
    genres = [c.metadata.get("genre", "") for c in r2.metadata.retrieved_chunks]
    assert any(g in ["seinen", "horror", "psychological", "dark fantasy"] for g in genres)

    history.extend([
        {"role": "user", "content": "Something darker and more mature"},
        {"role": "assistant", "content": r2.text}
    ])

    # --- Turn 3: Topic switch to order tracking ---
    r3 = pipeline.process(
        query="Where's my order from last week?",
        user_id="test_user_002",
        session_id=session_id,
        conversation_history=history
    )
    assert r3.metadata.intent == "order_tracking"
    # Clean topic switch — no recommendation content leaking
    assert "recommend" not in r3.text.lower()
    assert r3.order_info is not None

    history.extend([
        {"role": "user", "content": "Where's my order from last week?"},
        {"role": "assistant", "content": r3.text}
    ])

    # --- Turn 4: Back-reference to Turn 2 product ---
    r4 = pipeline.process(
        query="Add that Monster hardcover to my cart",
        user_id="test_user_002",
        session_id=session_id,
        conversation_history=history
    )
    # Must resolve "Monster hardcover" to the correct ASIN from Turn 2
    expected_asin = next(p.asin for p in t2_products if "monster" in p.title.lower())
    assert r4.metadata.resolved_entity == expected_asin
    assert r4.metadata.cart_action == "add"
    assert r4.metadata.intent in ["cart_action", "product_detail"]

Scenario 3: Intent Handoff Mid-Conversation

Description

User starts with an FAQ question but transitions to a transactional intent (return request) mid-conversation. Tests that the orchestrator correctly re-classifies and routes to a different pipeline path without losing context.

sequenceDiagram
    participant U as User
    participant O as Orchestrator
    participant FAQ as FAQ Pipeline
    participant RET as Return Pipeline
    participant MEM as Memory

    U->>O: "What's your return policy?"
    O->>FAQ: Route to FAQ (intent=faq, confidence=0.96)
    FAQ-->>O: "You can return within 30 days..."
    O->>MEM: Store: user asked about returns

    U->>O: "Ok, I want to return order #45678"
    Note over O: Re-classify: return_request (0.94)<br/>NOT faq anymore
    O->>MEM: Read: user was asking about return policy
    O->>RET: Route to Return Pipeline
    Note over RET: Context-aware: user already<br/>knows the policy, skip explanation
    RET-->>O: "I've initiated a return for order #45678..."

    U->>O: "Will I get a full refund?"
    Note over O: Classify: faq (0.88) vs return_followup (0.85)
    Note over O: Memory indicates active return → route to return_followup
    O->>RET: Route to Return Pipeline (context: active return)
    RET-->>O: "Yes, you'll receive a full refund within 5-7 business days"

Test Implementation

def test_scenario_3_intent_handoff():
    """E2E: FAQ → Return Request → Return Follow-up transitions"""

    session_id = "sess_handoff_001"
    history = []

    # Turn 1: FAQ about return policy
    r1 = pipeline.process(
        query="What's your return policy?",
        user_id="test_user_003",
        session_id=session_id,
        conversation_history=history
    )
    assert r1.metadata.intent == "faq"
    assert "30 days" in r1.text or "return" in r1.text.lower()

    history.extend([
        {"role": "user", "content": "What's your return policy?"},
        {"role": "assistant", "content": r1.text}
    ])

    # Turn 2: Transition to actual return request
    r2 = pipeline.process(
        query="Ok, I want to return order #45678",
        user_id="test_user_003",
        session_id=session_id,
        conversation_history=history
    )
    assert r2.metadata.intent == "return_request"
    assert r2.metadata.intent != "faq"  # Must re-classify
    assert "45678" in r2.text  # Referenced the correct order
    # Should NOT repeat the return policy (user already knows it)
    assert r2.text.count("30 days") <= 1  # At most a brief mention

    history.extend([
        {"role": "user", "content": "Ok, I want to return order #45678"},
        {"role": "assistant", "content": r2.text}
    ])

    # Turn 3: Follow-up within return context
    r3 = pipeline.process(
        query="Will I get a full refund?",
        user_id="test_user_003",
        session_id=session_id,
        conversation_history=history
    )
    # Should be routed as return follow-up, not generic FAQ
    assert r3.metadata.intent in ["return_followup", "return_request", "faq"]
    assert r3.metadata.memory_context_used == True
    # Response should be specific to the active return, not generic
    assert "refund" in r3.text.lower()

Scenario 4: Guardrail Trigger Mid-Pipeline

Description

The retriever returns a chunk containing competitor information (Barnes & Noble) that exists in the product catalog editorial content. The guardrails must catch and filter this before it reaches the user, and the response must gracefully degrade without the contaminated chunk.

flowchart TD
    Q["Query: 'Where can I find manga deals?'"]
    Q --> IC["Intent: recommendation (0.89)"]
    IC --> RT["Retriever returns 3 chunks"]

    RT --> C1["Chunk 1: 'MangaAssist has weekly deals<br/>on popular series...'<br/>✅ Safe"]
    RT --> C2["Chunk 2: 'Barnes & Noble also offers<br/>competitive manga pricing...'<br/>❌ Competitor mention"]
    RT --> C3["Chunk 3: 'Check our seasonal sales<br/>for up to 40% off...'<br/>✅ Safe"]

    C1 --> GR["Guardrail Filter"]
    C2 --> GR
    C3 --> GR

    GR -->|"Filter competitor chunk"| PB["Prompt Builder<br/>receives only Chunk 1 + Chunk 3"]
    GR -->|"Log blocked chunk"| LOG["Audit Log:<br/>competitor_filter triggered<br/>chunk_id: editorial_042"]

    PB --> LLM["LLM generates response<br/>using 2 clean chunks"]
    LLM --> GR2["Post-generation guardrail"]
    GR2 -->|"✅ No competitor in response"| FM["Formatter"]
    FM --> RESP["Final response:<br/>Deals from our catalog only"]

    style C2 fill:#e17055,color:#fff
    style GR fill:#fdcb6e,color:#333
    style RESP fill:#00b894,color:#fff

Test Implementation

def test_scenario_4_guardrail_blocks_competitor_in_retrieval():
    """E2E: Competitor content in RAG index → filtered → graceful response"""

    # Pre-condition: inject a contaminated chunk into the test index
    inject_test_chunk({
        "doc_id": "editorial_042",
        "text": "For manga deals, Barnes & Noble offers competitive pricing. "
                "Their membership program includes 10% off all manga titles.",
        "metadata": {"source": "editorial", "category": "deals"}
    })

    response = pipeline.process(
        query="Where can I find manga deals?",
        user_id="test_user_004",
        session_id="sess_guardrail_001",
        conversation_history=[]
    )

    # Competitor content must NOT appear in response
    assert "barnes" not in response.text.lower()
    assert "noble" not in response.text.lower()
    assert "membership" not in response.text.lower()  # Their program, not ours

    # Response should still be helpful (not empty or error)
    assert len(response.text) > 50
    assert response.metadata.intent == "recommendation"

    # Guardrail should have logged the filtering
    assert response.metadata.guardrail_filters_applied >= 1
    assert "competitor_filter" in response.metadata.guardrail_details

    # At least 1 chunk should have been used (the clean ones)
    assert len(response.metadata.chunks_used) >= 1
    # The competitor chunk should NOT be in the used chunks
    assert "editorial_042" not in [c.doc_id for c in response.metadata.chunks_used]

Scenario 5: Fallback Cascade

Description

A query where everything goes slightly wrong: the classifier has low confidence, the retriever returns no relevant results, and the system must gracefully cascade through fallback layers until it produces a helpful response.

flowchart TD
    Q["Query: 'ugh this is so frustrating<br/>nothing works'"]

    Q --> IC["Intent Classifier"]
    IC -->|"confidence = 0.42<br/>(below 0.80 threshold)"| FALLBACK1["Fallback: Use DistilBERT<br/>Stage 2 classifier"]

    FALLBACK1 -->|"intent = complaint (0.61)<br/>(still below 0.80)"| FALLBACK2["Fallback: Check memory<br/>for active context"]

    FALLBACK2 -->|"No active order,<br/>no recent topic"| FALLBACK3["Fallback: Route to<br/>empathetic response template"]

    FALLBACK3 --> TEMPLATE["Template Response:<br/>'I'm sorry you're having trouble.<br/>Could you tell me more about<br/>what's not working? I'm here to help.'"]

    TEMPLATE --> GR["Guardrails: ✅ Safe"]
    GR --> ESCALATION["Set escalation flag<br/>if next message is also frustrated"]

    style Q fill:#2d3436,color:#fff
    style FALLBACK1 fill:#fdcb6e,color:#333
    style FALLBACK2 fill:#f39c12,color:#fff
    style FALLBACK3 fill:#e17055,color:#fff
    style TEMPLATE fill:#00b894,color:#fff

Test Implementation

def test_scenario_5_fallback_cascade():
    """E2E: Low confidence → no retrieval → template fallback"""

    response = pipeline.process(
        query="ugh this is so frustrating nothing works",
        user_id="test_user_005",
        session_id="sess_fallback_001",
        conversation_history=[]
    )

    # Classifier should have low confidence
    assert response.metadata.intent_confidence < 0.80

    # Pipeline should not have crashed
    assert response.text is not None
    assert len(response.text) > 20

    # Response should be empathetic and ask for clarification
    empathy_indicators = ["sorry", "help", "trouble", "understand", "tell me more"]
    assert any(word in response.text.lower() for word in empathy_indicators)

    # Should NOT hallucinate a specific response (no product, no order info)
    assert response.products == []
    assert response.order_info is None

    # Fallback path should be logged
    assert response.metadata.fallback_triggered == True
    assert response.metadata.response_source in ["template", "empathetic_fallback"]

    # LLM should NOT have been called (template response = $0)
    assert response.metadata.llm_called == False

    # Escalation readiness should be set for next turn
    assert response.metadata.escalation_primed == True

Scenario 6: Real-Time Data Staleness

Description

The chatbot recommends a product, but between retrieval and response delivery, the product goes out of stock. The real-time inventory check must catch this and either substitute or warn the user.

sequenceDiagram
    participant U as User
    participant RT as Retriever
    participant INV as Inventory API
    participant LLM as LLM
    participant FM as Formatter

    U->>RT: "Recommend me Attack on Titan volumes"
    RT-->>U: [Vol 1 (in-stock), Vol 2 (in-stock), Vol 3 (in-stock)]

    Note over U,FM: ⏱️ 200ms passes — inventory changes

    U->>INV: Check real-time inventory for 3 ASINs
    Note over INV: Vol 2 now OUT OF STOCK<br/>(someone bought the last copy)
    INV-->>U: [Vol 1: ✅, Vol 2: ❌, Vol 3: ✅]

    U->>LLM: Generate with 2 available + 1 note
    LLM-->>U: Response mentioning Vol 2 unavailability

    U->>FM: Format with availability badges
    FM-->>U: Product cards with ✅/❌ status

Test Implementation

def test_scenario_6_inventory_staleness():
    """E2E: Product goes out-of-stock between retrieval and response"""

    # Simulate inventory change mid-pipeline
    with mock_inventory_change("B00AOT_VOL2", status="out_of_stock", after_ms=100):
        response = pipeline.process(
            query="I want to buy Attack on Titan volumes 1-3",
            user_id="test_user_006",
            session_id="sess_stale_001",
            conversation_history=[]
        )

    # Response should acknowledge the out-of-stock item
    assert any(
        indicator in response.text.lower()
        for indicator in ["out of stock", "unavailable", "currently not available", "sold out"]
    )

    # Should NOT present the out-of-stock item as purchasable
    for product in response.products:
        if product.asin == "B00AOT_VOL2":
            assert product.availability != "in_stock"
            assert product.availability in ["out_of_stock", "backordered", "unavailable"]

    # Should still recommend the available volumes
    available_asins = [p.asin for p in response.products if p.availability == "in_stock"]
    assert len(available_asins) >= 2

    # Should NOT hallucinate a price for the out-of-stock item
    # (or should show last known price with "was" prefix)
    assert response.metadata.stale_data_handled == True

Scenario 7: Token Budget Overflow

Description

A complex multi-turn conversation with large retrieval context that approaches or exceeds the LLM's token budget. The prompt builder must intelligently truncate without losing critical information.

flowchart TD
    subgraph Input["Raw Input to Prompt Builder"]
        SYS["System Prompt<br/>500 tokens"]
        HIST["Conversation History (12 turns)<br/>3,200 tokens"]
        CTX["Retrieved Context (5 chunks)<br/>2,800 tokens"]
        QUERY["User Query<br/>50 tokens"]
        INST["Instructions<br/>200 tokens"]
    end

    SUM["Total: 6,750 tokens"]
    BUDGET["Budget: 4,000 tokens (input)<br/>+ 500 tokens (output)<br/>= 4,500 max"]

    Input --> SUM
    SUM -->|"OVER BUDGET by 2,250"| TRUNC["Truncation Strategy"]

    TRUNC --> S1["Step 1: Summarize old history<br/>12 turns → 3-turn summary<br/>3,200 → 400 tokens<br/>Saved: 2,800"]
    S1 --> S2["Step 2: Trim to top-3 chunks<br/>5 chunks → 3 chunks<br/>2,800 → 1,680 tokens<br/>Saved: 1,120"]
    S2 --> S3["Step 3: Keep system +<br/>query + instructions intact"]

    S3 --> FINAL["Final Prompt: 2,830 tokens<br/>✅ Within budget"]

    style SUM fill:#e17055,color:#fff
    style FINAL fill:#00b894,color:#fff

Test Implementation

def test_scenario_7_token_budget_overflow():
    """E2E: Long conversation + large context → truncation without info loss"""

    # Build a 12-turn conversation history
    long_history = []
    for i in range(12):
        long_history.extend([
            {"role": "user", "content": f"Turn {i}: Tell me about manga genre {GENRES[i]}"},
            {"role": "assistant", "content": f"Here's a detailed explanation of {GENRES[i]} manga " * 20}
        ])

    response = pipeline.process(
        query="Based on everything we discussed, what should I read first?",
        user_id="test_user_007",
        session_id="sess_overflow_001",
        conversation_history=long_history
    )

    # Response should succeed (not crash on token overflow)
    assert response.text is not None
    assert len(response.text) > 50

    # Prompt should be within budget
    assert response.metadata.prompt_tokens <= 4000

    # Critical information should be preserved
    # System prompt: always kept in full
    assert response.metadata.system_prompt_truncated == False
    # User query: always kept in full
    assert response.metadata.query_truncated == False
    # Safety instructions: always kept in full
    assert response.metadata.safety_instructions_present == True

    # History should be summarized, not just chopped
    assert response.metadata.history_strategy in ["summarized", "recent_only"]
    assert response.metadata.history_strategy != "truncated_raw"  # Don't just cut text mid-sentence

    # Context should be prioritized by relevance
    if response.metadata.chunks_truncated:
        assert response.metadata.chunks_ordered_by_relevance == True
        assert response.metadata.chunks_used <= response.metadata.chunks_retrieved

    # Response quality should not degrade significantly
    assert response.metadata.guardrail_passed == True

Scenario 8: Rate Limiting and Bedrock Throttling

Description

Simulates Amazon Bedrock returning a ThrottlingException due to rate limits. The pipeline must gracefully degrade — using cached responses, falling back to a simpler model, or returning a template response — without showing the user an error.

flowchart TD
    Q["User Query"]
    Q --> IC["Intent Classifier: recommendation"]
    IC --> RT["Retriever: 3 chunks"]
    RT --> PB["Prompt Builder: assembled"]
    PB --> LLM["Call Bedrock<br/>Claude 3.5 Sonnet"]

    LLM -->|"ThrottlingException!"| RETRY["Retry with backoff<br/>(max 2 retries)"]

    RETRY -->|"Still throttled"| FALLBACK{"Fallback Strategy"}

    FALLBACK -->|"Strategy 1"| CACHE["Check semantic cache<br/>for similar query"]
    FALLBACK -->|"Strategy 2"| LITE["Try lighter model<br/>(Claude 3 Haiku)"]
    FALLBACK -->|"Strategy 3"| TEMPLATE["Return template response<br/>with retrieved products"]

    CACHE -->|"Cache hit"| SERVE["Serve cached response<br/>(with freshness flag)"]
    CACHE -->|"Cache miss"| LITE

    LITE -->|"Success"| SERVE2["Serve lower-quality response<br/>(note: simpler model used)"]
    LITE -->|"Also throttled"| TEMPLATE

    TEMPLATE --> SERVE3["Serve template:<br/>'Based on your interest, here are<br/>some popular titles: [product list]'"]

    style LLM fill:#e17055,color:#fff
    style RETRY fill:#fdcb6e,color:#333
    style SERVE fill:#00b894,color:#fff
    style SERVE2 fill:#00b894,color:#fff
    style SERVE3 fill:#00b894,color:#fff

Test Implementation

def test_scenario_8_bedrock_throttling():
    """E2E: Bedrock throttled → graceful degradation"""

    # Simulate Bedrock throttling all calls
    with mock_bedrock_throttle(exception="ThrottlingException"):
        response = pipeline.process(
            query="Recommend some popular shonen manga",
            user_id="test_user_008",
            session_id="sess_throttle_001",
            conversation_history=[]
        )

    # Should NOT return an error to the user
    assert response.error is None
    assert response.text is not None
    assert len(response.text) > 30

    # Should indicate degraded mode in metadata (not in user response)
    assert response.metadata.degraded_mode == True
    assert response.metadata.fallback_reason == "bedrock_throttled"

    # Should still return products (from retrieval, even without LLM generation)
    assert len(response.products) >= 1

    # Products should still have valid data from catalog
    for product in response.products:
        assert product.asin in CATALOG_ASINS
        assert product.price > 0

    # Retry count should be logged
    assert response.metadata.retry_count <= 2  # Max 2 retries before fallback

    # Latency should not be excessive (retries + fallback)
    assert response.metadata.total_latency_ms < 10000  # 10s max even with retries

    # Monitoring: this should have generated an alert
    assert response.metadata.alert_generated == True
    assert response.metadata.alert_type == "bedrock_throttling"

Running the Integration Test Suite

Execution Flow

flowchart TD
    START["Developer triggers<br/>integration tests"]
    START --> ENV["Start Docker Compose<br/>(OpenSearch + DynamoDB + Redis + Ollama)"]
    ENV --> SEED["Seed test data<br/>(100 products, vector index,<br/>conversation histories)"]
    SEED --> RUN["Run 50 E2E scenarios<br/>(pytest -m integration)"]

    RUN --> RESULTS["Collect results"]
    RESULTS --> REPORT["Generate report:<br/>- Scenarios passed/failed<br/>- Boundary assertion details<br/>- Latency breakdown<br/>- Coverage by intent"]

    REPORT --> GATE{"All critical<br/>scenarios pass?"}
    GATE -->|"Yes"| MERGE["Allow PR merge"]
    GATE -->|"No"| BLOCK["Block PR +<br/>detailed failure report"]

    ENV --> CLEANUP["Teardown Docker<br/>after tests complete"]

    style START fill:#2d3436,color:#fff
    style MERGE fill:#00b894,color:#fff
    style BLOCK fill:#e17055,color:#fff

Test Configuration

# conftest.py
import pytest

@pytest.fixture(scope="session")
def integration_env():
    """Start all local services for integration testing"""
    env = IntegrationEnvironment()
    env.start_docker_compose("docker-compose.test.yml")
    env.wait_for_healthy(timeout=60)
    env.seed_product_catalog(count=100)
    env.build_vector_index()
    env.seed_conversation_histories()
    yield env
    env.teardown()

@pytest.fixture
def pipeline(integration_env):
    """Fresh pipeline instance for each test"""
    return ChatbotPipeline(
        classifier=IntentClassifier.load("models/intent_v4"),
        retriever=RAGRetriever(endpoint=integration_env.opensearch_url),
        prompt_builder=PromptBuilder.load("prompts/v3"),
        llm=LocalLLMClient(model="llama3:8b", endpoint=integration_env.ollama_url),
        guardrails=GuardrailEngine.load("configs/guardrails_v6"),
        formatter=ResponseFormatter(),
        memory=ConversationMemory(endpoint=integration_env.dynamodb_url),
    )

Summary: Integration Test Coverage Matrix

Scenario	Components Tested	Key Assertion	Failure Mode Caught
1. Happy Path	All 6	Products have valid ASINs from catalog	Cross-component data format mismatch
2. Multi-Turn	Classifier + Memory + Retriever	Entity resolution across turns	Memory corruption, context loss
3. Intent Handoff	Classifier + Orchestrator + Memory	Clean intent transition	Routing stickiness, context bleed
4. Guardrail Mid-Pipeline	Retriever + Guardrails + LLM	Competitor content filtered	RAG contamination, filter bypass
5. Fallback Cascade	Classifier + Retriever + Templates	Graceful degradation path	Crash on low confidence, empty response
6. Data Staleness	Retriever + Inventory API + Formatter	Out-of-stock handled	Stale recommendation, false availability
7. Token Overflow	History + Context + Prompt Builder	Intelligent truncation	Prompt exceeds budget, info loss
8. Throttling	LLM + Cache + Fallback + Templates	User never sees error	Unhandled exception, blank response