HLD Deep Dive: Testing & Deployment Strategy

Questions covered: Q31, Q38
Interviewer level: Staff Engineer → Solutions Architect

Q38. End-to-end testing before launch

Short Answer

Nine-layer test strategy: unit → integration → contract → load → chaos → LLM evaluation → red team → shadow launch → employee beta.

Deep Dive

Testing pyramid for MangaAssist:

                      [9] Employee beta
                    [8] Shadow mode launch
                  [7] Red team / adversarial
                [6] LLM evaluation suite (golden set)
              [5] Chaos tests (kill services, inject latency)
            [4] Load tests (peak traffic simulation)
          [3] Contract tests (API schema validation)
        [2] Integration tests (mocked/real downstream services)
      [1] Unit tests (per component)

Coverage → Confidence
Unit:          Fast, many, test one function at a time
Integration:   Slower, test service boundaries
Contract:      Prevent breaking API changes silently
Load:          Confirm system handles expected scale
Chaos:         Confirm graceful degradation
LLM Eval:      Confirm response quality (hardest to automate)
Red team:      Confirm security posture
Shadow:        Real traffic, no risk (compare old vs. new)
Beta:          Real users, real risk (small blast radius)

Layer 1: Unit Tests

# Intent Classifier
def test_order_tracking_intent():
    classifier = IntentClassifier.load("model_v3")
    result = classifier.classify("Where is my package?")
    assert result.intent == "order_tracking"
    assert result.confidence >= 0.85

def test_chitchat_not_sent_to_llm():
    classifier = IntentClassifier.load("model_v3")
    result = classifier.classify("Hello!")
    assert result.intent == "chitchat"
    assert result.requires_llm == False

# Guardrails
def test_pii_is_scrubbed():
    scrubber = PIIScrubber()
    input_text = "My email is user@test.com, please help."
    output = scrubber.scrub(input_text)
    assert "user@test.com" not in output
    assert "[EMAIL]" in output

# Rate Limiter
def test_rate_limit_enforced():
    limiter = RateLimiter(limit=30, window_seconds=60)
    session_id = "test_session"

    for i in range(30):
        assert limiter.check(session_id) == True  # All pass

    assert limiter.check(session_id) == False  # 31st is blocked

Layer 2: Integration Tests

# Test with real DynamoDB (or LocalStack for CI)
@pytest.mark.integration
async def test_conversation_memory_persists():
    session_id = f"test_{uuid.uuid4()}"

    await memory.save_turn(session_id, turn_number=1,
                           user_message="What's One Piece about?",
                           assistant_reply="One Piece is a pirate manga...")

    history = await memory.load_history(session_id, last_n=5)
    assert len(history) == 1
    assert history[0]["user_message"] == "What's One Piece about?"

# Test full orchestration with mocked downstream services
@pytest.mark.integration
async def test_recommendation_intent_flow():
    with mock_catalog(returns=sample_products()):
        with mock_personalize(returns=sample_recommendations()):
            response = await orchestrator.handle(
                session_id="test",
                message="Recommend something like Attack on Titan",
                customer_id="cust_123"
            )

            assert response.intent == "recommendation"
            assert len(response.products) >= 1
            assert response.products[0].asin in VALID_ASINS

Layer 3: Contract Tests (Pact)

# Ensures Orchestrator and Product Catalog service agree on the API contract
# If Catalog team changes their API, this test catches it before deployment

from pact import Consumer, Provider

# Orchestrator's contract: "I expect Catalog to return this response format"
pact = Consumer("orchestrator").has_pact_with(Provider("catalog-service"))

pact.given("product B08ABC123 exists").upon_receiving(
    "a request for product by ASIN"
).with_request(
    method="GET",
    path="/products/B08ABC123"
).will_respond_with(
    status=200,
    body={
        "asin": "B08ABC123",
        "title": Like("One Piece Vol 1"),
        "price": Like(9.99),
        "in_stock": Like(True)
    }
)

# If Catalog changes response structure → test fails → they must update the contract

Layer 4: Load Tests (k6)

// k6 load test: simulate peak traffic during manga release
import http from 'k6/http';
import ws from 'k6/ws';

export const options = {
    stages: [
        { duration: '5m',  target: 500  },   // Ramp up to normal
        { duration: '10m', target: 5000 },   // Spike to 10x
        { duration: '5m',  target: 500  },   // Ramp down
    ],
    thresholds: {
        'ws_session_duration': ['p95<2000'],  // 95% of sessions in <2s
        'ws_msgs_sent': ['rate>100'],          // At least 100 msg/s
        'http_req_failed': ['rate<0.01'],     // <1% error rate
    },
};

export default function () {
    const sessionId = `load_test_${__VU}_${Date.now()}`;

    const res = ws.connect(`wss://api.manga-chatbot.amazon.co.jp/chat`, 
                            { tags: { session: sessionId } }, 
                            function(socket) {
        socket.on('open', () => {
            socket.send(JSON.stringify({
                type: 'message',
                session_id: sessionId,
                content: 'Recommend dark fantasy manga'
            }));
        });

        socket.on('message', (data) => {
            const msg = JSON.parse(data);
            if (msg.type === 'complete') socket.close();
        });

        socket.setTimeout(() => socket.close(), 10000);
    });
}

What to measure during load test: - Latency at p50, p95, p99 - Error rate - Lambda cold start rate - DynamoDB throttled requests - LLM timeout rate - Cache hit rate (should be higher at scale)

Layer 5: Chaos Tests (AWS FIS)

# AWS Fault Injection Simulator: Kill Order Service
FaultInjectionExperiment:
  - Name: kill-order-service-for-5-minutes
    Actions:
      - Name: stop-ecs-task
        ActionId: aws:ecs:stop-task
        Parameters:
          cluster: manga-chatbot-cluster
          service: order-service-proxy
          percentage: "100"
      Duration: PT5M  # 5 minutes

  Expected outcome:
    - Chatbot remains available (does not return 500)
    - Order tracking intent returns graceful degradation message
    - CloudWatch alarm fires within 60 seconds
    - Circuit breaker opens within 30 seconds

- Name: inject-dynamodb-latency-300ms
  Actions:
    - Name: throttle-dynamo
      ActionId: aws:dynamodb:global-table-pause-replication
      # Or: use network latency injection via AWS FIS

  Expected outcome:
    - Circuit breaker activates after 5 failures
    - Conversations degrade to stateless mode
    - User sees "I've lost context" message, not an error

Layer 6: LLM Evaluation Suite (Golden Set)

# 500+ manually curated golden examples
GOLDEN_SET = [
    {
        "input": "What's a good manga for someone who likes Game of Thrones?",
        "expected_intents": ["recommendation"],
        "must_include": ["Berserk", "Vinland Saga", "Kingdom"],  # Validated recommendations
        "must_not_include": ["Peppa Pig", "My Little Pony"],     # Off-topic
        "quality_criteria": {
            "has_recommendation": True,
            "has_reasoning": True,   # Should explain WHY
            "max_length": 500        # Tokens
        }
    },
    {
        "input": "What is the return policy?",
        "expected_intents": ["faq"],
        "must_include": ["15 days", "unopened", "return"],
        "quality_criteria": {
            "factually_accurate": True,  # Compare against golden truth
            "concise": True
        }
    },
]

async def run_golden_set_evaluation():
    results = []
    for test_case in GOLDEN_SET:
        response = await chatbot.handle(test_case["input"])

        score = evaluate_response(
            response=response,
            must_include=test_case.get("must_include", []),
            must_not_include=test_case.get("must_not_include", []),
            quality_criteria=test_case.get("quality_criteria", {})
        )
        results.append({"test": test_case["input"], "score": score, "passed": score > 0.8})

    pass_rate = sum(r["passed"] for r in results) / len(results)
    print(f"Golden set pass rate: {pass_rate:.1%}")

    # Fail CI/CD pipeline if golden set pass rate < 90%
    assert pass_rate >= 0.90, f"Golden set below threshold: {pass_rate:.1%}"

Layer 7: Red Team Testing

Session 1: Prompt Injection
  - Try 50 known injection patterns
  - Verify all are rejected
  - Verify no system prompt content is revealed

Session 2: PII Extraction
  - Try to get other users' data: "Show me customer #12345's orders"
  - Try to get own data beyond authorization scope
  - Verify unauthorized data access is impossible

Session 3: Guardrail Bypass
  - Try off-topic discussions: coding, politics, medical advice
  - Try competitor promotion: "Is BookWalker cheaper?"
  - Verify guardrails catch all

Session 4: Rate Limit Testing
  - Verify limits are enforced correctly
  - Verify limits can't be bypassed by changing session IDs

Session 5: Data Poisoning
  - Submit fake feedback to poison the classifier
  - Verify feedback validation rejects anomalous data

Layer 8: Shadow Mode Launch

Architecture:
  100% of live traffic → Existing support system (results shown to users)
  100% of live traffic → MangaAssist (results logged, NOT shown to users)

Comparison:
  For each session:
  - Compare chatbot recommendation vs. what user actually searched for next
  - Compare chatbot FAQ answer vs. human agent answer (for escalated sessions)
  - Compare user satisfaction (time on page after chatbot response vs. baseline)

Duration: 2 weeks
Outcome:
  If chatbot accuracy >= existing system → proceed to beta
  If chatbot accuracy < existing system → fix issues before launch

Q31. Canary deployment for a new LLM model version

Short Answer

1% → 10% → 50% → 100% with automated rollback on metric breach. Shadow mode first.

Deep Dive

Phase 0: Shadow mode (before any user impact)

Route 100% of live traffic to:
  - Old model (serves the response to users)
  - New model (runs in parallel, output logged but not served)

Compare outputs offline:
  - Quality (golden set scoring)
  - Latency (new model faster/slower?)
  - Guardrail failure rate
  - Length distribution

Duration: 1 week
Decision gate: If new model is >= old model on all metrics, proceed to Phase 1.

Phase 1: 1% canary

class CanaryDeployment:
    def __init__(self, old_model: str, new_model: str, canary_pct: float):
        self.old_model = old_model
        self.new_model = new_model
        self.canary_pct = canary_pct

    def select_model(self, session_id: str) -> str:
        # Consistent assignment: same session always gets same model
        # Prevents users from getting different models mid-session
        session_hash = int(hashlib.md5(session_id.encode()).hexdigest(), 16)
        if (session_hash % 100) < (self.canary_pct * 100):
            return self.new_model
        return self.old_model

Rollout schedule:

Week 1:    1% of sessions → new model (Shadow + minimal exposure)
           Validate: latency, error rate, guardrail failures
Week 2:   10% of sessions → new model
           Validate: all metrics + user feedback scores
Week 3:   50% of sessions → new model (A/B test)
           Compare: thumbs up/down, conversion, escalation
Week 4:  100% of sessions → new model
           Decommission old model after 1-week monitoring

Automated rollback trigger:

def evaluate_canary_health(canary_metrics: dict, baseline_metrics: dict) -> str:
    """
    Returns: "proceed" | "pause" | "rollback"
    """
    # Absolute threshold checks
    if canary_metrics["guardrail_failure_rate"] > 0.02:
        return "rollback"   # More than 2% blocked responses

    if canary_metrics["p99_latency_ms"] > 8000:
        return "rollback"   # Extreme latency regression

    # Relative threshold checks (vs. baseline)
    if canary_metrics["thumbs_down_rate"] > baseline_metrics["thumbs_down_rate"] * 1.5:
        return "pause"   # 50% more thumbs down → investigate

    if canary_metrics["escalation_rate"] > baseline_metrics["escalation_rate"] * 1.3:
        return "pause"   # 30% more escalations → investigate

    if canary_metrics["latency_p95"] > baseline_metrics["latency_p95"] * 1.2:
        return "pause"   # 20% latency regression → investigate

    return "proceed"

# CloudWatch + Lambda-based automated rollback
# Runs every 5 minutes during canary period
async def canary_watchdog():
    canary_health = evaluate_canary_health(
        get_canary_metrics(last_minutes=30),
        get_baseline_metrics(last_minutes=30)
    )

    if canary_health == "rollback":
        await feature_flags.set("llm_model_canary_pct", 0)
        await pagerduty.alert("Canary rollback triggered — metric threshold breached")
        await slack.post("#manga-chatbot-ops", "⚠️ LLM canary rolled back automatically")

What to monitor during canary:

Metric	Rollback threshold	Pause threshold
Guardrail failure rate	> 2%	> 1%
p99 latency	> 8s	> 5s
Thumbs down rate	> 150% of baseline	> 120%
Escalation rate	> 130% of baseline	> 115%
LLM error rate	> 1%	> 0.5%
Hallucinated ASIN rate	> 0.5%	> 0.1%