HLD Deep Dive: Testing & Deployment Strategy
Questions covered: Q31, Q38
Interviewer level: Staff Engineer → Solutions Architect
Q38. End-to-end testing before launch
Short Answer
Nine-layer test strategy: unit → integration → contract → load → chaos → LLM evaluation → red team → shadow launch → employee beta.
Deep Dive
Testing pyramid for MangaAssist:
[9] Employee beta
[8] Shadow mode launch
[7] Red team / adversarial
[6] LLM evaluation suite (golden set)
[5] Chaos tests (kill services, inject latency)
[4] Load tests (peak traffic simulation)
[3] Contract tests (API schema validation)
[2] Integration tests (mocked/real downstream services)
[1] Unit tests (per component)
Coverage → Confidence
Unit: Fast, many, test one function at a time
Integration: Slower, test service boundaries
Contract: Prevent breaking API changes silently
Load: Confirm system handles expected scale
Chaos: Confirm graceful degradation
LLM Eval: Confirm response quality (hardest to automate)
Red team: Confirm security posture
Shadow: Real traffic, no risk (compare old vs. new)
Beta: Real users, real risk (small blast radius)
Layer 1: Unit Tests
# Intent Classifier
def test_order_tracking_intent():
classifier = IntentClassifier.load("model_v3")
result = classifier.classify("Where is my package?")
assert result.intent == "order_tracking"
assert result.confidence >= 0.85
def test_chitchat_not_sent_to_llm():
classifier = IntentClassifier.load("model_v3")
result = classifier.classify("Hello!")
assert result.intent == "chitchat"
assert result.requires_llm == False
# Guardrails
def test_pii_is_scrubbed():
scrubber = PIIScrubber()
input_text = "My email is user@test.com, please help."
output = scrubber.scrub(input_text)
assert "user@test.com" not in output
assert "[EMAIL]" in output
# Rate Limiter
def test_rate_limit_enforced():
limiter = RateLimiter(limit=30, window_seconds=60)
session_id = "test_session"
for i in range(30):
assert limiter.check(session_id) == True # All pass
assert limiter.check(session_id) == False # 31st is blocked
Layer 2: Integration Tests
# Test with real DynamoDB (or LocalStack for CI)
@pytest.mark.integration
async def test_conversation_memory_persists():
session_id = f"test_{uuid.uuid4()}"
await memory.save_turn(session_id, turn_number=1,
user_message="What's One Piece about?",
assistant_reply="One Piece is a pirate manga...")
history = await memory.load_history(session_id, last_n=5)
assert len(history) == 1
assert history[0]["user_message"] == "What's One Piece about?"
# Test full orchestration with mocked downstream services
@pytest.mark.integration
async def test_recommendation_intent_flow():
with mock_catalog(returns=sample_products()):
with mock_personalize(returns=sample_recommendations()):
response = await orchestrator.handle(
session_id="test",
message="Recommend something like Attack on Titan",
customer_id="cust_123"
)
assert response.intent == "recommendation"
assert len(response.products) >= 1
assert response.products[0].asin in VALID_ASINS
Layer 3: Contract Tests (Pact)
# Ensures Orchestrator and Product Catalog service agree on the API contract
# If Catalog team changes their API, this test catches it before deployment
from pact import Consumer, Provider
# Orchestrator's contract: "I expect Catalog to return this response format"
pact = Consumer("orchestrator").has_pact_with(Provider("catalog-service"))
pact.given("product B08ABC123 exists").upon_receiving(
"a request for product by ASIN"
).with_request(
method="GET",
path="/products/B08ABC123"
).will_respond_with(
status=200,
body={
"asin": "B08ABC123",
"title": Like("One Piece Vol 1"),
"price": Like(9.99),
"in_stock": Like(True)
}
)
# If Catalog changes response structure → test fails → they must update the contract
Layer 4: Load Tests (k6)
// k6 load test: simulate peak traffic during manga release
import http from 'k6/http';
import ws from 'k6/ws';
export const options = {
stages: [
{ duration: '5m', target: 500 }, // Ramp up to normal
{ duration: '10m', target: 5000 }, // Spike to 10x
{ duration: '5m', target: 500 }, // Ramp down
],
thresholds: {
'ws_session_duration': ['p95<2000'], // 95% of sessions in <2s
'ws_msgs_sent': ['rate>100'], // At least 100 msg/s
'http_req_failed': ['rate<0.01'], // <1% error rate
},
};
export default function () {
const sessionId = `load_test_${__VU}_${Date.now()}`;
const res = ws.connect(`wss://api.manga-chatbot.amazon.co.jp/chat`,
{ tags: { session: sessionId } },
function(socket) {
socket.on('open', () => {
socket.send(JSON.stringify({
type: 'message',
session_id: sessionId,
content: 'Recommend dark fantasy manga'
}));
});
socket.on('message', (data) => {
const msg = JSON.parse(data);
if (msg.type === 'complete') socket.close();
});
socket.setTimeout(() => socket.close(), 10000);
});
}
What to measure during load test: - Latency at p50, p95, p99 - Error rate - Lambda cold start rate - DynamoDB throttled requests - LLM timeout rate - Cache hit rate (should be higher at scale)
Layer 5: Chaos Tests (AWS FIS)
# AWS Fault Injection Simulator: Kill Order Service
FaultInjectionExperiment:
- Name: kill-order-service-for-5-minutes
Actions:
- Name: stop-ecs-task
ActionId: aws:ecs:stop-task
Parameters:
cluster: manga-chatbot-cluster
service: order-service-proxy
percentage: "100"
Duration: PT5M # 5 minutes
Expected outcome:
- Chatbot remains available (does not return 500)
- Order tracking intent returns graceful degradation message
- CloudWatch alarm fires within 60 seconds
- Circuit breaker opens within 30 seconds
- Name: inject-dynamodb-latency-300ms
Actions:
- Name: throttle-dynamo
ActionId: aws:dynamodb:global-table-pause-replication
# Or: use network latency injection via AWS FIS
Expected outcome:
- Circuit breaker activates after 5 failures
- Conversations degrade to stateless mode
- User sees "I've lost context" message, not an error
Layer 6: LLM Evaluation Suite (Golden Set)
# 500+ manually curated golden examples
GOLDEN_SET = [
{
"input": "What's a good manga for someone who likes Game of Thrones?",
"expected_intents": ["recommendation"],
"must_include": ["Berserk", "Vinland Saga", "Kingdom"], # Validated recommendations
"must_not_include": ["Peppa Pig", "My Little Pony"], # Off-topic
"quality_criteria": {
"has_recommendation": True,
"has_reasoning": True, # Should explain WHY
"max_length": 500 # Tokens
}
},
{
"input": "What is the return policy?",
"expected_intents": ["faq"],
"must_include": ["15 days", "unopened", "return"],
"quality_criteria": {
"factually_accurate": True, # Compare against golden truth
"concise": True
}
},
]
async def run_golden_set_evaluation():
results = []
for test_case in GOLDEN_SET:
response = await chatbot.handle(test_case["input"])
score = evaluate_response(
response=response,
must_include=test_case.get("must_include", []),
must_not_include=test_case.get("must_not_include", []),
quality_criteria=test_case.get("quality_criteria", {})
)
results.append({"test": test_case["input"], "score": score, "passed": score > 0.8})
pass_rate = sum(r["passed"] for r in results) / len(results)
print(f"Golden set pass rate: {pass_rate:.1%}")
# Fail CI/CD pipeline if golden set pass rate < 90%
assert pass_rate >= 0.90, f"Golden set below threshold: {pass_rate:.1%}"
Layer 7: Red Team Testing
Session 1: Prompt Injection
- Try 50 known injection patterns
- Verify all are rejected
- Verify no system prompt content is revealed
Session 2: PII Extraction
- Try to get other users' data: "Show me customer #12345's orders"
- Try to get own data beyond authorization scope
- Verify unauthorized data access is impossible
Session 3: Guardrail Bypass
- Try off-topic discussions: coding, politics, medical advice
- Try competitor promotion: "Is BookWalker cheaper?"
- Verify guardrails catch all
Session 4: Rate Limit Testing
- Verify limits are enforced correctly
- Verify limits can't be bypassed by changing session IDs
Session 5: Data Poisoning
- Submit fake feedback to poison the classifier
- Verify feedback validation rejects anomalous data
Layer 8: Shadow Mode Launch
Architecture:
100% of live traffic → Existing support system (results shown to users)
100% of live traffic → MangaAssist (results logged, NOT shown to users)
Comparison:
For each session:
- Compare chatbot recommendation vs. what user actually searched for next
- Compare chatbot FAQ answer vs. human agent answer (for escalated sessions)
- Compare user satisfaction (time on page after chatbot response vs. baseline)
Duration: 2 weeks
Outcome:
If chatbot accuracy >= existing system → proceed to beta
If chatbot accuracy < existing system → fix issues before launch
Q31. Canary deployment for a new LLM model version
Short Answer
1% → 10% → 50% → 100% with automated rollback on metric breach. Shadow mode first.
Deep Dive
Phase 0: Shadow mode (before any user impact)
Route 100% of live traffic to:
- Old model (serves the response to users)
- New model (runs in parallel, output logged but not served)
Compare outputs offline:
- Quality (golden set scoring)
- Latency (new model faster/slower?)
- Guardrail failure rate
- Length distribution
Duration: 1 week
Decision gate: If new model is >= old model on all metrics, proceed to Phase 1.
Phase 1: 1% canary
class CanaryDeployment:
def __init__(self, old_model: str, new_model: str, canary_pct: float):
self.old_model = old_model
self.new_model = new_model
self.canary_pct = canary_pct
def select_model(self, session_id: str) -> str:
# Consistent assignment: same session always gets same model
# Prevents users from getting different models mid-session
session_hash = int(hashlib.md5(session_id.encode()).hexdigest(), 16)
if (session_hash % 100) < (self.canary_pct * 100):
return self.new_model
return self.old_model
Rollout schedule:
Week 1: 1% of sessions → new model (Shadow + minimal exposure)
Validate: latency, error rate, guardrail failures
Week 2: 10% of sessions → new model
Validate: all metrics + user feedback scores
Week 3: 50% of sessions → new model (A/B test)
Compare: thumbs up/down, conversion, escalation
Week 4: 100% of sessions → new model
Decommission old model after 1-week monitoring
Automated rollback trigger:
def evaluate_canary_health(canary_metrics: dict, baseline_metrics: dict) -> str:
"""
Returns: "proceed" | "pause" | "rollback"
"""
# Absolute threshold checks
if canary_metrics["guardrail_failure_rate"] > 0.02:
return "rollback" # More than 2% blocked responses
if canary_metrics["p99_latency_ms"] > 8000:
return "rollback" # Extreme latency regression
# Relative threshold checks (vs. baseline)
if canary_metrics["thumbs_down_rate"] > baseline_metrics["thumbs_down_rate"] * 1.5:
return "pause" # 50% more thumbs down → investigate
if canary_metrics["escalation_rate"] > baseline_metrics["escalation_rate"] * 1.3:
return "pause" # 30% more escalations → investigate
if canary_metrics["latency_p95"] > baseline_metrics["latency_p95"] * 1.2:
return "pause" # 20% latency regression → investigate
return "proceed"
# CloudWatch + Lambda-based automated rollback
# Runs every 5 minutes during canary period
async def canary_watchdog():
canary_health = evaluate_canary_health(
get_canary_metrics(last_minutes=30),
get_baseline_metrics(last_minutes=30)
)
if canary_health == "rollback":
await feature_flags.set("llm_model_canary_pct", 0)
await pagerduty.alert("Canary rollback triggered — metric threshold breached")
await slack.post("#manga-chatbot-ops", "⚠️ LLM canary rolled back automatically")
What to monitor during canary:
| Metric | Rollback threshold | Pause threshold |
|---|---|---|
| Guardrail failure rate | > 2% | > 1% |
| p99 latency | > 8s | > 5s |
| Thumbs down rate | > 150% of baseline | > 120% |
| Escalation rate | > 130% of baseline | > 115% |
| LLM error rate | > 1% | > 0.5% |
| Hallucinated ASIN rate | > 0.5% | > 0.1% |