LOCAL PREVIEW View on GitHub

Scenario 04 — Latency–Quality Ratio Analysis

Parent: 02-model-evaluation-optimal-configuration · AIP-C01 Skill 5.1.2

Context

MangaAssist on Amazon.com has strict latency SLAs to maintain a responsive customer experience:

Metric Target
Total end-to-end latency (P99) < 5,000 ms
Intent classification < 100 ms
RAG retrieval (OpenSearch Serverless) < 500 ms
LLM generation (Bedrock) < 3,000 ms
ElastiCache Redis lookup < 5 ms
DynamoDB read/write < 10 ms

The team must balance response quality against latency constraints. Claude 3.5 Sonnet produces higher-quality responses but is slower; Claude 3 Haiku is faster but may sacrifice quality on complex intents. The streaming mode, cold-start behavior on ECS Fargate, and OpenSearch Serverless OCU scaling all affect end-to-end latency.


Questions (12 Total)

Easy (3)

Q1. Break down MangaAssist's total end-to-end latency budget (P99 < 5s) into its constituent components. If RAG retrieval takes 450ms at P99 and intent classification takes 80ms, how much of the budget remains for LLM generation? Why is it important to track P50 vs P99 separately?

Q2. Explain the difference between streaming and non-streaming mode for Bedrock LLM responses in MangaAssist. What is Time-to-First-Token (TTFT), and why does it matter more than total generation time for the user experience? Which intents benefit most from streaming?

Q3. MangaAssist uses ElastiCache Redis for response caching. When a cache hit occurs, the total latency drops from ~4s to ~50ms. How does the cache hit rate affect the aggregate P50 and P99 latency metrics? If the cache hit rate is 40%, what are the effective P50 and P99?


Medium (3)

Q4. MangaAssist's ECS Fargate tasks experience cold starts when scaling from zero or when new tasks are provisioned during traffic spikes. A cold start adds 2–5 seconds of latency for the first request. Design a strategy to mitigate cold-start impact. Consider: (a) minimum task count configuration, (b) pre-warming with synthetic requests, © Application Auto Scaling predictive policies, and (d) how cold starts interact with the P99 latency SLA.

Q5. Compare the latency profile of Sonnet vs Haiku for MangaAssist's product_discovery intent. Sonnet averages 2,200ms generation time (P99: 3,800ms) while Haiku averages 400ms (P99: 900ms). The product_discovery intent also includes RAG retrieval (P99: 500ms) and context re-ranking (P99: 150ms). Does Sonnet fit within the 5s P99 budget? If not, what optimizations would you propose that don't involve switching to Haiku?

Q6. OpenSearch Serverless auto-scales OCUs (OpenSearch Compute Units) based on traffic. During scale-up events, retrieval latency spikes from P99 of 500ms to 1,200ms for 2–3 minutes. How does this affect MangaAssist's end-to-end latency SLA? Design a monitoring and mitigation strategy. Should MangaAssist over-provision OCUs during known traffic peaks (e.g., new anime season launches)?


Hard (3)

Q7. Design a latency–quality scoring function for MangaAssist that produces a single composite score for any model-intent configuration. The function should: (a) penalize configurations that violate the P99 latency SLA, (b) reward higher quality scores (CSAT, task completion), © apply a diminishing returns curve for quality beyond "good enough," and (d) account for the difference between P50 (typical user experience) and P99 (worst-case). Provide the mathematical formulation and explain how you'd calibrate the weights using production data.

Q8. MangaAssist wants to implement adaptive timeout and fallback logic. If Sonnet doesn't respond within 3 seconds for a given request, the system should: (a) cancel the Sonnet request, (b) re-issue to Haiku, and © still stay within the 5s total budget. Design this fallback architecture. What is the probability of triggering fallback based on Sonnet's latency distribution? What is the quality impact when ~5% of requests are silently served by Haiku instead of Sonnet? How do you track and report on fallback rates?

Q9. MangaAssist's multi-turn conversations have increasing latency per turn because the input context grows. Turn 1 might have 500 input tokens (fast), but Turn 5 might have 4,000 input tokens (slow). Plot the theoretical latency curve as a function of turn number. Design a system that maintains consistent latency across turns. Options include: context window truncation, mid-conversation summarization, or switching models for later turns. Evaluate the tradeoffs of each approach.


Very Hard (3)

Q10. MangaAssist needs to implement end-to-end distributed tracing across its entire request path: API Gateway → ECS Fargate orchestrator → intent classifier (SageMaker endpoint) → ElastiCache Redis → OpenSearch Serverless → Bedrock LLM → response assembly → DynamoDB write. Design the tracing architecture using AWS X-Ray. For each span, define what metadata to capture. How do you correlate traces with quality metrics (i.e., link a slow trace to a low CSAT score)? How do you identify which component is the bottleneck for P99 violations?

Q11. MangaAssist wants to optimize for perceived latency rather than measured latency. Design a system that: (a) uses streaming to show progressive responses while the LLM is still generating, (b) shows RAG-sourced product cards immediately while the LLM generates companion text, © pre-fetches likely next-turn context during the user's reading time, and (d) uses optimistic UI patterns (show cached responses instantly, then update if the fresh response differs). Quantify the perceived latency improvement for a recommendation intent request that actually takes 4.2s to complete.

Q12. MangaAssist experiences a latency anomaly: the return_request intent shows P99 latency of 7.2s — well above the 5s SLA — but only on Tuesdays and Wednesdays, and only for users with more than 5 previous returns. Design the investigation methodology. What data would you collect? How do you correlate the anomaly with: (a) DynamoDB conditional writes for return eligibility checks, (b) OpenSearch query complexity for users with long return histories, © Bedrock throttling during mid-week batch processing, and (d) ElastiCache evictions due to Tuesday promotion campaign cache floods? Walk through the root cause analysis step by step and propose the fix.