05. Grilling Sessions — Hard Follow-up Drills (MangaAssist)

These aren't softballs. Each question escalates into follow-ups designed to expose shallow answers. If you can survive all the follow-ups, you are genuinely prepared. The "thinking" sections show what a strong answer contains — but you should derive these yourself before reading them.

Round 1: API Architecture — Streaming vs. Guardrails Tension

Q: Your WebSocket streams tokens to the user in real-time. But your 6-stage guardrails pipeline runs on the COMPLETE response. So the user is seeing unguarded tokens as they stream. How do you prevent a hallucinated price or PII from reaching the user's screen before guardrails finish?

Follow-up 1: "You said you run a lightweight inline check during streaming — regex PII + keyword competitor filter. But price accuracy requires a round-trip to the Pricing API, which you can't do per-token. So what happens when the LLM streams '$12.99' for a product that actually costs $9.99? The user SAW $12.99 already."

Follow-up 2: "You send a 'correction frame' to replace the response. But the user already screenshot it and tweeted 'Amazon showing wrong prices.' How do you justify this architectural decision vs. buffering the entire response before showing anything?"

What a strong answer contains: - Buffering kills the streaming UX that makes the chatbot feel fast. Correction frame fires in <0.5% of cases. The 99.5% tradeoff is worth it - Critical insight: Prices in product cards are NEVER streamed from the LLM. Product cards are injected at the END of the stream from the live Pricing API directly, bypassing LLM generation entirely. The LLM only generates descriptive text; structured product data (including price) comes from the catalog - The only scenario where a hallucinated price reaches the user is inline text mentions like "this costs about $12.99" — rare because the system prompt explicitly instructs the LLM not to state prices inline, the recommendation data comes pre-populated - Legal reviewed the correction frame + audit log pattern and accepted it. The audit log means every price discrepancy is traceable

Round 2: Testing — Golden Dataset vs. Model Updates

Q: You have a 500-prompt golden dataset for LLM evaluation. One morning you wake up, run your eval, and 40 prompts that passed yesterday now fail. Anthropic pushed a model update overnight that you don't control. What do you do?

Follow-up 1: "You investigate and find the model update improved recommendations quality by 15% but changed response formatting slightly, causing your regex-based ASIN extractor in the eval pipeline to miss ASINs. Is this a model problem or a test problem?"

Follow-up 2: "The eval pipeline now shows a 'hallucination rate regression' from 0.4% to 1.8%. You suspect this is a false positive from the changed formatting, not actual hallucinations. How do you tell the difference, and do you block the model promotion while you investigate?"

What a strong answer contains: - Block the promotion first, investigate second — always err on caution with hallucination metrics. The cost of a wrong block (delay) is lower than the cost of promoting a regressed model - Manually inspect the 40 failing prompts. If the ASIN extractor is breaking on new formatting, that's a test infrastructure bug — fix the extractor, re-run, confirm real hallucination rate - If even 5 of those 40 are genuine new hallucinations, the model update has a quality regression requiring prompt engineering compensation before promotion - Lesson: eval pipelines need to be robust to format changes. Property-based assertions ("does the response contain a valid ASIN from the catalog?") are more durable than pattern-based assertions ("does the response match this regex?"). The eval framework itself is a codebase with tests

Round 3: Scale & Production — Unexpected Hotspot

Q: It's Tuesday at 2pm. Your dashboard shows DynamoDB read latency on the sessions table suddenly spiked from 2ms to 450ms. It's not Prime Day. No deploy happened. What's your debugging process?

Follow-up 1: "You look at CloudWatch and see the spike is isolated to one partition key range. What does that tell you?"

Follow-up 2: "You find that a popular manga YouTuber just linked to a specific chat session, and 50,000 users clicked it simultaneously. Your DAX cache is warm for most sessions but this specific session wasn't cached. Walk me through the exact failure cascade."

Follow-up 3: "The YouTuber does this regularly — links to your chatbot for different manga recommendations. This will happen again next week. What permanent fix do you implement that you don't already have in place?"

What a strong answer contains: - Q debugging: CloudWatch → look at consumed RCU vs provisioned, check hot partition metrics, check X-Ray traces to find which service is calling DynamoDB with what key - Follow-up 1: One partition key range = hot partition = thundering herd on a specific session ID. Canonical cause: many users accessing the same session simultaneously - Follow-up 2 cascade: New viral session → not in DAX (never been read before) → 50K concurrent requests hit DAX miss → 50K route to DynamoDB directly → ProvisionedThroughputExceededException → orchestrator retries immediately → amplifies the spike → circuit-breaker eventually opens → users get fallback responses - Follow-up 3 permanent fix: "Popular session" detection heuristic — if a session receives >100 read requests/second, proactively push it into DAX with an extended TTL AND trigger read request coalescing at the application layer so only ONE DynamoDB read fires, with remaining 49,999 requests waiting for that result. Bonus: a link-preview endpoint that pre-warms DAX when a session URL is shared (use existing share event as the trigger)

Round 4: Cost & Business Justification

Q: Your manager says "the chatbot costs $150K/month in AWS infrastructure. Justify every dollar or we're cutting the budget by 40%."

Follow-up 1: "The biggest line item is Bedrock at $80K/month. Your Haiku fallback already saves $18K/month. What else can you cut without user-visible quality degradation?"

Follow-up 2: "An engineer proposes caching LLM responses more aggressively — pushing from 15% to 30% cache hit rate. What are the risks?"

What a strong answer contains: - Justification frame: The chatbot costs $150K/month but deflects 75% of conversations from human agents at $5-10/interaction. At our conversation volume, that's $X million in deflection savings. Breakeven is at 2% deflection rate. We're at 75%. The ROI conversation should start here, not at the line-item level - Follow-up 1 cost cuts: (1) OpenSearch Serverless has idle costs — evaluate provisioned mode with auto-pause for off-peak hours. (2) Reranker GPU instances run continuously — evaluate spot instances for non-interactive workloads. (3) DynamoDB on-demand is convenient but provisioned + auto-scaling is cheaper at our traffic predictability level. (4) Evaluate Titan Embeddings caching — if the same query is embedded multiple times (common FAQ queries), the embedding vector can be cached in Redis keyed by normalized query text - Follow-up 2 cache risks: Stale product recommendations (a cached list becomes outdated as new releases drop). Reduced personalization — "recommend horror manga" from a user with 50 purchases should NOT return the same cached response as a new user. The semantic cache key must include user context signals or explicitly strip them. The 15% hit rate was tuned to only cache truly generic, user-independent queries. Pushing to 30% means caching semi-personalized responses, which degrades experience quality in ways that are hard to measure in an A/B test

Round 5: Legal & Trust — Fraudulent Listing

Q: Amazon Legal calls. They found your chatbot recommended a manga title where the listing was fraudulent — the ASIN was real and passed your ASIN validation, but the seller was counterfeit. How do you retroactively find all users who received that recommendation? And how do you prevent this class of issue going forward?

Follow-up: "Your ASIN validation checks that the product exists in the catalog. A fraudulent listing IS in the catalog — it passes your guardrail. What additional signal would you add to the guardrails pipeline?"

What a strong answer contains: - Retroactive audit: Every response is logged with response_id, session_id, products[] with ASINs, and timestamp. Query Redshift analytics for all responses containing that specific ASIN within the time window. This is O(minutes) to execute if the analytics schema is properly indexed on ASIN - User notification: For each affected session_id, look up the customer ID and trigger a notification via Amazon's existing customer communications pipeline. This is a cross-team coordination, not something MangaAssist owns end-to-end - Prevention — listing trust score: ASIN existence ≠ listing health. Add a listing_trust_score field from Amazon's Trust & Safety catalog API. Guardrails would filter out products below a configurable trust threshold. This requires: (1) new Pact contract with Catalog API, (2) new guardrails stage (or augment ASIN validation stage), (3) new unit tests, (4) threshold tuning with Trust & Safety team - Prevention — seller signals: Beyond trust score, consider filtering by seller history (new seller accounts with no feedback), listing age (listed < 30 days for established titles), and price anomalies (price 70%+ below comparable listings — often signals counterfeits). These signals are available from existing Amazon infrastructure; it's an integration question, not an ML question

Round 6: System Design Under Pressure (Bonus Round)

Q: A new team wants to build a mobile app that consumes your chatbot API. They say they need WebSocket streaming, but they also need the conversation to survive the app going to the background for up to 30 minutes. How does your current architecture handle that, and what breaks?

Follow-up 1: "API Gateway WebSocket connections time out after 10 minutes of inactivity by default. The user backgrounds the app for 25 minutes. They come back. What should happen, and how do you implement it?"

Follow-up 2: "The mobile client reconnects after 25 minutes. The user says 'so anyway, back to the Junji Ito recommendations you were giving me.' How does the chatbot know what it was talking about 25 minutes ago?"

What a strong answer contains: - What breaks: API Gateway WebSocket has a 2-hour hard timeout and a 10-minute idle timeout. At 25 minutes idle, the connection is dead. The mobile client must handle reconnection - Follow-up 1: Reconnection strategy: (1) Mobile client detects connection loss on foregrounding. (2) Client calls a GET /chat/session/{session_id} REST endpoint to verify session still exists in DynamoDB. (3) If alive, client establishes a NEW WebSocket connection, sends a $connect with the existing session_id as a parameter. (4) Server loads the session from DynamoDB, recognizes the session resumption, and restores context. DynamoDB TTL for sessions needs to be extended to at least 2 hours (currently 24 hours based on existing design, so this works) - Follow-up 2: Conversation history is stored in DynamoDB, not in the WebSocket connection itself. The connection is stateless — it's the session ID that carries continuity. When the reconnected client sends a message, the orchestrator loads the conversation history from DynamoDB exactly the same way it does for any multi-turn message. The 25-minute gap is invisible to the orchestrator. The only gap is if the session TTL expired (24 hours), in which case the orchestrator starts fresh and probably should signal that to the user: "Welcome back! It's been a while — feel free to continue where you left off or start fresh."

Rapid-Fire Lightning Round

Answer these in 30 seconds each. No long explanations.

Why use token bucket rate limiting over fixed window?
Your intent classifier is at P99 = 50ms. Bedrock TTFT is P99 = 1.5s. If you had to cut total latency by 300ms, where do you focus and why?
You have 9 downstream dependencies. In a given request, how many can fail simultaneously before the user gets a degraded response vs. an error?
What's the difference between a circuit breaker and a retry with exponential backoff, and when do you use which?
You're instrumenting a new intent path. What's the minimum set of metrics you emit to be production-ready?
The reranker circuit breaker is open. What does the user experience look like?
Why does the guardrails pipeline run sequentially instead of in parallel?
A junior engineer asks why you don't just mock all downstream services in integration tests. What's your one-sentence answer?
What's the semantic cache key for "recommend horror manga"?
Bedrock on-demand vs. provisioned throughput — when does provisioned stop being worth it?

Answers to Lightning Round

Fixed window allows burst at boundary (30 msgs at 11:59 + 30 at 12:00 = 60 in 2 seconds). Token bucket smooths the rate continuously
Bedrock TTFT — it's 30x the latency of the classifier. Even a 20% improvement on Bedrock saves 300ms. Speculative execution (run RAG in parallel with classification) is the other lever
Up to 4 can fail if they're non-critical paths AND their circuit breakers return graceful fallbacks. Customer Profile (auth optional), Recommendations, Promotions, Reviews can all degrade gracefully. If Catalog, Pricing, or Session DynamoDB fail, the request fails
Retry with backoff: recovers from transient failures, assumes the service will be healthy soon. Circuit breaker: stops sending requests when a service is consistently down, preventing resource exhaustion. Use retry for occasional failures; use circuit breaker when failure rate exceeds a threshold
Request count, error rate, P50/P95/P99 latency, intent classification, Bedrock tokens in/out, cache hit/miss rate, guardrail trigger rate
Reranker down → orchestrator falls back to raw cosine similarity from vector search (top-3 by KNN score instead of reranker score). Response quality slightly lower but no user-visible error. The product cards are still populated, just potentially less precisely ranked
Some stages transform the response (PII redaction changes the text). Parallel stages would operate on the original text, not the cleaned text from earlier stages. Order matters for correctness
"Mocks hide latency — a mock returning in 1ms doesn't tell you the real endpoint takes 55ms at P99 cold start"
Normalized query text hash + user tier (guest vs. prime) + no personal history. Hash of normalize("recommend horror manga") + user_tier
When your provisioned utilization drops below 50% for sustained periods. Provisioned is cost-effective when you have predictable baseline traffic; at very low or very variable traffic, on-demand wins

Keep adding new scenarios to this file after every mock interview or technical screen. Every question that catches you off-guard is a question to add here.