Interview Q&A — Technical Proof-of-Concept
Skill 1.1.2 | Task 1.1 — Analyze Requirements and Design GenAI Solutions | Domain 1
Scenario 1: PoC and Production Data Distribution Mismatch
Opening Question
Q: Your MangaAssist PoC achieved 94% answer accuracy on a 200-question evaluation set. Six weeks later, production CSAT scores are 61%. Retrieval quality audits show MRR of 0.44 in production vs. 0.91 in the PoC. The code and infrastructure are identical. What failed in the PoC design and how do you retrospectively diagnose and prospectively fix it?
Model Answer
The PoC evaluation set was built by the team who knows the system — they wrote queries using official catalog titles, product SKU terminology, and system-adjacent vocabulary. Real users say "that manga about the samurai who can't die" not "Blade of the Immortal Vol. 1." This distribution mismatch is the most common RAG PoC failure: the golden set leaked domain knowledge from the builders into the test data. In RAG systems, retrieval fails before generation fails — if the embedding can't match colloquial queries to index documents, the model never even sees the right context. The diagnostic: compare term overlap between PoC queries and production query logs (collect from API Gateway access logs). High Jaccard similarity with catalog text in the PoC set, low similarity in production = distribution mismatch confirmed. The fix has two layers: (1) rebuild the golden set from real production or beta traffic (shadow logs from 1,000 real user sessions, cluster by intent, select 50 per cluster including tail queries, paraphrases, multilingual variants); (2) add a data-representativeness gate to the launch checklist: PoC set must include ≥ 30% queries written by users outside the dev team.
Follow-up 1: Measuring retrieval quality separately from answer quality
Q: The PoC only measured final answer accuracy (94%). What retrieval-specific metrics should have been included? A: At minimum: MRR (Mean Reciprocal Rank) — is the relevant document in the top-3 retrieved chunks? NDCG@5 — are the most relevant documents ranked highest? Recall@K — what fraction of relevant documents appear in the top-K? These should be measured independently of generation. For MangaAssist, the golden set should include not just the expected answer but the expected source document(s) — then retrieval quality is measurable without running the FM. A 94% answer accuracy can co-exist with 60% recall@5 if the FM is good at hallucinating plausible-sounding answers when retrieval fails. Measuring separately exposes which component is the actual failure point — retrieval or generation.
Follow-up 2: Representative sampling strategy
Q: You don't have production traffic yet because the system hasn't launched. How do you build a representative PoC dataset pre-launch? A: Three sources: (1) user research sessions — run 5–10 structured user sessions with target users (manga readers unfamiliar with your catalog), record their natural queries, harvest 50–100 real phrasings; (2) competitive product usage — if users interact with a competitor or similar product, sample query patterns from public reviews, forums, or Reddit communities (r/manga with 2M+ subscribers is a goldmine for natural query language); (3) adversarial variants — for each clean query in your dev set, generate 3 paraphrases using Claude ("rephrase this question as a 14-year-old who types informally"), translate to Japanese and back-translate to introduce phrasing, add typos and slang. The goal: ≥ 40% of the evaluation set should use phrasing patterns not found anywhere in your system prompt or catalog text.
Follow-up 3: What a launch readiness gate looks like
Q: What would a data-representativeness launch gate look like concretely? What passes, what fails? A: Define three pass criteria: (1) Vocabulary overlap ≤ 0.3 Jaccard between eval set queries and catalog document terms — queries that too closely mirror catalog wording indicate insider-authored eval sets. (2) Language coverage: ≥ 30% Japanese queries, ≥ 10% romanized/mixed queries, ≥ 5% queries with typos or abbreviations. (3) Retrieval recall@5 ≥ 0.8 on the representative set — if recall drops below 0.8 with real-user phrasing, the embedding model or indexing strategy needs work before launch. Any gate failure is a blocker: the PoC passes technically but the evaluation methodology doesn't pass. This reframes launch readiness as "do we have evidence the system works for real users?" not just "does the demo look good?"
Follow-up 4: Post-launch drift detection
Q: The PoC passed all gates and launched. How do you detect distribution drift after launch?
A: Set up a continuous monitoring pipeline: (1) hourly export of production queries from CloudWatch Logs to S3; (2) weekly automated drift analysis — compute embedding centroid of current production queries vs. baseline PoC set, alert if cosine distance exceeds threshold (e.g., > 0.15); (3) sample 100 production queries per day into a human review queue — tag retrieval quality (good/partial/miss), track quality rate trend; (4) add a retrieval_miss_rate CloudWatch metric — if ≥ 3 of top-5 retrieved chunks have cosine similarity < 0.65, count as a miss. Alert if retrieval_miss_rate exceeds 15% for a rolling 1-hour window. Rerun the golden-set evaluation weekly with the rotating production-query sample appended, so the evaluation set evolves with production traffic.
Grill 1: "94% accuracy in PoC proves the system works"
Q: A stakeholder says "94% accuracy is excellent — this clearly works. Why complicate it with representative sampling concerns?" How do you respond? A: The 94% number is real but it measures the wrong thing. It answers: "Does the system work for questions we already knew how to ask?" Production answers the harder question: "Does the system work for all the questions users will actually ask?" These are not the same distribution. The 33-point CSAT drop from 94% PoC accuracy to 61% production quality is the empirical proof that the PoC metric was misleading. The accurate framing for the stakeholder: the PoC answered "is this technically possible?" — that's a valid question. But "will this work for users?" requires a different evaluation methodology. A 94% score on a non-representative set tells us the model and embedding are capable; it doesn't tell us whether the system is ready for production users.
Grill 2: Fixing after launch — no new full rebuild
Q: You've launched. You can't rebuild the index. What immediate actions improve production quality without an index rebuild? A: Four tactics requiring no index rebuild: (1) Query expansion: add a Haiku pre-processing step that expands colloquial user queries into multiple canonical phrasings before embedding. "That manga with the assassin who becomes good" → ["Vinland Saga", "redemption arc shonen", "assassination manga character development"]. Improves recall without touching the index. (2) Alias injection: build a lookup table of known slang, abbreviations, and common phrasing for top-200 manga titles. Pre-process queries to inject canonical title variants. (3) Hybrid search: if the index supports BM25 (OpenSearch does), enable hybrid search — lexical BM25 matching catches exact title strings that dense embeddings might miss. (4) Logging-driven fast golden-set update: within 48 hours of launch, harvest 500 real queries from logs, identify the top-50 retrieval failure patterns, add them to the evaluation set — now CI/CD has a better regression signal for future prompt and retrieval changes.
Grill 3: Evaluation set staleness
Q: You refreshed the evaluation set with production traffic. Six months later, new manga titles have been added and the golden set refers to old titles. How do you manage evaluation set lifecycle? A: Treat the golden set as a living artifact with the same lifecycle as the catalog. Three practices: (1) Version the golden set in Git alongside model and prompt versions — every CI evaluation run records which golden set version was used. (2) Automated expiry scanning: weekly job cross-references golden set expected documents against the current index. If > 10% of expected documents are no longer in the index (discontinued titles, catalog cleanup), flag those questions for human review and replacement. (3) Rolling addition policy: add 20 new production-sampled queries per week, retire questions where the expected answer has changed (price updates, edition changes). The golden set stays at ~500 questions total with quarterly full reviews. Never let the golden set become the "permanent test set from the PoC" — that's how you end up measuring an outdated system against an outdated benchmark.
Red Flags — Weak Answer Indicators
- Accepting 94% PoC accuracy as sufficient evidence for launch without questioning the evaluation methodology
- Measuring only answer quality and not retrieval quality (MRR, recall@K)
- No concrete plan for how to build a representative evaluation set pre-launch
- Post-launch monitoring is only infrastructure metrics (latency, errors) not quality metrics
- No mention of evaluation set lifecycle management
Strong Answer Indicators
- Immediately identifies Jaccard overlap between PoC queries and catalog vocabulary as the diagnostic
- Separates retrieval quality metrics (MRR, recall@K, NDCG) from generation quality metrics
- Proposes three specific pre-launch sources for representative query sampling
- Designs a concrete launch gate with pass/fail criteria (vocabulary overlap, language coverage, recall threshold)
- Defines a post-launch drift detection pipeline with specific metric thresholds
Scenario 2: Single-User Load Assumption
Opening Question
Q: The MangaAssist PoC demo ran beautifully — 1.8 seconds, clean answers. You set a production timeout of 2 seconds based on that demo. Day-1 production sees 500 concurrent users and 70% of requests timeout. Traces show OpenSearch, Redis, and DynamoDB each taking 800ms+ instead of the 80ms seen in the PoC. What happened and how do you retrospectively fix the timeout policy?
Model Answer
The PoC was a single-engineer, sequential-request test. Each request found all dependency resources idle: no OpenSearch search thread queue, no Redis connection contention, no DynamoDB read capacity competing with other keys. At 500 concurrent users, the system enters a contention regime: OpenSearch search threads queue up (8 threads shared, 500 requests means queuing), Redis connection pool saturates (default pool size often 10–20 connections), DynamoDB read capacity units consumed by competing requests. The 10× latency increase from 80ms to 800ms per dependency is not unexpected — it reflects real-world load behavior that the single-user PoC was structurally incapable of revealing. The 2-second timeout is set against a single-user P50, not a concurrent P99. Fix in two parts: (1) immediately increase timeouts to P99-under-load values derived from load test data; (2) structurally: run a concurrency test at 2× expected peak before any timeout policy is set. The rule is: timeout = P99 of the dependency under target concurrency + 20% buffer, not P50 of a demo request.
Follow-up 1: Running a concurrency load test in PoC phase
Q: How do you structure a concurrency test during the PoC phase before you have real users? A: Use a synthetic load generator: Apache JMeter, Locust, or AWS Distributed Load Testing. Define three profiles: (1) steady-state: 100 concurrent users, ramp over 5 minutes, sustain 15 minutes; (2) peak-traffic: 500 concurrent users, ramp over 2 minutes, sustain 5 minutes; (3) burst: 0 → 1,000 users over 30 seconds (flash sale simulation). Measure P50, P95, P99 for each dependency (OpenSearch, Redis, DynamoDB, Bedrock) under each profile. Set timeouts from P99-burst. The PoC passes the load test if P99 < 2.5 seconds under 500 concurrent users. This adds 2–3 days to a PoC but prevents the class of failure where the architecture is only valid for sequential single-user patterns.
Follow-up 2: Connection pool tuning
Q: The Redis connection pool saturating is a fixable configuration issue. How do you size the pool appropriately?
A: Pool size = (requests per second × expected_hold_time_seconds) + 20% buffer. For MangaAssist: 500 concurrent users, average Redis operation hold of 5ms → pool size = (500 × 0.005) = 2.5 connections per Lambda + buffer. But Lambda scales to N concurrent invocations, each with its own pool — if using ElastiCache with 10 Lambda workers at pool=10, that's 100 open connections. ElastiCache connection limits vary by node type (cache.t3.medium: ~65K connections). The right approach for Lambda is connection pooling via proxy: use ElastiCache Serverless or a Redis proxy layer so Lambda instances share a pool rather than each maintaining their own. This reduces total connection count while maintaining throughput. Set the pool size based on load test evidence, not defaults from the Redis client library.
Follow-up 3: Circuit breakers for dependency saturation
Q: During the day-1 overload, retry logic made the situation worse — failed requests retried immediately and doubled the load. What is the correct retry architecture?
A: Exponential backoff with jitter + circuit breaker. For individual retries: base_delay=100ms, max_delay=2s, jitter=random(0, base_delay), max_retries=3. This spreads retry pressure across a 1-2 second window instead of instantly. The circuit breaker layer: track error rate over a 10-second rolling window per dependency. If > 30% of requests fail, open the circuit — reject requests immediately with a 503 rather than adding retry load to an already-saturated dependency. After 30 seconds (half-open state), allow 1 probe request — if it succeeds, close the circuit. This prevents retry storms from amplifying a saturated dependency into a complete outage. Use an exponential-backoff + circuit-breaker library (resilience4j pattern) rather than implementing ad hoc try/except retry loops.
Follow-up 4: Documenting safe operating envelope
Q: What does a "safe operating envelope" for launch planning look like?
A: A documented table capturing: max_concurrent_users, P50/P95/P99 latency per component, error rate per component, recommended timeout value, circuit-breaker threshold, and scaling headroom (at what concurrency level does the next alarm fire). Example row: OpenSearch: max_tps=800, P99=420ms@500cu, error_rate=0.1%, recommended_timeout=700ms, circuit_breaker_threshold=30%, scale_alert=650cu. This table becomes the launch gate: if load testing doesn't produce a clean completed row for every dependency, launch is blocked. Post-launch, the table is updated from production metrics and reviewed at each major traffic milestone.
Grill 1: Timeout was increased to 10 seconds — now users wait 10s on failure
Q: After the incident, someone increased all timeouts to 10 seconds to "stop the timeouts." What's wrong with this? A: It trades one failure mode (fast timeout) for a worse one (slow failure + resource exhaustion). A 10-second timeout means every slow request holds its Lambda execution environment, connection pool slot, and API Gateway connection for 10 seconds before failing. At 500 concurrent users with 20% failure rate, that's 100 simultaneous 10-second hangs — Lambda concurrency exhausted, connection pools exhausted, API Gateway hitting connection limits. Users who could get a fast failure and retry now wait 10 seconds before the same failure. The correct approach: timeout = P99-under-load + 20% buffer (from load test data). If P99 = 420ms under full load, timeout = 500ms. Failures are fast, resources are returned quickly, retries don't compound. Timeout is a correctness and resource management decision, not a "give more time = better reliability" decision.
Grill 2: Load test passes but Day-2 shows new failure pattern
Q: Load test passed at 500 users. But Day-2 shows the same saturation at 300 users. What changed? A: Session accumulation. The load test ran 500 concurrent users for 15 minutes with fresh sessions. Production Day-2 has 300 concurrent new requests PLUS a growing backlog of long-running sessions with conversation history in DynamoDB and Redis. Each retained session consumes additional DynamoDB read capacity (history lookup per turn) and Redis memory (cached preferences). The effective "load" is higher than new request count alone. Fix: the load test must model accumulated state — prime the environment with 5,000 existing sessions in DynamoDB before running the concurrency test. Also check Redis memory utilization — a full Redis evicts keys under LRU, causing cache misses that cascade to DynamoDB, which hits its RCU limit. Session TTL policy (expire sessions after 60 minutes of inactivity) is a required safeguard against state accumulation.
Grill 3: The PoC had no degradation mode — all-or-nothing
Q: When OpenSearch was saturated, the chatbot returned HTTP 500 to all users. Why is "fail hard" worse than degraded responses in this scenario?
A: Hard failure at every layer compounds in cascade: the frontend shows an error, the user retries immediately (doubling load on an already saturated backend), and the retry behavior becomes a self-inflicted DDoS. A degraded response mode breaks the cascade. When OpenSearch is at circuit-breaker threshold: answer from the model's general knowledge with a caveat tag ("I'm unable to look up specific titles right now — here's what I know generally"). Return HTTP 200 with a retrieval_mode=fallback field in the response. The user gets a useful (if less grounded) answer, does not retry, and the OpenSearch recovery window is shorter because traffic is being absorbed, not amplified. Hard failures are only appropriate for security-critical paths (auth, PII exposure risk). For answer quality degradation, a graceful fallback always beats a hard error.
Red Flags — Weak Answer Indicators
- Setting timeouts from single-user demo latency without load testing
- No mention of connection pool sizing under concurrent load
- Treating retry logic as purely a configuration knob without circuit breaker awareness
- Load testing at exactly the expected traffic volume without adding a burst profile (2×)
- Not designing a degraded response mode (answering without retrieval when OpenSearch is saturated)
Strong Answer Indicators
- Immediately identifies the contention regime difference between single-user and concurrent load
- Derives timeout values from P99-under-load, not P50-demo
- Proposes three load test profiles (steady-state, peak, burst)
- Sizes connection pools from first principles (TPS × hold_time × Lambda_concurrency)
- Designs circuit breakers with explicit thresholds and half-open probe logic
- Shows awareness of state accumulation (session backlog) as a hidden load amplifier
Scenario 3: Late Streaming Pivot
Opening Question
Q: Your MangaAssist PoC was built with synchronous invoke_model calls returning complete responses. Three weeks before launch, the product team pivots to a streaming typewriter UI. The backend team says "it's just a different Bedrock API call." Six hours into the streaming release, users are seeing frozen responses and broken JSON on 30% of requests. What did the "just a different API call" assumption miss?
Model Answer
Streaming is not a backend implementation detail — it is a protocol contract change that requires renegotiating every layer of the stack. The invoke_model_with_response_stream API returns a EventStream of typed event chunks; the application must: (1) forward each chunk to the WebSocket client before the next chunk arrives (chunk buffering breaks the streaming UX); (2) handle mid-stream errors — a StreamingError event can arrive at any point, unlike a synchronous call that either succeeds or fails atomically; (3) emit a stream-termination signal the client can detect; (4) handle token budget overruns that manifest as a mid-stream StopReason change. The synchronous PoC had none of these requirements. "Just a different API call" missed: stream-specific error event handling, WebSocket frame design (chunk, completion, error frame types), mid-stream failure recovery, frontend partial-state rendering, and first-token-latency observability. The 30% failure rate is not a Bedrock issue — it's a missing stream contract that causes the client to crash when it receives unexpected frame shapes.
Follow-up 1: Define the streaming message contract
Q: What does a complete stream contract look like for a WebSocket-based chat UI?
A: Define 4 message types with a fixed JSON envelope: (1) {"type":"chunk","session_id":"...","sequence":N,"text":"..."} — token chunk, the client appends text and increments sequence; (2) {"type":"stream_end","session_id":"...","total_tokens":N,"stop_reason":"end_turn"} — stream complete, client finalizes rendering; (3) {"type":"error","session_id":"...","error_code":"...","retryable":true/false,"message":"..."} — stream or service error, client shows appropriate UX; (4) {"type":"heartbeat","session_id":"..."} — keepalive if no chunks for > 2s, prevents WebSocket idle timeout. The sequence field gives the client ordering guarantees. The retryable flag in errors tells the client whether to auto-retry or show a permanent error. This contract must be co-authored with the frontend engineer before a single streaming line is written.
Follow-up 2: First-token latency as a metric
Q: For streaming UX, why is P99 end-to-end response time the wrong primary metric?
A: With streaming, the user experience is driven by time-to-first-token — the latency from request to the appearance of the first word on screen. A response that starts in 400ms and takes 6 seconds total feels fast. A response that takes 4 seconds before the first token then finishes in 5 seconds total feels broken, even though the total time is only 1 second longer. Instrument TimeToFirstTokenMs: timestamp when Bedrock returns the first content_block_delta event; subtract the request timestamp. Target: P95 < 800ms for MangaAssist. This is a separate SLA from TotalStreamDurationMs (target: P95 < 6s for a complete answer). Alert separately on both. The PoC had only total-response-time instrumentation, which obscured first-token latency entirely.
Follow-up 3: Handling mid-stream errors
Q: Bedrock throws a ThrottlingException after the stream has already started delivering chunks to the user. What does the client and server each need to do?
A: Server side: catch the ThrottlingException in the stream loop; emit an {"type":"error","error_code":"throttle","retryable":true} frame on the WebSocket; close the WebSocket cleanly (don't let it hang). Do not retry the entire request automatically — the user already has partial output and a retry would duplicate the streamed text. Client side: on receiving retryable:true error after partial output, display the partial answer with a "I was interrupted — ask me to continue?" affordance. Do not auto-retry (duplicated text problem). Do not clear the partial answer (user loses what they already read). The sequence number in each chunk enables the client to detect if chunks were missed and request a partial retry from a specific sequence point — a more advanced resumable streaming design. For a V1, the explicit error frame + partial answer preservation is sufficient.
Follow-up 4: Testing streaming under failure conditions
Q: How do you test streaming reliability if the PoC only had a happy-path demo environment? A: Five streaming-specific test cases that must pass before launch: (1) Dependency stall mid-stream: inject a 5-second delay in the enrichment mock after 20% of chunks are sent — verify client handles gracefully, no frozen screen; (2) WebSocket disconnect during stream: kill the server connection at chunk N — verify the client shows a partial answer rather than blank screen; (3) ThrottlingException at chunk 1: trigger throttle immediately — verify client shows error, no crash; (4) ThrottlingException at chunk N/2: trigger mid-stream throttle — verify partial answer preserved + error affordance; (5) Sequence number gap: skip sequence 7 in a 10-chunk stream — verify the client detects the gap and displays accurately. These 5 tests replace the "it works" happy-path demo as the streaming readiness gate.
Grill 1: "We can add streaming in a hotfix after launch"
Q: The team proposes launching with synchronous responses and adding streaming in a hotfix the following week. Why might that strategy backfire? A: Streaming is a protocol change — it requires changes to WebSocket frame structure, client rendering logic, error-handling paths, and observability. A hotfix deploys these changes under production pressure, with no testing time for the 5 failure modes listed above. WebSocket protocol changes are particularly dangerous as hotfixes because a mismatch between server and client frame format (e.g., API Gateway WebSocket route configuration) is immediately visible to 100% of users. Additionally, the product launch announcement has already set user expectations for the typewriter UI — launching without it is a user experience miss, but launching a broken streaming implementation is worse. The correct decision is: delay launch by one sprint to properly implement and test streaming, not hotfix it under time pressure.
Grill 2: What if the frontend team changes the stream contract without coordinating?
Q: After launch, the frontend team adds a new required field to the chunk frame that the backend doesn't send. How do you prevent this class of breakage?
A: Treat the stream message schema as a versioned, shared API contract. Publish it as a JSON Schema artifact in a shared repository. Both frontend and backend CI pipelines validate their message shape against the schema. Changes to the schema require a PR with both teams as required reviewers — this is not a backend or frontend change unilaterally. Version the contract: {"schema_version":"1.2","type":"chunk",...}. The client checks schema_version on each message and alerts when it receives an unexpected version. Consider a message broker (SNS/SQS) or API Gateway WebSocket message validation built around the schema. The cultural rule: stream message schema is an API contract, not an implementation detail.
Grill 3: The streaming implementation works, but costs 3× more
Q: After fixing streaming, you notice Bedrock token costs tripled. Why would switching from invoke_model to streaming change costs?
A: Streaming itself doesn't change token cost — input and output tokens are billed identically. The cost increase likely has three causes: (1) Retry amplification — early streaming failures caused full-request retries, so some messages were invoking Bedrock 2–3 times. A retry of a streaming call that delivered 50% of chunks before failing means 50% of output tokens were billed on the partial call plus 100% on the retry. Fix: implement idempotent retry keys and a resume-from-sequence strategy so retries don't regenerate already-delivered content. (2) Prompt changes — the streaming implementation may have added new system prompt instructions for stream format, increasing input tokens by 200–300 per request. (3) Error handling prompts — new error-case prompts (e.g., "continue the response where you left off") added token overhead not in the cost model. Re-baseline the cost model for the streaming version using per-component token instrumentation.
Red Flags — Weak Answer Indicators
- Treating streaming as only a backend API change with no frontend or contract implications
- No WebSocket message contract definition (chunk/end/error/heartbeat frame types)
- Missing first-token latency as a distinct metric from total response time
- Happy-path-only streaming test plan
- No mention of mid-stream ThrottlingException handling strategy
Strong Answer Indicators
- Immediately identifies the 4 new protocol requirements streaming introduces
- Defines all 4 WebSocket message types with specific JSON structure
- Separates
TimeToFirstTokenMsandTotalStreamDurationMsas distinct SLAs - Names 5 specific streaming failure test cases required before launch
- Addresses retry-cost amplification in streaming vs. synchronous retry patterns
Scenario 4: Token Cost Overrun
Opening Question
Q: The MangaAssist finance model projected $8K/month in Bedrock costs based on the PoC. Two weeks into production, the actual spend is $47K/month — a 5.9× overrun. The team says they "didn't change anything." Walk me through how a 5.9× cost overrun can happen with no code changes and how you diagnose it.
Model Answer
"We didn't change anything" almost always means "we didn't change application code." But token cost overruns have four independent drivers, each of which can drift after PoC: (1) System prompt growth — during prompt iteration, the system prompt grew from the PoC's 800 tokens to 2,400 tokens; at 1M requests/day this alone adds $2.88M/day in input token cost (if using Sonnet). (2) RAG context expansion — the contextual chunks retrieved grew from 2 × 500-token chunks to 5 × 800-token chunks as the retrieval pipeline was tuned for higher recall. (3) Conversation history accumulation — multi-turn users send full history on every turn; a 10-turn session's prompt includes 9 prior turns (~1,800 tokens) that weren't in the PoC's single-turn eval. (4) Production query distribution — real user queries are longer and more complex than PoC test queries (averaging 180 tokens vs. 40 tokens in the PoC). The diagnostic is per-component token instrumentation: break every Bedrock call into system_tokens, history_tokens, rag_tokens, user_input_tokens, output_tokens and publish them as CloudWatch metrics with dimensions. The cost overrun investigation then shows exactly which component drifted.
Follow-up 1: Setting per-request token budgets
Q: What is a concrete per-request token budget for MangaAssist, and how do you enforce it?
A: Define a budget allocation per request type. For a standard recommendation query: system_prompt=1,500 tokens (reserved, trimmed if longer), user_input=500 tokens max (truncate if longer, warn in logs), conversation_history=2,000 tokens (token-aware assembly), RAG_context=2,500 tokens (top-K chunks within budget), output_budget=1,500 tokens = total 8,000 tokens. Enforcement: before every FM call, count tokens for each component. If any component exceeds its allocation, apply trimming rules (compress history, reduce top-K, truncate user input at word boundary). If the total would exceed 8,000 tokens, refuse to invoke the FM and return a structured error with reason=context_budget_exceeded. Log every budget enforcement event as a metric. Hard cap prevents runaway costs from a single malicious or inadvertent oversized input.
Follow-up 2: Cost-aware routing
Q: Some requests are unavoidably large. How do you prevent large requests from automatically routing to the most expensive model?
A: Build cost-aware routing into the model selection policy. Tiers by estimated token count: ≤ 3,000 tokens → Haiku; 3,001–8,000 tokens → Sonnet; > 8,000 tokens → Sonnet forced + ContentTooLarge warning logged. Additionally, add a request-type override: if task_type=intent_classification, always Haiku regardless of token count. In practice, the vast majority of oversized requests are conversation history accumulation (solvable by compression), not intrinsically complex queries. Routing a compressed 6,000-token synthesis request through Sonnet is appropriate. Routing a 6,000-token request because the team forgot to compress history is an architecture failure that should be caught by the token-aware assembly step, not the routing policy.
Follow-up 3: Finance model update process
Q: What process ensures the finance model is updated when the prompt or retrieval configuration changes?
A: Make token-cost impact analysis a required part of prompt change reviews. Concretely: the PR template for prompt changes includes a mandatory field: "before_token_count / after_token_count / delta_per_request / projected_daily_cost_delta." Enforce this via a CI check that measures the token count of every prompt template against the baseline and fails the build if the delta exceeds 10% without a cost impact sign-off comment. Similarly, when RAG top-K changes (e.g., from k=3 to k=5), the pipeline controller should emit a ConfigChangeEvent that triggers a recalculation of the projected cost. Finance reviews any config change projecting > $1,000/day delta before it ships to production. This institutionalizes cost as an engineering constraint, not an afterthought.
Follow-up 4: Token cost per feature, not just per request
Q: Leadership wants to know which MangaAssist features are the most expensive. How do you break down costs by feature?
A: Tag every Bedrock invocation with a feature_name dimension (recommendation, search, faq, checkout_assist) in addition to model_id and task_type. Publish a custom CloudWatch metric BedrockTokensUsed with those dimensions. In CloudWatch Metrics Insights, query SUM(BedrockInputTokens) GROUP BY feature_name to get daily cost by feature. For absolute dollar attribution: input_cost = (input_tokens / 1,000,000) × model_input_price. Build a Cost Explorer dashboard that shows feature-level spend trend. This lets engineering prioritize optimization work: if the recommendation feature represents 60% of token spend, a 20% token reduction there saves more than a 50% reduction in faq at 5% of spend.
Grill 1: "The PoC was accurate — this is a usage growth problem"
Q: Engineering says the cost overrun is simply because production traffic was underestimated. Is that a valid defense? A: Partially — if the PoC's cost-per-request estimate was accurate but volume was underestimated, the cost model worked correctly and the business planning was wrong. But the 5.9× overrun is not just a volume issue — at even 2× the PoC's projected volume, the expected cost would be $16K, not $47K. The 3× remaining gap is a cost-per-request failure, not a volume failure. The per-request token instrumentation will show which component drifted. More broadly, "we didn't model enough traffic" is a planning failure; "tokens per request grew 3× after PoC" is an architectural governance failure. Both need to be addressed. The PoC's role is to establish a reliable per-request cost baseline that is explicitly linked to specific prompt sizes, RAG configurations, and model choices — not a vague estimate from demo runs.
Grill 2: Prompt compression to reduce cost — quality tradeoff
Q: You trim the system prompt from 2,400 to 800 tokens to reduce cost. Three safety behaviors stop working. How do you approach this tradeoff? A: Never trim prompt content by character/word deletion — trim by consolidation. A 2,400-token system prompt usually has redundancy, verbose examples, and repetitive instructions. Run a structured analysis: categorize each instruction block by safety vs. quality vs. style. Safety blocks (persona constraints, content policy, citation requirements, hallucination guards) are non-negotiable — they stay. Quality blocks (output format, response length, vocabulary tone) are trimmed to minimum effective phrasing. Style blocks (elaborate persona description, multi-line examples) are compressed to concise directives. After compression to 1,000 tokens, run the full golden-set evaluation and compare safety metric pass rates. If any safety metric drops, the trimmed block must be restored. Cost optimization never justifies safety regression — the right order is: trim style → trim quality → never trim safety.
Grill 3: Usage patterns shift — cost model was wrong for JP users
Q: JP users write longer queries on average (Japanese characters encode differently) and the cost model was based on English-only PoC queries. How does this affect token cost?
A: Significantly. Japanese text is byte-heavy relative to English in UTF-8 but tokenizes differently in Anthropic's tokenizer — roughly 1.5–2× more tokens per character count compared to English for some character types (kanji especially). A 100-character Japanese query may consume 150–200 tokens vs. a 100-character English query at 25–30 tokens. The PoC underestimated input token volume for JP queries because it was calibrated on English text. The fix: re-baseline the cost model with a representative JP-language query sample. Add a language dimension to the BedrockTokensUsed metric to measure per-language token cost separately. For JP-heavy workloads, use Haiku aggressively for classification (where quality gap is small) to offset the higher per-query token count.
Red Flags — Weak Answer Indicators
- Attributing the entire overrun to volume growth without investigating per-request token drift
- No per-component token instrumentation (system, history, RAG, user, output)
- Missing the connection between prompt iteration and token cost drift
- No process for requiring cost impact analysis on prompt changes
- Ignoring JP language tokenization differences in a bilingual system
Strong Answer Indicators
- Immediately identifies four independent cost drivers (system prompt growth, RAG expansion, history accumulation, query distribution shift)
- Proposes per-component CloudWatch metrics with strict naming convention
- Designs a concrete token budget allocation per request type with hard enforcement
- Creates a PR gate requiring cost delta sign-off for prompt changes > 10% token delta
- Addresses Japanese tokenization as a non-trivial cost multiplier
Scenario 5: Raw User Input Robustness Gap
Opening Question
Q: Internal PoC testing showed 91% retrieval accuracy. Beta launch to 200 real manga fans showed 52% retrieval accuracy. Users are using abbreviations like "DB", "FMA", "HxH", slang like "that op mc isekai one", Katakana mixed with Romaji, and emoji. None of these appeared in the PoC dataset. Diagnose the failure and propose a pipeline fix.
Model Answer
The PoC evaluation set was built by developers who know the catalog. Real users don't say "Dragon Ball Vol. 1" — they say "DB" or "Goku's story." This is a normalization gap: the system was never validated against natural user vocabulary. In a RAG pipeline, the failure hits retrieval before generation: the embedding of "HxH" and the embedding of "Hunter x Hunter" occupy different vector spaces because the model hasn't seen "HxH" as a canonical alias for that title. The fix requires a pre-processing normalization layer between the user's raw input and the embedding call. Components: (1) an alias expansion lookup — a DynamoDB table mapping common abbreviations and slang to canonical titles (built from community sources like MyAnimeList, Reddit, community wikis); (2) a language normalization step — Katakana/Romaji alignment using Amazon Comprehend or a lightweight custom normalizer; (3) a query augmentation step — Haiku rewrites the normalized query into 2-3 canonical phrasings before embedding, improving recall by broadening the semantic search. These steps add < 100ms latency and recover the majority of the retrieval gap.
Follow-up 1: Building and maintaining the alias table
Q: How do you build and keep an alias lookup table current for a manga catalog with 50,000 titles?
A: Three-tier approach: (1) Seed from public sources — community databases (AniList API, MyAnimeList API) provide canonical title mappings with alternate titles, abbreviations, and romanizations for the top 10,000 titles. Ingest these via a weekly Glue ETL job into a DynamoDB alias table with alias_hash as partition key, canonical_title as attribute. (2) Production query mining — weekly offline job clusters production queries that returned 0 results into Comprehend topic groups. A human reviewer annotates 50 new aliases per week from the "no results" cluster. (3) User feedback signal — when users rephrase a query (e.g., after a no-results response, user types "I meant Naruto" after asking "the orange ninja one"), capture the (original, intent) pair as a candidate alias for human review. Store aliases with a confidence score and source tag; low-confidence aliases are applied only to the query augmentation step, not the direct alias expansion.
Follow-up 2: Language normalization for JP/Romaji/Katakana queries
Q: A user types "鬼滅 demon slayer" — mixed Japanese and English in the same query. How does the normalization pipeline handle this?
A: The normalization pipeline has three passes in sequence: (1) Language detection: call Amazon Comprehend detect_dominant_language — returns ja for Japanese-dominant, en for English-dominant, or a mixed signal. (2) Script normalization: for JP segments, run through a Katakana-to-Hiragana converter and a Kanji normalizer (normalize traditional to simplified variants). Apply Amazon Translate to get an English parallel for the JP segment. For "鬼滅 demon slayer" → ["Kimetsu no Yaiba", "demon slayer"]. (3) Query augmentation: use Haiku to generate 2 additional search phrasings from the normalized query: ["Demon Slayer manga", "Kimetsu no Yaiba manga volume"]. Embed all 3 phrasings, retrieve top-3 per phrasing, deduplicate, re-rank. This multi-phrasing recall strategy recovers mixed-language queries where a single embedding would miss one of the segments.
Follow-up 3: "OP mc isekai" — no canonical alias exists
Q: A user types "that op mc isekai one where he's a slime." There's no alias for this. The catalog has Tensura (Slime Isekai). What does the pipeline do?
A: Alias lookup misses → query augmentation step is the fallback. Haiku receives the query with a few-shot prompt: User query: "that op mc isekai where he's a slime". Generate 3 search queries for a manga catalog that would find this title. Haiku produces: ["isekai manga main character reincarnated as slime", "That Time I Got Reincarnated as a Slime manga", "Tensei Shitara Slime Datta Ken"]. These go through standard embedding and retrieval. The Haiku augmentation step is the "smart fallback" for queries that are descriptive rather than title-based — a form of query-by-description retrieval. Monitor the augmentation hit rate: if > 40% of queries require augmentation to get a positive retrieval, that's a signal to expand the alias table's coverage of popular descriptive patterns.
Follow-up 4: Emoji and special character handling
Q: A user sends "🍥 manga recommendations" or "manga 🔥🔥🔥." How do you normalize emoji input without stripping semantic content? A: Emoji carry semantic content in this domain (🍥 = Naruto's fishcake symbol, well-known in the manga community). Two-step handling: (1) Emoji-to-text mapping: apply a curated lookup table of fandom-relevant emoji to text equivalents (🍥 → "Naruto", 🔥 → "action hot popular", ⚔️ → "action sword combat"). These are domain-specific mappings that a generic emoji library won't have — build them from the top-50 emoji that appear in user queries (cloud analytics). (2) Strip unknown emoji: emoji that don't have mappings are stripped before embedding, preventing tokenizer confusion. Log stripped emoji for weekly review — if a new emoji pattern emerges frequently (e.g., 🌸 for romance manga), add it to the mapping table. Standard Unicode normalization (NFC) handles invisible characters, combined characters, and variation selectors before embedding.
Grill 1: "Won't Anthropic's model handle this natively?"
Q: An engineer says Claude 3 understands slang and abbreviations natively — why build a normalization layer at all? A: Claude 3 the generation model understands slang in text — but the problem is in the retrieval step, not the generation step. The embedding model (Amazon Titan Embeddings) converts "HxH" to a vector. That vector's similarity to "Hunter x Hunter" in the index depends on whether the embedding model was trained on enough text pairing those two strings. If the pair is rare in training data, the cosine similarity is low, the document is not retrieved, and Claude 3 never sees the right context. Claude 3 generation quality is irrelevant when retrieval fails — it will generate a plausible-sounding but wrong answer from general knowledge. The normalization layer ensures that the embedding call, for retrieval purposes, receives a canonical form that the embedding model reliably maps to the correct vector neighborhood. These are two different model problems.
Grill 2: Normalization pipeline adds 100ms+ — you said < 100ms
Q: In production, the normalization pipeline latency is 180ms (Comprehend call + Haiku call + alias lookup). How do you fix this?
A: Decompose and parallelize. Alias lookup (DynamoDB GetItem) = < 5ms (synchronous, fast). Language detection (Comprehend) and query augmentation (Haiku) can run in parallel since augmentation doesn't depend on Comprehend output. Parallelization: asyncio.gather(comprehend_call, alias_lookup_call) → 30ms for the fast path. Haiku augmentation = 150ms (the dominant cost). Optimization: run Haiku augmentation only when the alias lookup returns no result (cold start). If alias lookup hits, skip augmentation — most popular title queries resolve via alias table. For "HxH" → "Hunter x Hunter" (alias table hit), no augmentation needed. For "that slime isekai" (alias miss), augmentation fires. Expected latency distribution: 80% of queries → alias hit → 35ms total; 20% → alias miss + augmentation → 185ms total. Acceptable tradeoff for a 3-second end-to-end SLA.
Grill 3: Normalization introduces false positives — "DB" matches "Dragon Ball" but user meant "Dungeon Born"
Q: The alias table maps "DB" to "Dragon Ball." A user asking about "Dungeon Born" (a light novel abbreviated "DB") gets Dragon Ball results. How do you handle this ambiguity? A: Alias expansion should be additive, not substitutive. Instead of replacing "DB" with "Dragon Ball," expand the query to include both: query = "DB" → search candidates: ["DB", "Dragon Ball", "Dungeon Born"]. Retrieve top-2 results per candidate phrase, deduplicate, re-rank by semantic similarity to the original query context. If the user's full query is "DB where the guy fights demons in a dungeon," the context provides enough signal to rank "Dungeon Born" results above "Dragon Ball" results even if "Dragon Ball" has more coverage. Additionally, expose the alias disambiguation step: if the top-2 results disagree in domain (manga vs. light novel), return a clarification response: "Did you mean Dragon Ball or Dungeon Born? Both are commonly abbreviated as DB." The disambiguation response costs one turn but prevents a confidently wrong recommendation.
Red Flags — Weak Answer Indicators
- Saying "Claude 3 handles this natively" without recognizing the retrieval-layer problem
- No alias lookup or query normalization step in the pipeline architecture
- Treating emoji as noise to strip entirely without semantic value assessment
- No monitoring of normalization effectiveness (alias hit rate, augmentation trigger rate)
- Forgetting that multilingual queries require normalization in the embedding step, not just the generation step
Strong Answer Indicators
- Immediately distinguishes retrieval-layer failure from generation-layer capability
- Proposes three-tier alias building (public sources, production mining, user feedback)
- Handles mixed-language queries with multi-phrasing parallel retrieval
- Designs conditional augmentation (Haiku fires only on alias-miss) to control latency
- Addresses alias ambiguity with additive expansion and contextual re-ranking rather than substitution