04. Interview Q&A Deep Dive — API Design, Testing & Scale (MangaAssist)

These questions go beyond surface-level descriptions. They probe why you made specific decisions, force trade-off articulation, expose failure reasoning, and test whether you can think on your feet when the interviewer pushes back. Organized by topic with Easy → Medium → Hard → Killer Follow-Up escalation.

Section 1: API Architecture & Design Decisions

Q1: You have 6 different API types. Why not consolidate into a single REST gateway?

Easy: Each API type serves a fundamentally different communication pattern. WebSocket is for real-time streaming, REST is for request-response, gRPC is for high-throughput internal service calls, ML inference has its own invocation model on SageMaker/Bedrock, and vector search has a specialized query protocol. Forcing all of these through a single REST gateway would be architecturally inappropriate
Medium: The key reason is latency budget allocation. A single gateway creates a serialization bottleneck — every request pays the cost of the gateway's processing pipeline even if it doesn't need it. WebSocket connections are long-lived (minutes to hours) while REST calls are millisecond-level ephemeral. Forcing WebSocket through a REST proxy would add 50-100ms per token of overhead, which is devastating when you're streaming 200+ tokens. Similarly, internal gRPC calls between services benefit from HTTP/2 multiplexing and binary protobuf serialization — wrapping them in REST JSON adds unnecessary serialization cost
Hard: Consolidation also creates a blast radius problem. If I put all 6 API types behind one gateway and that gateway has a deployment issue or hits a concurrency ceiling, everything dies simultaneously. By keeping them separate, I get independent scaling and independent failure domains. When Bedrock throttled during Prime Day load testing, it only affected the ML inference path — the WebSocket connection stayed alive, and the orchestrator could return a cached or degraded response. A single gateway would have propagated the Bedrock backpressure to ALL connection types

Killer Follow-up: "But now you have 6 different things to monitor, 6 different deployment pipelines, 6 different security surfaces. How do you manage that operational complexity?"

Each API surface deploys independently through its own CodePipeline. But monitoring is unified — all APIs emit structured metrics to CloudWatch and traces to X-Ray with a common request_id that threads through the entire call chain. The operational tradeoff is real: 6 pipelines means 6 potential points of deployment failure. But the alternative — a monolithic gateway — has a worse failure mode: one bad deploy takes down everything. I chose correlated observability with independent deployability. The X-Ray service map gives me a single-pane view across all 6 API types in one dashboard, so operational overhead is manageable

Easy: The message arrives via WebSocket, gets classified as a recommendation intent, triggers a RAG retrieval for horror manga context, fans out to the recommendation engine and product catalog, feeds everything into Claude for generation, runs through guardrails, and streams back to the user
Medium: Here's the precise call chain with timings: 1. WebSocket $default handler receives the message (0ms — already connected and authenticated) 2. Input validator checks for PII, prompt injection, payload size (5ms) 3. Stage 1 intent classification — regex pre-filter doesn't match with high confidence, so... 4. Stage 2 DistilBERT classifier on SageMaker Inferentia classifies as recommendation with 0.92 confidence (15ms) 5. Parallel fan-out (all start simultaneously):
- Titan Embeddings V2 embeds "horror manga like Junji Ito" into a 1024-dim vector (30ms)
- Recommendation Engine (Amazon Personalize) returns ranked ASINs for horror manga (200ms)
- Customer Profile fetch (if authenticated) for personalization context (50ms) 6. OpenSearch KNN search with the embedding vector, filtered to horror genre — returns top-10 chunks (40ms, starts after embedding completes at T+30ms) 7. Cross-Encoder Reranker scores the 10 chunks against the query, returns top-3 (50ms, starts at T+70ms) 8. Prompt assembly — system prompt + conversation history + top-3 RAG chunks + recommendation engine ASINs + product catalog details (5ms) 9. Claude 3.5 Sonnet generation begins — first token at ~500ms TTFT (starts at T+125ms) 10. Token streaming — each token goes through lightweight guardrail checks and streams to the user via WebSocket (500-1000ms for full generation) 11. Post-generation guardrails — full 6-stage pipeline on the complete response (80ms) 12. Final frame with product cards, action buttons, and metadata sent to frontend
Total wall clock: ~1.8-2.5s for first token to appear (steps 1-9), ~2.5-3.5s for complete response
Hard: The critical optimization here is speculative execution. I don't wait for intent classification to finish before starting RAG. Since ~70% of intents require RAG anyway, I start embedding + vector search in parallel with intent classification. If the intent turns out to be something that doesn't need RAG (like "where is my order?"), I throw away the RAG results — wasting ~30ms of compute. But for the 70% majority case, I save 15ms off the critical path. The recommendation engine fan-out also runs speculatively. The net effect: 150-300ms saved on the critical path for the majority of requests

Killer Follow-up: "You said the reranker processes top-10 from OpenSearch but earlier docs say top-50. Which is it, and why did it change?"

Initially it was top-50. During the multi-model P99 latency spike at 10K concurrent (Scenario 7 in scale testing), the reranker became the bottleneck — GPU saturated with 50 candidates per request. I ran an offline evaluation: top-20 covered 95% of the same relevant results that top-50 found. Then in production, after further tuning, I tightened it to top-10 post-KNN with the reranker scoring all 10. The vector search quality with HNSW was good enough that the long tail beyond top-10 rarely contained anything the top-3 reranked results missed. Each reduction was data-driven, not arbitrary

Q3: Why gRPC for the Product Catalog but REST for everything else internally?

Easy: The Product Catalog is the highest-throughput internal API — nearly every intent needs product data. gRPC with protobuf serialization gives better performance than JSON REST for high-volume, structured data
Medium: Product Catalog responses are large, structured, and schema-stable (ASIN, title, author, format, language, page count, images). Protobuf serialization is 3-5x smaller than JSON for this payload shape and parsing is faster. At 30K+ catalog lookups per second during peak, this saves meaningful network bandwidth and CPU cycles. The other services (Orders, Returns, Shipping) have lower call volume and more dynamic schemas that change frequently. REST + JSON is more operationally friendly for those teams — they can evolve their APIs without the protobuf schema compilation step
Hard: The deeper reason is team dynamics. The Product Catalog team had an existing gRPC service with a well-defined proto schema that multiple consumers already used. I was their 7^th consumer — adopting their existing contract was easier than asking them to build a REST facade for me. The Orders, Returns, and Shipping teams all had REST APIs already instrumented with their own monitoring, rate limiting, and contract testing. Switching them to gRPC would have required cross-team migration effort with zero business value. Technical decisions at Amazon don't happen in isolation — they happen in the context of existing team contracts and migration costs

Killer Follow-up: "gRPC requires HTTP/2 which means you can't use standard API Gateway. How did the catalog calls route through your infrastructure?"

The gRPC calls from the orchestrator to the Catalog service went through internal AWS networking (VPC peering), not through API Gateway. API Gateway handled external-facing WebSocket and REST traffic only. Internally, the ECS-based orchestrator made direct gRPC calls to the Catalog service's internal NLB (Network Load Balancer), which supports HTTP/2 natively. Service discovery was via AWS Cloud Map. This is a common pattern at Amazon — API Gateway for edge traffic, direct service-to-service calls internally

Q4: Your WebSocket streaming sends individual JSON frames per token. Why not batch tokens?

Easy: Individual frames give the best perceived latency — each token appears on screen the moment it's generated, creating a smooth typing effect
Medium: I considered three approaches: (1) single-token frames (what I chose), (2) batched frames every 50ms, (3) newline-delimited streaming. Single-token frames have higher network overhead (each frame is a full WebSocket message) but minimize perceived latency. Batching at 50ms would reduce message count by 5-10x but introduces 50ms of jitter in the visual typing speed. At Amazon's scale and CDN infrastructure, the overhead of per-token frames was negligible. The frontend rendered them identically to how ChatGPT streams — users expect this pattern now
Hard: There's a subtlety: I'm not actually sending every single token as its own frame. Bedrock's streaming API returns chunks of 1-5 tokens at a time. I forward those chunks as-is rather than buffering or splitting. So the frame boundary aligns with Bedrock's natural chunk boundaries, not individual token boundaries. This was a practical optimization — splitting Bedrock chunks into individual tokens would require token boundary detection (not trivial for multi-byte UTF-8 manga titles in Japanese), while batching them into larger frames would add buffering latency. Forwarding Bedrock's natural chunks was the simplest and most latency-efficient approach

Killer Follow-up: "What happens if a WebSocket frame is lost? TCP guarantees ordering but what about API Gateway's WebSocket implementation — is it truly lossless?"

API Gateway's WebSocket is built on TCP, so in-order delivery is guaranteed at the transport layer. But there's a subtlety: if the API Gateway management API call to post a message back to the client fails (e.g., the connection was closed between sending two tokens), I get a GoneException. The orchestrator handles this by detecting the stale connection, stopping LLM generation (why waste tokens nobody will see?), and cleaning up the session. For the product cards and action buttons that come at the end of the stream, those are critical — I included a sequence_number in each frame so the frontend could detect gaps and request a resend of the final structured data if needed

Q5: You enforce rate limiting at 30 msg/min for authenticated users. How did you arrive at that number?

Easy: It was based on observed user behavior. During beta testing, the 99^th percentile chatbot user sent about 15 messages per minute during active conversations. 30 messages/min gives 2x headroom above the heaviest real usage
Medium: The rate limit serves three purposes: (1) protect Bedrock quota from abuse, (2) prevent single-user resource hogging, (3) mitigate automated scraping attempts. The number was calibrated by analyzing 4 weeks of beta traffic data. Median was 3 msg/min, P95 was 8 msg/min, P99 was 15 msg/min. Setting the limit at 30 msg/min means only abusive or automated traffic gets throttled — no legitimate user hits it. Guest users get 10 msg/min because we can't tie their usage to an identity, so the blast radius of abuse is lower
Hard: The implementation is a token bucket algorithm per customer ID (authenticated) or per session ID (guest), stored in ElastiCache Redis with a sliding window. I chose token bucket over fixed window because fixed window has a boundary problem — a user could send 30 messages at 11:59:59 and 30 more at 12:00:01, effectively 60 messages in 2 seconds. Token bucket smooths this out. The Redis key TTL auto-expires after the window period, so the rate limiter self-cleans. I also added a global LLM path rate limit — a configurable cap across ALL users — as a safety valve to protect Bedrock quota if a coordinated attack hit many different user accounts simultaneously

Q6: Your guardrails pipeline has 6 stages. What's the order and why does order matter?

Easy: PII Redaction → Price Accuracy → Toxicity Filter → Competitor Mention → ASIN Validation → Scope Check. The order is deliberate — PII redaction runs first because it's the highest-severity safety concern
Medium: The ordering follows two principles: severity (most dangerous failures first) and dependency (some stages need clean input from earlier stages). PII redaction is first because if any subsequent stage logs or side-effects the response, PII is already removed. Price accuracy is second because it makes an external API call to the Pricing Service — if I ran it later and the response was already modified by other stages, the price-matching logic would need to account for those modifications. Toxicity and competitor mention are pure filters that don't modify the response (they block and replace), so their ordering relative to each other is flexible. ASIN validation is near the end because it makes a batch API call and is the most latency-expensive guardrail (~30ms). Scope check is last because it's a catch-all
Hard: There's a tension between running guardrails sequentially (100ms total for 6 stages) and running them in parallel (40ms but with coordination complexity). I chose sequential because some stages are stateful with respect to the response text. If PII redaction replaces "card ending in 4111" with "[REDACTED]" and the toxicity filter runs in parallel on the original text, the toxicity filter's output won't have the PII redaction applied. Sequential guarantees each stage operates on the output of the previous stage. The 100ms total cost was acceptable within the 3-second latency budget because guardrails run after gen while the user is already seeing streamed tokens

Killer Follow-up: "You said guardrails run after generation, but you're streaming tokens to the user in real-time. Doesn't that mean the user sees unguarded tokens?"

Correct — there's a two-tier approach. During streaming, I run a lightweight inline check on each token chunk: regex-based PII detection and keyword-based competitor mention filter. These are the two checks that can be applied at the token level with near-zero latency. The full 6-stage pipeline runs on the completed response. If the full pipeline catches something the inline check missed (e.g., the price accuracy check or ASIN validation which require the complete response), I send a correction frame that tells the frontend to replace the streamed response with the corrected version. In practice, the correction frame fires in < 0.5% of responses. The inline filters catch the obvious stuff during streaming; the full pipeline catches edge cases after completion

Section 2: Testing Strategy Deep Dive

Q7: You test LLM outputs with an "LLM-as-a-judge" framework. Isn't that circular — using an LLM to evaluate an LLM?

Easy: It sounds circular, but it works because the judge LLM has the ground truth. I provide the judge with the user query, the chatbot's response, AND the expected behavior (correct ASIN, correct price, expected genre). The judge isn't generating answers — it's scoring answers against known criteria
Medium: The judge uses a separate Claude instance with a structured scoring rubric. It's not asking "is this a good answer?" — it's asking specific verifiable questions: "Does the response mention the correct ASIN B08X1YRSTR?" "Is the stated price within $0.01 of the reference price $9.99?" "Does the response recommend manga titles in the horror genre?" These are factual checks, not subjective quality judgments. For subjective dimensions (tone, helpfulness), I validated the LLM-as-a-judge scores against human annotations on a 100-prompt calibration set and found 92% agreement at the scoring level
Hard: The real risk of using an LLM judge is that it shares the same blind spots as the generation model. If Claude consistently misidentifies a manga genre, the judge Claude might not catch it. I mitigated this in three ways: (1) factual checks use structured data from the catalog API, not the judge's own knowledge; (2) the hallucination metric uses a custom validator that cross-references ASINs and prices against the real database — no LLM involvement; (3) I periodically ran a human evaluation on a 50-prompt random sample to check for judge drift. The LLM-as-a-judge is a scalability tool, not a replacement for human evaluation — it runs 500 prompts in 20 minutes where human eval would take days

Killer Follow-up: "Your golden dataset has 500 prompts. How do you know 500 is enough? What if there are failure modes your dataset doesn't cover?"

500 is not a magic number — it's the minimum that gives statistical stability across 8 intent categories and 6 evaluation dimensions. With 500 prompts, the smallest category (edge cases) has 80 prompts, giving a ±5% confidence interval at 95% confidence for binary pass/fail metrics. Is that enough to catch every failure mode? No. That's why the golden dataset is a living set — every production incident where the LLM generated a bad response gets added as a new test case. When the price hallucination issue was discovered, I added 30 new prompts specifically testing price accuracy across different product types. The dataset grew from 350 to 500 over 4 months of iteration. It's never "done" — it's a regression suite that grows with every failure

Q8: Contract testing caught a field rename from `format` to `product_format`. But how does that work practically with 6 provider teams?

Easy: I wrote Pact consumer contracts specifying what fields I need from each provider. Each provider team's CI pipeline verifies their service against our contracts. If they make a breaking change, their deploy is blocked
Medium: The practical challenge is adoption. At Amazon, you can't force another team to run your tests. I sold contract testing to each provider team by framing it as their protection too — if they know exactly what fields their consumers depend on, they can freely change everything else without fear. I published the Pact contracts to a shared Pact Broker, and each provider team added a pact:verify step to their CodePipeline. The verification runs in < 30 seconds per provider, so it didn't slow their CI. I also set up Slack notifications: when a contract verification fails, both teams get alerted simultaneously
Hard: The hardest part was schema evolution. When the Returns team needed to change eligible from a boolean to an object (to include a reason field), they couldn't just change it — my contract would fail. The process was: (1) they add the new eligibility_v2 field alongside the old eligible field, (2) I update my consumer to read from eligibility_v2, (3) I update my contract, (4) they run verification — passes on both old and new fields, (5) after my deploy, they deprecate the old field. This dual-write/dual-read pattern adds migration cost, but it's the price of safe evolution in a microservices environment. Without contracts, they would have just changed it and I'd have found out in production

Killer Follow-up: "What if a provider team is non-responsive? They push a breaking change and their CI is green because they removed your pact verification step."

This happened once. The Promotions team reorganized their CI and accidentally dropped the Pact verification step. Their next deploy renamed a field and broke our promotions intent. We caught it in our integration tests (which run against staging, not mocks) during our next deploy cycle. After that, I added a "contract freshness check" — a nightly job that verifies each provider's Pact verification was executed within the last 7 days. If a provider goes stale, it generates a warning in our operational dashboard and I reach out proactively. Prevention is better than detection, but you need both

Q9: You have 400 unit tests, 80 integration tests, 50 contracts, 30 E2E tests, 500 eval prompts, and 60 security tests. How do you keep that test suite maintainable?

Easy: The key is test ownership. Unit tests are owned by the component developer. Integration tests are co-owned by the integration points. Contract tests are shared with provider teams. LLM eval tests are managed as a dataset, not code. Security tests are maintained by a security rotation
Medium: The biggest maintenance burden is the LLM eval dataset. LLM behavior changes with model updates — when Claude 3.5 Sonnet gets a point release, golden dataset responses shift. I handle this with threshold-based evaluation, not exact-match: as long as scores stay above the threshold, model updates pass automatically. When a model update causes a regression (below threshold), the eval framework produces a comparison report showing which prompts regressed and by how much, so the human review is targeted, not exhaustive. For unit and integration tests, I follow the "test should read like documentation" principle — if a test needs comments to explain what it's testing, it's too complex
Hard: Test suite rot is the real enemy. Over 6 months, I had to delete ~50 tests that were testing behavior that no longer existed (removed intents, changed schema formats). I added a test coverage mapping: each test is tagged with the feature it covers, and when a feature changes, the CI flags tests whose tagged feature was modified. Tests for deleted features get auto-flagged for removal rather than silently turning into dead code that still runs and wastes CI time. The rule: every test must justify its existence by catching a real category of bugs. If a test hasn't failed (legitimately, not flaky) in 3 months, it's a candidate for review

Q10: You said you never mock SageMaker in integration tests. But staging endpoints cost money and can be flaky. How do you justify that?

Easy: Staging SageMaker endpoints cost ~$50/month to keep warm. That's trivial compared to the production bugs they catch. Mocking hides latency issues that only show up under real network conditions
Medium: The specific bug that justified this policy: the DistilBERT intent classifier had a cold-start issue where the first inference request after an idle period took 55 seconds. With a mock returning in 1ms, this was invisible. With a real staging endpoint, the integration test failed with a timeout after 30 seconds, surfacing the cold-start problem before production. The fix (warmup requests after scale-up events) would have been a production incident if discovered in production instead of staging
Hard: The flakiness concern is valid. Staging ML endpoints have higher variance than production (smaller instances, shared infrastructure). I handle this with two strategies: (1) retry-aware assertions — the test retries once on timeout before failing, matching the production orchestrator's retry behavior; (2) latency bounds, not exact values — I assert that the classifier responds in < 500ms (5x the expected P99) rather than < 50ms, so staging variance doesn't cause flaky failures. The test is checking "does it work and respond in a reasonable time?" not "does it meet production SLA?"

Killer Follow-up: "What about CI provider outages? If SageMaker staging is down, does your entire CI pipeline block?"

Yes, and that happened twice in 6 months. Both times SageMaker staging had degraded performance that caused integration tests to time out. My mitigation: I tag SageMaker-dependent integration tests separately. When they fail, the CI pipeline marks them as unstable rather than failed, logs the SageMaker health status, and allows the build to proceed if all other tests pass. The SageMaker tests then run in a deferred pipeline that retries every 30 minutes until they pass. This prevents a SageMaker outage from blocking developer deploys while still ensuring the tests eventually pass before merge

Section 3: Scale, Performance & Production Operations

Q11: You said you used Provisioned Throughput for Bedrock Sonnet. How did you size it? What if you over-provision?

Easy: I analyzed 4 weeks of production traffic to find the sustained baseline — ~3K LLM calls/second during normal hours. I provisioned for 80% of that baseline and used on-demand for overflow
Medium: Provisioned Throughput pricing is commit-based — you pay for the reserved capacity whether you use it or not. Overprovisioning wastes money; underprovisioning means you still hit on-demand throttling. My sizing: I provisioned for the P75 of daily traffic (not peak, not average). During low-traffic hours (2am-6am), I'm paying for unused capacity, but during business hours the provisioned capacity handles 75% of requests at a lower per-token cost than on-demand. The math: provisioned at $X/hour for baseline + on-demand for 25% overflow was 35% cheaper than pure on-demand at the same total volume
Hard: The tricky part is Prime Day and flash sales. I couldn't provision year-round for 10x traffic — that would be astronomical cost. Instead, I used a two-tier approach: (1) permanent Provisioned Throughput for daily baseline, (2) scheduled capacity increases 48 hours before known events (Prime Day, major manga release dates like a new One Piece volume). The scheduled increase was managed through CloudFormation stack updates triggered by an EventBridge schedule. After the event, capacity scaled back down within 24 hours. For truly unpredicted spikes, the Haiku fallback handled overflow

Killer Follow-up: "Bedrock Provisioned Throughput has a minimum commitment period. What if your traffic patterns change dramatically and the provisioned capacity is wrong?"

The minimum commitment for Provisioned Throughput is 1 month. If traffic drops (e.g., seasonal decline), you're paying for unused capacity until the commitment expires. I mitigated this by provisioning conservatively (P75 not P90) and relying on the on-demand + Haiku fallback for bursts. I also built a weekly cost review that compares provisioned utilization against actual usage. If utilization drops below 50% for 2 consecutive weeks, it triggers an alert to review the provisioning level for the next commitment period. It's not perfect — you're betting on future traffic when you commit — but the 35% cost saving on the provisioned portion makes the bet worthwhile at our scale

Q12: You described 7 scale testing scenarios. Which one would you handle differently if you did it again?

Easy: The WebSocket connection limit (Scenario 5). I should have requested the limit increase before the first load test, not after hitting 90% of the default limit. It was predictable — we knew the concurrent session target was 500K
Medium: The cache stampede (Scenario 6) fix was correct but my initial approach was wrong. I first tried a simple mutex lock ("only one request refreshes the cache, others wait"). This caused a different problem — hundreds of requests blocking on the lock, holding WebSocket connections open, and consuming Lambda concurrency. The Probabilistic Early Reexpiration approach was the right answer, but I wasted a sprint on the lock-based approach first. If I did it again, I'd go straight to PER because it's a well-known pattern for this exact problem
Hard: The DynamoDB thundering herd (Scenario 4) taught me that my partition key design was fundamentally wrong. Using session ID as the partition key meant popular shared sessions created hot partitions. My fix (random suffix + DAX + request coalescing) worked, but the root cause was a data model issue. If I redesigned from scratch, I'd use a write-sharding pattern from day one — distributing writes across multiple partition keys and collecting reads via scatter-gather. The DAX cache masked the problem rather than solving it at the data model layer. It works, but it's a band-aid that requires DAX to always be available

Killer Follow-up: "You mentioned DAX is a band-aid. If DAX goes down, does the thundering herd return?"

Yes, and that's the downside of caching as a solution to a data model problem. I mitigated DAX failure with the other fixes: exponential backoff + jitter and request coalescing are implemented in the application layer and don't depend on DAX. If DAX goes down, the thundering herd wouldn't be as bad as the original incident because the jitter and coalescing still spread the load. But it would be worse than normal because all reads go directly to DynamoDB. I monitor DAX health and have a runbook: if DAX is down for > 5 minutes, the orchestrator switches the sessions table to on-demand capacity mode (if not already) and activates more aggressive client-side caching with 30-second staleness tolerance. Not ideal, but defense in depth

Q13: Your system handles 50K messages/second at peak. What's the cost per message?

Easy: The blended cost is approximately $0.003-0.005 per message, varying by intent complexity. Simple intents (chitchat, FAQ) cost ~$0.001. Complex intents requiring full Sonnet inference cost ~$0.008-0.012
Medium: Breaking it down:
Intent classification (DistilBERT on Inferentia): ~$0.0002/msg
Embedding (Titan V2): ~$0.0001/msg
Vector search (OpenSearch Serverless): ~$0.0003/msg
Reranker (SageMaker GPU): ~$0.0004/msg
LLM generation (Claude 3.5 Sonnet, avg 200 tokens out): ~$0.004-0.008/msg (biggest cost by far)
Downstream API calls: ~$0.0001/msg (internal, compute-only cost)
Guardrails pipeline: ~$0.0002/msg
Infrastructure (DynamoDB, ElastiCache, API Gateway): ~$0.0005/msg amortized
The intelligent model routing (template responses for chitchat, Haiku for simple intents) brings the blended average down significantly because 40-50% of messages never hit Sonnet
Hard: Cost per message is a misleading metric in isolation. What matters is cost per resolved conversation. An average conversation is 5 messages over 3 minutes. If the chatbot resolves the user's question (no escalation to human), the cost is ~$0.02 per resolved conversation. A human agent costs ~$5-10 per interaction. The chatbot resolves ~75% of conversations without escalation, so the ROI is massive. The interesting cost question isn't "how much per message" but "at what deflection rate does the chatbot break even?" — and that breakeven point is at ~2% deflection. We're at 75%

Q14: How do you handle a scenario where Bedrock is completely down — not throttled, but unavailable?

Easy: The orchestrator detects Bedrock failure via circuit breaker (3 consecutive failures in 10 seconds). It switches to a graceful degradation mode with pre-cached responses for common queries and a "I'm experiencing some issues, let me connect you with a human agent" escalation for anything else
Medium: The degradation has three tiers: 1. Tier 1 (Sonnet down, Haiku available): Route all traffic to Claude Haiku. Quality drops for complex queries but most users don't notice for simple interactions 2. Tier 2 (All Bedrock LLMs down): Activate the cached response store — 200 pre-generated responses covering the top FAQ questions and product queries. Responses are tagged with "This answer was generated from our knowledge base" to set expectations. Non-covered queries get an escalation offer 3. Tier 3 (extended outage > 15 minutes): Push notification to the frontend to display a banner: "Our AI assistant is temporarily limited. You can browse our FAQ or chat with a human agent." Disable the chat widget for new sessions to prevent a backlog of unanswerable queries
Hard: The hardest part of Bedrock outage handling was the partial outage. Bedrock doesn't go fully down — it returns intermittent errors, increased latency, or partial responses. The circuit breaker had to distinguish between "Bedrock is slow but working" (wait longer) vs. "Bedrock is degraded and getting worse" (failover). I used a sliding window error rate: if > 20% of Bedrock calls fail or exceed 10-second timeout in a 60-second window, the circuit opens. The window-based approach prevents a single slow call from triggering failover while still reacting quickly to genuine degradation. Recovery is gradual — when the circuit half-opens, I send 10% of traffic to Bedrock and watch the error rate before fully reopening

Q15: You mentioned the guardrail that catches hallucinated prices has a ~2% trigger rate in early development, brought down to <0.5%. How does the price validator actually work?

Easy: The price validator extracts any dollar amounts from the LLM response, looks up the corresponding ASIN in the pricing service, and replaces any incorrect prices with the real price
Medium: The extraction is regex-based — it finds patterns like "$9.99", "9.99 USD", "priced at 9.99" and maps them to the nearest ASIN mention in the same sentence or paragraph. The mapping is key: the LLM might mention 3 products with 3 prices, and the validator needs to match each price to the correct product. I use proximity-based matching: each extracted price is associated with the closest ASIN reference (by character position in the response text). Then each (ASIN, price) pair is validated against the live Pricing API. If the price is wrong, it's replaced in-place
Hard: The edge cases are brutal. The LLM sometimes generates comparative prices ("usually $12.99 but on sale for $9.99") — both are valid amounts but only the sale price is the current price. The validator checks BOTH against the pricing API: the current price and the list price. If the LLM invents a discount that doesn't exist ("50% off!" when there's no active promotion), the validator removes the discount claim and states only the current price. Another edge case: multi-currency. Japanese manga volumes sometimes show yen prices from the JP market. The validator only validates USD prices for the US store experience, but flags yen amounts for manual review. Getting the regex + proximity matching + multi-price validation right required ~30 unit tests covering these edge cases alone

Killer Follow-up: "What if the Pricing API is slow or down during validation? Do you serve the response with potentially wrong prices?"

Never. Prices are safety-critical — showing a wrong price is a legal issue. If the Pricing API doesn't respond within 200ms, the price validator removes ALL price mentions from the response and adds a note: "Check the product page for current pricing." This is aggressive but correct. I'd rather show no price than a wrong price. The Pricing API has 99.99% availability and P99 < 50ms, so this fallback fires extremely rarely. But when it does, the response still makes sense — the product recommendations and descriptions are intact, just without inline prices

Section 4: Cross-Cutting Concerns & System Thinking

Q16: How do you version your APIs? What happens when the frontend needs a breaking change?

Easy: External REST APIs use URL path versioning (/v1/chat/message). WebSocket protocol version is negotiated during $connect via a query parameter. Internal APIs use contract testing to prevent unannounced breaking changes
Medium: For the WebSocket API, the frontend sends a protocol_version parameter during the $connect handshake. The backend supports protocol versions N and N-1 simultaneously. When I needed to add product card schema changes (adding a ratings field to the product JSON), I introduced protocol version 2. Old frontends on v1 got the old schema; new frontends on v2 got the enriched schema. After 4 weeks when analytics showed 99% of sessions were on v2, I deprecated v1. This rolling upgrade approach meant zero downtime for frontend deployments
Hard: The hardest versioning challenge was the LLM response format. When I changed the guardrails pipeline to add the Scope Check stage, responses that previously passed guardrails now occasionally got redirected. This wasn't an API schema change — the schema was identical — but the behavior changed. Dependent frontend logic that assumed "if I get a response, it fully answers the question" broke for scope-redirected responses. I learned to version behavior changes, not just schema changes. The metadata.guardrail_action field now explicitly indicates if the response was modified by guardrails so the frontend can handle it appropriately

Q17: Your system generates analytics data from every conversation. How do you ensure PII never reaches your analytics pipeline?

Easy: PII redaction runs at the input stage before the message reaches the LLM, so PII never enters model logs. The analytics pipeline receives the redacted version of all messages
Medium: There are three PII boundaries: (1) Input redaction — PII is replaced with tokens before the message reaches the orchestrator. The original message with PII is NEVER persisted. (2) Output redaction — the guardrails PII stage scrubs any PII that the LLM generates (e.g., if it reconstructs a credit card number from conversation context). (3) Analytics pipeline scrubbing — a secondary Comprehend-based PII scan runs on the Kinesis stream before data reaches Redshift, as a defense-in-depth layer
Hard: The hardest PII challenge was conversation history. DynamoDB stores conversation turns for multi-turn context. Those turns contain user messages that have been PII-redacted, but the redaction is lossy — "[REDACTED]" tokens don't preserve the original semantic meaning. When the LLM sees "My email is [EMAIL]" in conversation history, it can still infer that the user shared contact information. I addressed this by separating PII metadata from content: the redacted message is stored for LLM context, but a separate encrypted PII vault (stored in a KMS-encrypted DynamoDB column) holds the original PII fields for compliance retrieval. This way, if customer service needs to see "what email did the user mention?", they can access it through an authorized PII retrieval API, but the analytics pipeline and LLM context never see it

Killer Follow-up: "GDPR right to be forgotten — if a user requests deletion, how do you purge their data from everywhere including conversation history, analytics, and model training data?"

Conversation history in DynamoDB has a 24-hour TTL by default, so it auto-expires. For explicit deletion requests, I delete the session from DynamoDB immediately + issue a delete event to the Kinesis stream that propagates to Redshift (analytics) and triggers a data pipeline that scrubs the user's data from the Redshift analytics tables. For the LLM — MangaAssist uses Bedrock on-demand inference, not fine-tuning on user data. User conversations are not used for model training. This was a deliberate architectural decision: using only inference (not training) on user data simplifies GDPR compliance enormously because there's no "unlearning" problem

Q18: If you had to redesign this system from scratch knowing everything you know now, what would you change?

Easy: I'd use Provisioned Throughput for Bedrock from day one instead of discovering we needed it during load testing. I'd also implement the Probabilistic Early Reexpiration cache pattern from the start rather than hitting the cache stampede first
Medium: I'd rethink the reranker. The Cross-Encoder reranker on SageMaker was the biggest latency bottleneck and operational burden (GPU scaling, cold starts, batch inference complexity). Today, I'd evaluate using Bedrock's built-in reranking capabilities or a retrieval model like Cohere Rerank that integrates more natively with the Bedrock ecosystem. The dedicated SageMaker endpoint was the right call at the time, but the operational overhead of managing a GPU-based real-time endpoint was disproportionate to its value
Hard: The deepest architectural change: I'd move from a synchronous orchestration pattern to an event-driven architecture for the non-streaming path. Currently, the orchestrator calls each service synchronously (albeit in parallel). If any call is slow, the orchestrator holds resources waiting. An event-driven approach using Step Functions or EventBridge would let me decouple the orchestration from individual service latencies. The streaming response path would remain synchronous (users need real-time tokens), but the data gathering phase (catalog lookups, recommendation engine, customer profile) could be fully async with results assembled as they arrive. This would improve resilience and reduce the need for aggressive timeouts and circuit breakers

Section 5: Scenario-Based "What Would You Do?" Questions

Q19: It's Prime Day. Your dashboard shows Bedrock P99 climbing from 1.5s to 4s over the last 30 minutes. LLM error rate is at 3% and rising. What do you do, step by step?

Expected Answer Structure:

Immediate (0-2 minutes): Check if the issue is MangaAssist-specific or a Bedrock regional issue. Look at AWS Health Dashboard and internal Bedrock metrics. If it's regional degradation, nothing I can do about Bedrock itself — focus on mitigating impact
Short-term (2-10 minutes): Activate the Haiku fallback for simple intents immediately (this is automated via the orchestrator's 5% threshold, but I'd lower it to 3% given it's Prime Day). Enable aggressive semantic cache for common queries. This reduces Bedrock load by 25-30%
Escalation (5-15 minutes): If P99 keeps climbing, activate Tier 2 degradation — cached responses for top-200 queries, escalation offer for everything else. Page the on-call team. File a Bedrock support case
Communication: Update the status page. Push a frontend banner if degradation is user-visible
Post-incident: Analyze trace data to understand if our traffic pattern contributed. Review provisioned throughput sizing. Update the runbook

Follow-up: "The Haiku fallback is active but now Haiku is also slow. What's your next move?"

If both Sonnet and Haiku are degraded, it's a Bedrock regional issue. I'd activate multi-region failover (Route 53 routes to us-west-2 where we have a secondary deployment). If the secondary region is also affected, go to Tier 2 (cached responses + escalation). The absolute worst case: Tier 3 — disable the chat widget, direct users to existing self-service tools (FAQ pages, order tracking page). The chatbot is a value-add, not a critical path for purchases. Losing it during Prime Day hurts but doesn't block revenue

Q20: A new team wants to consume your chatbot's API to build a mobile app experience. What do you need to discuss with them?