Interview Q&A — Comprehensive Architectural Design

Skill 1.1.1 | Task 1.1 — Analyze Requirements and Design GenAI Solutions | Domain 1

Scenario 1: Wrong FM for Intent Routing

Opening Question

Q: Your MangaAssist chatbot routes every incoming user message through Claude 3 Sonnet for intent classification before passing it to a downstream recommender. Latency is 1.4 seconds and your Bedrock bill is $40K/month. How do you diagnose what's wrong and what's the architectural fix?

Model Answer

The root problem is task-model mismatch. Intent classification is a deterministic, low-complexity labeling task — output is a small vocabulary (browse, search, recommend, purchase-intent, other). Claude 3 Sonnet at $3/$15 per 1M tokens is priced for deep reasoning and synthesis; Claude 3 Haiku at $0.25/$1.25 handles structured classification just as well. On a 400-token prompt returning a 5-token label, Sonnet costs roughly 12× more than Haiku. At 1M messages/day that's over $12K/month wasted on classification alone. The architectural fix is a model-tier policy: introduce a central policy map (task_type → model_id) externalized in AppConfig or Parameter Store. Classification, moderation pre-checks, and slot extraction bind to Haiku; multi-turn answer synthesis escalates to Sonnet. Every call site emits task_type, model_id, latency_ms, and tokens as CloudWatch dimensions so the tier policy is observable and its SLA is measurable.

Follow-up 1: Benchmarking the cheaper model

Q: How do you prove Haiku is "good enough" for intent classification before you switch 100% of traffic? A: Run a labeled evaluation set of 500–1,000 real user queries — sampled from production or beta logs, not hand-crafted. Measure classification accuracy, precision per intent class, and false-positive rate for the other class on both models. If Haiku meets a pre-agreed quality gate (e.g., ≥ 97% accuracy, ≤ 1% misrouting to wrong intent tier), it passes. Run the same set with Sonnet as a baseline to establish the delta. If the gap is less than 2% accuracy, the cost savings justify the switch. Instrument a shadow-traffic period where both models run in parallel and log disagreements before fully cutting over.

Follow-up 2: Policy externalization

Q: Why should the model ID live in AppConfig rather than in the Lambda environment variable or the application code? A: Three reasons. First, changeability without redeployment — AppConfig lets you push a new routing policy and roll it back in seconds; a Lambda env-var change forces a function version update and potentially a deployment pipeline run. Second, A/B testing — AppConfig supports feature flags and percentage-based rollouts, so you can route 10% of classification traffic to a newer cheaper model and compare quality metrics in real time. Third, auditability — AppConfig records every configuration change with timestamps and authors; if a cost spike reappears you can correlate it exactly to a policy change. Hardcoded model IDs in source code are the worst option because they tie a business routing decision to a code commit and review cycle.

Follow-up 3: Observability for tier policy

Q: What CloudWatch metrics would you define to prove the tiered model policy is being respected? A: Publish a custom CloudWatch metric BedrockInvocation with dimensions: task_type, model_id, environment, service_name. Define alarms on: (1) model_id=claude-3-sonnet AND task_type=intent_classification — any non-zero count triggers a P2 alert (premium model used for cheap task). (2) task_type=answer_synthesis AND model_id=claude-3-haiku — if the policy ever accidentally routes synthesis to Haiku, you catch quality regressions immediately. (3) Total token spend per task type per hour with an anomaly detection alarm. This gives you both cost governance and quality safety nets simultaneously.

Follow-up 4: Degradation when Haiku is throttled

Q: Haiku has a lower Bedrock concurrency quota than Sonnet. What happens to your routing if Haiku gets throttled during a flash sale? A: The policy needs a fallback tier. When a ThrottlingException is caught on Haiku, the policy can either (a) escalate to Sonnet for that request (quality preserved, cost spike bounded to the throttle window), or (b) use a local deterministic classifier (a small fine-tuned model or keyword router) as a graceful degradation tier. Option (a) is simpler; option (b) eliminates the cost spike entirely. The key is that the fallback must be explicit in the policy, not an ad hoc try/catch in the application layer. Log every fallback invocation as model_id=haiku-fallback-sonnet with a reason dimension so the operations team can detect pathological throttling and request a quota increase.

Grill 1: "Sonnet everywhere is safer"

Q: Your product director says: "Just use Sonnet everywhere. We can't risk misrouting customer requests." How do you push back without dismissing the concern? A: The concern is valid but the solution is wrong. The risk of misrouting is a quality risk that belongs in the benchmark, not a reason to skip benchmarking. The correct response is: "Let's measure whether Haiku misroutes before assuming it does." Present the evaluation results side by side. If Haiku does misroute, the data will show it and Sonnet stays for that task — but we need to see the evidence. Beyond that, pointing out that a $12K/month waste on classification compounds: if we can't optimize the cheapest task we lose runway for the tasks that actually need Sonnet. And "safest" conflates model capability with model cost — a routing error from Haiku is recoverable at the next retrieval step; it is not the same risk class as a synthesis failure.

Grill 2: Multi-model versioning complexity

Q: Now you have 3 models in the policy for 6 task types. How does your team manage model-version upgrades without silent behavior regressions? A: Every model ID in the policy should be pinned to a full versioned model string (e.g., anthropic.claude-3-haiku-20240307-v1:0). When a new Haiku or Sonnet version is released on Bedrock, promote it through a canary: run 5% of production traffic on the new version, compare quality metrics against the baseline using an automated LLM-as-judge or golden-set evaluator, gate promotion to 100% on a quality threshold. The policy registry records which version is active per task type. Never use latest or an unpinned alias in production — that creates silent behavior changes whenever AWS rotates the model pointer. Version changes are treated as a deployment event with the same rollback capability as a code change.

Grill 3: The classification task grows

Q: Intent classification expands: instead of 5 labels you now need 25 fine-grained sub-intents with conditional routing. Does your argument for Haiku still hold? A: It depends on whether the complexity stays in the label space or enters the reasoning space. If it's still a single-turn structured classification task — "given this user message, return one of 25 labels" — Haiku with few-shot examples still works well; benchmark it. If the task now requires multi-hop reasoning ("is this a complaint about a recommendation or about delivery, and if delivery, which carrier?"), then you may need Sonnet for that classification path only. The policy handles this: bind the 20 simple sub-intents to Haiku and the 5 complex conditional intents to Sonnet. But if Haiku still passes quality thresholds on your 25-label set, you keep the cost savings — don't assume complexity always requires a bigger model; let the benchmark decide.

Red Flags — Weak Answer Indicators

"We'll just use Sonnet everywhere to be safe" without citing quality benchmarks
No mention of a labeled evaluation set before switching models
Treating model selection as a one-time decision with no ongoing observability
Hardcoding model IDs in application logic without acknowledging the change-management risk
Confusing model capability tier with model reliability — Haiku and Sonnet have the same Bedrock SLA

Strong Answer Indicators

Immediately quantifies the cost delta (12× per request, $12K+/month at 1M msgs/day)
Proposes a labeled benchmark with representative traffic, not clean examples
Externalizes routing policy (AppConfig/Parameter Store) with rollback capability
Defines specific CloudWatch dimensions and alarm conditions for policy enforcement
Includes a Haiku throttle fallback strategy in the architecture, not just the happy path

Scenario 2: Sync/Async Mismatch in Streaming Chat

Opening Question

Q: Users are reporting that the MangaAssist chatbot frequently freezes mid-response and disconnects. Traces show the Bedrock token stream starts successfully but the WebSocket session drops 4–8 seconds in. You find a product-metadata lookup inside the streaming loop. Walk me through what's happening and how you fix it.

Model Answer

The streaming loop is mixing two incompatible latency profiles. The Bedrock invoke_model_with_response_stream call begins forwarding tokens immediately — latency target is sub-second for first token. But the loop is pausing to call a synchronous product-metadata enrichment service inside the chunk-forwarding iteration. When that service takes 4–8 seconds (P95 enrichment latency), the stream stalls, WebSocket idle-timeout fires, and the client disconnects before the answer lands. The fix is workload separation: move mandatory enrichment lookups to before the stream starts, or make them asynchronous and optional during streaming. Specifically: (1) pre-fetch product metadata as a preflight call before invoking Bedrock; cache it in memory for the duration of that session turn; then start the stream with all enrichment data already available. (2) For data that is "nice to have" in the answer, make it a post-stream append signal — emit it as a separate non-blocking message after the main stream completes. The WebSocket session then has a guaranteed smooth token path independent of enrichment latency.

Follow-up 1: What to prefetch vs. defer

Q: How do you decide which enrichment operations must be pre-fetched before the stream starts vs. which can be deferred? A: Classify enrichment by answer dependency: if the user's query cannot be correctly answered without the data (price, stock status, series ID), it must be pre-fetched. If it enriches or links the answer (cover art URL, review blurb, related titles), it is deferrable. Pre-fetch everything in the mandatory bucket with a strict timeout (e.g., 800ms budget for all pre-flight calls). Run them in parallel using asyncio.gather or concurrent Lambda fan-out. If a mandatory call exceeds the budget, fail fast and answer without grounding rather than stalling the stream. Deferrable data is fetched in a separate async task and emitted as a structured append frame after the main stream closes — the client renders it progressively.

Follow-up 2: Testing stream resilience

Q: A dev says "it works fine in local testing." Why wouldn't this bug appear in local or single-user tests? A: Two reasons. First, latency masking: in local testing, the enrichment service runs on the same machine or same LAN — round-trip is <10ms, well below any WebSocket idle timeout. In production, you have cross-AZ or cross-region network calls, retries, and dependency queueing under load. Second, concurrency hiding: when one engineer sends sequential requests, there's no queue depth on the enrichment service. Under 500 concurrent users, each enrichment call competes for connection pool slots, thread capacity, and downstream database locks — P95 latency balloons. The fix is to add a load-characterization test: inject an artificial 3–5 second delay into the enrichment mock during integration tests to prove the streaming path doesn't block. This becomes a regression gate in CI.

Follow-up 3: WebSocket timeout configuration

Q: Can you just increase the WebSocket idle timeout to 30 seconds as a quick fix? A: That's a symptom fix, not a root fix. Extending the timeout patches the disconnection but doesn't fix the user experience — the user still stares at a frozen screen for 8 seconds. Worse, it increases server-side resource consumption: each stalled WebSocket connection holds memory, file descriptors, and potentially a thread. At 10K concurrent users with 8-second stalls, you have 80K connection-seconds of unnecessary hold. The correct fix makes the streaming path independent of slow dependencies so idle time drops to near-zero. Timeout tuning is appropriate only as a temporary mitigation while the architecture change is deployed — not as a permanent solution.

Follow-up 4: Observability for streaming health

Q: What metrics would you add to detect this class of problem proactively? A: Publish three stream-specific metrics: (1) StreamChunkGapMs — histogram of time between consecutive chunks forwarded to the client. P95 > 500ms in the middle of a stream (not the first token) is an anomaly alarm. (2) StreamDisconnectCount by disconnect_reason dimension — distinguish client-initiated vs. server-side vs. timeout disconnects. (3) StreamEnrichmentLatencyMs — track the enrichment pre-fetch latency separately. Correlate StreamChunkGapMs spikes with StreamEnrichmentLatencyMs spikes to prove causality in post-incident analysis. These three metrics give you both an early warning alarm and a root-cause shortcut when the next incident occurs.

Grill 1: Product demands enrichment in the stream body

Q: Product insists users must see enriched product data (prices, stock) woven into the streamed answer — not appended after. How do you architect that? A: If the data must appear inline with streamed tokens, the data must be available before the first relevant token is generated. That means the enrichment must complete in the pre-flight phase and be injected into the prompt context, not into the stream itself. Concretely: fetch mandatory enrichment data in parallel with the user intent analysis step. Compose the full prompt context (system prompt + RAG results + enrichment data + conversation history) before calling Bedrock. The model then streams tokens that naturally include the enriched data because it was in the context. The key insight is that "enrichment in the stream" is a generation-time concern solved at context assembly time, not at chunk-forwarding time.

Grill 2: Third-party enrichment SLA is unpredictable

Q: The enrichment service is a third-party catalog API with a P99 of 4 seconds. What do you do when you can't fix the upstream latency? A: Three mitigations, applied in layers. (1) Cache aggressively: product metadata changes infrequently (minutes to hours). A Redis cache with a 5-minute TTL covers the vast majority of queries at <1ms lookup. Cache hit rate > 95% eliminates the P99 problem for most traffic. (2) Stale-while-revalidate: serve the cached value immediately, kick off a background revalidation, update the cache for the next request. (3) Graceful omission: if the cache is cold and the live call would exceed a 500ms budget, answer without the enriched data and include a CTA ("for the latest price, tap the product card") rather than stalling the stream. The architecture should never allow an uncontrolled third-party SLA to hold a Bedrock stream hostage.

Grill 3: What if the enrichment failure causes incorrect answers?

Q: If you skip enrichment and answer anyway, the model might quote a wrong price from its training data. Isn't that worse than a 4-second delay? A: Yes, and the solution is explicit fallback framing, not silent omission. If enrichment is unavailable, the prompt context should include a constraint: "Price data is not available for this request. Do not state a specific price. Direct the user to the product page for current pricing." The model then frames the answer correctly without confabulating. This is safer than either silently serving a stale price or stalling the stream. The architectural principle is: every dependency failure mode should have a defined prompt-level response strategy, not just an infrastructure-level retry. The model's behavior at enrichment failure should be a designed state, not an accidental one.

Red Flags — Weak Answer Indicators

Suggesting "increase WebSocket timeout" as the primary fix
Not distinguishing mandatory enrichment from optional enrichment
No mention of testing stream resilience under dependency latency injection
Treating this as a pure infrastructure problem (scaling the enrichment service) vs. an architecture problem (where enrichment lives in the call sequence)
Missing the first-token-latency vs. stream-continuation distinction

Strong Answer Indicators

Classifies enrichment by answer-dependency type before prescribing a fix
Recommends pre-flight parallel enrichment with a strict latency budget
Designs a graceful omission path with prompt-level fallback framing
Proposes stream-specific observability metrics (chunk gap, disconnect reason, enrichment latency)
Addresses the third-party SLA problem with a caching + stale-while-revalidate layer

Scenario 3: Shared Read/Write Path in OpenSearch

Opening Question

Q: During every nightly catalog ingest job, MangaAssist users start getting slow or empty search results from the chatbot. Bedrock responses look fine in traces, but retrieval quality drops to near zero for 2–3 hours. What's the architectural failure and how do you fix it?

Model Answer

The root failure is workload contention: the same OpenSearch Serverless collection is absorbing both the live retrieval traffic and the heavy background ingest writes. During a bulk indexing job, write amplification consumes the collection's OCU allocation, search thread pools compete with indexing threads, and query latency climbs until timeouts exceed the RAG pipeline's threshold — at which point the orchestrator proceeds without retrieved context and the model answers from training data alone (hallucinating). The answer still returns HTTP 200, so the problem is invisible to infrastructure health checks. Architecturally, the fix requires workload isolation: separate the write path from the read path. Options include: (1) a blue/green index strategy — ingest into a shadow index, validate quality, then atomically swap an index alias from the live index to the newly built one; (2) a dedicated OCU allocation for ingest vs. serving with separate collections; (3) scheduled ingest windows aligned to off-peak hours (1–5 AM JST for a Japanese manga audience). The key design principle: retrieval is the SLA-bearing workload and must be protected from batch-write pressure.

Follow-up 1: Detecting silent quality degradation

Q: The system was returning HTTP 200 throughout. How do you instrument this so you detect the degradation before users report it? A: Standard infrastructure metrics (HTTP 5xx, P99 latency) are insufficient here because the failure mode is quality degradation, not hard failure. Add three quality-adjacent metrics: (1) RetrievedChunkCount per request — if a request that should retrieve 5 chunks returns 0, that's a retrieval failure even if the HTTP call succeeded. Alert when P50(RetrievedChunkCount) < 2 for more than 2 consecutive minutes. (2) SearchLatencyMs from OpenSearch directly — correlate spikes with ingest job schedules. (3) RAGGroundingRate — the percentage of FM responses that have at least one cited retrieved chunk. A drop from 95% to 40% during ingest hours is a leading indicator of this failure. These three together give an alarm that fires on quality degradation, not just infrastructure failure.

Follow-up 2: Blue/green index swap mechanics

Q: Walk me through the alias-swap approach. What validation happens before you cut over? A: The ingest pipeline writes to a shadow index (e.g., manga-catalog-v2). Before alias promotion: (1) run a chunk-count check — verify the new index has ≥ 99% of the document count of the live index, ruling out a truncated ingest; (2) run a golden-set retrieval test — execute 50 known-good queries against the shadow index and confirm recall ≥ baseline within a 5% tolerance; (3) check index health status is green (no relocating or unassigned shards). If all gates pass, execute an atomic alias update: POST /_aliases removes manga-catalog-current from manga-catalog-v1 and adds it to manga-catalog-v2 in a single OpenSearch transaction, with zero downtime for read traffic. If validation fails, the shadow index is discarded and the ingest job retries without ever touching the live serving path.

Follow-up 3: What if ingest must be real-time?

Q: The catalog team says they need near-real-time updates — product prices can change every 15 minutes. Blue/green won't work. What's your design? A: For near-real-time updates the design shifts from batch-rebuild to incremental streaming: use DynamoDB Streams on the product catalog table to trigger a Lambda that updates only the changed documents in OpenSearch. This keeps the ingest volume low (thousands of documents per hour vs. millions in a bulk job) and avoids the write amplification that causes contention. The write impact per document is a single _index call, not a full-index merge storm. For the OCU capacity concern: provision a separate OCU for ingest (using OpenSearch Serverless collection-level OCU allocation) so write OCU and search OCU don't compete. Additionally, use an ingest throttle (_bulk request rate limit) to prevent DynamoDB Streams lag from turning into a burst write storm on OpenSearch.

Follow-up 4: Fallback behavior when retrieval is degraded

Q: Even with good architecture, retrieval can fail (cluster restarting, network blip). What should the chatbot do when OpenSearch returns 0 results? A: Design an explicit degradation mode. When RetrievedChunkCount == 0 AND the upstream had no network error (which would trigger a retry), the orchestrator should route to a fallback prompt template that: (1) acknowledges limited context, (2) answers only from the model's general knowledge with a confidence caveat, (3) offers to retry or link to browse. Do not silently generate an answer as if retrieval succeeded — that produces confidently wrong answers. Log the degradation mode as retrieval_mode=fallback so every degraded response is measurable and the degradation rate can be tracked per user cohort, time window, and platform event (like an ingest job).

Grill 1: "Why not just over-provision OCUs?"

Q: The team proposes doubling OCU capacity so reads and writes never compete. Is that the right call? A: It addresses the symptom but not the design problem. Over-provisioning buys headroom until the next scale event — a larger catalog, a bigger ingest job, a Flash Sale. You've moved the failure threshold, not eliminated the architecture flaw. More importantly, it conflates two different reliability requirements: online retrieval needs low-latency, high-availability OCU; batch ingest tolerates higher latency and is failure-tolerant (it can retry). Putting them in the same capacity envelope means a batch failure mode can always degrade online SLA. The correct fix is structural isolation. OCU over-provisioning may be a valid short-term mitigation while the alias-swap architecture is built, but it should not be the permanent design.

Grill 2: What if the golden-set validation passes but production quality still degrades?

Q: Your pre-promotion validation shows recall ≥ baseline on the golden set, but after alias swap, production users still see worse results. What went wrong? A: This is a distribution mismatch between the golden set and production traffic — the classic PoC/evaluation gap. The golden set only covered known-good query patterns. Production may have new queries, seasonal phrasing, or language variants not represented. Resolution: (1) expand the golden set using recent production query logs — sample the last 7 days, cluster by intent type, add tail queries not in the golden set; (2) run A/B shadow scoring: send the same queries to both old and new index simultaneously, compare top-k recall scores per query, flag any cohort where the new index underperforms by >10%; (3) implement a gradual alias promotion — route 5% of traffic to the new index alias while keeping 95% on the old one, monitor RAGGroundingRate per cohort, promote only when quality convergence is confirmed.

Grill 3: The ingest job stalls mid-run

Q: The ingest job fails at 60% completion and the shadow index is partially populated. What happens to users? A: Nothing — if the architecture is correct, users are unaffected. The alias still points to the fully populated live index. The shadow index is discarded (or resumed from a checkpoint if the ingest pipeline supports incremental completion). This is the exact reason the blue/green pattern exists: partial failures never contaminate the serving path. The validation gate — document-count check before alias swap — ensures a 60%-populated shadow index can never receive the alias. Log the failure with the ingest job's checkpoint offset and retry from that position. Monitor IngestJobCompletionRate and alert if any job completes at < 95% of expected document count.

Red Flags — Weak Answer Indicators

Describing this as a pure scaling problem ("add more OCUs") without addressing workload separation
Not recognizing that HTTP 200 responses can represent quality failures in RAG systems
No mention of alias-swap or index isolation strategy for the write path
Missing the RAG grounding rate metric — only monitoring infrastructure (HTTP errors, OCU %)
No fallback mode when retrieval returns 0 results

Strong Answer Indicators

Immediately labels the failure as quality degradation hidden behind HTTP 200
Proposes workload isolation (separate ingest vs. serving path) as the structural fix
Explains the blue/green alias-swap with explicit validation gates
Designs a near-real-time alternative (DynamoDB Streams + incremental updates) when batch doesn't meet latency requirements
Creates a RAGGroundingRate metric to detect quality degradation independently of infrastructure health

Scenario 4: Cold-Start Latency Cascade

Opening Question

Q: Your MangaAssist chatbot hits a P99 latency of 12 seconds during the first 2 minutes of a flash sale burst — users are abandoning. Traces show: API Gateway + Lambda authorizer + Step Functions + 3 Lambda functions + ECS Fargate task all start simultaneously. What's the architecture problem and what's your remediation plan?

Model Answer

This is a cold-start cascade: every component on the synchronous user-facing path has its own initialization cost, and under a burst these initializations happen in parallel alignment rather than staggering. API Gateway authorizer Lambda (500ms cold start) → Step Functions start execution (200ms) → Lambda 1 orchestration step (500ms) → Lambda 2 retrieval (500ms) → Lambda 3 formatting (300ms) → ECS Fargate task start (8–12 seconds for first container) = 10–14 seconds before the first byte of answer. The pattern mistake is chaining too many cold-start-prone services on the synchronous hot path. The remediation is twofold: (1) Flatten the synchronous chain — move non-critical post-processing steps (formatting, analytics tagging) to asynchronous side channels. Keep only the critical path synchronous. (2) Warm the warm-capacity-sensitive components — ECS Fargate with desiredCount > 0 minimum tasks (provisioned concurrency), Lambda authorizer with provisioned concurrency during known burst windows (flash sales). The goal: every component on the critical path is already warm when the first sale-hour request arrives.

Follow-up 1: ECS vs. Lambda for the orchestrator

Q: Why is ECS Fargate on the hot path worse than Lambda in terms of cold-start management? A: Lambda provisioned concurrency is a well-understood, per-function knob — you pay for idle execution environments but eliminate cold starts entirely. ECS Fargate cold starts are structurally longer: pulling the container image (30–120 seconds for large images), starting the runtime, waiting for health checks. Even with pre-pulled ECR images in the task's AZ, ECS minimum cold start is 8–15 seconds. For latency-sensitive hot paths, Lambda with provisioned concurrency is the better choice for orchestration steps that need < 2-second budget. ECS Fargate is appropriate for long-running, stateful, or container-native workloads — not for handling the first token of a streaming chat session. The recommendation: move orchestration to Lambda with provisioned concurrency, keep ECS for background workers (ingest, batch embedding, async work).

Follow-up 2: Warming strategy for flash sales

Q: Flash sales are scheduled in advance. How do you automate warm capacity preparation? A: Use an EventBridge scheduled rule to trigger a pre-warm Lambda 15 minutes before each flash sale start time. The pre-warm Lambda: (1) increases Lambda authorizer provisioned concurrency to the expected concurrency level (derived from previous flash sale traffic × 1.2 safety margin); (2) issues an ECS update_service call to increase desired_count on the hot-path Fargate service; (3) sends a CloudWatch metric WarmCapacityReady=1 when both signals are confirmed. At the end of the sale window, another EventBridge rule scales back down. This pre-warming cost is known and bounded — typically < 5% of the variable cost savings from the flash sale revenue. The alternative (reactive auto-scaling) always lags burst traffic by 2–5 minutes.

Follow-up 3: Identifying which hops are on the critical path

Q: How do you distinguish synchronous critical-path hops from asynchronous background work in your trace data? A: Instrument with AWS X-Ray and add a path_type annotation (values: critical_sync, async_background) to each subsegment. In the trace waterfall: any span that must complete before the first token reaches the user is critical_sync. Spans that run after the user has received a response, or in parallel without blocking the response, are async_background. Specifically, look at the critical path duration in X-Ray Service Map — the longest chain from API entry to Bedrock response. Any Lambda or ECS span on that chain is eligible for provisioned concurrency analysis. Background spans (analytics, audit logging, cache warm-up) should be moved to EventBridge Pipes or an SQS queue and processed asynchronously. The waterfall makes invisible serialization visible.

Follow-up 4: What is an acceptable p99 budget per hop?

Q: You have a 3-second end-to-end SLA. How do you allocate the budget across components? A: Use a latency budget breakdown. Example allocation for a 3-second target: API Gateway + auth: 100ms; intent classification (Haiku): 200ms; RAG retrieval (OpenSearch): 300ms; prompt assembly: 50ms; Bedrock first-token (Sonnet): 1,500ms; response formatting + emit first chunk: 150ms; network overhead: 200ms. Total: ~2,500ms, leaving 500ms buffer for p99 variance. Each component owns its budget. Exceeding budget triggers an alarm, not a silent degradation. For components with high variance (Bedrock first-token P99 can be 2–3× P50), reserve additional buffer or add a latency SLA to the Bedrock model assessment criteria. The budget is a living contract — reviewed after each major architectural change.

Grill 1: "Step Functions adds resilience"

Q: The team says they need Step Functions on the hot path for retry logic and error handling. How do you respond? A: Step Functions is excellent for retry orchestration and state machines — but the synchronous Express Workflow variant still adds 50–200ms per step invocation overhead, and Standard Workflows are billed per state transition which adds 200ms+ latency per step. For a sub-3-second chat experience, this overhead is material. The alternative: implement retry and error handling directly in the Lambda orchestrator with exponential backoff (botocore's built-in retry config) for Bedrock calls. Step Functions is appropriate for long-running background workflows (ingest pipelines, fine-tuning orchestration, batch evaluation) where latency tolerance is seconds to minutes. Moving it off the synchronous user path reduces cold-start exposure AND latency overhead. If the retry complexity is truly high (multi-step compensating transactions), use Step Functions Express Workflows with provisioned concurrency on the trigger Lambda.

Grill 2: Auto-scaling "will handle it"

Q: The ops team says ECS Auto Scaling is configured and will handle the burst. Why is that insufficient? A: ECS Auto Scaling reacts to a metric breach — it takes a minimum of 3–5 minutes from "metric breach detected" to "new Fargate task is healthy." The flash sale burst peaks in the first 30–120 seconds (highest conversion intent). Auto Scaling arrives 5 minutes late. During those 5 minutes, the existing task count is overwhelmed: queuing increases, latency spikes, users abandon. Auto Scaling is the right tool for sustained load growth, not for instantaneous burst spikes. For burst events, the answer is pre-provisioned capacity, not reactive scaling. Auto Scaling handles the long tail (hour 2 onward); pre-provisioning handles the spike. Design for both: pre-warm headroom + auto-scale for sustained growth.

Grill 3: What if the business can't afford always-warm capacity?

Q: Pre-warming Lambda and ECS runs 24/7 even when there's no traffic. Leadership asks: is the cost justified? A: Quantify the cost both ways. Provisioned concurrency for a Lambda authorizer running at 100 concurrent executions costs roughly $50–150/month. A failed flash sale (1M messages in 2 hours, 40% abandonment due to 12-second latency) costs potentially $200K–$500K in lost GMV. The ROI is not close. Beyond flash sales, always-warm capacity is only necessary for peak windows if cost is a concern: use scheduled scaling to warm 30 minutes before the known peak window (business hours: 11 AM–9 PM JST) and scale to zero overnight. This gives you warm capacity 65% of the time at 65% of the always-on cost. Present the analysis to leadership as a cost-of-reliability decision, not a pure engineering preference.

Red Flags — Weak Answer Indicators

Proposing only auto-scaling without addressing the pre-warm gap for burst events
Missing the ECS cold start duration (~8–15s for Fargate) vs. Lambda cold start (~100–500ms)
Not identifying which hops are synchronous vs. asynchronous in the critical path
No latency budget allocation — treating the 3-second SLA as a single number without per-component breakdown
Treating Step Functions as always-appropriate without acknowledging synchronous hot-path overhead

Strong Answer Indicators

Calculates the cascade: sums per-hop cold start times and arrives at 10–14 seconds total
Distinguishes ECS Fargate cold start characteristics from Lambda
Proposes EventBridge-triggered pre-warming for planned burst events
Provides a latency budget breakdown across components (total 2,500ms with buffer)
Justifies provisioned-concurrency cost against revenue risk from abandonment

Scenario 5: Context Window Silent Truncation

Opening Question

Q: MangaAssist's multi-turn chatbot works well for the first 8–10 turns, but in sessions exceeding 20 turns, users report that the chatbot "forgets" earlier preferences and starts contradicting itself. There are no errors in logs. What is the architectural failure and how do you design a proper fix?

Model Answer

This is silent context truncation. The application is appending all conversation history and all retrieved chunks to every Bedrock invocation without a token budget. In a 20-turn session with an average of 200 tokens per turn plus system prompt (1,500 tokens) plus RAG context (2,000 tokens), the input easily exceeds 8,000–12,000 tokens. Instead of erroring, Bedrock's context window silently drops the oldest tokens — the user's early preferences, stated constraints, and prior context are removed without any signal in the response or logs. The model answers confidently from truncated context and contradicts what the user said 15 turns ago. The architectural fix requires treating context as a budgeted resource with explicit allocation: reserve fixed token slots for (1) system instructions, (2) RAG context, (3) current user request, (4) recent turns, and (5) summarized older history. Implement token-aware history assembly: estimate token count before each FM call, apply a compression strategy (summarize vs. drop), log every compression event, and emit metrics on context budget utilization.

Follow-up 1: Token estimation before assembly

Q: How do you count tokens accurately before sending to Bedrock if the tokenizer differs per model? A: Use the model's native tokenizer where available. For Claude models on Bedrock, Anthropic's anthropic Python package provides count_tokens() which gives the exact count the model will see. For a production pipeline: initialize a tokenizer client as a module-level singleton (not per-request) to avoid initialization overhead; estimate turn-by-turn as history is assembled; stop appending when the running total exceeds the budget allocation. A practical alternative for speed: use a character-count proxy (Claude averages ~3.5 characters per token) as a fast pre-check before the exact count. The key is that token estimation happens before each FM call, not as a post-hoc audit. Budget overruns are caught and handled, not silently passed to the model.

Follow-up 2: Summarization vs. truncation strategy

Q: Should you silently drop the oldest turns or summarize them? When do you choose each? A: Summarization is almost always preferred over silent dropping — it preserves semantic signal while reducing token cost. The decision tree: (1) when older turns contain user preferences, constraints, or names ("I don't want anything with more than 4 volumes" — critical to preserve), summarize: "User has expressed a preference for short series under 4 volumes." (2) When older turns are purely greeting/navigational ("Hi", "Yes", "Show me more"), dropping is acceptable — zero semantic value lost. (3) When the session is very long (>50 turns), two-tier compression: summarize turns 1–40 into a 200-token session summary ("User is browsing shonen manga under ¥1,500 for a 10-year-old, dislikes horror themes"), keep turns 41–49 verbatim, keep turn 50 (current) verbatim. The session summary is stored in DynamoDB alongside the full transcript so it's always available for re-assembly.

Follow-up 3: Observability for context budget

Q: What metrics and alarms do you add to make context truncation visible? A: Three metrics: (1) ContextBudgetUtilization (0.0–1.0) — what fraction of the input token budget was used this request. P95 > 0.95 is a warning alarm (nearly full budget, risk of truncation on the next message). P95 = 1.0 is a critical alarm (budget hit, compression fired). (2) TurnsDroppedCount per request — number of turns removed or summarized. Any non-zero count should be logged with session_id for post-hoc analysis. (3) CompressionEventRate — percentage of requests that triggered compression. If this exceeds 20% of sessions, the context budget allocation needs re-tuning (either increase budget, trim system prompt, or reduce RAG chunk count). Surface these in a CloudWatch dashboard alongside CSAT metrics — quality drops in long sessions correlate with compression rate.

Follow-up 4: Separate storage from context assembly

Q: Should conversation history storage in DynamoDB be the same structure as what gets sent to Bedrock? A: No — these are two separate concerns with different requirements. DynamoDB is the durable transcript store: full verbatim history, including dropped or compressed turns, with timestamps, user IDs, session metadata, and safety annotations. It's write-heavy, append-only, and privacy-governed (subject to GDPR/retention policies). The context assembly layer reads from DynamoDB and selects a ranked subset: recent verbatim turns + summarized older session, within the token budget. This means the application layer has a context assembly function that is independently testable, tunable, and versioned. Never send the full DynamoDB history directly to Bedrock — that entangles storage schema with model API contract and makes both harder to evolve.

Grill 1: "Just use a bigger context window model"

Q: A teammate suggests switching to Claude 3 Opus with a 200K token context window so we never truncate. What's your counterargument? A: Four problems. (1) Cost: at 200K tokens per request, Claude 3 Opus at $15/$75 per 1M tokens costs ~$3 per request for input alone. At 1M messages/day that's $3M/day — economically catastrophic. (2) Latency: larger context windows increase time-to-first-token significantly. A 200K-token prompt takes 8–15 seconds to process before the first word streams. (3) Quality: counter-intuitively, very long contexts degrade generation quality — "lost in the middle" phenomenon where information in the middle of the context is systematically underweighted by the model. A well-designed 8K-token context with good compression often outperforms a naïve 200K dump. (4) Memory management is still necessary: even with a 200K window, at 1M messages/day sessions don't stay under 200K forever. You're buying time, not eliminating the architectural problem.

Grill 2: A user invokes a preference from turn 3 in turn 25

Q: Your compression trimmed turn 3 ("I hate horror manga"). In turn 25, the user asks "no horror please, like we discussed." The model has no memory of turn 3. How do you handle this? A: Two layers of defense. First, entity extraction at ingest time: when processing each turn, run a lightweight extraction pass (Haiku-class model with a structured prompt) to pull out stated user preferences, constraints, and named entities and store them as a structured preference object in DynamoDB alongside the transcript. This preference object is always included in the assembled context regardless of compression decisions. Second, preference preservation in compression: the summarizer is instructed to always surface user preferences in the session summary: "User stated they do not want horror themes (turn 3)." The preference is preserved even when the raw turn is dropped. The context assembly layer has two inputs: session summary + recent verbatim turns + user preference struct — the preference struct is a fixed-size (200 token) reserved allocation that is never compressed away.

Grill 3: Privacy implications of long-lived session memory

Q: You're storing session summaries and preferences in DynamoDB. A user asks you to delete their conversation history. What are the GDPR/compliance implications of your memory architecture? A: Session data is personal data under GDPR if it's linked to a user identity. The right-to-erasure obligation applies. Implementation: (1) all session records, preference objects, and summaries in DynamoDB are keyed by user_id partition key — a single DeleteItem scan or GSI query erases all records for that user. (2) summaries must be erased along with raw turns — a lingering summary is still personal data even if the raw turns are deleted. (3) if session summaries are cached in Redis for active sessions, the cache must be invalidated on deletion. (4) retain a deletion audit log (separate from personal data) recording that erasure occurred, the timestamp, and the data categories deleted — without retaining the content itself. Design the TTL separately: sessions expire after 90 days of inactivity automatically via DynamoDB TTL, reducing the long-term erasure surface.

Red Flags — Weak Answer Indicators

Suggesting "use a bigger context window" as the primary solution without addressing cost and latency
Not distinguishing durable conversation storage from context assembly
Implementing silent dropping without logging dropped turns or emitting compression metrics
Missing the user-preference preservation problem under compression
No mention of token estimation before FM call — assuming context truncation won't happen

Strong Answer Indicators

Defines explicit token budget allocation across 5 categories (system, RAG, current turn, recent history, summary)
Separates DynamoDB transcript storage from context assembly as independent concerns
Proposes entity extraction for preference preservation orthogonal to compression
Addresses GDPR/right-to-erasure for multi-tier session storage
Defines ContextBudgetUtilization, TurnsDroppedCount, and CompressionEventRate as observable metrics