Interview Q&A — Dynamic Model Selection Architecture
Skill 1.2.2 | Task 1.2 — Select and Configure FMs | Domain 1
Scenario 1: Model ID Hard-Coded in Lambda — No Runtime Switching
Opening Question
Q: During a Bedrock pricing spike, the on-call engineer needs to switch MangaAssist from Claude 3 Sonnet to Haiku across all ECS services within 5 minutes to prevent a cost overrun. The model ID is hard-coded as a constant in four Lambda functions. What is the operational impact, what is the immediate workaround, and how do you redesign to prevent this permanently?
Model Answer
The immediate operational impact is that the engineer cannot switch models in under 5 minutes. A hard-coded model ID requires: code change → PR → CI → Lambda deployment — a 15–30 minute process at best, with risk. The workaround in the moment: push an SSM Parameter Store update to a parameter the Lambda already reads (if one exists for anything), or set a Lambda environment variable override. But if nothing is externalised, the engineer is stuck waiting for a deployment. The permanent redesign: replace hard-coded model IDs with AWS AppConfig. The Lambda uses the appconfigdata SDK with a session-token pattern: on first invocation it calls start_configuration_session(), then polls get_latest_configuration() on each Lambda invocation but only when POLL_INTERVAL_SECONDS (default 60) has elapsed. The AppConfig document holds the model mapping keyed by use-case and environment. Add a cost_override_active boolean flag: when true, all call sites automatically drop to Haiku regardless of the normal per-use-case routing. A cost spike response goes from a 20-minute deployment to a 30-second AppConfig console toggle.
Follow-up 1: AppConfig session-token pattern in detail
Q: Walk me through the AppConfig appconfigdata session-token polling pattern in Lambda.
A: The appconfigdata pattern avoids the deprecated GetConfiguration API and uses tokenized sessions: (1) On module load (Lambda cold start), call appconfig.start_configuration_session(ApplicationIdentifier, EnvironmentIdentifier, ConfigurationProfileIdentifier, RequiredMinimumPollIntervalInSeconds=15). Store the returned InitialConfigurationToken. (2) Store _config_session_token, _config_cache, and _last_poll_time as module-level globals (persisted across Lambda invocations within the same execution environment). (3) On each Lambda invocation, check if time.time() - _last_poll_time >= POLL_INTERVAL_SECONDS. If yes, call appconfig.get_latest_configuration(ConfigurationToken=_config_session_token). The response contains either a new token + new config (config changed) or just a new token + empty body (no change). Always update _config_session_token with the returned token. (4) If config body is non-empty, parse and replace _config_cache. Return _config_cache. This pattern minimizes AppConfig API calls — only once per poll interval, not once per request — and is compatible with Lambda's execution model.
Follow-up 2: Pre-commit hook to block hard-coded model IDs
Q: How do you enforce that no future developer introduces a hard-coded model ID into the codebase?
A: A custom pre-commit hook using regex scanning: the hook scans staged Python/JS files for the pattern r'anthropic\.claude[-.]3[-.](?:sonnet|haiku|opus)' and any variant that resembles a Bedrock model ID string literal (ARN-like patterns with hardcoded version strings). If the pattern is found outside the designated constants/model_ids.py module (which itself is immutable except by the platform team), the commit is blocked with a message: [BLOCKED] Hard-coded Bedrock model ID detected in {file}:{line}. Use TieredModelRouter.get_model_id(task_type) instead. Add the hook to .pre-commit-config.yaml and enforce it in CI with pre-commit run --all-files as a required GitHub Actions check. The hook takes < 1 second and catches violations before review. Test the hook itself: add a test file with a hard-coded model ID and assert the hook blocks it.
Follow-up 3: Testing the cost-override flag without a production incident
Q: How do you validate that the cost_override_active flag actually switches all services to Haiku before you need it in a real incident?
A: Run a quarterly cost-override drill: (1) Create a "cost override" AppConfig deployment in the staging environment — set cost_override_active: true. (2) Deploy staging and run the integration test suite. Assert that model_id in all Bedrock invocation logs is amazon.titan-text-lite-v1 or anthropic.claude-3-haiku-* (no Sonnet calls). (3) Time the drill from "engineer sets AppConfig flag" to "all services are using Haiku" — target < 90 seconds (poll interval + AppConfig propagation). (4) Document the drill result in the runbook with pass/fail. Also run a shadow cost estimation: estimated_cost_with_haiku_only / estimated_cost_normal_routing to quantify the cost reduction the flag provides. If the flag doesn't achieve at least 70% cost reduction, the routing taxonomy needs review — too many non-switchable Sonnet calls.
Follow-up 4: Rollback if the cost override causes quality degradation
Q: The cost override is active and CSAT drops sharply — some use cases clearly need Sonnet. How do you handle a partial rollback?
A: The AppConfig document supports per-use-case model overrides, not just a global flag. Instead of a binary cost_override_active, the schema supports: { "cost_override_use_cases": ["intent_classify", "moderation"] } — only these use cases switch to Haiku; premium use cases (product_recommendation, complex_answer) stay on Sonnet. Partial rollback procedure: (1) Change cost_override_active: true → add specific use cases to cost_override_use_cases. (2) Monitor CSAT per use-case by logging use_case as a tag alongside CSAT feedback. (3) Add use cases to the override list one at a time, monitoring CSAT after each addition. This way you maximise cost reduction (Haiku on all safe use cases) while keeping Sonnet on the use cases that demonstrably need it. The AppConfig change propagates in < 60 seconds with no deployment required.
Grill 1: "SSM Parameter Store is simpler — why AppConfig?"
Q: A senior engineer argues: "SSM Parameter Store is simpler and faster to implement than AppConfig. We should just read model IDs from SSM." How do you respond?
A: SSM Parameter Store is fine for a single value consumed by a single service. AppConfig is the right choice for a structured routing configuration consumed by multiple services because it provides: (1) Versioning: AppConfig maintains a full history of configuration deployments — you can roll back a bad config change with one click. SSM has no native deployment history. (2) Deployment strategy: AppConfig supports canary + linear deployment of config changes — you can validate a new model routing config on 10% of Lambda invocations before rolling it to 100%. SSM has no equivalent. (3) Polling with caching: the appconfigdata session-token pattern explicitly supports interval-based polling and caching. SSM GetParameter on every Lambda invocation is an API call that accumulates cost and throttling at high request rates. (4) Validation: AppConfig supports a JSON schema validator on the configuration profile — a badly structured config update is rejected before deployment. SSM has no such gate. The right answer is: use SSM for simple key-value parameters (API keys, feature flags), use AppConfig for structured, multi-field configuration that benefits from versioned deployment.
Grill 2: AppConfig adds a network call — does it hurt latency?
Q: Every Lambda invocation could poll AppConfig. What is the latency impact?
A: The module-level caching strategy ensures the AppConfig network call happens at most once every 60 seconds per Lambda execution environment, not once per invocation. In steady state, _last_poll_time is recent enough that every invocation hits the in-memory cache (nanosecond read). The only invocation with AppConfig latency is the first invocation after the 60-second poll interval, and even then: get_latest_configuration() returns ~5ms when the config is unchanged (empty body response). The cold-start invocation (first invocation in a new execution environment) calls start_configuration_session() which takes ~20ms. This adds 20ms to cold start latency — negligible given that Lambda cold start is already 200–500ms. X-Ray annotations on cache hits are the right monitoring: if you see appconfig_cache_miss=true being annotated more than 1 per 60 seconds per execution environment, there's a bug in the caching logic. In practice, AppConfig adds zero measurable p99 latency for standard invocation patterns.
Red Flags — Weak Answer Indicators
- Proposing SSM Parameter Store without acknowledging AppConfig's deployment strategy and versioning advantages
- No pre-commit hook or static analysis for hard-coded model ID enforcement
- Missing the
appconfigdatasession-token pattern — falling back to deprecatedGetConfigurationor per-request API calls - No cost-override validation drill — treating the operational requirement as a design-only concern
Strong Answer Indicators
- Correctly describes the AppConfig
appconfigdatasession-token caching pattern with module-level globals - Designs the pre-commit hook with a specific regex targeting Bedrock model ID string literals
- Proposes a quarterly cost-override drill with a < 90-second switch-time assertion
- Handles partial rollback via per-use-case
cost_override_use_casesrather than a binary global flag
Scenario 2: AppConfig Not Cached — 200ms Per Request Latency Spike
Opening Question
Q: After migrating from hard-coded model IDs to AppConfig, the team observes p99 latency increasing from 1.8s to 2.4s on the recommendation API. X-Ray traces show a 200ms segment labeled GetConfiguration on every request. During high-traffic periods this spike grows to 800ms and AppConfig starts throwing ThrottlingException. What happened and how do you fix it?
Model Answer
The team implemented AppConfig reads without the session-token caching pattern — they used the deprecated GetConfiguration API or called get_latest_configuration() on every Lambda invocation without checking POLL_INTERVAL_SECONDS. This turns a one-per-60-seconds operation into a one-per-request operation. At 1,000 requests/second, that is 1,000 AppConfig API calls/second — well above the default GetLatestConfiguration throttle limit of ~100 calls/second per session. The fix: implement the module-level caching pattern correctly. Add three module-level globals in the Lambda handler: _config_session_token = None, _config_cache = {}, _last_poll_time = 0.0. In the get_model_config() function, compare time.time() - _last_poll_time >= POLL_INTERVAL_SECONDS before making any AppConfig call. Cache the config in _config_cache and update _last_poll_time when a new token is received. Every invocation after the first one within the poll interval reads from the in-memory dict — zero network calls, zero latency, zero throttling.
Follow-up 1: How module-level globals survive across Lambda invocations
Q: You're depending on module-level globals persisting between invocations. Is this reliable in Lambda?
A: Lambda execution environments are reused ("warm" invocations) as long as the function continues to receive traffic. Module-level globals are initialized once at cold start and persist for the lifetime of the execution environment. For a high-traffic function like MangaAssist's recommendation API (1,000+ req/s), execution environments stay warm for extended periods — the globals persist across hundreds or thousands of invocations. The risk is per-execution-environment stale config: if a config change is deployed to AppConfig, it propagates to each execution environment within one poll interval (60 seconds) independently. During the 60-second window, some execution environments have the new config and some have the old one — this is expected and acceptable because AppConfig config changes are non-instantaneous by design. The fix for needing faster propagation: reduce POLL_INTERVAL_SECONDS to 15 (the minimum). The risk for overly stale config: if the Lambda environment is sleeping (very low traffic), _last_poll_time may be 5 minutes old — on the next invocation, the cache poll fires immediately because time.time() - _last_poll_time is large.
Follow-up 2: X-Ray annotation strategy for cache health monitoring
Q: How do you verify in production that the caching is working correctly?
A: Add X-Ray annotations to get_model_config(): xray.put_annotation("appconfig_cache_hit", True/False) and xray.put_annotation("appconfig_latency_ms", elapsed_ms). Create a CloudWatch X-Ray Analytics metrics query: count(appconfig_cache_hit == False) / count(all) = cache miss rate. The expected cache miss rate for a 1,000 req/s function with POLL_INTERVAL_SECONDS=60 is approximately: 1 miss per 60 seconds per execution environment. If there are 50 concurrent execution environments, that is ~50 cache misses/60 seconds ≈ 0.8 misses/second. At 1,000 req/s, the miss rate is 0.08% — correct behavior. If the miss rate is 2%+, the caching logic is broken. Set a CloudWatch alarm: if appconfig_cache_miss_rate > 1% sustained for 5 minutes, page the on-call. Use both the miss rate and the AppConfig ThrottlingException metric — if throttles appear alongside a high miss rate, the caching logic is the root cause.
Follow-up 3: What happens if AppConfig is down?
Q: The Lambda is trying to poll AppConfig, but AppConfig is experiencing an outage. What does the cached pattern do?
A: The module-level cache provides resilience: if get_latest_configuration() throws any exception (including AppConfig outage), the code catches the exception, emits a appconfig_fetch_failed CloudWatch metric, and returns _config_cache — the last-known-good configuration. The Lambda does not fail at AppConfig — it continues serving with the cached config until AppConfig recovers. Two edge cases: (1) Cold start during AppConfig outage: _config_cache is empty, _config_session_token is None. The start_configuration_session() call will fail. Catch this exception and fall back to hard-coded defaults (the DEFAULT_CONFIG constant defined in the module). The defaults are the last-resort safe values — typically the lowest-cost, safest model for each use-case. (2) Stale config for extended outage: if AppConfig is down for 30 minutes, the Lambda is serving config that is up to 30 minutes old. This is usually acceptable. Emit a appconfig_stale_config_duration_seconds metric (current time minus _last_poll_time) and alarm when > 300 seconds during normal operations.
Follow-up 4: Handling configuration schema changes without downtime
Q: You need to add a new field to the AppConfig document (a new use-case). How do you deploy this without Lambda returning KeyError on the missing field?
A: Three defensive practices: (1) Additive-only changes: always add new fields, never rename or delete existing ones without a migration plan. New fields are unknown to old Lambda code — explicit check with config.get("new_field", default_value) rather than config["new_field"]. (2) Schema validation on write: the AppConfig configuration profile has a JSON Schema validator attached — the new field is added to the schema before the Lambda code change, and the AppConfig validator rejects missing or malformed documents at deployment time. (3) Phased rollout: AppConfig linear deployment strategy — deploy the new config to 10% of invocations first, confirm no KeyError alarms, then 100%. If the Lambda cold-starts during the 10% phase and gets the new config, it must handle the field gracefully. Lambda code must be backward-compatible with both old and new AppConfig documents for the duration of the deployment window.
Grill 1: "Just use ElastiCache Redis for the config — it's already in the stack"
Q: An engineer proposes replacing AppConfig with ElastiCache Redis for the model config cache. "Redis is faster, we already use it for the circuit breaker." How do you evaluate this? A: Redis is an excellent cache, but it does not replace AppConfig — they serve different functions. AppConfig is a configuration management service with: versioned history, deployment strategies (canary/linear), rollback, schema validation, IAM-controlled access to config changes. Redis is a data cache with none of these features. Using Redis as a config store means: (1) no deployment history — you can't see what the config was last week; (2) no rollback UI — a bad config deployed to Redis stays there until someone manually overwrites it in panic; (3) no access control — any Lambda with Redis access can overwrite the routing config. The correct architecture: AppConfig is the source of truth for model routing configuration. Redis can optionally be a distributed cache layer in front of AppConfig if you need cluster-wide instant propagation (especially in ECS where AppConfig polling is per-task, not per-cluster). But the management interface remains AppConfig. The answer is "Redis can be an optimization layer; it cannot replace AppConfig's governance features."
Grill 2: The 60-second poll interval means a config rollback takes 60 seconds to propagate to all environments
Q: During an incident, the SRE pushes a config rollback in AppConfig but some Lambda execution environments are still serving the bad config 45 seconds later. Is this acceptable?
A: For most model routing config changes, a 60-second propagation window is acceptable — the previous configuration was presumably serving traffic successfully, and returning to it within 60 seconds is far better than a code deployment rollback (which is 15–30 minutes). For scenarios where even 60 seconds matters (e.g., a safety-compromising config change), reduce RequiredMinimumPollIntervalInSeconds to 15. The tradeoff: 15-second polling at 100 concurrent execution environments = 100/15 ≈ 7 AppConfig calls/second, well within service limits. The correct escalation path: for critical config changes (e.g., disabling a model that's producing harmful output), use an application-level circuit breaker that can be toggled via Redis in < 1 second in parallel with the AppConfig rollback. Combine both levers: Redis for instant emergency stop, AppConfig as the authoritative versioned config store.
Red Flags — Weak Answer Indicators
- Not identifying the missing module-level caching as the root cause
- Proposing to cache in a local variable inside the Lambda handler function (resets on every invocation)
- No X-Ray annotation strategy for validating cache hit rate
- Missing the AppConfig outage fallback — no last-known-good handling
Strong Answer Indicators
- Immediately identifies module-level globals as the correct persistence scope for Lambda execution environments
- Correctly explains warm invocation reuse and the expected cache miss rate calculation
- Designs an AppConfig outage fallback to
DEFAULT_CONFIGconstants for cold-start failure - Proposes Redis as an optional distributed cache layer in ECS contexts (not a replacement for AppConfig)
Scenario 3: New Bedrock Model Registered Without Schema Validation — Bad Config Crashes All Inference Workers
Opening Question
Q: The ML team adds a new Claude 3.5 Sonnet model to the AppConfig routing document. They copy a Haiku config block and forget to add the required anthropic_version field and the correct max_tokens for Sonnet (8,192 vs. 4,096). All inference ECS tasks begin throwing ValidationException and the chatbot is completely down. The incident lasts 22 minutes. Explain the chain of failures and design a system that prevents this from happening again.
Model Answer
Chain of failures: (1) The AppConfig document lacks a JSON Schema validator on the configuration profile — any JSON document, regardless of correctness, can be deployed. (2) The team deploys the bad document and AppConfig accepts it. (3) Lambda execution environments poll and receive the new config — no client-side validation of the fetched config. (4) Lambda constructs Bedrock InvokeModel body missing anthropic_version → Bedrock returns ValidationException. (5) No Lambda-side fallback to the previous known-good config → all requests fail. Prevention architecture: Three-layer validation: (1) CI schema validation: a jsonschema.validate(config, MODEL_ROUTING_SCHEMA) step in the CI/CD pipeline for any PR that modifies the AppConfig document — fails the PR if any registered model block is missing required fields. (2) AppConfig profile validator: attach the same JSON schema as an AppConfig managed validator to the configuration profile — AppConfig rejects at deployment time. (3) Lambda-side defensive validation: the get_model_config() function validates the model block for the requested use case before using it. If the block is missing required fields, serve from LAST_KNOWN_GOOD_CONFIG (the previous validated config snapshot) and emit InvalidConfigRejected CloudWatch metric.
Follow-up 1: What the schema validation covers
Q: Design the JSON schema for the model routing configuration that would catch the missing anthropic_version field.
A: Required validation rules per model block: model_id (string, required, pattern ^(anthropic|amazon)\.[a-z0-9.-]+), max_tokens (integer, required, minimum 1, maximum 200000), temperature (number, optional, minimum 0, maximum 1), top_p (number, optional, minimum 0, maximum 1), and a conditional requirement: if model_id contains "anthropic" then anthropic_version is required (string, enum: ["bedrock-2023-05-31"]). The conditional requirement on anthropic_version is the exact field that was missing in this incident. The jsonschema library supports if/then/else for conditional requirements. Add a max_tokens_limits cross-field check: if model_id == sonnet, max_tokens <= 8192; if model_id == haiku, max_tokens <= 4096. This prevents misconfigured max_tokens values that cause silent truncation or API errors. The schema is stored in the repository under schemas/model_routing_config.json and loaded in both CI and the Lambda handler.
Follow-up 2: How the Lambda-side fallback to last-known-good works
Q: In the Lambda, after fetching a new config from AppConfig, how does the last-known-good fallback work mechanically?
A: Implementation: (1) Maintain a second module-level global _last_known_good_config = {} that is updated only when the fetched config passes schema validation. (2) In get_model_config(): fetch new config from AppConfig → run jsonschema.validate(new_config, schema). If validation passes: update both _config_cache and _last_known_good_config with the new config, update _last_poll_time. If validation fails: emit InvalidConfigRejected CloudWatch metric with the validation error message as a metric dimension; keep _config_cache pointing to the last validated config (do NOT update it with the bad config); log the validation error at ERROR level with the raw config document. The _last_known_good_config is a separate variable from _config_cache specifically to handle the case where a validation failure happens at cold start (when _config_cache is empty) — fall back to module-level DEFAULT_CONFIG constants in that case. This three-level fallback (new config → last known good → hardcoded defaults) ensures zero hard crashes from config changes.
Follow-up 3: AppConfig canary deployment for model config changes
Q: Besides schema validation, how can AppConfig's deployment strategy prevent this type of incident from reaching all instances simultaneously?
A: Use AppConfig's linear deployment strategy for model routing config changes: deploy to 10% of Lambda execution environments, wait 2 minutes, check CloudWatch alarms, then advance to 100%. The alarm to watch: Bedrock ValidationException rate > 0.1% in the first 2 minutes. Because the bad config only reaches 10% of environments in the first wave, 90% of traffic remains unaffected. The canary wave failing triggers an automatic AppConfig rollback (configure RollbackWhenAlarmFiring: true in the deployment strategy with the ValidationException alarm ARN). The automatic rollback executes within 30 seconds. Contrast with the incident: the bad config deployed to 100% simultaneously → 100% of traffic failed for 22 minutes. With canary deployment: 10% of traffic affected for < 3 minutes before automatic rollback. The total user impact is reduced by ~10× in scope and ~7× in duration.
Follow-up 4: Adding a new model to AppConfig correctly — what the runbook looks like
Q: Write the correct step-by-step runbook for adding a new Bedrock model to the AppConfig routing configuration.
A: Runbook: Step 1 — Copy the model config block template from templates/model_config_block.json. The template has all required fields with placeholder values and comments for acceptable ranges. Step 2 — Fill in the specific model values: model_id from the Bedrock console, anthropic_version from Bedrock's API reference (always bedrock-2023-05-31 for Claude models), max_tokens from the model page. Step 3 — Run CI validation locally: python -m pytest tests/test_model_config_schema.py --new-config path/to/draft.json. Assert exit code 0. Step 4 — Create a PR; the pipeline runs jsonschema.validate(config, schema) and comments the validation result. Step 5 — Merge and deploy using AppConfig linear strategy (10% → 100%). Step 6 — Monitor InvalidConfigRejected and ValidationException metrics for 10 minutes post-deployment. Step 7 — Run the automated model smoke test: one synthetic query per use case that uses the new model. Assert all responses valid. Close the deployment task. The template in Step 1 is the key: it turns "copy and modify a valid example" into the default workflow, eliminating the "copy Haiku config and forget a field" class of errors.
Grill 1: Schema validation didn't run because the config was pushed directly via console
Q: Post-incident analysis reveals the engineer bypassed CI by deploying the config directly through the AWS AppConfig console. The CI schema validation check was never run. How do you close this bypass?
A: Two controls: (1) IAM policy restricting CreateHostedConfigurationVersion: add an SCP or IAM condition that only allows appconfig:CreateHostedConfigurationVersion from the CI/CD IAM role (identified by the condition aws:PrincipalARN matching the pipeline role ARN). Engineers can read AppConfig configs via console but cannot deploy them directly. All deployments must go through the pipeline. (2) AppConfig validator: the managed JSON schema validator on the configuration profile is service-side — it runs on every CreateHostedConfigurationVersion call regardless of who makes it (console, CLI, pipeline). This means even if the IAM control is misconfigured, the schema validation still catches malformed documents. The combination of IAM restriction + service-side schema validation means the bypass via console is (a) prevented by policy and (b) still validated even if the policy is ever misconfigured. Defense in depth.
Grill 2: The schema validator has a bug — it accepts a malformed anthropic_version
Q: A subtle schema bug means anthropic_version: "bedrock-2024-01-01" (an invalid value) passes the schema validator. This bad value reaches Lambda and causes subtle errors. How do you catch this?
A: The JSON schema validator is a first layer — it catches structural issues (missing required fields, wrong types). It cannot catch API-level semantic errors (an anthropic_version value that is syntactically valid but unknown to Bedrock). Second layer: model smoke test in CI: after the config passes schema validation, run a synthetic InvokeModel call with the new model config in a sandbox Bedrock account (or a local mock). If Bedrock returns ValidationException, the smoke test fails and blocks deployment. This directly tests the Bedrock API contract, not just the schema. Third layer: enum constraint in the schema: "anthropic_version": { "type": "string", "enum": ["bedrock-2023-05-31"] }. The enum is sourced from the Bedrock API documentation and updated when a new version is released. The schema + smoke test combination catches both structural and semantic validation failures.
Red Flags — Weak Answer Indicators
- Only proposing schema validation in CI without AppConfig service-side validation OR Lambda-side fallback
- Missing the three-level fallback hierarchy (new config → last known good → hardcoded defaults)
- No canary deployment strategy for config changes
- Not addressing the IAM bypass through which CI validation was sidestepped
Strong Answer Indicators
- Designs three-layer validation (CI + AppConfig managed validator + Lambda defensive validation)
- Implements the
_last_known_good_configseparate from_config_cachefor cold-start safety - Proposes AppConfig linear deployment with
RollbackWhenAlarmFiringtied to theValidationExceptionalarm - Closes the console bypass with IAM
PrincipalARNcondition + service-side validator as defense-in-depth
Scenario 4: Lambda Env Var Model Switch — In-Flight Version Skew
Opening Question
Q: During a rolling Lambda update, the team changes an environment variable FM_MODEL_ID from claude-3-haiku to claude-3-sonnet. For approximately 8 minutes, 30% of Lambda execution environments have the old env var and 70% have the new one. Users' multi-turn sessions route different turns to different models, causing inconsistent response style and context handling. Explain the problem, what you should have done, and how you detect version skew in production.
Model Answer
The root cause is using Lambda environment variables as the mechanism for model selection. Environment variables are baked into the Lambda deployment/update — they are not a runtime-switchable per-request control. During a rolling update (which Lambda uses to limit concurrent invocation interruption), old and new execution environments coexist. The only way to avoid this skew with env vars is a complete replacement (all-at-once) deployment — which risks a much higher blast radius on failure. The architectural mistake: environment variables should never be used to control model selection because they cannot be changed without a deployment, and their propagation during rolling updates is inherently racy. The correct replacement: AppConfig as described in Scenario 1. AppConfig polling converges all execution environments within one poll interval (15–60 seconds) without a deployment. For the claude-3-haiku→claude-3-sonnet change: update AppConfig document → all Lambda environments see the new model within 60 seconds, no deployment required, no version skew.
Follow-up 1: Detecting version skew via CloudWatch Logs Insights
Q: During the incident, how would you detect that version skew is causing multi-model sessions in production? A: Run a CloudWatch Logs Insights query across the Lambda log group:
fields @timestamp, session_id, model_id
| stats count_distinct(model_id) as model_variety by session_id
| filter model_variety > 1
| sort @timestamp desc
count_distinct(model_id) > 1 means the session contacted two different model IDs across its turns — definitive evidence of version skew. Add this as a persistent CloudWatch metric filter that emits a MultiModelSession metric whenever count_distinct(model_id) > 1 per session ID per 5-minute window. Alarm: MultiModelSession > 5 per 5 minutes → page on-call. The detection latency is limited by the CloudWatch Logs Insights query interval. For real-time detection, emit a session_model_id field in X-Ray traces and use X-Ray Analytics to find segments with mixed model annotations within the same trace.
Follow-up 2: Policy enforcement against future env var model switching
Q: How do you prevent the team from trying to use env vars for model selection again?
A: Three controls: (1) Documentation and ADR: write an ADR (Architecture Decision Record) stating "Model IDs MUST be sourced from AppConfig. Lambda environment variables MUST NOT be used for runtime model selection." Include the version skew incident as the motivation. (2) Pre-commit hook: scan Lambda configuration files (serverless.yml, template.yaml, CDK Stack files, Dockerfile ENV lines) for keys matching *MODEL*, *FM_ID*, *BEDROCK* — flag them for review. Block if the value is a Bedrock model ID string. (3) Pull request template checklist: add a mandatory item "I have confirmed that no model IDs are set via Lambda environment variables" to the PR description template. The combination of the ADR (cultural), the pre-commit hook (technical), and the PR checklist (process) creates three independent barriers. Any single one catching the issue is sufficient to prevent the incident.
Follow-up 3: If you must use env vars (legacy constraint), how do you minimize skew window?
Q: A legacy Lambda function vendor says they can only support env var-based model selection. What is the minimum-skew migration strategy?
A: If AppConfig is truly impossible (vendor constraint), use atomic blue/green deployment instead of rolling: create a new Lambda function version with the new env var → update the Lambda alias to point exclusively to the new version in a single atomic API call. There is no rolling update window — the alias switch is instant. Users mid-session during the alias switch may land on either version for their next request, but the switch duration is milliseconds (not 8 minutes). Additional controls: (1) Session affinity via ElastiCache: write session_id → model_id to Redis on first session request; subsequent requests read from Redis and use the same model regardless of which Lambda version serves them. This provides session-level model consistency even if the Lambda version varies. (2) Emit model_id in every response and detect inconsistency in the client layer. (3) Minimize env-var update frequency — batch model changes across a maintenance window when traffic is lowest.
Follow-up 4: What fields should be logged to support version skew investigation
Q: If the incident happens again, what structured log fields give you the fastest root-cause diagnosis?
A: Every Bedrock invocation log must include: session_id (for grouping turns), model_id (exact model used), lambda_version (the specific Lambda function version ARN, available via context.function_version), appconfig_config_version (the AppConfig ConfigurationVersion returned in the get_latest_configuration() response), timestamp_utc (ISO-8601, not log timestamp which can be skewed), and request_id. With these fields, a CloudWatch Logs Insights query can produce: for every session with more than one lambda_version or model_id, show the turn-by-turn model/version history. This query runs in < 10 seconds on a 24-hour log range and produces a clear timeline showing when the version skew started and ended. The appconfig_config_version field specifically distinguishes "env var skew" (same config version, different Lambda version) from "AppConfig rollout race" (different config versions, same Lambda version) — enabling precise root cause attribution.
Grill 1: "Session affinity via Redis for model consistency adds latency"
Q: The team pushes back on Redis session affinity: "A Redis GET on every request adds 2ms of latency. We can't afford that." How do you respond? A: First, 2ms is the maximum Redis latency in the same VPC — in practice ElastiCache GET operations complete in 0.3–0.8ms. Second, MangaAssist already uses ElastiCache Redis for the circuit breaker state — the connection pool is already established, and there is no additional cold-connection overhead. Third, the cost of session model inconsistency is not 2ms — it is an incoherent user experience where the chatbot's response style and memory change mid-conversation. The CSAT impact of a context-confused response is far larger than 2ms. Fourth, the Redis GET is only necessary if the fundamental architecture (env vars + rolling deploy) is unavoidable — the real answer is "use AppConfig and eliminate both the version skew problem and the Redis workaround." The 2ms argument is a distraction from the correct architectural choice.
Grill 2: CloudWatch Logs Insights query has 5-minute latency — by the time alert fires, the skew is over
Q: The version skew during a rolling deploy lasts 8 minutes. A CloudWatch Logs Insights query on a 5-minute window means the alert fires at minute 5 — but the rolling deploy may have completed by minute 8. Is the alert still useful?
A: The alert serves two purposes: (1) Real-time intervention: if the skew is still in progress at minute 5, the alert enables an on-call engineer to pause the deployment, force an alias rollback, or reduce the rolling update percentage. Even a 4-second window of action time (minutes 5–8) is better than discovering the incident retroactively. (2) Post-incident detection: if the skew is already resolved by the time the alert fires, the alert is still a Post-Incident Review trigger. The SKew happened — the alert confirms it — the retrospective happens. Without the alert, a 30% model-skew incident that lasted 8 minutes might never be noticed or investigated. For truly real-time detection: push model_id to an X-Ray trace annotation and use X-Ray Analytics Service Map, which updates near-realtime (< 10 seconds). The CloudWatch Logs Insights query is the retroactive audit; X-Ray is the real-time detection layer.
Red Flags — Weak Answer Indicators
- Not understanding why rolling Lambda updates create a multi-model window with env vars
- Proposing atomic all-at-once deployment without acknowledging blast radius tradeoff
- Missing the
count_distinct(model_id) by session_idLogs Insights query as the diagnostic tool - Not identifying AppConfig as the correct replacement for env var model selection
Strong Answer Indicators
- Correctly diagnoses rolling update window as the fundamental cause (not a "timing issue")
- Designs the
MultiModelSessionpersistent CloudWatch metric filter as an ongoing detection mechanism - Proposes session affinity via Redis only as a workaround for legacy constraints
- Provides the exact Logs Insights query with all relevant structured log fields
Scenario 5: Routing Rule Conflict — Cost and Latency Routers Both Match
Opening Question
Q: MangaAssist has two routing rules in AppConfig — one optimising for cost (use Haiku) and one optimising for latency (use a local-region fine-tuned model). Both rules have a condition that matches messages tagged type=recommendation. For 14 days nobody noticed that 50% of recommendation requests were going to Haiku and 50% to the fine-tuned model, producing inconsistent quality. How does this happen, how do you detect it, and how do you redesign routing to prevent conflicts?
Model Answer
The routing engine evaluated rules in list order without explicit priority integers, and both rules matched type=recommendation. Without deterministic priority resolution, the engine may have been implementing last-match, first-match, or non-deterministic set iteration (depending on implementation language and collection type). In Python, iterating a dict is insertion-order deterministic since Python 3.7, but if rules were stored in a set or un-ordered JSON object, ordering was non-deterministic. Either way, two rules matching the same input without a tiebreaker produces undefined behavior. Detection: segment model_id in every Bedrock invocation log by task_type == recommendation. If count_distinct(model_id) where task_type=recommendation > 1, there is a conflict. Redesign: add a mandatory priority integer field to every routing rule schema (lower integer = higher priority, like iptables). Validation: the CI schema check verifies that no two rules targeting the same task_type have the same priority integer. The routing engine selects the rule with the lowest priority integer among all matched rules — deterministic and explicit.
Follow-up 1: RoutingRule dataclass design with conflict detection
Q: Show how you would redesign the RoutingRule model and the routing evaluation logic to be explicitly conflict-free.
A: RoutingRule as a Python dataclass with fields: rule_id: str, priority: int, conditions: Dict[str, str] (e.g., {"task_type": "recommendation", "user_segment": "premium"}), model_id: str, description: str. The routing function: (1) collect all rules where every condition key-value matches the request context; (2) sort matched rules by priority ascending; (3) if the top two matched rules have the same priority — emit RoutingRuleConflict CloudWatch metric with dimensions rule_a, rule_b, task_type; log a warning; use the rule with the lower lexicographic rule_id as a deterministic tiebreaker (never silent non-determinism); (4) return the model_id from the highest-priority (lowest priority integer) rule. The RoutingRuleConflict metric is the signal that the configuration has a conflict; the deterministic tiebreaker ensures the system never behaves randomly while the conflict is pending resolution. Zero silent failures.
Follow-up 2: CI validation for routing rule conflicts
Q: How do you catch a routing rule conflict in CI before it reaches AppConfig?
A: A test_routing_rules_no_conflict() test that loads the AppConfig document from the PR branch and evaluates all pairwise combinations of routing rules: for any pair (rule_A, rule_B) with the same priority and overlapping condition sets, fail the test with a message: [CONFLICT] Rule '{rule_A.rule_id}' and '{rule_B.rule_id}' both match task_type=recommendation with priority=10. Assign distinct priority values. The test uses itertools.combinations(rules, 2) to check all pairs — O(n²) but the number of rules is small (< 50) in any real system and the test completes in < 5ms. Add this as a required CI check for any PR that modifies the routing configuration. The test is deterministic, fast, and catches exactly the class of conflict that caused the 14-day incident.
Follow-up 3: What the 14-day blind spot tells you about monitoring
Q: The conflict went undetected for 14 days. What monitoring would have caught it within 1 hour?
A: Three monitoring gaps to close: (1) Per-task-type model distribution metric: emit RoutingModelId as a CloudWatch metric dimension alongside TaskType. Create an anomaly detection alarm: if the cardinality of model_id for TaskType=recommendation in a 5-minute window is > 1, alarm. This catches the conflict within the first 5 minutes after the first conflicting requests arrive. (2) Conflict metric alert: the RoutingRuleConflict metric (emitted by the conflict-detecting router) should have an alarm with threshold 0 — any single conflict event pages on-call. (3) Quality distribution by model: a daily automated eval that scores 50 recommendation responses per model. If the recommendation use-case shows bi-modal quality (some responses at 4.⅖, others at 3.⅕), the quality distribution histogram is a lagging indicator that something is routing inconsistently. The per-task-type model cardinality alarm is the fastest detection; the quality histogram is the business-impact confirmation.
Follow-up 4: Handling rule precedence in a multi-environment configuration
Q: The routing rule priority system works in dev, but in production, different teams own different rule files that are merged at deployment time. How do you prevent priority conflicts between teams?
A: Global namespace partitioning for priority ranges: each team owns a dedicated 100-unit priority range. Example: Core ML team owns priorities 1–100; Product Recommendation team owns 101–200; Safety/Moderation team owns 201–300. Since moderation should always run highest-priority, the Safety team's range starts at the top (1–100) and Core ML gets 101–200. Each team's CI validates only within their own range. A merge-time validation step in the shared CI/CD pipeline checks all combined rules for cross-team priority conflicts. Because each team has its own non-overlapping range, cross-team conflicts are structurally impossible. Within a team's range, strict priority discipline is the team's responsibility. Additionally: create a routing_rule_ownership.yaml that maps rule_id patterns to team names — if a PR modifies a rule owned by a different team, require approval from the owning team.
Grill 1: "We should just use the first-match rule — it's simpler"
Q: An engineer proposes: "First-match wins, in the order rules appear in the config file. No priority needed." Is this acceptable?
A: First-match with implicit list order is fragile for three reasons: (1) Config file order is not semantically meaningful — it depends on which engineer wrote the rule first, not which rule has business priority. "Cost optimisation beats latency" vs. "latency beats cost" is a business decision that should be explicit, not an accident of edit history. (2) Merge conflicts corrupt order: if two PRs both add rules and the merge resolves a conflict in the rules array, the resulting order is determined by git's merge algorithm, not by any deliberate business logic. (3) Invisible priority assumption: a new engineer adding a rule in the "middle" of the list unintentionally changes the priority of all subsequent rules without understanding the impact. Explicit priority integers make business intent clear, version-controlled, and auditable. The argument that "priority integers are more complex" is outweighed by the fact that the current incident cost 14 days of inconsistent quality — the complexity is the right kind (explicit and visible) vs. the wrong kind (implicit and dangerous).
Grill 2: The priority tiebreaker (lexicographic rule_id) also seems arbitrary
Q: The lexicographic-rule_id tiebreaker for equal priorities is arbitrary. Why not just raise an error and fail the request?
A: Fail-fast (raise an error) is appropriate in CI — where the conflict is caught and blocked before deployment. In production, raising an error on every request when a conflict exists would trigger a 100% failure rate for all recommendation requests until the conflict is resolved — worse than the original 50/50 inconsistency. The deterministic tiebreaker is a production safety valve, not the intended resolution. Its job is to: (1) keep the system serving requests with a predictable (even if not perfect) behavior; (2) emit the RoutingRuleConflict metric that pages on-call; (3) log which rule "won" via the tiebreaker so the investigation can confirm the behavior. The correct response to a RoutingRuleConflict alarm is to fix the conflict in AppConfig (assign distinct priorities) within the next deployment cycle — typically < 30 minutes. The system is degraded (serving with arbitrary tiebreaker) but not down. The priority is to auto-detect + alert + degrade-gracefully, not fail-hard in production.
Red Flags — Weak Answer Indicators
- Not identifying the absence of a
priorityfield as the architectural root cause - Proposing first-match with implicit list order as an acceptable resolution
- Missing the CI pairwise conflict detection test
- No
RoutingRuleConflictmetric emitted or monitored
Strong Answer Indicators
- Designs
RoutingRuledataclass with mandatorypriority: int - Implements deterministic tiebreaker (lexicographic rule_id) with
RoutingRuleConflictmetric emission for transparent degradation - Creates O(n²) pairwise combination CI test that catches conflicts before deployment
- Designs team-owned priority namespace ranges to prevent cross-team conflicts in a multi-team config