Scenarios and Runbooks — Dynamic Model Selection Architecture
MangaAssist context: JP Manga store chatbot on AWS — Bedrock Claude 3 (Sonnet at $3/$15 per 1M tokens input/output, Haiku at $0.25/$1.25), OpenSearch Serverless (vector store), DynamoDB (sessions/products), ECS Fargate (orchestrator), API Gateway WebSocket, ElastiCache Redis. Target: useful answer in under 3 seconds, 1M messages/day scale.
Skill Mapping
| Dimension | Detail |
|---|---|
| Certification | AWS AIP-C01 — AI Practitioner |
| Domain | 1 — Foundation Model Integration, Data Management, and Compliance |
| Task | 1.2 — Select and configure FMs |
| Skill | 1.2.2 — Create flexible architecture patterns to enable dynamic model selection and provider switching without requiring code modifications |
| This File | Five production scenarios with detection flowcharts, root cause analysis, resolution code, and prevention strategies |
Skill Scope Statement
This file presents five real-world failure scenarios that MangaAssist has encountered (or would encounter) in production when adopting dynamic model selection and provider switching. The scenarios cover hard-coded model IDs forcing redeployments, AppConfig cache misses adding latency, missing schema validation on new model registrations, in-flight Lambda version skew after environment-variable-based switches, and conflicting routing rules producing non-deterministic model choices. Each scenario includes: a problem statement, a mermaid detection flowchart, root cause analysis, Python resolution code, and prevention measures. These runbooks are designed for on-call engineers and platform architects responsible for the MangaAssist inference layer.
Mind Map — Dynamic Model Selection Failure Modes
mindmap
root((Dynamic Model<br/>Selection Failures))
Hard-Coded Model ID
Lambda Redeploy Required
No Runtime Switching
Costly Cost Spikes
AppConfig Cache Miss
200ms Config Latency
Per-Request API Calls
Throttling Risk
Missing Schema Validation
Bad Config Crashes Workers
Unvalidated Model Entries
Inference Layer Outage
Env Var Provider Switch
In-Flight Version Skew
Multiple Models Serving
Non-Deterministic Responses
Routing Rule Conflict
Cost vs Latency Router
Non-Deterministic Selection
Audit Trail Gap
Scenario Overview
| # | Scenario | Severity | Blast Radius | Typical Detection Time |
|---|---|---|---|---|
| 1 | Model ID hard-coded in Lambda — redeploy required to switch models during cost spike | P2 — High | All inference traffic routed through affected Lambda | 10-30 minutes (manual discovery, not alarmed) |
| 2 | AppConfig model routing config not cached — every request fetches config adding 200 ms latency | P2 — High | Every inference call, all users | 5-10 minutes via p99 latency alarm |
| 3 | New Bedrock model registered without schema validation — bad config crashes all inference workers | P1 — Critical | Complete inference layer outage | 1-3 minutes via error rate alarm |
| 4 | Provider switch via Lambda env var — concurrent Lambda versions serve different models simultaneously | P2 — High | Subset of in-flight requests during rollout | 15-30 minutes via response inconsistency reports |
| 5 | Routing rule conflict: cost-based and latency-based routers both match same message type | P3 — Medium | Non-deterministic model selection for affected intent class | Post-hoc via model-usage audit |
Scenario 1: Model ID Hard-Coded in Lambda — Redeploy Required During Cost Spike
Problem
MangaAssist's inference Lambda was originally built with the Bedrock model ID (anthropic.claude-3-sonnet-20240229-v1:0) embedded directly in source code. During a sudden traffic surge — typically late Friday evening when manga chapter releases drive a 5× spike — token costs spike proportionally. The on-call engineer wants to instantly downgrade from Sonnet to Haiku to cut costs by ~90%, but the only mechanism is a full Lambda redeployment which takes 8-12 minutes and introduces deployment risk at peak hours.
Detection
flowchart TD
A["CloudWatch Alarm:<br/>BedRock token cost<br/>anomaly detector fires"] --> B{"On-call opens<br/>Lambda console"}
B --> C["Searches for model ID<br/>in environment variables"]
C -->|"Not found in env vars"| D["Searches source code<br/>in Lambda function code"]
D -->|"Model ID found<br/>in source code"| E["ROOT CAUSE:<br/>Hard-coded model ID —<br/>no runtime switching available"]
D -->|"Not found in source"| F["Check AppConfig /<br/>Parameter Store for routing config"]
E --> G{"Assess alternatives"}
G -->|"No config switch available"| H["Must redeploy Lambda<br/>with updated model ID"]
G -->|"AppConfig exists but unused"| I["Enable AppConfig-backed<br/>routing — use Runbook 1"]
H --> J["Track deployment duration<br/>vs cost burn — escalate if > 15 min"]
Root Cause
The inference Lambda was scaffolded with the model ID as a Python constant (MODEL_ID = "anthropic.claude-3-sonnet-20240229-v1:0"). No abstraction layer was built to read the model ID from an external configuration source. When Bedrock Haiku became available and cost targets tightened, the team never migrated configuration to AppConfig or Parameter Store, because the switch was always done "just once." Over time, hard-coded IDs spread to three Lambda functions (inference, retry-handler, eval-harness), each needing independent redeployment.
Resolution
"""
Runbook 1: Migrate hard-coded model ID to AppConfig-backed dynamic routing.
Provides zero-redeploy model switching for MangaAssist inference Lambda.
Prerequisites:
- AWS AppConfig application: "manga-assist"
- AppConfig environment: "production"
- AppConfig configuration profile: "model-routing"
- Initial hosted config document (JSON) deployed before Lambda update
"""
import json
import logging
import os
import time
from functools import lru_cache
from typing import Any
import boto3
logger = logging.getLogger("manga_model_router")
logger.setLevel(logging.INFO)
# ---------------------------------------------------------------------------
# AppConfig SDK cache — avoids per-request HTTP calls (see Scenario 2 fix)
# ---------------------------------------------------------------------------
_appconfig_client = boto3.client("appconfigdata", region_name=os.environ.get("AWS_REGION", "us-east-1"))
# Module-level state — persists across warm invocations
_config_session_token: str | None = None
_config_cache: dict | None = None
_cache_loaded_at: float = 0.0
# AppConfig poll interval — Lambda will not call GetLatestConfiguration
# more often than this across warm invocations (max 1 req per interval per instance)
APPCONFIG_POLL_SECONDS = int(os.environ.get("APPCONFIG_POLL_SECONDS", "60"))
APPCONFIG_APP = os.environ["APPCONFIG_APP"] # e.g. "manga-assist"
APPCONFIG_ENV = os.environ["APPCONFIG_ENV"] # e.g. "production"
APPCONFIG_PROFILE = os.environ["APPCONFIG_PROFILE"] # e.g. "model-routing"
def _start_config_session() -> str:
"""Start an AppConfig data session and return the initial token."""
response = _appconfig_client.start_configuration_session(
ApplicationIdentifier=APPCONFIG_APP,
EnvironmentIdentifier=APPCONFIG_ENV,
ConfigurationProfileIdentifier=APPCONFIG_PROFILE,
RequiredMinimumPollIntervalInSeconds=APPCONFIG_POLL_SECONDS,
)
return response["InitialConfigurationToken"]
def get_model_routing_config() -> dict:
"""
Fetch model routing config from AppConfig with Lambda-level caching.
Returns the cached config if last fetch was within APPCONFIG_POLL_SECONDS.
Only calls GetLatestConfiguration when the poll interval has elapsed.
Config shape:
{
"default_model_id": "anthropic.claude-3-haiku-20240307-v1:0",
"intent_overrides": {
"product_recommendation": "anthropic.claude-3-sonnet-20240229-v1:0",
"content_moderation": "anthropic.claude-3-haiku-20240307-v1:0"
},
"cost_override_active": false
}
"""
global _config_session_token, _config_cache, _cache_loaded_at
now = time.monotonic()
# Return cached config if within poll interval
if _config_cache is not None and (now - _cache_loaded_at) < APPCONFIG_POLL_SECONDS:
return _config_cache
# Start session on cold start or token expiry
if _config_session_token is None:
_config_session_token = _start_config_session()
try:
response = _appconfig_client.get_latest_configuration(
ConfigurationToken=_config_session_token
)
# Token is rotated on every call — must always update
_config_session_token = response["NextPollConfigurationToken"]
raw = response["Configuration"].read()
if raw:
# Non-empty response means config changed (or first fetch)
_config_cache = json.loads(raw)
_cache_loaded_at = now
logger.info(
"AppConfig model routing config refreshed: default_model=%s",
_config_cache.get("default_model_id"),
)
else:
# Empty body = no change since last poll; keep existing cache
logger.debug("AppConfig returned empty body — no config change")
except Exception as exc:
logger.error("AppConfig fetch failed: %s — using last known config", exc)
if _config_cache is None:
raise RuntimeError(
"AppConfig unreachable and no cached config available"
) from exc
return _config_cache
def resolve_model_id(intent: str) -> str:
"""
Resolve the Bedrock model ID for a given intent using AppConfig routing rules.
Cost override (set in AppConfig when cost alarm fires) forces all traffic
to Haiku regardless of intent-level overrides.
"""
config = get_model_routing_config()
if config.get("cost_override_active", False):
model_id = "anthropic.claude-3-haiku-20240307-v1:0"
logger.info("Cost override active — routing all traffic to Haiku")
return model_id
model_id = config.get("intent_overrides", {}).get(
intent, config["default_model_id"]
)
logger.info("Resolved model_id=%s for intent=%s", model_id, intent)
return model_id
def lambda_handler(event: dict, context: Any) -> dict:
"""
MangaAssist inference Lambda — model ID read from AppConfig, not source code.
"""
bedrock = boto3.client("bedrock-runtime", region_name=os.environ.get("AWS_REGION", "us-east-1"))
session_id = event["session_id"]
user_message = event["message"]
intent = event.get("intent", "general_qa")
model_id = resolve_model_id(intent)
payload = {
"anthropic_version": "bedrock-2023-05-31",
"max_tokens": 1024,
"messages": [{"role": "user", "content": user_message}],
}
response = bedrock.invoke_model(
modelId=model_id,
body=json.dumps(payload),
contentType="application/json",
accept="application/json",
)
result = json.loads(response["body"].read())
answer = result["content"][0]["text"]
logger.info(
"Inference complete: session=%s intent=%s model=%s tokens_in=%d tokens_out=%d",
session_id,
intent,
model_id,
result["usage"]["input_tokens"],
result["usage"]["output_tokens"],
)
return {"session_id": session_id, "answer": answer, "model_id": model_id}
Prevention
- Never embed Bedrock model IDs as Python constants or Lambda environment variables. All model IDs live in AppConfig hosted configuration from day one.
- Add a
pre-commithook that scans for string literals matching the Bedrock model ID pattern (r"anthropic\.[a-z0-9\-]+\.") in Lambda source files and fails the commit. - Create a CloudWatch alarm on Bedrock token costs (
AWS/Bedrock→InputTokenCount+OutputTokenCount). When the alarm fires, the on-call runbook instruction is to flipcost_override_active: truein AppConfig — not to redeploy anything. - Include the
resolve_model_idabstraction as the organization's standard Lambda scaffold so new functions inherit dynamic routing from the start.
Scenario 2: AppConfig Model Routing Config Not Cached — 200 ms Added Per Request
Problem
MangaAssist updated the inference Lambda to read model routing from AppConfig (fixing Scenario 1), but the developer used the older GetConfiguration API directly on every invocation instead of setting up a session-token-based polling loop. Every inference call makes a synchronous AppConfig HTTP call before invoking Bedrock, adding ~200 ms to every request and sometimes triggering AppConfig API throttling (429s) under load. The p99 latency for the chatbot rises from 1.8 s to 2.4 s and periodically spikes to 5 s when throttled.
Detection
flowchart TD
A["CloudWatch Alarm:<br/>MangaAssist p99 latency > 2.5s"] --> B{"Check X-Ray<br/>service map"}
B -->|"AppConfig segment<br/>> 180ms present on<br/>every trace"| C["AppConfig call<br/>on every invocation confirmed"]
B -->|"Bedrock alone<br/>accounts for latency"| D["Model performance issue —<br/>Different Runbook"]
C --> E{"Check Lambda<br/>CloudWatch Logs"}
E -->|"'AppConfig model routing<br/>config refreshed' on<br/>EVERY invocation"| F["ROOT CAUSE:<br/>No module-level caching —<br/>fresh HTTP call per request"]
E -->|"Throttle errors<br/>(ThrottlingException)"| G["AppConfig throttling<br/>compounding latency"]
F --> H["Apply Runbook 2:<br/>Session-token polling loop"]
G --> H
Root Cause
The Lambda used the legacy appconfig (not appconfigdata) client, calling GetConfiguration with the same ClientId on every invocation. The appconfigdata client's session-token pattern was designed specifically to eliminate per-request API calls — the token acts as a cursor; AppConfig only sends a new config body when something has changed. Without the session token and module-level caching, every warm invocation made a fresh HTTPS call to AppConfig, serializing it with the Bedrock call. Under high concurrency, the AppConfig endpoint throttled at its default burst limit, causing intermittent 5-second spikes as the Lambda retried with exponential backoff.
Resolution
"""
Runbook 2: Fix AppConfig per-request fetches with session-token caching.
The fix is additive — replace the legacy GetConfiguration pattern with
the appconfigdata session pattern shown in Runbook 1, ensuring the
module-level globals (_config_session_token, _config_cache) persist
across Lambda warm invocations.
This snippet shows the broken pattern and the corrected replacement.
"""
import json
import logging
import os
import time
import boto3
from botocore.exceptions import ClientError
logger = logging.getLogger("manga_appconfig_fix")
logger.setLevel(logging.INFO)
# ── BROKEN PATTERN (do NOT use) ─────────────────────────────────────────────
def _broken_get_config_every_call() -> dict:
"""
Anti-pattern: calls appconfig.GetConfiguration on every Lambda invocation.
Adds 150-250ms and risks throttling at scale.
"""
client = boto3.client("appconfig") # Legacy client
response = client.get_configuration(
Application="manga-assist",
Environment="production",
Configuration="model-routing",
ClientId="manga-lambda", # Same client ID = no cache benefit
)
return json.loads(response["Content"].read() or b"{}")
# ── CORRECT PATTERN ──────────────────────────────────────────────────────────
_appconfig_data_client = boto3.client(
"appconfigdata",
region_name=os.environ.get("AWS_REGION", "us-east-1"),
)
# These persist across warm invocations within the same Lambda execution environment
_session_token: str | None = None
_cached_config: dict | None = None
_last_poll_time: float = 0.0
POLL_INTERVAL_SECONDS = int(os.environ.get("APPCONFIG_POLL_SECONDS", "60"))
def _ensure_session_token() -> str:
"""Start a new AppConfig data session if one does not exist."""
global _session_token
if _session_token is None:
resp = _appconfig_data_client.start_configuration_session(
ApplicationIdentifier=os.environ["APPCONFIG_APP"],
EnvironmentIdentifier=os.environ["APPCONFIG_ENV"],
ConfigurationProfileIdentifier=os.environ["APPCONFIG_PROFILE"],
RequiredMinimumPollIntervalInSeconds=POLL_INTERVAL_SECONDS,
)
_session_token = resp["InitialConfigurationToken"]
logger.info("AppConfig session started (cold start or token expiry)")
return _session_token
def get_model_config_cached() -> dict:
"""
Return model routing config, fetching from AppConfig at most once per
POLL_INTERVAL_SECONDS across warm invocations.
Benchmark improvement:
Before fix — AppConfig call on every invocation: ~200ms added
After fix — AppConfig call every 60s (once per poll interval): ~0ms added
on warm invocations; session token call on cold start: ~40ms
"""
global _session_token, _cached_config, _last_poll_time
now = time.monotonic()
if _cached_config is not None and (now - _last_poll_time) < POLL_INTERVAL_SECONDS:
# Cache is fresh — return immediately, no HTTP call
return _cached_config
token = _ensure_session_token()
try:
start = time.monotonic()
resp = _appconfig_data_client.get_latest_configuration(
ConfigurationToken=token
)
elapsed_ms = (time.monotonic() - start) * 1000
# MUST always update the token — it is rotated on every API call
_session_token = resp["NextPollConfigurationToken"]
body = resp["Configuration"].read()
if body:
_cached_config = json.loads(body)
_last_poll_time = now
logger.info(
"AppConfig config updated in %.1f ms: %s",
elapsed_ms,
json.dumps(_cached_config),
)
else:
# Empty body = no config change, keep existing cache
_last_poll_time = now # Reset timer so we don't poll again immediately
logger.debug("AppConfig: no config change (%.1f ms)", elapsed_ms)
except ClientError as exc:
error_code = exc.response["Error"]["Code"]
if error_code == "BadRequestException" and _cached_config is not None:
# Token may have expired (e.g. Lambda was paused > 24h)
# Reset session and fall back to cached data
logger.warning(
"AppConfig token invalid, resetting session — using cached config"
)
_session_token = None
else:
logger.error("AppConfig GetLatestConfiguration failed: %s", exc)
if _cached_config is None:
raise
return _cached_config
def lambda_handler(event: dict, context) -> dict:
"""
Demonstrates: AppConfig config retrieved without HTTP overhead on warm invocations.
Only the FIRST call per poll interval (or cold start) hits AppConfig.
"""
config = get_model_config_cached() # Zero HTTP cost on warm invoke within poll window
intent = event.get("intent", "general_qa")
model_id = (
config.get("intent_overrides", {}).get(intent)
or config["default_model_id"]
)
logger.info("Model selected: %s for intent: %s", model_id, intent)
return {"model_id": model_id, "intent": intent}
Prevention
- Establish a Lambda scaffolding standard that includes the session-token polling loop from day one. Document the anti-pattern (
GetConfigurationwithout caching) in the team's internal wiki with a "do not use" warning. - Add an X-Ray annotation
appconfig_cache_hit: true/falseto every Lambda invocation. Create a CloudWatch Insights query that alerts whenappconfig_cache_hit = falseappears on > 5% of invocations. - Set
RequiredMinimumPollIntervalInSecondsto at least 60 seconds. AppConfig enforces a minimum of 15 seconds, and choosing 60 seconds reduces bill by 4× while still allowing fast config propagation. - In load tests (before prod deploy), assert that p50 and p99 AppConfig segment latency is < 5 ms; a segment latency > 100 ms indicates the cache is not working.
Scenario 3: New Bedrock Model Registered Without Schema Validation — Bad Config Crashes Inference Workers
Problem
A junior engineer adds a newly announced Bedrock model (anthropic.claude-3-5-sonnet-20241022-v2:0) to MangaAssist's AppConfig model routing document. The document is missing the required "anthropic_version" key that the inference code expects when building the Bedrock request payload. The bad config is deployed to AppConfig production. On the next poll cycle (60 seconds later), all warm Lambda instances fetch the config update, attempt to invoke Bedrock with a malformed payload, and start returning 5xx errors. The chatbot becomes completely unavailable for all users.
Detection
flowchart TD
A["PagerDuty P1 Alert:<br/>MangaAssist error rate > 10%"] --> B{"Check Lambda<br/>CloudWatch Logs"}
B -->|"KeyError: 'anthropic_version'<br/>in build_payload()"| C["Config schema<br/>violation detected"]
B -->|"ValidationException<br/>from Bedrock"| D["Also config-related —<br/>check payload construction"]
C --> E{"When did errors start?"}
D --> E
E -->|"~60s ago<br/>(AppConfig poll cycle)"| F["Check AppConfig<br/>deployment history"]
F -->|"New deployment<br/>found at T-60s"| G["ROOT CAUSE:<br/>Bad config deployed —<br/>no schema validation<br/>on AppConfig document"]
G --> H{"Is AppConfig rollback<br/>available?"}
H -->|"Yes"| I["Roll back to previous<br/>AppConfig deployment version"]
H -->|"No"| J["Manually push corrected<br/>config document via Runbook 3"]
I --> K["Monitor error rate<br/>drops to baseline"]
J --> K
Root Cause
AppConfig accepts arbitrary JSON documents with no enforced schema unless a JSON Schema validator is explicitly attached to the configuration profile. The routing config document grew organically without a formal schema definition. When an engineer added the new model entry, they omitted the anthropic_version field that the inference Lambda's build_payload() function requires. Since there was no CI validation of the AppConfig document against a schema before deployment, the malformed config was pushed to production. All Lambda instances that polled during the next cycle immediately began failing on every inference call.
Resolution
"""
Runbook 3: Schema validation for AppConfig model routing documents.
Two parts:
(a) CI validation script — runs in GitHub Actions before AppConfig deployment
(b) Lambda-side defensive validation — rejects bad config at fetch time,
falls back to last known good config, and fires a CloudWatch alarm
"""
import json
import logging
import os
import time
import boto3
import jsonschema
from jsonschema import ValidationError
logger = logging.getLogger("manga_config_validator")
logger.setLevel(logging.INFO)
# ── CONFIG SCHEMA ────────────────────────────────────────────────────────────
MODEL_ROUTING_SCHEMA = {
"$schema": "http://json-schema.org/draft-07/schema#",
"title": "MangaAssist Model Routing Config",
"type": "object",
"required": ["default_model_id", "intent_overrides", "cost_override_active"],
"additionalProperties": False,
"properties": {
"default_model_id": {
"type": "string",
"pattern": r"^anthropic\.",
"description": "Bedrock model ID used when no intent override matches",
},
"intent_overrides": {
"type": "object",
"additionalProperties": {
"type": "string",
"pattern": r"^anthropic\.",
},
"description": "Per-intent model ID overrides",
},
"cost_override_active": {
"type": "boolean",
"description": "When true, routes all traffic to Haiku regardless of intent",
},
"model_registry": {
"type": "array",
"items": {
"type": "object",
"required": [
"model_id",
"anthropic_version",
"max_tokens_default",
"cost_tier",
],
"additionalProperties": False,
"properties": {
"model_id": {"type": "string", "pattern": r"^anthropic\."},
"anthropic_version": {
"type": "string",
"enum": ["bedrock-2023-05-31"],
},
"max_tokens_default": {"type": "integer", "minimum": 1, "maximum": 8192},
"cost_tier": {"type": "string", "enum": ["low", "medium", "high"]},
},
},
},
},
}
def validate_model_routing_config(config_doc: dict) -> None:
"""
Validate a model routing config document against MODEL_ROUTING_SCHEMA.
Raises jsonschema.ValidationError on failure.
Use this function in:
1. CI pipeline (pre-AppConfig-deploy) — raises and fails the pipeline
2. Lambda fetch handler — catches and falls back to last known good config
"""
jsonschema.validate(instance=config_doc, schema=MODEL_ROUTING_SCHEMA)
logger.info("Config schema validation passed")
# ── CI SCRIPT (run as: python validate_appconfig.py <config_file.json>) ──────
def ci_validate_config_file(file_path: str) -> None:
"""Entry point for CI pipeline schema check before AppConfig deployment."""
with open(file_path, "r", encoding="utf-8") as f:
doc = json.load(f)
try:
validate_model_routing_config(doc)
print(f"[PASS] {file_path} passes schema validation")
except ValidationError as exc:
print(f"[FAIL] {file_path} schema violation: {exc.message}")
raise SystemExit(1) from exc
# ── LAMBDA-SIDE DEFENSIVE FETCH ──────────────────────────────────────────────
_appconfig_client = boto3.client(
"appconfigdata", region_name=os.environ.get("AWS_REGION", "us-east-1")
)
_cloudwatch = boto3.client("cloudwatch", region_name=os.environ.get("AWS_REGION", "us-east-1"))
_session_token: str | None = None
_last_known_good_config: dict | None = None
_last_poll_time: float = 0.0
POLL_INTERVAL_SECONDS = int(os.environ.get("APPCONFIG_POLL_SECONDS", "60"))
def _emit_invalid_config_alarm() -> None:
"""Push a custom CloudWatch metric so the ops alarm fires within 60 seconds."""
try:
_cloudwatch.put_metric_data(
Namespace="MangaAssist/ModelRouting",
MetricData=[
{
"MetricName": "InvalidConfigRejected",
"Value": 1.0,
"Unit": "Count",
}
],
)
except Exception as cw_exc:
logger.error("Failed to emit InvalidConfigRejected metric: %s", cw_exc)
def get_validated_model_config() -> dict:
"""
Fetch AppConfig model routing with schema validation.
Rejects and discards invalid documents; falls back to last known good config
and emits a CloudWatch alarm metric.
"""
global _session_token, _last_known_good_config, _last_poll_time
now = time.monotonic()
if _last_known_good_config and (now - _last_poll_time) < POLL_INTERVAL_SECONDS:
return _last_known_good_config
if _session_token is None:
resp = _appconfig_client.start_configuration_session(
ApplicationIdentifier=os.environ["APPCONFIG_APP"],
EnvironmentIdentifier=os.environ["APPCONFIG_ENV"],
ConfigurationProfileIdentifier=os.environ["APPCONFIG_PROFILE"],
RequiredMinimumPollIntervalInSeconds=POLL_INTERVAL_SECONDS,
)
_session_token = resp["InitialConfigurationToken"]
resp = _appconfig_client.get_latest_configuration(
ConfigurationToken=_session_token
)
_session_token = resp["NextPollConfigurationToken"]
body = resp["Configuration"].read()
if body:
try:
candidate = json.loads(body)
validate_model_routing_config(candidate) # Raises if invalid
_last_known_good_config = candidate
_last_poll_time = now
logger.info("New valid config accepted: %s", json.dumps(candidate))
except (json.JSONDecodeError, ValidationError) as exc:
logger.error(
"INVALID AppConfig document rejected — keeping last known good: %s", exc
)
_emit_invalid_config_alarm()
# Do NOT update _last_known_good_config — the bad doc is discarded
else:
_last_poll_time = now
if _last_known_good_config is None:
raise RuntimeError("No valid model routing config available and no fallback")
return _last_known_good_config
Prevention
- Attach an AWS AppConfig JSON Schema validator to the
model-routingconfiguration profile. AppConfig will reject syntactically invalid documents at deploy time, before they ever reach Lambda. - Add the
ci_validate_config_file()step to the GitHub Actions workflow that deploys AppConfig changes. The pipeline fails (and deployment is blocked) if schema validation fails. - In Lambda, always use
get_validated_model_config()instead of parsing AppConfig JSON directly. The schema check on every config update acts as a defence-in-depth layer against AppConfig validator bypass. - Create a CloudWatch alarm on
MangaAssist/ModelRouting → InvalidConfigRejected > 0to page on-call the moment a bad document is detected, even if Lambda itself has not failed (because it fell back gracefully).
Scenario 4: Provider Switch via Lambda Env Var — In-Flight Version Skew Serving Different Models
Problem
The platform team decides to switch MangaAssist's default model from Sonnet to Haiku across all Lambdas. Instead of updating the AppConfig document (which would propagate to all warm instances within one poll cycle), an engineer updates the DEFAULT_MODEL_ID environment variable on the Lambda function. AWS Lambda performs a rolling update: new invocations run the new version (Haiku), but in-flight requests on existing execution environments finish on the old version (Sonnet). During the ~10-minute rollout window, a session's first turn is answered by Haiku and the second turn by Sonnet, producing noticeably inconsistent response styles, persona, and reasoning depth. Some users submit quality complaints; the support team cannot reproduce the issue because by then all instances have updated.
Detection
flowchart TD
A["Support tickets:<br/>'Chatbot changed personality<br/>mid-conversation'"] --> B{"Check CloudWatch Logs<br/>for session_id across invocations"}
B -->|"Same session_id,<br/>different model_id<br/>values in logs"| C["Model skew confirmed:<br/>multiple models serving<br/>same session"]
B -->|"Consistent model_id<br/>across session"| D["Not a model skew issue —<br/>check prompt or context"]
C --> E{"When did mixed<br/>model_id start appearing?"}
E -->|"~T-10 minutes:<br/>Lambda env var update"| F["ROOT CAUSE:<br/>Rolling Lambda update<br/>via env var caused<br/>in-flight version skew"]
F --> G{"All instances<br/>updated now?"}
G -->|"Yes — skew resolved"| H["Incident closed —<br/>implement prevention"]
G -->|"No — skew still active"| I["Force all execution\nenvironments to update:\nDeploy dummy code change\nor increase Lambda concurrency\nto drain old environments"]
H --> J["Migrate to AppConfig-based<br/>switching — Runbook 4"]
I --> J
Root Cause
Lambda environment variable updates trigger a rolling deployment: AWS gradually replaces execution environments with new ones carrying the updated env var. Existing warm environments continue serving requests until they are recycled. During this window (typically 5-15 minutes under moderate load), the Lambda function effectively has two "versions" of its configuration running simultaneously. Because DEFAULT_MODEL_ID was read at Lambda cold-start (module import time) rather than at request time, existing environments were reading the old value from their process memory, not from the updated env var. The env var mechanism is not designed for seamless zero-skew config propagation.
Resolution
"""
Runbook 4: Replace Lambda env var model switching with AppConfig atomic flag.
Demonstrates:
(a) Why env var switching causes skew (the broken pattern)
(b) AppConfig-based switching that propagates atomically to all warm instances
within one poll cycle (~60 seconds) with no rolling deployment
"""
import json
import logging
import os
import time
import boto3
logger = logging.getLogger("manga_model_switch")
logger.setLevel(logging.INFO)
# ── BROKEN PATTERN: env var read at module import time ───────────────────────
# This value is frozen at cold start for the lifetime of the execution env.
# When an env var update is deployed, old execution envs retain the old value
# until they are recycled.
_BROKEN_MODEL_ID = os.environ.get(
"DEFAULT_MODEL_ID",
"anthropic.claude-3-sonnet-20240229-v1:0",
)
def _broken_get_model_id_from_env() -> str:
"""Anti-pattern: returns the value captured at import time."""
return _BROKEN_MODEL_ID # Stale for old execution environments during rollout
# ── CORRECT PATTERN: AppConfig-backed switching ───────────────────────────────
_appconfig = boto3.client("appconfigdata", region_name=os.environ.get("AWS_REGION", "us-east-1"))
_session_token: str | None = None
_config_cache: dict | None = None
_last_poll: float = 0.0
POLL_SECONDS = int(os.environ.get("APPCONFIG_POLL_SECONDS", "60"))
def _get_appconfig_routing() -> dict:
"""Session-token-cached AppConfig fetch (same pattern as Runbooks 1–3)."""
global _session_token, _config_cache, _last_poll
now = time.monotonic()
if _config_cache and (now - _last_poll) < POLL_SECONDS:
return _config_cache
if _session_token is None:
resp = _appconfig.start_configuration_session(
ApplicationIdentifier=os.environ["APPCONFIG_APP"],
EnvironmentIdentifier=os.environ["APPCONFIG_ENV"],
ConfigurationProfileIdentifier=os.environ["APPCONFIG_PROFILE"],
RequiredMinimumPollIntervalInSeconds=POLL_SECONDS,
)
_session_token = resp["InitialConfigurationToken"]
resp = _appconfig.get_latest_configuration(ConfigurationToken=_session_token)
_session_token = resp["NextPollConfigurationToken"]
body = resp["Configuration"].read()
if body:
_config_cache = json.loads(body)
_last_poll = now
logger.info("AppConfig refreshed: default_model=%s", _config_cache.get("default_model_id"))
return _config_cache
def get_model_id_for_session(session_id: str, intent: str) -> str:
"""
Return the model ID from AppConfig, read at request time (not import time).
When an operator flips `default_model_id` in AppConfig, ALL warm Lambda
instances pick up the change within POLL_SECONDS — no rolling deployment,
no version skew, no in-flight inconsistency.
"""
config = _get_appconfig_routing()
if config.get("cost_override_active", False):
model_id = "anthropic.claude-3-haiku-20240307-v1:0"
else:
model_id = config.get("intent_overrides", {}).get(
intent, config["default_model_id"]
)
logger.info(
"session=%s intent=%s resolved_model=%s", session_id, intent, model_id
)
return model_id
def lambda_handler(event: dict, context) -> dict:
"""Inference handler: model ID resolved at request time from AppConfig."""
bedrock = boto3.client("bedrock-runtime", region_name=os.environ.get("AWS_REGION", "us-east-1"))
session_id = event["session_id"]
intent = event.get("intent", "general_qa")
user_message = event["message"]
model_id = get_model_id_for_session(session_id, intent)
payload = json.dumps({
"anthropic_version": "bedrock-2023-05-31",
"max_tokens": 1024,
"messages": [{"role": "user", "content": user_message}],
})
response = bedrock.invoke_model(
modelId=model_id,
body=payload,
contentType="application/json",
accept="application/json",
)
answer = json.loads(response["body"].read())["content"][0]["text"]
# Emit model_id to CloudWatch Logs for skew detection queries
logger.info(
json.dumps({
"event": "inference_complete",
"session_id": session_id,
"model_id": model_id,
"intent": intent,
})
)
return {"session_id": session_id, "answer": answer, "model_id": model_id}
Prevention
- Establish a team policy: model selection changes are made exclusively via AppConfig, never via Lambda environment variables. Environment variables are reserved for infrastructure configuration (region, concurrency limits) that genuinely requires a redeployment boundary.
- Add a CloudWatch Logs Insights query that detects model skew within a session:
Run this query as a scheduled alarm that fires if
fields session_id, model_id | stats count_distinct(model_id) as unique_models by session_id | filter unique_models > 1 | sort @timestamp descunique_models > 1appears in the last 5 minutes. - When a model switch is needed urgently, have the operator update the AppConfig document through a self-service Slack slash command (
/manga set-model haiku) backed by an API Gateway → Lambda that writes to AppConfig — eliminating ad-hoc console changes that bypass proper change tracking.
Scenario 5: Routing Rule Conflict — Cost-Based and Latency-Based Routers Both Apply to Same Message Type
Problem
MangaAssist's routing layer was extended with two independent routing strategies: a cost-based router (routes product_recommendation intents to Haiku to save cost) and a latency-based router (routes product_recommendation intents to Sonnet because its response quality reduces follow-up messages, lowering total latency). Both rules match the product_recommendation intent. Depending on which router runs last in the chain, the model selection is non-deterministic. Over a two-week period, roughly half the product recommendation queries are answered by Haiku and half by Sonnet with no consistent logic. A/B test metrics are meaningless. The engineering team cannot attribute quality regressions to a specific model.
Detection
flowchart TD
A["Model usage audit:<br/>product_recommendation intent<br/>split ~50/50 Haiku vs Sonnet"] --> B{"Check routing rule<br/>evaluation order in AppConfig"}
B -->|"Both cost-router AND<br/>latency-router rules<br/>match same intent"| C["Routing rule conflict<br/>confirmed"]
B -->|"Single rule matches"| D["Audit logging gap —<br/>check which router<br/>is executing"]
C --> E{"Check rule<br/>priority field"}
E -->|"Both rules have<br/>same priority value<br/>or no priority field"| F["ROOT CAUSE:<br/>Ambiguous priority —<br/>last rule wins<br/>(non-deterministic under<br/>concurrent updates)"]
E -->|"Priority field present<br/>but both rules set<br/>to same value"| F
F --> G["Apply Runbook 5:<br/>Priority-ordered rule<br/>evaluation with<br/>conflict detection"]
G --> H["Audit historical logs<br/>to assess which sessions<br/>were affected — file<br/>data quality incident"]
Root Cause
The routing configuration stored in AppConfig included two separate rule arrays — cost_rules and latency_rules — evaluated by separate router classes that were composed in a chain. When a product_recommendation message arrived, both routers independently returned a model selection. The composer used a simple last_write_wins merge with no priority field or conflict detection. The order of evaluation was determined by Python dict iteration order, which is insertion-ordered but not semantically meaningful. A recent config change silently swapped the insertion order, flipping which router won. The result was silent non-determinism that persisted for 14 days before anyone noticed via an A/B metric anomaly.
Resolution
"""
Runbook 5: Priority-ordered routing rule evaluation with conflict detection.
Replaces independent cost/latency router composition with a single unified
rule engine that:
1. Evaluates all rules and collects all matches
2. Detects conflicts (multiple matches for same intent)
3. Resolves via explicit priority field
4. Emits a CloudWatch alarm if a conflict is detected (so rules are fixed)
"""
import json
import logging
import os
import time
from dataclasses import dataclass
from typing import Any
import boto3
logger = logging.getLogger("manga_routing_engine")
logger.setLevel(logging.INFO)
_cloudwatch = boto3.client("cloudwatch", region_name=os.environ.get("AWS_REGION", "us-east-1"))
@dataclass(frozen=True)
class RoutingRule:
"""A single model routing rule with an explicit priority."""
rule_id: str
strategy: str # "cost" | "latency" | "quality"
intent_pattern: str # Exact intent string this rule matches
model_id: str # Bedrock model ID to use
priority: int # Lower number = higher priority (1 = highest)
enabled: bool
def parse_routing_rules(config_doc: dict) -> list[RoutingRule]:
"""
Parse AppConfig document into RoutingRule objects.
Expected config shape:
{
"default_model_id": "anthropic.claude-3-haiku-20240307-v1:0",
"cost_override_active": false,
"routing_rules": [
{
"rule_id": "cost-haiku-product-rec",
"strategy": "cost",
"intent_pattern": "product_recommendation",
"model_id": "anthropic.claude-3-haiku-20240307-v1:0",
"priority": 2,
"enabled": true
},
{
"rule_id": "latency-sonnet-product-rec",
"strategy": "latency",
"intent_pattern": "product_recommendation",
"model_id": "anthropic.claude-3-sonnet-20240229-v1:0",
"priority": 1,
"enabled": true
}
]
}
"""
rules = []
for raw in config_doc.get("routing_rules", []):
rules.append(
RoutingRule(
rule_id=raw["rule_id"],
strategy=raw["strategy"],
intent_pattern=raw["intent_pattern"],
model_id=raw["model_id"],
priority=raw["priority"],
enabled=raw.get("enabled", True),
)
)
return rules
def _emit_conflict_metric(intent: str, conflicting_rule_ids: list[str]) -> None:
"""Emit a CloudWatch metric when a routing conflict is detected."""
try:
_cloudwatch.put_metric_data(
Namespace="MangaAssist/ModelRouting",
MetricData=[
{
"MetricName": "RoutingRuleConflict",
"Dimensions": [{"Name": "Intent", "Value": intent}],
"Value": 1.0,
"Unit": "Count",
}
],
)
logger.warning(
"Routing conflict detected for intent=%s rules=%s",
intent,
conflicting_rule_ids,
)
except Exception as exc:
logger.error("Failed to emit RoutingRuleConflict metric: %s", exc)
def resolve_model_for_intent(
intent: str,
routing_rules: list[RoutingRule],
default_model_id: str,
cost_override_active: bool = False,
) -> tuple[str, str]:
"""
Deterministically resolve a model ID for a given intent.
Returns: (model_id, winning_rule_id_or_'default')
Resolution algorithm:
1. If cost_override_active → always return Haiku
2. Filter to enabled rules matching intent
3. If zero matches → return default_model_id
4. If one match → return that rule's model_id
5. If multiple matches → emit conflict alarm, use lowest priority number (highest priority)
"""
if cost_override_active:
logger.info("Cost override active — routing intent=%s to Haiku", intent)
return "anthropic.claude-3-haiku-20240307-v1:0", "cost_override"
matching_rules = [
r for r in routing_rules if r.enabled and r.intent_pattern == intent
]
if not matching_rules:
logger.info(
"No routing rule matched intent=%s — using default=%s",
intent,
default_model_id,
)
return default_model_id, "default"
if len(matching_rules) > 1:
# Conflict detected — emit alarm, then resolve deterministically by priority
rule_ids = [r.rule_id for r in matching_rules]
_emit_conflict_metric(intent, rule_ids)
# Sort ascending by priority — lowest number wins
matching_rules = sorted(matching_rules, key=lambda r: r.priority)
logger.warning(
"Conflict resolved via priority: winning_rule=%s over %s",
matching_rules[0].rule_id,
rule_ids,
)
winner = matching_rules[0]
logger.info(
"Routing rule applied: intent=%s rule=%s strategy=%s model=%s",
intent,
winner.rule_id,
winner.strategy,
winner.model_id,
)
return winner.model_id, winner.rule_id
def lambda_handler(event: dict, context: Any) -> dict:
"""
MangaAssist inference Lambda using unified priority-ordered routing engine.
"""
# (AppConfig fetch omitted for brevity — use get_validated_model_config from Runbook 3)
config = json.loads(os.environ.get("_TEST_CONFIG_OVERRIDE", "{}")) # test hook
rules = parse_routing_rules(config)
intent = event.get("intent", "general_qa")
default_model = config.get("default_model_id", "anthropic.claude-3-haiku-20240307-v1:0")
cost_override = config.get("cost_override_active", False)
model_id, rule_id = resolve_model_for_intent(
intent, rules, default_model, cost_override
)
# Structured log for downstream audit queries
logger.info(
json.dumps({
"event": "model_resolved",
"session_id": event.get("session_id"),
"intent": intent,
"model_id": model_id,
"winning_rule": rule_id,
})
)
return {"model_id": model_id, "winning_rule": rule_id}
Prevention
- Every routing rule document must have a
priorityfield (integer, unique within intent scope). AppConfig JSON Schema validation (from Scenario 3 prevention) should enforce"required": ["priority"]on every rule object. - Add a
validate_routing_rules.pyCI script that, after schema validation, checks for intent-scope conflicts and fails the pipeline if any two enabled rules share the sameintent_patternwithout having distinctpriorityvalues. - Create a CloudWatch alarm on
MangaAssist/ModelRouting → RoutingRuleConflict > 0. This fires even when the conflict is resolved gracefully, ensuring the team is notified to fix the rule duplication rather than silently accepting the priority fallback. - Use
winning_rulestructured log fields and a CloudWatch Logs Insights dashboard to continuously monitor rule usage distribution. An unexpected 50/50 split across two rules for the same intent is a leading indicator of a conflict that will surface on the next config update.