US-02: Intent Classifier Cost Optimization
User Story
As a ML infrastructure engineer, I want to reduce SageMaker inference costs for intent classification by maximizing rule-based pre-filtering, So that only ambiguous messages require the ML model, cutting inference spend by 50-70%.
Acceptance Criteria
- Rule-based pre-filter handles ≥ 60% of messages without invoking SageMaker.
- SageMaker endpoint uses auto-scaling with scale-to-zero during off-peak hours.
- Batch inference is used for non-real-time classification tasks (analytics re-classification).
- Intent classification results are cached per session to avoid re-classifying follow-up messages with obvious intents.
- Overall intent classifier cost decreases by 50-70%.
High-Level Design
Cost Problem
SageMaker real-time endpoints run continuously. A ml.m5.xlarge instance costs ~$0.23/hour = $167/month per instance. With 2 instances for HA and auto-scaling peaks, the baseline is $400-600/month — even during zero-traffic hours.
The two-stage intent classifier (LLD-2) already describes a rule-based pre-filter, but cost optimization requires pushing its coverage higher and right-sizing the ML fallback.
Optimization Architecture
graph TD
A[User Message] --> B[Rule-Based<br>Pre-filter v2]
B -->|confidence >= 0.85<br>~65% of traffic| C[Return Intent<br>Zero ML cost]
B -->|confidence < 0.85| D{Session<br>Cache?}
D -->|Same intent context<br>in last 3 turns| E[Reuse Cached<br>Intent]
D -->|New context| F[SageMaker<br>BERT Classifier]
F --> G[Cache Result<br>in Session Memory]
G --> H[Return Intent]
E --> H
C --> H
subgraph "Off-Peak"
F -.->|scale to zero| I[Serverless Inference<br>or cold start]
end
style C fill:#2d8,stroke:#333
style E fill:#2d8,stroke:#333
style I fill:#fd2,stroke:#333
Cost Breakdown
| Scenario | Monthly Cost | Notes |
|---|---|---|
| Baseline (always-on 2x ml.m5.xlarge) | ~$500 | 24/7 regardless of traffic |
| With auto-scaling (min 0, max 4) | ~$200 | Scale to zero off-peak |
| With rule-based pre-filter (65% bypass) | ~$70 | Only 35% of traffic hits ML |
| With session intent cache (additional 10%) | ~$55 | 25% of traffic hits ML |
Low-Level Design
1. Enhanced Rule-Based Pre-filter
Expand pattern coverage beyond the original LLD-2 design to catch more intents deterministically.
graph TD
A[User Message] --> B[Normalize Text<br>lowercase, strip punctuation]
B --> C[Keyword Matcher<br>Aho-Corasick multi-pattern]
C --> D{Match found?}
D -->|Yes| E[Apply Rule<br>Confidence Scoring]
E --> F{Confidence >= 0.85?}
F -->|Yes| G[Extract Entities<br>Regex NER]
G --> H[Return Intent]
F -->|No| I[Fall through to ML]
D -->|No| I
style H fill:#2d8,stroke:#333
style I fill:#fd2,stroke:#333
Expanded Rule Patterns
import re
from dataclasses import dataclass
from typing import Optional
@dataclass
class RuleMatch:
intent: str
confidence: float
entities: dict
rule_id: str
class IntentRuleEngine:
"""Deterministic intent classification using pattern matching."""
def __init__(self):
self.rules = self._build_rules()
def classify(self, message: str) -> Optional[RuleMatch]:
normalized = message.lower().strip()
for rule in self.rules:
match = rule["pattern"].search(normalized)
if match:
entities = rule["entity_extractor"](normalized, match)
return RuleMatch(
intent=rule["intent"],
confidence=rule["confidence"],
entities=entities,
rule_id=rule["id"],
)
return None
def _build_rules(self) -> list[dict]:
return [
# Order tracking — high confidence patterns
{
"id": "R001",
"intent": "order_tracking",
"confidence": 0.95,
"pattern": re.compile(
r"(where\s+is\s+my\s+order|track\s*(my)?\s*order|order\s+status"
r"|delivery\s+status|when\s+will\s+(it|my\s+order)\s+arrive"
r"|shipping\s+update|has\s+my\s+order\s+shipped)"
),
"entity_extractor": self._extract_order_entities,
},
# Return request
{
"id": "R002",
"intent": "return_request",
"confidence": 0.92,
"pattern": re.compile(
r"(i\s+want\s+to\s+return|return\s+(this|my|the)"
r"|damaged|defective|wrong\s+item|refund"
r"|send\s*(it)?\s*back|exchange\s+this)"
),
"entity_extractor": self._extract_return_entities,
},
# Escalation
{
"id": "R003",
"intent": "escalation",
"confidence": 0.98,
"pattern": re.compile(
r"(talk\s+to\s+(a\s+)?(human|agent|person|representative|someone)"
r"|speak\s+to\s+(a\s+)?manager|customer\s+service"
r"|real\s+person|live\s+agent)"
),
"entity_extractor": lambda msg, m: {},
},
# Chitchat — greetings
{
"id": "R004",
"intent": "chitchat",
"confidence": 0.95,
"pattern": re.compile(
r"^(hi|hello|hey|good\s+(morning|afternoon|evening)"
r"|howdy|yo|what'?s?\s+up)\b"
),
"entity_extractor": lambda msg, m: {"sub_intent": "greeting"},
},
# Chitchat — thanks
{
"id": "R005",
"intent": "chitchat",
"confidence": 0.95,
"pattern": re.compile(
r"(thanks?(\s+you)?|thank\s+you|thx|ty|appreciate\s+it|great\s+help)"
),
"entity_extractor": lambda msg, m: {"sub_intent": "thanks"},
},
# Chitchat — goodbye
{
"id": "R006",
"intent": "chitchat",
"confidence": 0.95,
"pattern": re.compile(r"(bye|goodbye|see\s+you|later|that'?s\s+all)"),
"entity_extractor": lambda msg, m: {"sub_intent": "goodbye"},
},
# Promotion inquiry
{
"id": "R007",
"intent": "promotion",
"confidence": 0.90,
"pattern": re.compile(
r"(deal|discount|sale|promo|coupon|offer|special\s+price"
r"|on\s+sale|price\s+drop|bargain)"
),
"entity_extractor": lambda msg, m: {},
},
# FAQ — return policy
{
"id": "R008",
"intent": "faq",
"confidence": 0.88,
"pattern": re.compile(
r"(return\s+policy|shipping\s+policy|how\s+long\s+to\s+ship"
r"|free\s+shipping|payment\s+method|gift\s+wrap)"
),
"entity_extractor": lambda msg, m: {"topic": "policy"},
},
# Product discovery
{
"id": "R009",
"intent": "product_discovery",
"confidence": 0.85,
"pattern": re.compile(
r"(show\s+me|browse|what'?s?\s+(popular|trending|new|hot)"
r"|best\s+sell(er|ing)|top\s+manga|new\s+release)"
),
"entity_extractor": self._extract_discovery_entities,
},
]
def _extract_order_entities(self, msg: str, match) -> dict:
order_match = re.search(r"#?(\d{3}-\d{7}-\d{7}|\d{10,})", msg)
return {"order_id": order_match.group(1) if order_match else None}
def _extract_return_entities(self, msg: str, match) -> dict:
asin_match = re.search(r"\b(B[0-9A-Z]{9})\b", msg.upper())
return {
"asin": asin_match.group(1) if asin_match else None,
"reason": "damaged" if "damage" in msg else "other",
}
def _extract_discovery_entities(self, msg: str, match) -> dict:
genres = ["horror", "action", "romance", "comedy", "fantasy", "sci-fi",
"shonen", "shojo", "seinen", "josei", "isekai", "slice of life"]
found = [g for g in genres if g in msg]
return {"genre": found[0] if found else None}
2. Session Intent Cache
Avoid re-classifying when the user is clearly continuing the same topic.
sequenceDiagram
participant User
participant Orchestrator
participant IntentCache
participant RuleEngine
participant SageMaker
User->>Orchestrator: "Show me action manga"
Orchestrator->>RuleEngine: Classify
RuleEngine-->>Orchestrator: product_discovery (0.85)
Orchestrator->>IntentCache: Store (session, product_discovery)
User->>Orchestrator: "What about horror ones?"
Orchestrator->>IntentCache: Check session context
IntentCache-->>Orchestrator: Last intent = product_discovery
Note over Orchestrator: "What about X ones" is a<br>continuation pattern
Orchestrator->>Orchestrator: Reuse intent = product_discovery<br>Update entity: genre=horror
Note over Orchestrator: Skipped both RuleEngine<br>and SageMaker
Code Example: Session Intent Cache
import time
from dataclasses import dataclass, field
from typing import Optional
CONTINUATION_PATTERNS = [
r"^(what\s+about|how\s+about|and|also|any)\b",
r"^(show\s+me\s+more|more\s+like|similar)",
r"^(the\s+(first|second|third|last)\s+one)",
r"^(yes|yeah|sure|ok|that\s+one)",
]
@dataclass
class CachedIntent:
intent: str
confidence: float
entities: dict
timestamp: float
turn_index: int
class SessionIntentCache:
"""Cache intent classification within a session to avoid redundant ML calls."""
MAX_AGE_SECONDS = 300 # 5 minutes
MAX_TURN_GAP = 3 # Reuse if within last 3 turns
def __init__(self):
self._cache: dict[str, CachedIntent] = {}
def get_cached_intent(
self,
session_id: str,
message: str,
current_turn: int,
) -> Optional[CachedIntent]:
cached = self._cache.get(session_id)
if cached is None:
return None
# Check staleness
if time.time() - cached.timestamp > self.MAX_AGE_SECONDS:
del self._cache[session_id]
return None
if current_turn - cached.turn_index > self.MAX_TURN_GAP:
return None
# Check if message is a continuation
if not self._is_continuation(message):
return None
return cached
def store(
self,
session_id: str,
intent: str,
confidence: float,
entities: dict,
turn_index: int,
) -> None:
self._cache[session_id] = CachedIntent(
intent=intent,
confidence=confidence,
entities=entities,
timestamp=time.time(),
turn_index=turn_index,
)
def _is_continuation(self, message: str) -> bool:
import re
normalized = message.lower().strip()
return any(re.search(p, normalized) for p in CONTINUATION_PATTERNS)
3. SageMaker Auto-Scaling with Scale-to-Zero
graph TD
subgraph "Traffic Pattern"
A[Peak Hours<br>9am-11pm JST] -->|high traffic| B[2-4 instances<br>ml.m5.xlarge]
C[Off-Peak Hours<br>11pm-9am JST] -->|low traffic| D[0-1 instances<br>Scale to zero]
end
subgraph "Scaling Policy"
E[CloudWatch Alarm<br>InvocationsPerInstance] --> F{> 100/min?}
F -->|Yes| G[Scale Out<br>+1 instance]
F -->|No| H{< 5/min for 15 min?}
H -->|Yes| I[Scale In<br>-1 instance]
H -->|No| J[Hold]
end
subgraph "Serverless Fallback"
D -->|cold start ~30s| K[Serverless Inference<br>Endpoint]
K -->|warm| L[Fast inference]
end
Code Example: SageMaker Auto-Scaling Configuration
import boto3
def configure_intent_classifier_autoscaling(
endpoint_name: str = "manga-intent-classifier",
min_instances: int = 0,
max_instances: int = 4,
) -> None:
"""Configure auto-scaling for the SageMaker intent classifier endpoint."""
aas_client = boto3.client("application-autoscaling")
resource_id = f"endpoint/{endpoint_name}/variant/AllTraffic"
# Register scalable target with min=0 (scale to zero)
aas_client.register_scalable_target(
ServiceNamespace="sagemaker",
ResourceId=resource_id,
ScalableDimension="sagemaker:variant:DesiredInstanceCount",
MinCapacity=min_instances,
MaxCapacity=max_instances,
)
# Scale-out policy: add instances when invocations are high
aas_client.put_scaling_policy(
PolicyName="intent-classifier-scale-out",
ServiceNamespace="sagemaker",
ResourceId=resource_id,
ScalableDimension="sagemaker:variant:DesiredInstanceCount",
PolicyType="TargetTrackingScaling",
TargetTrackingScalingPolicyConfiguration={
"TargetValue": 100.0, # invocations per instance per minute
"CustomizedMetricSpecification": {
"MetricName": "InvocationsPerInstance",
"Namespace": "AWS/SageMaker",
"Dimensions": [
{"Name": "EndpointName", "Value": endpoint_name},
{"Name": "VariantName", "Value": "AllTraffic"},
],
"Statistic": "Average",
},
"ScaleInCooldown": 300, # 5 min before scaling in
"ScaleOutCooldown": 60, # 1 min before scaling out
},
)
def configure_serverless_fallback(
model_name: str = "manga-intent-classifier",
memory_size_mb: int = 2048,
max_concurrency: int = 10,
) -> dict:
"""Create a serverless inference endpoint as cold-start fallback."""
sm_client = boto3.client("sagemaker")
config_name = f"{model_name}-serverless-config"
sm_client.create_endpoint_config(
EndpointConfigName=config_name,
ProductionVariants=[
{
"VariantName": "AllTraffic",
"ModelName": model_name,
"ServerlessConfig": {
"MemorySizeInMB": memory_size_mb,
"MaxConcurrency": max_concurrency,
},
}
],
)
return {"endpoint_config": config_name, "type": "serverless"}
Monitoring and Metrics
| Metric | Target | Alert |
|---|---|---|
| Rule-based classification rate | ≥ 65% | < 55% |
| Session intent cache hit rate | ≥ 10% | < 5% |
| SageMaker active instances (off-peak) | 0 | > 1 for 30+ min |
| SageMaker p99 latency | < 50ms | > 100ms |
| Monthly SageMaker cost | ≤ $70 | > $150 |
| Classification accuracy | ≥ 92% | < 88% |
Risks and Mitigations
| Risk | Impact | Mitigation |
|---|---|---|
| Rule-based false positives | Wrong intent → wrong response | Log all rule matches and sample audit weekly; tune confidence thresholds |
| Scale-to-zero cold start latency | 20-30s delay on first request after scale-in | Use serverless inference as warm fallback; pre-warm at 8:30am JST |
| Session cache returns stale intent | User switches topic but cache serves old intent | Only reuse on explicit continuation patterns; clear cache on topic-switch signals |
Deep Dive: Why This Works on a Manga Chatbot Workload
The intent classifier is conceptually small ($400–600/month is rounding error against Bedrock spend) but architecturally load-bearing: every other cost optimization in this collection depends on its output. US-01's template router, US-03's response cache keying, US-06's RAG bypass gate, and US-08's request prioritizer all branch on the intent label. Treating intent classification as a pure cost play misses the point — it is the precision floor for the entire cost-optimization stack.
Property 1: Manga-store intents are surface-form-distinguishable, not semantically subtle. Unlike open-domain assistants where "schedule a meeting" and "book a flight" might require fine semantic distinction, manga-chatbot intents have stable keyword skeletons: "where is volume N", "when does X come out", "is Y in English", "recommend me", "track my order". The reason 65% rule-based coverage is achievable is that the head of the intent distribution is keyword-driven; only the long tail (10–20% of traffic) needs ML. Sending 100% of traffic through SageMaker is paying for ML on queries a re.match would have classified with 0.98 confidence. The architectural assumption is that the regex set covers head-distribution intents at precision ≥ 0.95; the failure signal is rule-rate decline (a sudden drop from 65% → 50% means new query patterns have appeared and the rule set is stale).
Property 2: Conversation continuation is structurally signalled. Phrases like "what about", "more like that", "and the price", "any others" are explicit linguistic markers that the user is not changing intent. The session intent cache exploits this by reusing the prior intent label without re-running the classifier — the cache is keyed by session_id, gated by continuation regex, and TTL'd to 90 seconds (a typical chat-turn cadence). This is qualitatively different from response caching: it caches a 16-byte intent label, not a response. The cost saving is small (~10% of inference calls avoided) but the latency saving is significant (~50ms saved per cache hit on every continuation turn). The failure mode is topic-switch confusion — if the user pivots without a continuation marker, the cache must miss; this is why topic-switch heuristics (entity change, question-word change) explicitly invalidate the cache.
Property 3: Traffic is heavily diurnal and SageMaker pricing is hour-aligned. The MangaAssist load curve is concentrated 9am–11pm JST (~14 hours) and near-zero overnight. SageMaker realtime endpoints bill per instance-hour regardless of traffic; running 2 always-on ml.m5.xlarge instances during the 10 idle hours is the largest single waste in the original architecture. Scale-to-zero with auto-scaling reclaims this directly. The 30–60 second cold-start penalty is acceptable because (a) overnight traffic is mostly automated agents/test traffic, not paying customers, and (b) the rule-based path covers 65%+ of overnight traffic without ever touching SageMaker. Serverless inference is the warm-fallback for traffic that arrives during scale-from-zero — it has its own cold-start (~10s) but no idle billing.
Bottom line: the savings come from layering a cheap-and-precise filter (regex) in front of an expensive-and-flexible classifier (ML). The 25% of traffic that still needs ML is exactly the queries where ML is differentiated; the other 75% gets served by mechanisms whose cost is dominated by Lambda invocation overhead, not inference compute.
Real-World Validation
Industry Benchmarks & Case Studies
- Rasa / Dialogflow architecture pattern — Both major open-source NLU frameworks ship with a "rule policy" that fires before the ML policy. Rasa's published benchmarks show 60–75% intent coverage from rules alone on closed-domain assistants (booking, ordering, FAQ). The 65% rule-based target in this story sits in the middle of the published band.
- AWS SageMaker Serverless Inference docs — Cold-start latency is 30–60 seconds on first request after scale-to-zero (varies by container size). This matches the story's "20-30s cold start" expectation. Serverless billing model: pay per inference duration + memory, not idle time. Public AWS case studies (Cisco, Mueller) report 60–80% cost reduction vs realtime endpoints for spiky/diurnal workloads — consistent with this story's 50–70% target.
- AWS re:Invent 2023 session AIM304 (multi-tier inference) — Documents the rule-pre-filter + ML-fallback pattern with worked cost examples on a customer-support chatbot, showing 65–80% inference cost reduction.
- Google Dialogflow CX confidence-threshold guidance — Recommends 0.85 as the production confidence floor for intent labels; below this, fall through to a fallback intent or re-prompt. The 0.85 threshold in this story aligns with industry default.
- Internal cross-reference:
Domain1-FM-Integration-Data-Compliance/task files on intent routing — Document the FM-vs-classifier trade-off; this story is the cheap-classifier operating point. - Internal cross-reference:
Optimization-Tradeoffs-User-Stories/— Covers latency-vs-cost trade-off for inference; the cold-start penalty here is the known tax.
Math Validation
ml.m5.xlarge: $0.269/hr (us-east-1, validate per region) × 2 instances × 730 hrs/month = ~$393/month always-on. Story baseline of "$400–600/month for 2× ml.m5.xlarge always-on" is within range, especially with auto-scaling buffer + multi-AZ overhead. ✅- Optimized path: serverless ~$0.20 per 1M inferences + Lambda overhead. At 250K SageMaker calls/day (25% of 1M) × 30 days = 7.5M calls/month → ~$15 at serverless inference rates + $40–60 in Lambda + $0 idle = ~$55–75/month. Story claim of "~$55/month optimized" is at the optimistic end; realistic band is $55–90 once observability tooling is included.
- Crossover analysis: scale-to-zero wins over always-on whenever idle hours/day > 6. MangaAssist's 10 idle hours puts it firmly in the scale-to-zero region.
Conservative vs Aggressive Savings Bounds
| Bound | Source | Total monthly savings |
|---|---|---|
| Conservative | Rasa-style 50% rule coverage + 5% session cache + scale-to-zero overnight only | ~45% (~$220/month) |
| Aggressive | 75% rule coverage + 15% session cache + full serverless migration | ~80% (~$480/month) |
| Story's projected savings | 50–70% (~$300–420/month) | Realistic mid-band given the hand-curated 9-rule set. |
Cross-Story Interactions & Conflicts
This story is the authoritative side for several intent-precision contracts that other stories depend on.
- US-01 (LLM Tokens) — Authoritative side: this story. US-01's template router consumes the intent label and confidence score produced here. Conflict mode: if rule-based false-positive rate exceeds the confidence threshold (0.85), US-01 will template-render wrong answers cheaply. Resolution: intent precision floor is enforced here (≥ 0.92 weekly average); US-01's monitoring also tracks intent-driven misroute rate as a downstream signal. A weekly precision-recall audit on a labeled sample feeds back into rule tuning.
- US-03 (Caching) — See US-03. The session intent cache shares the Redis tier and uses the keyspace prefix
intent:sess:. TTL alignment: session intent cache TTL (90s) must be < session-state TTL in US-05 (24h) so stale intent can never outlive its session. - US-08 (Traffic-Based) — See US-08. The request prioritizer in US-08 routes by intent + tier; if this classifier is unavailable (cold-start or scale-from-zero failure), US-08 must fall back to a "default to MEDIUM priority" policy rather than blocking. The shared availability signal: a CloudWatch alarm on intent-classifier-unavailable rate > 1% triggers US-08's degradation logic.
Rollback & Experimentation
Shadow-Mode Plan
- Run rule-based classifier in shadow for 2 weeks: every message goes to both the rule engine and SageMaker; rule decisions are logged but SageMaker output is served. Compare on a daily basis; promote each rule to "live" only if its agreement-with-ML rate is > 95% on at least 1,000 samples.
- Run scale-to-zero in shadow with one of the two endpoints for 1 week: leave one endpoint always-on as fallback; route 50% traffic to the scaled-to-zero endpoint and measure cold-start incidence and p99 latency.
Canary Thresholds
- Rules: enable in 5%-traffic blocks per rule (R001 first, then R002, …). Each rule promoted to 100% only after 72h with false-positive rate < 2% on audit sample.
- Scale-to-zero: 10% of traffic for 48h, full after 1 week of stable cold-start metrics.
- Abort criteria (any one trips): rule-based misclassification rate > 5%, scale-from-zero cold-start exceeding 90s, session-cache stale-intent complaints from CSAT comments.
Kill Switch
- Single feature flag:
intent_classifier_optimization_enabled. When false, all traffic skips rules and session cache, hits SageMaker directly, and SageMaker reverts to always-on (2 instances). Rollback within ~3 minutes via SSM.
Quality Regression Criteria (story-specific)
- Rule-based intent precision: ≥ 0.92 (weekly audit on 500-sample stratified set).
- Session cache stale-intent rate: ≤ 1% (sampled audit).
- SageMaker scale-from-zero cold-start P95: ≤ 60s.
- Overall classification accuracy on a frozen validation set: must not regress > 1.5 percentage points vs always-on baseline.
Multi-Reviewer Validation Findings & Resolutions
The cross-reviewer pass identified the following story-specific findings. README's "Multi-Reviewer Validation & Cross-Cutting Hardening" section covers concerns that span all stories.
S1 (must-fix before production)
English-only regex on a bilingual workload. Rules R001–R009 are English regex patterns. Manga-store traffic is JP/EN bilingual (the architecture is JST-aligned). On Japanese inputs ("注文はどこですか", "在庫ありますか"), every rule misses; rule-based coverage on JP traffic drops to ~5–15% versus the claimed 65% on aggregated traffic. The 65% number is true only when stratified entirely on English. Resolution: ship parallel rule sets per language — IntentRuleEngine.classify_en(), classify_ja(), plus a romaji handler for mixed-script queries ("Berserk 何巻まで"). Detect language by Unicode-block ratio before routing to the rule set. Republish rule-coverage metric stratified by language; the 65% acceptance criterion must be met per language, not aggregated.
ReDoS surface on rule-based classifier. Untrusted user input flows directly into re.match(). A long whitespace string ("is is is is …" × 1K) on a backtracking pattern causes catastrophic exponential time, blocking the classifier. Resolution: (a) cap input length at 2,000 characters before regex; (b) use the regex library with timeout=100ms and circuit-break on timeout; © audit each rule's worst-case complexity (no nested unbounded quantifiers).
intent_precision (the contract US-01, US-06, US-08 depend on) conflated with accuracy. The acceptance criterion "Classification accuracy ≥ 92%" is a balanced-set metric. US-01's template router needs precision (TP / (TP+FP)) on production traffic, which can be much lower than accuracy on a balanced set. Resolution: rename the metric to intent_precision, redefine as TP/(TP+FP) on a labeled production sample, raise floor to ≥ 0.94. Publish weekly to a CloudWatch metric that US-01 and US-06 read; US-01 raises template-router threshold automatically when precision < 0.90.
S2 (fix before scale-up)
Session intent cache TTL based on turn count, not elapsed time. MAX_TURN_GAP = 3 doesn't catch users who pause 2+ minutes between turns. After 100s silence, the 90s TTL has expired but the turn-gap check still considers the cached intent valid. Resolution: store last_cached_at timestamp in the cached entry; reject reuse if now - last_cached_at > 90s regardless of turn gap. Also invalidate on entity-set change (different ASIN or genre referenced).
Scale-from-zero cold-start risk. First request after off-peak scale-to-zero takes 30–60s. During this window users wait or sessions time out. Serverless inference fallback latency is not specified — if also scaled to zero, both paths suffer the same penalty. Resolution: make Serverless Inference always-warm (it has no idle billing model — confirm pricing before assuming) as a true fallback. If always-warm pricing exceeds the savings from main scale-to-zero, keep one realtime instance always-on as the warm floor (~$167/month) and accept lower headline savings.
No deployment versioning for SageMaker model. A model deployment during scale-to-zero is invisible until next scale-from-zero — users in the cold-start window may run on stale model. Resolution: record model_version on every classification event; alert on multiple model_version values active in the same hour.
Kill switch absent for the rule engine itself. Story has intent_classifier_optimization_enabled (whole story off/on) but not a per-component switch to disable just the rules while keeping session cache + SageMaker. Resolution: add intent_rules_enabled flag; when off, all traffic skips rules and goes directly to session-cache → SageMaker fallback.
S3 (acknowledged / future work)
- Drift detection on intent distribution (any intent-share change > 5% week-over-week triggers retraining review).
- Multilingual rules tested against a 500-prompt JP/EN/mixed golden set, refreshed quarterly.
- Cost attribution: per-classification cost (rule = ~$0; session-cache = ~$0; SageMaker = ~$0.0001) emitted to US-07.