FM Performance Enhancement — Scenarios and Runbooks
AWS AIP-C01 Task 4.2 — Skill 4.2.4: Diagnose and resolve FM parameter misconfiguration and A/B testing failures Context: MangaAssist JP manga store chatbot — Bedrock Claude 3 (Sonnet/Haiku), OpenSearch Serverless, DynamoDB, ECS Fargate Intents: product_search, order_status, recommendation, manga_qa, chitchat, shipping_info
Skill Mapping
| Certification | Domain | Task | Skill |
|---|---|---|---|
| AWS AIP-C01 | Domain 4 — Operational Efficiency | Task 4.2 — Optimize FM performance | Skill 4.2.4 — Troubleshoot FM parameter misconfigurations and A/B testing failures through systematic root cause analysis |
Skill scope: Identify, diagnose, and resolve production issues related to FM parameter configuration, A/B testing methodology, and performance regression through structured runbooks with decision trees.
Scenario 1 — Temperature Too High for Order Status Queries
Problem Statement
MangaAssist users are reporting fabricated delivery dates in order status responses. A customer asks "Where is my order #MGA-78291?" and receives: "Your order is scheduled for delivery on March 15th and is currently in transit via Japan Post." But the actual delivery date is March 22nd, and the carrier is Sagawa Express. The chatbot is confidently hallucinating order details.
Detection
flowchart TD
A[User Complaint:<br/>'Wrong delivery date'] --> B[Support Team<br/>Logs Ticket]
B --> C[Check CloudWatch Metric:<br/>fm.hallucination_rate]
C --> D{Hallucination Rate<br/>> 5%?}
D -- Yes --> E[Check Intent Breakdown]
E --> F{order_status<br/>Hallucination > 15%?}
F -- Yes --> G[Confirmed: Parameter<br/>Issue for order_status]
F -- No --> H[Investigate Other<br/>Root Causes]
D -- No --> I[Check Specific<br/>Session Logs]
I --> J{Response Contains<br/>Data Not in Context?}
J -- Yes --> G
J -- No --> K[Check RAG Retrieval<br/>Quality]
Detection signals:
- CloudWatch alarm: fm.hallucination_rate for intent=order_status spikes from 1.2% to 18.7%
- Customer support tickets mentioning "wrong delivery date" or "wrong tracking number" increase 4x
- Automated claim verification shows 17% of order_status responses contain dates not present in the DynamoDB order record
Root Cause Analysis
The root cause is a parameter profile misconfiguration. During a recent optimization push, the temperature for order_status was accidentally changed from 0.15 to 0.75.
Investigation steps:
-
Check the current parameter profile:
profile_manager = ParameterProfileManager() profile = profile_manager.get_profile("order_status") print(f"Temperature: {profile.temperature}") # Expected: 0.15, Actual: 0.75 -
Check DynamoDB profile history: Query the
MangaAssist-ParameterProfilestable for theorder_statusitem. Theupdated_attimestamp reveals the change was made during a batch profile update that incorrectly applied recommendation-tier parameters to all intents. -
Trace the change to the deployment: The batch update script did not filter by intent group and applied the recommendation profile's temperature (0.75) to order_status.
Why temperature=0.75 causes hallucination for order_status: - At T=0.75, the model's probability distribution over tokens is flattened. Low-probability tokens (e.g., random dates, incorrect carrier names) become viable candidates. - Order status responses must contain exact values from the database. Even a small amount of creative token selection introduces fabricated data. - The model "fills in" plausible-sounding details when the actual data is not forcefully constrained by low temperature.
Resolution
flowchart TD
A[Confirm: order_status<br/>temperature = 0.75] --> B[Immediate: Revert to<br/>temperature = 0.15]
B --> C[Update DynamoDB<br/>ParameterProfiles Table]
C --> D[Clear Profile Cache<br/>on ECS Instances]
D --> E[Monitor Hallucination<br/>Rate for 1 Hour]
E --> F{Rate < 2%?}
F -- Yes --> G[Incident Resolved]
F -- No --> H[Check if Cache<br/>Is Actually Cleared]
H --> I[Force ECS Task Restart<br/>to Guarantee Fresh Config]
I --> E
Step-by-step resolution:
-
Immediate revert — Update DynamoDB profile to temperature=0.15:
profile_manager.update_profile(ParameterProfile( intent="order_status", temperature=0.15, top_k=40, top_p=0.70, max_tokens=256, stop_sequences=["\n\nUser:", "\n\nHuman:"], frequency_penalty=0.0, system_prompt_key="order_status_v3", few_shot_count=2, quality_score_target=0.95, version="hotfix-v1", updated_at="", # Will be set by update_profile )) -
Clear cache — The ParameterProfileManager caches profiles for 5 minutes. Either wait for TTL expiry or trigger a cache invalidation across all ECS tasks.
-
Monitor — Watch
fm.hallucination_rateforintent=order_statusin CloudWatch. It should drop below 2% within 15 minutes of the fix propagating. -
User communication — Notify affected users whose orders showed incorrect delivery information.
Prevention
| Prevention Measure | Implementation |
|---|---|
| Intent-group validation | Parameter update scripts must validate that factual intents (order_status, shipping_info) never exceed temperature=0.35 |
| Pre-deployment guardrail | CI/CD pipeline checks parameter profiles against allowed ranges per intent group before deployment |
| Automated anomaly detection | CloudWatch anomaly detection on fm.hallucination_rate per intent with 5-minute evaluation period |
| Separate update paths | Factual intents and creative intents have separate DynamoDB update pipelines — never batch-updated together |
| Canary on parameter changes | Any parameter profile change routes through a 5% canary for 1 hour before full deployment |
Scenario 2 — Statistical Significance Without Practical Significance
Problem Statement
An A/B test for product_search temperature (control=0.45 vs variant=0.50) has been running for 2 weeks. The StatisticalSignificanceCalculator reports p-value = 0.031 — statistically significant. The team is preparing to promote the variant. However, closer inspection reveals the quality improvement is 0.001 (from 0.883 to 0.884), while the variant uses 22% more tokens due to slightly longer responses, increasing cost per request by $0.003.
Detection
flowchart TD
A[A/B Test Completes<br/>p-value = 0.031] --> B[Team Prepares<br/>to Promote Variant]
B --> C{Review Full<br/>Analysis Report}
C --> D[Quality Improvement:<br/>+0.001 — 0.1%]
C --> E[Token Increase:<br/>+22%]
C --> F[Cost Increase:<br/>+$0.003/request]
D --> G{Practical<br/>Significance?}
G -- "0.1% improvement<br/>is negligible" --> H[Flag: Statistical<br/>Without Practical]
E --> I{Cost Justified?}
I -- "+22% tokens for<br/>0.1% quality" --> J[Flag: Poor<br/>Cost-Benefit Ratio]
H --> K[Reject Variant<br/>Despite p < 0.05]
J --> K
Detection signals: - The analysis report shows a quality improvement smaller than the minimum detectable effect specified in the experiment design (MDE was 0.03, actual improvement is 0.001) - Cohen's d effect size is 0.014 — classified as "negligible" (< 0.2 is negligible by convention) - The 95% confidence interval for the difference is (0.0001, 0.0019) — the upper bound barely exceeds zero
Root Cause Analysis
This is not a bug — it is a methodological trap. The experiment was overpowered: it ran long enough to detect a trivially small difference as "significant."
Why it happened:
1. The experiment was designed with min_samples=1,950 based on MDE=0.03. But traffic was higher than expected, so it accumulated 12,000 samples per variant.
2. With 12,000 samples, even a 0.001 difference produces a small p-value because the standard error shrinks with sample size: SE = std / sqrt(n).
3. The team did not define a minimum practical improvement threshold alongside the statistical significance threshold.
The cost analysis: | Metric | Control (T=0.45) | Variant (T=0.50) | Difference | |---|---|---|---| | Mean quality score | 0.883 | 0.884 | +0.001 (0.1%) | | Mean tokens per response | 385 | 470 | +85 (22%) | | Cost per request | $0.0135 | $0.0165 | +$0.003 (22%) | | Monthly cost (100k requests) | $1,350 | $1,650 | +$300/month | | Annual cost impact | — | — | +$3,600/year |
The verdict: Spending $3,600/year for a 0.1% quality improvement that no user can perceive is not justified.
Resolution
flowchart TD
A[Review Analysis Report] --> B{Effect Size<br/>Cohen d > 0.2?}
B -- "d=0.014<br/>Negligible" --> C[Check Practical<br/>Threshold]
C --> D{Quality Improvement<br/>> MDE?}
D -- "+0.001 < 0.03 MDE" --> E[Reject Variant]
B -- "d > 0.2" --> F[Proceed to<br/>Cost-Benefit Analysis]
F --> G{Annual Cost Increase<br/>< 10x Quality Value?}
G -- Yes --> H[Promote Variant]
G -- No --> E
E --> I[Conclude Experiment<br/>Document Learnings]
I --> J[Keep Control Parameters]
J --> K[Design Next Experiment<br/>with Larger Parameter Delta]
Decision framework:
-
Statistical significance alone is insufficient. Always check: - Effect size (Cohen's d > 0.2 for "small" practical effect) - Confidence interval (does the interval include the MDE?) - Cost-benefit ratio (quality improvement value vs cost increase)
-
Reject the variant. Keep control parameters for product_search.
-
Document the learning for future experiments.
Prevention
| Prevention Measure | Implementation |
|---|---|
| Mandatory practical significance check | Analysis report must include Cohen's d and MDE comparison before recommending promotion |
| Cost-benefit ratio in analysis | Every experiment analysis must calculate annual cost impact alongside quality metrics |
| Auto-cap sample size | Stop collecting data at 2x the planned sample size to avoid detecting trivially small effects |
| Define MDE and minimum Cohen's d upfront | Experiment creation requires minimum_practical_improvement (e.g., d > 0.2) as a promotion gate |
| Human review for promotion | Automated systems can recommend but not execute promotions — a human must sign off |
Scenario 3 — Top-k Too Restrictive for Manga Recommendations
Problem Statement
Users are complaining that MangaAssist recommendations are repetitive. The chatbot keeps suggesting the same 10-15 mainstream titles (One Piece, Naruto, Attack on Titan, Demon Slayer) regardless of the user's stated preferences. A user who asks for "dark psychological seinen manga" receives mainstream shonen titles instead of niche seinen recommendations like Homunculus, Ichi the Killer, or Oyasumi Punpun.
Detection
flowchart TD
A[User Feedback:<br/>'Same recommendations<br/>every time'] --> B[Analyze Recommendation<br/>Diversity Metric]
B --> C[Query: Unique Titles<br/>per 1000 Recommendations]
C --> D{Unique Titles<br/>> 50?}
D -- "Only 14 unique<br/>titles in last 1000" --> E[Confirmed:<br/>Diversity Problem]
D -- Yes --> F[Check Session-Level<br/>Deduplication]
E --> G[Check Parameter<br/>Profile]
G --> H{top_k = ?}
H -- "top_k = 30" --> I[Root Cause:<br/>top_k Too Restrictive]
H -- "top_k >= 100" --> J[Check RAG<br/>Index Coverage]
Detection signals: - Recommendation diversity metric: only 14 unique titles across the last 1,000 recommendation responses (should be 50+) - Repeat recommendation rate: 87% of recommendations are titles that appear in the top-15 most-recommended list - User satisfaction for recommendation intent drops from 4.⅕ to 3.⅖ over two weeks - "Similar to what I already read" complaints in post-chat surveys increase 3x
Root Cause Analysis
The root cause is top_k=30 combined with a skewed embedding space in OpenSearch.
The interaction between top_k and RAG:
- OpenSearch kNN search retrieves the top-10 most similar manga titles based on the user's query embedding
- These 10 titles are injected into the prompt as context
- The model generates a response choosing from its vocabulary, constrained by top_k=30
With top_k=30, the model can only select from the 30 most probable tokens at each generation step. When the prompt context contains 10 manga titles, the model's probability distribution heavily favors tokens that form the names of well-known titles (because they appeared more frequently in training data). Niche titles like "Oyasumi Punpun" have lower token probabilities and are excluded by the top_k=30 ceiling.
The compounding effect: - OpenSearch returns mainstream titles more often (they have more embeddings and reviews in the index) - The model's training data over-represents popular manga - top_k=30 eliminates the "long tail" of less-probable token sequences that form niche title names - Result: a feedback loop where only the most popular titles survive the generation process
Resolution
flowchart TD
A[Confirm: top_k=30 for<br/>recommendation intent] --> B[Increase top_k<br/>from 30 to 250]
B --> C[Also Increase top_p<br/>from 0.80 to 0.95]
C --> D[A/B Test: 20% Traffic<br/>for 3 Days]
D --> E[Measure Diversity:<br/>Unique Titles per 1000]
E --> F{Diversity > 50<br/>Unique Titles?}
F -- Yes --> G[Check Quality Score<br/>Not Degraded]
G --> H{Quality > 0.78?}
H -- Yes --> I[Promote New<br/>Parameters]
H -- No --> J[Reduce top_k to 200<br/>Re-test]
F -- No --> K[Also Check OpenSearch<br/>Index Diversity]
K --> L[Add Niche Titles<br/>to Embedding Index]
L --> D
Step-by-step resolution:
-
Update the recommendation parameter profile: - top_k: 30 → 250 - top_p: 0.80 → 0.95 - temperature: verify it is at 0.80 (it was correct)
-
A/B test the change — do not deploy directly because the jump from top_k=30 to top_k=250 is large: - Control: top_k=30 (current) - Variant: top_k=250 (proposed) - Traffic: 20% variant for 3 days
-
Measure recommendation diversity as the primary metric: - Unique titles per 1,000 recommendations - Genre diversity (number of distinct genres recommended) - "Long tail" ratio (percentage of recommendations from titles outside the top-50 most popular)
-
Check OpenSearch index — Ensure the embedding index contains niche titles. If the index only has 200 mainstream titles, no parameter tuning can fix the diversity problem.
Prevention
| Prevention Measure | Implementation |
|---|---|
| Recommendation diversity metric | CloudWatch custom metric tracking unique titles per 1,000 recommendations with alarm at <30 |
| Parameter range validation | Recommendation intent top_k must be >= 100; block deployments that set it lower |
| Quarterly index diversity audit | Automated check that OpenSearch contains titles from all genre categories, not just mainstream |
| Prompt-level diversity enforcement | System prompt explicitly instructs: "Include at least one lesser-known title in every recommendation set" |
| Session-level deduplication | Track recommended titles per session in DynamoDB; exclude already-recommended titles from future responses |
Scenario 4 — Parameter Profile Mismatch After New Intent Deployment
Problem Statement
The MangaAssist team deploys a new intent type: gift_suggestion — helping users pick manga gifts for friends. The new intent classifier correctly identifies gift_suggestion queries, but the ParameterProfileManager has no profile configured for this intent. The system falls through to the safety fallback (order_status defaults: temperature=0.15, top_k=40), producing responses that are factual but utterly uncreative: "Here is a manga: One Piece Volume 1. Price: $9.99." — no personalization, no enthusiasm, no gift-giving context.
Detection
flowchart TD
A[New Intent Deployed:<br/>gift_suggestion] --> B[Users Start Asking<br/>Gift Questions]
B --> C[Responses Are<br/>Dry and Uncreative]
C --> D[User Satisfaction for<br/>gift_suggestion = 2.1/5]
D --> E[Check Logs:<br/>ParameterProfileManager]
E --> F{Log Message:<br/>'Unknown intent,<br/>using order_status<br/>defaults'?}
F -- Yes --> G[Confirmed: No Profile<br/>for gift_suggestion]
F -- No --> H[Check Intent<br/>Classifier Accuracy]
Detection signals:
- ParameterProfileManager logs: "Unknown intent=gift_suggestion, using order_status defaults as safest option" appearing 400+ times/hour
- User satisfaction score for the new intent: 2.⅕ (target was 4.0/5)
- CloudWatch metric: fm.temperature grouped by intent shows gift_suggestion at 0.15 instead of the expected 0.70-0.80 range
- Qualitative review: gift suggestion responses lack personality, creativity, and gift-giving context
Root Cause Analysis
The deployment process for new intents has no coupling to the parameter configuration pipeline. The gap:
- The ML team deploys the updated intent classifier with the new
gift_suggestionlabel - The backend team updates the routing logic to handle the new intent
- Nobody creates a parameter profile in DynamoDB for
gift_suggestion - The ParameterProfileManager falls through to the safety default (order_status)
The safety fallback is working as designed — it prevents crashes. But the fallback parameters are completely wrong for a creative intent.
Resolution
flowchart TD
A[Identify Missing Profile] --> B[Determine Intent Category]
B --> C{gift_suggestion<br/>is Creative?}
C -- Yes --> D[Clone 'recommendation'<br/>Profile as Starting Point]
D --> E[Customize for<br/>Gift Context]
E --> F[Add System Prompt<br/>for Gift Suggestions]
F --> G[Deploy Profile<br/>to DynamoDB]
G --> H[Monitor Quality<br/>for 2 Hours]
H --> I{Satisfaction > 3.5?}
I -- Yes --> J[Schedule A/B Test<br/>for Fine-Tuning]
I -- No --> K[Adjust Parameters<br/>and Re-Monitor]
Step-by-step resolution:
-
Create the profile by cloning the closest existing intent:
gift_profile = ParameterProfile( intent="gift_suggestion", temperature=0.80, # Creative — same as recommendation top_k=250, # Wide vocabulary for diverse suggestions top_p=0.95, # Maximum diversity max_tokens=768, # Room for personalized explanations stop_sequences=["\n\nUser:", "\n\nHuman:"], frequency_penalty=0.3, # Reduce repetitive phrasing system_prompt_key="gift_suggestion_v1", few_shot_count=3, quality_score_target=0.82, version="initial-v1", updated_at="", ) profile_manager.update_profile(gift_profile) -
Create the system prompt:
SYSTEM_PROMPTS["gift_suggestion"] = ( "You are MangaAssist, a thoughtful gift recommendation specialist for a Japanese manga store. " "RULES:\n" "1. Ask about the recipient's age, interests, and reading experience.\n" "2. Suggest 3-5 titles with gift-appropriate descriptions (no spoilers).\n" "3. Include price range and whether a box set or single volume is better for gifting.\n" "4. Mention gift wrapping availability if applicable.\n" "5. Be warm and enthusiastic — gift giving is exciting." ) -
Monitor — Watch user satisfaction for
gift_suggestionover the next 2 hours. Target: >3.5/5 immediately, >4.0/5 after A/B tuning.
Prevention
| Prevention Measure | Implementation |
|---|---|
| Intent deployment checklist | Every new intent requires a parameter profile before the intent classifier update goes live |
| CI/CD validation | Deployment pipeline checks that every intent label in the classifier model has a corresponding DynamoDB profile |
| Fallback alerting | When ParameterProfileManager uses the fallback, it emits a WARNING-severity CloudWatch metric. Alarm triggers if fallback is used >10 times in 5 minutes |
| Intent onboarding template | Standard template for new intents includes: parameter profile, system prompt, few-shot examples, quality evaluation criteria |
| Automated profile scaffolding | When the intent classifier is updated, a Lambda function detects new labels and creates draft profiles by cloning the nearest existing intent's profile |
Scenario 5 — A/B Test Contamination from Request-Level Assignment
Problem Statement
An A/B test for manga_qa temperature (control=0.50 vs variant=0.65) shows no statistically significant difference after 3 weeks and 15,000 samples per variant. The team is puzzled because internal testing clearly showed the variant produced better responses. Upon investigation, they discover that the same user can receive control parameters for one message and variant parameters for the next message within the same conversation session. This cross-contamination corrupts the experiment data.
Detection
flowchart TD
A[A/B Test Inconclusive<br/>After 3 Weeks] --> B[Review Assignment<br/>Logic]
B --> C{Assignment Based On<br/>request_id or session_id?}
C -- request_id --> D[Confirmed: Per-Request<br/>Assignment]
D --> E[Check Session Logs<br/>for Cross-Variant]
E --> F[Query: Sessions Where<br/>Both Variants Served]
F --> G{Contaminated<br/>Sessions > 5%?}
G -- "42% of sessions<br/>saw both variants" --> H[Confirmed:<br/>Severe Contamination]
G -- "<5%" --> I[Contamination Not<br/>The Primary Issue]
Detection signals: - 42% of multi-turn sessions received both control and variant parameters across different turns - Within-session quality variance is higher than between-session variance (the opposite of what a clean experiment produces) - User satisfaction surveys show no difference because each user experienced a blend of both configurations - The experiment's statistical power is effectively halved because contaminated sessions add noise
Root Cause Analysis
The ABTestingFramework's assign_variant method was implemented with request-level hashing instead of session-level hashing. The hash input used request_id (which changes every message) instead of session_id (which is stable for the entire conversation).
The contamination mechanism:
Session ABC-123:
Turn 1 → hash(experiment_id + request_001) → bucket 0.32 → Control (T=0.50)
Turn 2 → hash(experiment_id + request_002) → bucket 0.71 → Variant (T=0.65)
Turn 3 → hash(experiment_id + request_003) → bucket 0.45 → Control (T=0.50)
Turn 4 → hash(experiment_id + request_004) → bucket 0.83 → Variant (T=0.65)
The user experiences an inconsistent conversation where the chatbot's "personality" changes mid-session. Quality evaluation becomes meaningless because the quality of turn 3 depends on the context established by turns 1 and 2, which used different parameters.
Why this makes the experiment invalid: 1. User experience is session-level, not request-level. A user rates the overall conversation, not individual messages. 2. Context dependency: In multi-turn conversations, each response builds on previous ones. Mixing parameters creates hybrid responses that neither control nor variant would produce in isolation. 3. Statistical noise: Contaminated sessions add variance that obscures the true difference between configurations.
Resolution
flowchart TD
A[Identify: request_id<br/>Used for Assignment] --> B[Fix: Change to<br/>session_id Hashing]
B --> C[Discard Contaminated<br/>Data — 3 Weeks Lost]
C --> D[Restart Experiment<br/>with Clean Assignment]
D --> E[Verify: Same Session<br/>Always Gets Same Variant]
E --> F[Run Validation Query:<br/>Sessions with Mixed Variants]
F --> G{Mixed Variant<br/>Sessions = 0?}
G -- Yes --> H[Experiment Running<br/>Cleanly]
G -- No --> I[Debug Hashing Logic<br/>Check Session ID Stability]
Step-by-step resolution:
-
Fix the assignment logic — Ensure
assign_variantusessession_id:# WRONG: Using request_id — changes every message hash_input = f"{experiment_id}:{request_id}".encode("utf-8") # CORRECT: Using session_id — stable for entire conversation hash_input = f"{experiment_id}:{session_id}".encode("utf-8") -
Discard all contaminated data — The 3 weeks of data collected with request-level assignment cannot be salvaged. Mark the experiment as
invalidatedand create a new one. -
Restart the experiment with session-level assignment: - Same control and variant configurations - Reset sample counters to zero - New experiment_id to clearly separate from invalid data
-
Validate — After 24 hours, run a diagnostic query:
-- Count sessions that received both variants (should be 0) SELECT COUNT(DISTINCT session_id) as contaminated_sessions FROM ab_results WHERE experiment_id = 'manga_qa_temp_v2' GROUP BY session_id HAVING COUNT(DISTINCT variant_id) > 1 -
Verify session_id stability — Ensure that session_id does not change during a conversation. Common causes of unstable session IDs: - WebSocket reconnections generating new session IDs - Load balancer session affinity misconfiguration - Client-side session rotation
Prevention
| Prevention Measure | Implementation |
|---|---|
| Session-level assignment enforced | ABTestingFramework constructor validates that the assignment key is session_id, not request_id |
| Contamination check on launch | Every experiment runs a 1-hour canary with an automated contamination check before full launch |
| Unit test for assignment consistency | Test that assign_variant(experiment_id, session_id) returns the same variant for 1,000 consecutive calls with the same session_id |
| Session stability monitoring | CloudWatch metric tracking session_id changes per WebSocket connection. Alert if >0.1% of connections generate multiple session IDs |
| Code review checklist | A/B testing code reviews must include: "Is assignment session-level? Is the session_id stable?" |
| Contamination dashboard | Real-time dashboard showing percentage of sessions with mixed variant assignments per active experiment |
Cross-Scenario Decision Tree
When an FM performance issue is detected, use this decision tree to route to the correct scenario:
flowchart TD
A[FM Performance<br/>Issue Detected] --> B{What Type?}
B -- "Hallucinated/Fabricated<br/>Content" --> C{Which Intent?}
C -- "Factual Intent<br/>(order_status, shipping_info)" --> D["Scenario 1<br/>Temperature Too High"]
C -- "Creative Intent<br/>(recommendation, chitchat)" --> E[Check RAG<br/>Retrieval Quality]
B -- "A/B Test<br/>Result Questionable" --> F{What's Wrong?}
F -- "Significant p-value<br/>but tiny improvement" --> G["Scenario 2<br/>Statistical vs Practical"]
F -- "Inconclusive after<br/>long run" --> H{Check Assignment<br/>Logic}
H -- "Request-level<br/>assignment" --> I["Scenario 5<br/>Contamination"]
H -- "Session-level<br/>correct" --> J[Check Sample Size<br/>and Power]
B -- "Repetitive/<br/>Boring Responses" --> K{Which Intent?}
K -- "Recommendations" --> L["Scenario 3<br/>Top-k Too Restrictive"]
K -- "New Intent" --> M{Profile<br/>Exists?}
M -- No --> N["Scenario 4<br/>Missing Profile"]
M -- Yes --> O[Check Temperature<br/>and System Prompt]
B -- "Wrong Personality/<br/>Tone" --> P{New Intent<br/>Recently Deployed?}
P -- Yes --> N
P -- No --> Q[Check System Prompt<br/>Version and Profile]
style D fill:#f8d7da
style G fill:#fff3cd
style I fill:#f8d7da
style L fill:#fff3cd
style N fill:#e1f5fe
Summary — Quick Reference
| Scenario | Symptom | Root Cause | Time to Detect | Time to Resolve |
|---|---|---|---|---|
| 1. Temperature too high | Hallucinated order details | Temperature=0.75 for factual intent | Minutes (if hallucination metric exists) | 15 minutes (profile revert + cache clear) |
| 2. Statistical vs practical | Promoting a 0.1% improvement at 22% more cost | No practical significance gate in analysis | Days (at experiment completion) | Immediate (reject variant) |
| 3. Top-k too restrictive | Same 15 titles recommended | top_k=30 eliminates long-tail tokens | Days (diversity metric needed) | Hours (profile update + A/B test) |
| 4. Missing profile | Dry, uncreative responses for new intent | No parameter profile created during deployment | Minutes (fallback log alerts) | 30 minutes (create profile) |
| 5. A/B contamination | Inconclusive experiment after weeks | Request-level instead of session-level assignment | Weeks (after experiment fails to converge) | Days (fix code, discard data, restart) |
Key Takeaways
- Factual intents demand low temperature — There is no safe "medium" temperature for order_status or shipping_info. Hallucinated data destroys user trust instantly.
- Statistical significance is necessary but not sufficient — Always pair p-values with effect size (Cohen's d) and cost-benefit analysis before promoting any variant.
- Recommendation diversity requires wide top_k — Restrictive top_k interacts with biased embeddings to create a feedback loop of only mainstream titles.
- New intent deployment must include parameter profiles — The safest fallback in the world cannot produce good responses if it uses the wrong personality and constraints.
- Session-level assignment is non-negotiable — Request-level A/B assignment in a multi-turn chatbot invalidates the entire experiment. Validate assignment logic before launching any test.