LOCAL PREVIEW View on GitHub

FM Performance Enhancement — Scenarios and Runbooks

AWS AIP-C01 Task 4.2 — Skill 4.2.4: Diagnose and resolve FM parameter misconfiguration and A/B testing failures Context: MangaAssist JP manga store chatbot — Bedrock Claude 3 (Sonnet/Haiku), OpenSearch Serverless, DynamoDB, ECS Fargate Intents: product_search, order_status, recommendation, manga_qa, chitchat, shipping_info


Skill Mapping

Certification Domain Task Skill
AWS AIP-C01 Domain 4 — Operational Efficiency Task 4.2 — Optimize FM performance Skill 4.2.4 — Troubleshoot FM parameter misconfigurations and A/B testing failures through systematic root cause analysis

Skill scope: Identify, diagnose, and resolve production issues related to FM parameter configuration, A/B testing methodology, and performance regression through structured runbooks with decision trees.


Scenario 1 — Temperature Too High for Order Status Queries

Problem Statement

MangaAssist users are reporting fabricated delivery dates in order status responses. A customer asks "Where is my order #MGA-78291?" and receives: "Your order is scheduled for delivery on March 15th and is currently in transit via Japan Post." But the actual delivery date is March 22nd, and the carrier is Sagawa Express. The chatbot is confidently hallucinating order details.

Detection

flowchart TD
    A[User Complaint:<br/>'Wrong delivery date'] --> B[Support Team<br/>Logs Ticket]
    B --> C[Check CloudWatch Metric:<br/>fm.hallucination_rate]
    C --> D{Hallucination Rate<br/>> 5%?}
    D -- Yes --> E[Check Intent Breakdown]
    E --> F{order_status<br/>Hallucination > 15%?}
    F -- Yes --> G[Confirmed: Parameter<br/>Issue for order_status]
    F -- No --> H[Investigate Other<br/>Root Causes]
    D -- No --> I[Check Specific<br/>Session Logs]
    I --> J{Response Contains<br/>Data Not in Context?}
    J -- Yes --> G
    J -- No --> K[Check RAG Retrieval<br/>Quality]

Detection signals: - CloudWatch alarm: fm.hallucination_rate for intent=order_status spikes from 1.2% to 18.7% - Customer support tickets mentioning "wrong delivery date" or "wrong tracking number" increase 4x - Automated claim verification shows 17% of order_status responses contain dates not present in the DynamoDB order record

Root Cause Analysis

The root cause is a parameter profile misconfiguration. During a recent optimization push, the temperature for order_status was accidentally changed from 0.15 to 0.75.

Investigation steps:

  1. Check the current parameter profile:

    profile_manager = ParameterProfileManager()
    profile = profile_manager.get_profile("order_status")
    print(f"Temperature: {profile.temperature}")  # Expected: 0.15, Actual: 0.75
    

  2. Check DynamoDB profile history: Query the MangaAssist-ParameterProfiles table for the order_status item. The updated_at timestamp reveals the change was made during a batch profile update that incorrectly applied recommendation-tier parameters to all intents.

  3. Trace the change to the deployment: The batch update script did not filter by intent group and applied the recommendation profile's temperature (0.75) to order_status.

Why temperature=0.75 causes hallucination for order_status: - At T=0.75, the model's probability distribution over tokens is flattened. Low-probability tokens (e.g., random dates, incorrect carrier names) become viable candidates. - Order status responses must contain exact values from the database. Even a small amount of creative token selection introduces fabricated data. - The model "fills in" plausible-sounding details when the actual data is not forcefully constrained by low temperature.

Resolution

flowchart TD
    A[Confirm: order_status<br/>temperature = 0.75] --> B[Immediate: Revert to<br/>temperature = 0.15]
    B --> C[Update DynamoDB<br/>ParameterProfiles Table]
    C --> D[Clear Profile Cache<br/>on ECS Instances]
    D --> E[Monitor Hallucination<br/>Rate for 1 Hour]
    E --> F{Rate < 2%?}
    F -- Yes --> G[Incident Resolved]
    F -- No --> H[Check if Cache<br/>Is Actually Cleared]
    H --> I[Force ECS Task Restart<br/>to Guarantee Fresh Config]
    I --> E

Step-by-step resolution:

  1. Immediate revert — Update DynamoDB profile to temperature=0.15:

    profile_manager.update_profile(ParameterProfile(
        intent="order_status",
        temperature=0.15,
        top_k=40,
        top_p=0.70,
        max_tokens=256,
        stop_sequences=["\n\nUser:", "\n\nHuman:"],
        frequency_penalty=0.0,
        system_prompt_key="order_status_v3",
        few_shot_count=2,
        quality_score_target=0.95,
        version="hotfix-v1",
        updated_at="",  # Will be set by update_profile
    ))
    

  2. Clear cache — The ParameterProfileManager caches profiles for 5 minutes. Either wait for TTL expiry or trigger a cache invalidation across all ECS tasks.

  3. Monitor — Watch fm.hallucination_rate for intent=order_status in CloudWatch. It should drop below 2% within 15 minutes of the fix propagating.

  4. User communication — Notify affected users whose orders showed incorrect delivery information.

Prevention

Prevention Measure Implementation
Intent-group validation Parameter update scripts must validate that factual intents (order_status, shipping_info) never exceed temperature=0.35
Pre-deployment guardrail CI/CD pipeline checks parameter profiles against allowed ranges per intent group before deployment
Automated anomaly detection CloudWatch anomaly detection on fm.hallucination_rate per intent with 5-minute evaluation period
Separate update paths Factual intents and creative intents have separate DynamoDB update pipelines — never batch-updated together
Canary on parameter changes Any parameter profile change routes through a 5% canary for 1 hour before full deployment

Scenario 2 — Statistical Significance Without Practical Significance

Problem Statement

An A/B test for product_search temperature (control=0.45 vs variant=0.50) has been running for 2 weeks. The StatisticalSignificanceCalculator reports p-value = 0.031 — statistically significant. The team is preparing to promote the variant. However, closer inspection reveals the quality improvement is 0.001 (from 0.883 to 0.884), while the variant uses 22% more tokens due to slightly longer responses, increasing cost per request by $0.003.

Detection

flowchart TD
    A[A/B Test Completes<br/>p-value = 0.031] --> B[Team Prepares<br/>to Promote Variant]
    B --> C{Review Full<br/>Analysis Report}
    C --> D[Quality Improvement:<br/>+0.001 — 0.1%]
    C --> E[Token Increase:<br/>+22%]
    C --> F[Cost Increase:<br/>+$0.003/request]
    D --> G{Practical<br/>Significance?}
    G -- "0.1% improvement<br/>is negligible" --> H[Flag: Statistical<br/>Without Practical]
    E --> I{Cost Justified?}
    I -- "+22% tokens for<br/>0.1% quality" --> J[Flag: Poor<br/>Cost-Benefit Ratio]
    H --> K[Reject Variant<br/>Despite p < 0.05]
    J --> K

Detection signals: - The analysis report shows a quality improvement smaller than the minimum detectable effect specified in the experiment design (MDE was 0.03, actual improvement is 0.001) - Cohen's d effect size is 0.014 — classified as "negligible" (< 0.2 is negligible by convention) - The 95% confidence interval for the difference is (0.0001, 0.0019) — the upper bound barely exceeds zero

Root Cause Analysis

This is not a bug — it is a methodological trap. The experiment was overpowered: it ran long enough to detect a trivially small difference as "significant."

Why it happened: 1. The experiment was designed with min_samples=1,950 based on MDE=0.03. But traffic was higher than expected, so it accumulated 12,000 samples per variant. 2. With 12,000 samples, even a 0.001 difference produces a small p-value because the standard error shrinks with sample size: SE = std / sqrt(n). 3. The team did not define a minimum practical improvement threshold alongside the statistical significance threshold.

The cost analysis: | Metric | Control (T=0.45) | Variant (T=0.50) | Difference | |---|---|---|---| | Mean quality score | 0.883 | 0.884 | +0.001 (0.1%) | | Mean tokens per response | 385 | 470 | +85 (22%) | | Cost per request | $0.0135 | $0.0165 | +$0.003 (22%) | | Monthly cost (100k requests) | $1,350 | $1,650 | +$300/month | | Annual cost impact | — | — | +$3,600/year |

The verdict: Spending $3,600/year for a 0.1% quality improvement that no user can perceive is not justified.

Resolution

flowchart TD
    A[Review Analysis Report] --> B{Effect Size<br/>Cohen d > 0.2?}
    B -- "d=0.014<br/>Negligible" --> C[Check Practical<br/>Threshold]
    C --> D{Quality Improvement<br/>> MDE?}
    D -- "+0.001 < 0.03 MDE" --> E[Reject Variant]
    B -- "d > 0.2" --> F[Proceed to<br/>Cost-Benefit Analysis]
    F --> G{Annual Cost Increase<br/>< 10x Quality Value?}
    G -- Yes --> H[Promote Variant]
    G -- No --> E
    E --> I[Conclude Experiment<br/>Document Learnings]
    I --> J[Keep Control Parameters]
    J --> K[Design Next Experiment<br/>with Larger Parameter Delta]

Decision framework:

  1. Statistical significance alone is insufficient. Always check: - Effect size (Cohen's d > 0.2 for "small" practical effect) - Confidence interval (does the interval include the MDE?) - Cost-benefit ratio (quality improvement value vs cost increase)

  2. Reject the variant. Keep control parameters for product_search.

  3. Document the learning for future experiments.

Prevention

Prevention Measure Implementation
Mandatory practical significance check Analysis report must include Cohen's d and MDE comparison before recommending promotion
Cost-benefit ratio in analysis Every experiment analysis must calculate annual cost impact alongside quality metrics
Auto-cap sample size Stop collecting data at 2x the planned sample size to avoid detecting trivially small effects
Define MDE and minimum Cohen's d upfront Experiment creation requires minimum_practical_improvement (e.g., d > 0.2) as a promotion gate
Human review for promotion Automated systems can recommend but not execute promotions — a human must sign off

Scenario 3 — Top-k Too Restrictive for Manga Recommendations

Problem Statement

Users are complaining that MangaAssist recommendations are repetitive. The chatbot keeps suggesting the same 10-15 mainstream titles (One Piece, Naruto, Attack on Titan, Demon Slayer) regardless of the user's stated preferences. A user who asks for "dark psychological seinen manga" receives mainstream shonen titles instead of niche seinen recommendations like Homunculus, Ichi the Killer, or Oyasumi Punpun.

Detection

flowchart TD
    A[User Feedback:<br/>'Same recommendations<br/>every time'] --> B[Analyze Recommendation<br/>Diversity Metric]
    B --> C[Query: Unique Titles<br/>per 1000 Recommendations]
    C --> D{Unique Titles<br/>> 50?}
    D -- "Only 14 unique<br/>titles in last 1000" --> E[Confirmed:<br/>Diversity Problem]
    D -- Yes --> F[Check Session-Level<br/>Deduplication]
    E --> G[Check Parameter<br/>Profile]
    G --> H{top_k = ?}
    H -- "top_k = 30" --> I[Root Cause:<br/>top_k Too Restrictive]
    H -- "top_k >= 100" --> J[Check RAG<br/>Index Coverage]

Detection signals: - Recommendation diversity metric: only 14 unique titles across the last 1,000 recommendation responses (should be 50+) - Repeat recommendation rate: 87% of recommendations are titles that appear in the top-15 most-recommended list - User satisfaction for recommendation intent drops from 4.⅕ to 3.⅖ over two weeks - "Similar to what I already read" complaints in post-chat surveys increase 3x

Root Cause Analysis

The root cause is top_k=30 combined with a skewed embedding space in OpenSearch.

The interaction between top_k and RAG:

  1. OpenSearch kNN search retrieves the top-10 most similar manga titles based on the user's query embedding
  2. These 10 titles are injected into the prompt as context
  3. The model generates a response choosing from its vocabulary, constrained by top_k=30

With top_k=30, the model can only select from the 30 most probable tokens at each generation step. When the prompt context contains 10 manga titles, the model's probability distribution heavily favors tokens that form the names of well-known titles (because they appeared more frequently in training data). Niche titles like "Oyasumi Punpun" have lower token probabilities and are excluded by the top_k=30 ceiling.

The compounding effect: - OpenSearch returns mainstream titles more often (they have more embeddings and reviews in the index) - The model's training data over-represents popular manga - top_k=30 eliminates the "long tail" of less-probable token sequences that form niche title names - Result: a feedback loop where only the most popular titles survive the generation process

Resolution

flowchart TD
    A[Confirm: top_k=30 for<br/>recommendation intent] --> B[Increase top_k<br/>from 30 to 250]
    B --> C[Also Increase top_p<br/>from 0.80 to 0.95]
    C --> D[A/B Test: 20% Traffic<br/>for 3 Days]
    D --> E[Measure Diversity:<br/>Unique Titles per 1000]
    E --> F{Diversity > 50<br/>Unique Titles?}
    F -- Yes --> G[Check Quality Score<br/>Not Degraded]
    G --> H{Quality > 0.78?}
    H -- Yes --> I[Promote New<br/>Parameters]
    H -- No --> J[Reduce top_k to 200<br/>Re-test]
    F -- No --> K[Also Check OpenSearch<br/>Index Diversity]
    K --> L[Add Niche Titles<br/>to Embedding Index]
    L --> D

Step-by-step resolution:

  1. Update the recommendation parameter profile: - top_k: 30 → 250 - top_p: 0.80 → 0.95 - temperature: verify it is at 0.80 (it was correct)

  2. A/B test the change — do not deploy directly because the jump from top_k=30 to top_k=250 is large: - Control: top_k=30 (current) - Variant: top_k=250 (proposed) - Traffic: 20% variant for 3 days

  3. Measure recommendation diversity as the primary metric: - Unique titles per 1,000 recommendations - Genre diversity (number of distinct genres recommended) - "Long tail" ratio (percentage of recommendations from titles outside the top-50 most popular)

  4. Check OpenSearch index — Ensure the embedding index contains niche titles. If the index only has 200 mainstream titles, no parameter tuning can fix the diversity problem.

Prevention

Prevention Measure Implementation
Recommendation diversity metric CloudWatch custom metric tracking unique titles per 1,000 recommendations with alarm at <30
Parameter range validation Recommendation intent top_k must be >= 100; block deployments that set it lower
Quarterly index diversity audit Automated check that OpenSearch contains titles from all genre categories, not just mainstream
Prompt-level diversity enforcement System prompt explicitly instructs: "Include at least one lesser-known title in every recommendation set"
Session-level deduplication Track recommended titles per session in DynamoDB; exclude already-recommended titles from future responses

Scenario 4 — Parameter Profile Mismatch After New Intent Deployment

Problem Statement

The MangaAssist team deploys a new intent type: gift_suggestion — helping users pick manga gifts for friends. The new intent classifier correctly identifies gift_suggestion queries, but the ParameterProfileManager has no profile configured for this intent. The system falls through to the safety fallback (order_status defaults: temperature=0.15, top_k=40), producing responses that are factual but utterly uncreative: "Here is a manga: One Piece Volume 1. Price: $9.99." — no personalization, no enthusiasm, no gift-giving context.

Detection

flowchart TD
    A[New Intent Deployed:<br/>gift_suggestion] --> B[Users Start Asking<br/>Gift Questions]
    B --> C[Responses Are<br/>Dry and Uncreative]
    C --> D[User Satisfaction for<br/>gift_suggestion = 2.1/5]
    D --> E[Check Logs:<br/>ParameterProfileManager]
    E --> F{Log Message:<br/>'Unknown intent,<br/>using order_status<br/>defaults'?}
    F -- Yes --> G[Confirmed: No Profile<br/>for gift_suggestion]
    F -- No --> H[Check Intent<br/>Classifier Accuracy]

Detection signals: - ParameterProfileManager logs: "Unknown intent=gift_suggestion, using order_status defaults as safest option" appearing 400+ times/hour - User satisfaction score for the new intent: 2.⅕ (target was 4.0/5) - CloudWatch metric: fm.temperature grouped by intent shows gift_suggestion at 0.15 instead of the expected 0.70-0.80 range - Qualitative review: gift suggestion responses lack personality, creativity, and gift-giving context

Root Cause Analysis

The deployment process for new intents has no coupling to the parameter configuration pipeline. The gap:

  1. The ML team deploys the updated intent classifier with the new gift_suggestion label
  2. The backend team updates the routing logic to handle the new intent
  3. Nobody creates a parameter profile in DynamoDB for gift_suggestion
  4. The ParameterProfileManager falls through to the safety default (order_status)

The safety fallback is working as designed — it prevents crashes. But the fallback parameters are completely wrong for a creative intent.

Resolution

flowchart TD
    A[Identify Missing Profile] --> B[Determine Intent Category]
    B --> C{gift_suggestion<br/>is Creative?}
    C -- Yes --> D[Clone 'recommendation'<br/>Profile as Starting Point]
    D --> E[Customize for<br/>Gift Context]
    E --> F[Add System Prompt<br/>for Gift Suggestions]
    F --> G[Deploy Profile<br/>to DynamoDB]
    G --> H[Monitor Quality<br/>for 2 Hours]
    H --> I{Satisfaction > 3.5?}
    I -- Yes --> J[Schedule A/B Test<br/>for Fine-Tuning]
    I -- No --> K[Adjust Parameters<br/>and Re-Monitor]

Step-by-step resolution:

  1. Create the profile by cloning the closest existing intent:

    gift_profile = ParameterProfile(
        intent="gift_suggestion",
        temperature=0.80,       # Creative — same as recommendation
        top_k=250,              # Wide vocabulary for diverse suggestions
        top_p=0.95,             # Maximum diversity
        max_tokens=768,         # Room for personalized explanations
        stop_sequences=["\n\nUser:", "\n\nHuman:"],
        frequency_penalty=0.3,  # Reduce repetitive phrasing
        system_prompt_key="gift_suggestion_v1",
        few_shot_count=3,
        quality_score_target=0.82,
        version="initial-v1",
        updated_at="",
    )
    profile_manager.update_profile(gift_profile)
    

  2. Create the system prompt:

    SYSTEM_PROMPTS["gift_suggestion"] = (
        "You are MangaAssist, a thoughtful gift recommendation specialist for a Japanese manga store. "
        "RULES:\n"
        "1. Ask about the recipient's age, interests, and reading experience.\n"
        "2. Suggest 3-5 titles with gift-appropriate descriptions (no spoilers).\n"
        "3. Include price range and whether a box set or single volume is better for gifting.\n"
        "4. Mention gift wrapping availability if applicable.\n"
        "5. Be warm and enthusiastic — gift giving is exciting."
    )
    

  3. Monitor — Watch user satisfaction for gift_suggestion over the next 2 hours. Target: >3.5/5 immediately, >4.0/5 after A/B tuning.

Prevention

Prevention Measure Implementation
Intent deployment checklist Every new intent requires a parameter profile before the intent classifier update goes live
CI/CD validation Deployment pipeline checks that every intent label in the classifier model has a corresponding DynamoDB profile
Fallback alerting When ParameterProfileManager uses the fallback, it emits a WARNING-severity CloudWatch metric. Alarm triggers if fallback is used >10 times in 5 minutes
Intent onboarding template Standard template for new intents includes: parameter profile, system prompt, few-shot examples, quality evaluation criteria
Automated profile scaffolding When the intent classifier is updated, a Lambda function detects new labels and creates draft profiles by cloning the nearest existing intent's profile

Scenario 5 — A/B Test Contamination from Request-Level Assignment

Problem Statement

An A/B test for manga_qa temperature (control=0.50 vs variant=0.65) shows no statistically significant difference after 3 weeks and 15,000 samples per variant. The team is puzzled because internal testing clearly showed the variant produced better responses. Upon investigation, they discover that the same user can receive control parameters for one message and variant parameters for the next message within the same conversation session. This cross-contamination corrupts the experiment data.

Detection

flowchart TD
    A[A/B Test Inconclusive<br/>After 3 Weeks] --> B[Review Assignment<br/>Logic]
    B --> C{Assignment Based On<br/>request_id or session_id?}
    C -- request_id --> D[Confirmed: Per-Request<br/>Assignment]
    D --> E[Check Session Logs<br/>for Cross-Variant]
    E --> F[Query: Sessions Where<br/>Both Variants Served]
    F --> G{Contaminated<br/>Sessions > 5%?}
    G -- "42% of sessions<br/>saw both variants" --> H[Confirmed:<br/>Severe Contamination]
    G -- "<5%" --> I[Contamination Not<br/>The Primary Issue]

Detection signals: - 42% of multi-turn sessions received both control and variant parameters across different turns - Within-session quality variance is higher than between-session variance (the opposite of what a clean experiment produces) - User satisfaction surveys show no difference because each user experienced a blend of both configurations - The experiment's statistical power is effectively halved because contaminated sessions add noise

Root Cause Analysis

The ABTestingFramework's assign_variant method was implemented with request-level hashing instead of session-level hashing. The hash input used request_id (which changes every message) instead of session_id (which is stable for the entire conversation).

The contamination mechanism:

Session ABC-123:
  Turn 1 → hash(experiment_id + request_001) → bucket 0.32 → Control (T=0.50)
  Turn 2 → hash(experiment_id + request_002) → bucket 0.71 → Variant (T=0.65)
  Turn 3 → hash(experiment_id + request_003) → bucket 0.45 → Control (T=0.50)
  Turn 4 → hash(experiment_id + request_004) → bucket 0.83 → Variant (T=0.65)

The user experiences an inconsistent conversation where the chatbot's "personality" changes mid-session. Quality evaluation becomes meaningless because the quality of turn 3 depends on the context established by turns 1 and 2, which used different parameters.

Why this makes the experiment invalid: 1. User experience is session-level, not request-level. A user rates the overall conversation, not individual messages. 2. Context dependency: In multi-turn conversations, each response builds on previous ones. Mixing parameters creates hybrid responses that neither control nor variant would produce in isolation. 3. Statistical noise: Contaminated sessions add variance that obscures the true difference between configurations.

Resolution

flowchart TD
    A[Identify: request_id<br/>Used for Assignment] --> B[Fix: Change to<br/>session_id Hashing]
    B --> C[Discard Contaminated<br/>Data — 3 Weeks Lost]
    C --> D[Restart Experiment<br/>with Clean Assignment]
    D --> E[Verify: Same Session<br/>Always Gets Same Variant]
    E --> F[Run Validation Query:<br/>Sessions with Mixed Variants]
    F --> G{Mixed Variant<br/>Sessions = 0?}
    G -- Yes --> H[Experiment Running<br/>Cleanly]
    G -- No --> I[Debug Hashing Logic<br/>Check Session ID Stability]

Step-by-step resolution:

  1. Fix the assignment logic — Ensure assign_variant uses session_id:

    # WRONG: Using request_id — changes every message
    hash_input = f"{experiment_id}:{request_id}".encode("utf-8")
    
    # CORRECT: Using session_id — stable for entire conversation
    hash_input = f"{experiment_id}:{session_id}".encode("utf-8")
    

  2. Discard all contaminated data — The 3 weeks of data collected with request-level assignment cannot be salvaged. Mark the experiment as invalidated and create a new one.

  3. Restart the experiment with session-level assignment: - Same control and variant configurations - Reset sample counters to zero - New experiment_id to clearly separate from invalid data

  4. Validate — After 24 hours, run a diagnostic query:

    -- Count sessions that received both variants (should be 0)
    SELECT COUNT(DISTINCT session_id) as contaminated_sessions
    FROM ab_results
    WHERE experiment_id = 'manga_qa_temp_v2'
    GROUP BY session_id
    HAVING COUNT(DISTINCT variant_id) > 1
    

  5. Verify session_id stability — Ensure that session_id does not change during a conversation. Common causes of unstable session IDs: - WebSocket reconnections generating new session IDs - Load balancer session affinity misconfiguration - Client-side session rotation

Prevention

Prevention Measure Implementation
Session-level assignment enforced ABTestingFramework constructor validates that the assignment key is session_id, not request_id
Contamination check on launch Every experiment runs a 1-hour canary with an automated contamination check before full launch
Unit test for assignment consistency Test that assign_variant(experiment_id, session_id) returns the same variant for 1,000 consecutive calls with the same session_id
Session stability monitoring CloudWatch metric tracking session_id changes per WebSocket connection. Alert if >0.1% of connections generate multiple session IDs
Code review checklist A/B testing code reviews must include: "Is assignment session-level? Is the session_id stable?"
Contamination dashboard Real-time dashboard showing percentage of sessions with mixed variant assignments per active experiment

Cross-Scenario Decision Tree

When an FM performance issue is detected, use this decision tree to route to the correct scenario:

flowchart TD
    A[FM Performance<br/>Issue Detected] --> B{What Type?}

    B -- "Hallucinated/Fabricated<br/>Content" --> C{Which Intent?}
    C -- "Factual Intent<br/>(order_status, shipping_info)" --> D["Scenario 1<br/>Temperature Too High"]
    C -- "Creative Intent<br/>(recommendation, chitchat)" --> E[Check RAG<br/>Retrieval Quality]

    B -- "A/B Test<br/>Result Questionable" --> F{What's Wrong?}
    F -- "Significant p-value<br/>but tiny improvement" --> G["Scenario 2<br/>Statistical vs Practical"]
    F -- "Inconclusive after<br/>long run" --> H{Check Assignment<br/>Logic}
    H -- "Request-level<br/>assignment" --> I["Scenario 5<br/>Contamination"]
    H -- "Session-level<br/>correct" --> J[Check Sample Size<br/>and Power]

    B -- "Repetitive/<br/>Boring Responses" --> K{Which Intent?}
    K -- "Recommendations" --> L["Scenario 3<br/>Top-k Too Restrictive"]
    K -- "New Intent" --> M{Profile<br/>Exists?}
    M -- No --> N["Scenario 4<br/>Missing Profile"]
    M -- Yes --> O[Check Temperature<br/>and System Prompt]

    B -- "Wrong Personality/<br/>Tone" --> P{New Intent<br/>Recently Deployed?}
    P -- Yes --> N
    P -- No --> Q[Check System Prompt<br/>Version and Profile]

    style D fill:#f8d7da
    style G fill:#fff3cd
    style I fill:#f8d7da
    style L fill:#fff3cd
    style N fill:#e1f5fe

Summary — Quick Reference

Scenario Symptom Root Cause Time to Detect Time to Resolve
1. Temperature too high Hallucinated order details Temperature=0.75 for factual intent Minutes (if hallucination metric exists) 15 minutes (profile revert + cache clear)
2. Statistical vs practical Promoting a 0.1% improvement at 22% more cost No practical significance gate in analysis Days (at experiment completion) Immediate (reject variant)
3. Top-k too restrictive Same 15 titles recommended top_k=30 eliminates long-tail tokens Days (diversity metric needed) Hours (profile update + A/B test)
4. Missing profile Dry, uncreative responses for new intent No parameter profile created during deployment Minutes (fallback log alerts) 30 minutes (create profile)
5. A/B contamination Inconclusive experiment after weeks Request-level instead of session-level assignment Weeks (after experiment fails to converge) Days (fix code, discard data, restart)

Key Takeaways

  1. Factual intents demand low temperature — There is no safe "medium" temperature for order_status or shipping_info. Hallucinated data destroys user trust instantly.
  2. Statistical significance is necessary but not sufficient — Always pair p-values with effect size (Cohen's d) and cost-benefit analysis before promoting any variant.
  3. Recommendation diversity requires wide top_k — Restrictive top_k interacts with biased embeddings to create a feedback loop of only mainstream titles.
  4. New intent deployment must include parameter profiles — The safest fallback in the world cannot produce good responses if it uses the wrong personality and constraints.
  5. Session-level assignment is non-negotiable — Request-level A/B assignment in a multi-turn chatbot invalidates the entire experiment. Validate assignment logic before launching any test.