LOCAL PREVIEW View on GitHub

06. Failure Scenarios and Workarounds

Why This Document Exists

Many production failures look like prompt failures. Some are prompt failures. Many are not.

This document focuses on cases where prompt optimization alone did not reach the goal, and the team had to change another part of the system while still using foundational models effectively.

Case Study Template

Each case includes:

  1. goal
  2. initial prompt strategy
  3. why it failed
  4. observable symptom
  5. root cause
  6. workaround that reached the goal
  7. tradeoff introduced

Case 1: Hallucinated Prices

Goal

Show product recommendations with helpful natural language that mentions price.

Initial Prompt Strategy

Add the instruction: "Never invent prices. Use only the provided product data."

Why It Failed

The model still occasionally mentioned stale or misplaced prices when multiple products were present.

Observable Symptom

  • wrong price attached to the wrong product
  • price restated even when missing from one candidate

Root Cause

The FM was still being asked to generate a business-critical fact in free-form text.

Workaround

Remove price generation from the FM responsibility. Let the FM explain product fit only. Let the UI bind price from structured catalog data after validation.

Tradeoff

Slightly less natural prose, much higher trust.

Case 2: Few-Shot Examples Caused Latency Regression

Goal

Improve recommendation explanation quality and tone consistency.

Initial Prompt Strategy

Add 6 rich examples of good recommendation answers.

Why It Failed

Latency and cost increased materially, especially on mobile sessions with long history.

Observable Symptom

  • slower first token
  • higher token cost
  • occasional context-window pressure

Root Cause

The examples were high-quality but too expensive for a high-volume commerce flow.

Workaround

Use 1 or 2 compact exemplars per intent, plus better upstream retrieval and candidate ranking.

Tradeoff

Less stylistic richness than the full few-shot version, but far better operational efficiency.

Case 3: Stronger JSON Instructions Did Not Stop Schema Drift

Goal

Return stable JSON for UI rendering.

Initial Prompt Strategy

Add stricter wording such as "Return valid JSON only" and "Do not include extra keys."

Why It Failed

Nested arrays and optional fields still drifted under complex recommendation prompts.

Observable Symptom

  • extra fields
  • code fences
  • trailing commentary

Root Cause

Prompt instructions alone were handling both semantics and syntax enforcement.

Workaround

Keep the prompt contract, but add parser validation, repair logic, and a fallback response template for malformed outputs.

Tradeoff

More orchestration complexity, far better UI safety.

Case 4: More Retrieved Chunks Made Answers Worse

Goal

Reduce FAQ misses by adding more context.

Initial Prompt Strategy

Increase top-k retrieval from 3 chunks to 8 chunks.

Why It Failed

Answers became longer, less direct, and occasionally contradictory.

Observable Symptom

  • vague answers
  • mixed policy language
  • higher latency

Root Cause

Recall went up, but precision and prompt focus went down.

Workaround

Use stricter metadata filters, reranking, and a smaller chunk budget per intent.

Tradeoff

Lower raw context volume, higher effective grounding quality.

Case 5: History Summaries Lost User Preference Detail

Goal

Preserve multi-turn recommendation quality during long conversations.

Initial Prompt Strategy

Replace old turns with one generic history summary.

Why It Failed

Important recurring preferences such as genre tolerance, language preference, and budget sensitivity were dropped.

Observable Symptom

  • later recommendations stopped matching earlier preferences

Root Cause

The summary optimized for brevity but not task relevance.

Workaround

Store preference slots separately from natural-language summary.

Tradeoff

Slightly more memory management complexity, much better long-session personalization.

Case 6: Lower Temperature Hurt Recommendation Quality

Goal

Reduce hallucination in recommendation flows.

Initial Prompt Strategy

Set temperature very low globally.

Why It Failed

The model became repetitive and generic in recommendation wording.

Observable Symptom

  • repeated phrasing
  • weaker contrast between titles

Root Cause

The system tried to solve factual trust issues through generation randomness control alone.

Workaround

Use deterministic upstream ranking and keep the FM responsible only for explanation. Then use a slightly richer generation setting for recommendation copy while retaining hard factual boundaries.

Tradeoff

More routing logic, better balance between trust and UX.

Case 7: Prompt Hardening Alone Did Not Stop Injection

Goal

Prevent user attempts to override system instructions.

Initial Prompt Strategy

Add strong wording about ignoring malicious requests.

Why It Failed

Indirect attacks through retrieved or pasted content still leaked through edge cases.

Observable Symptom

  • occasional off-scope answer
  • policy leakage attempts

Root Cause

Untrusted content was still entering the prompt context too freely.

Workaround

Combine prompt hardening with content sanitization, source filtering, trusted vs untrusted separation, and output validation.

Tradeoff

Higher implementation complexity, lower attack surface.

Case 8: FM Could Not Reliably Resolve Ambiguous Multi-Intent Requests

Goal

Handle messages that combine support and shopping intents.

Initial Prompt Strategy

Ask the FM to infer priorities and answer all parts naturally.

Why It Failed

The model often blended operational facts with recommendation language or skipped one of the requests.

Observable Symptom

  • one intent ignored
  • mixed sections

Root Cause

The prompt delegated workflow orchestration to generation.

Workaround

Split intents upstream and either answer in two distinct sections or generate two separate sub-responses.

Tradeoff

Less conversational elegance, much better completeness and safety.

Case 9: Bedrock Timeout Could Not Be Solved by Prompt Trimming Alone

Goal

Reduce timeout rate during spikes.

Initial Prompt Strategy

Trim prompt size and lower output length.

Why It Failed

Timeouts were driven by load and model path choice, not only prompt size.

Observable Symptom

  • degraded response rate during peak periods

Root Cause

Traffic and model selection dominated the problem.

Workaround

Use model tiering, stronger routing to template/API paths, and graceful fallback when the FM path degrades.

Tradeoff

Slightly less FM coverage during peak traffic, better availability.

Case 10: Guardrail Over-Blocking Reduced Useful Answers

Goal

Keep responses safe without making the chatbot feel broken.

Initial Prompt Strategy

Make the system prompt extremely restrictive.

Why It Failed

Many harmless answers became evasive or empty.

Observable Symptom

  • high safe-fallback rate
  • low user satisfaction for otherwise simple questions

Root Cause

The prompt collapsed nuance into blanket caution.

Workaround

Move business-critical caution into narrower intent-specific prompts and keep downstream guardrails focused on actual risk categories.

Tradeoff

More prompt variants to manage, much better user utility.