06. Failure Scenarios and Workarounds
Why This Document Exists
Many production failures look like prompt failures. Some are prompt failures. Many are not.
This document focuses on cases where prompt optimization alone did not reach the goal, and the team had to change another part of the system while still using foundational models effectively.
Case Study Template
Each case includes:
- goal
- initial prompt strategy
- why it failed
- observable symptom
- root cause
- workaround that reached the goal
- tradeoff introduced
Case 1: Hallucinated Prices
Goal
Show product recommendations with helpful natural language that mentions price.
Initial Prompt Strategy
Add the instruction: "Never invent prices. Use only the provided product data."
Why It Failed
The model still occasionally mentioned stale or misplaced prices when multiple products were present.
Observable Symptom
- wrong price attached to the wrong product
- price restated even when missing from one candidate
Root Cause
The FM was still being asked to generate a business-critical fact in free-form text.
Workaround
Remove price generation from the FM responsibility. Let the FM explain product fit only. Let the UI bind price from structured catalog data after validation.
Tradeoff
Slightly less natural prose, much higher trust.
Case 2: Few-Shot Examples Caused Latency Regression
Goal
Improve recommendation explanation quality and tone consistency.
Initial Prompt Strategy
Add 6 rich examples of good recommendation answers.
Why It Failed
Latency and cost increased materially, especially on mobile sessions with long history.
Observable Symptom
- slower first token
- higher token cost
- occasional context-window pressure
Root Cause
The examples were high-quality but too expensive for a high-volume commerce flow.
Workaround
Use 1 or 2 compact exemplars per intent, plus better upstream retrieval and candidate ranking.
Tradeoff
Less stylistic richness than the full few-shot version, but far better operational efficiency.
Case 3: Stronger JSON Instructions Did Not Stop Schema Drift
Goal
Return stable JSON for UI rendering.
Initial Prompt Strategy
Add stricter wording such as "Return valid JSON only" and "Do not include extra keys."
Why It Failed
Nested arrays and optional fields still drifted under complex recommendation prompts.
Observable Symptom
- extra fields
- code fences
- trailing commentary
Root Cause
Prompt instructions alone were handling both semantics and syntax enforcement.
Workaround
Keep the prompt contract, but add parser validation, repair logic, and a fallback response template for malformed outputs.
Tradeoff
More orchestration complexity, far better UI safety.
Case 4: More Retrieved Chunks Made Answers Worse
Goal
Reduce FAQ misses by adding more context.
Initial Prompt Strategy
Increase top-k retrieval from 3 chunks to 8 chunks.
Why It Failed
Answers became longer, less direct, and occasionally contradictory.
Observable Symptom
- vague answers
- mixed policy language
- higher latency
Root Cause
Recall went up, but precision and prompt focus went down.
Workaround
Use stricter metadata filters, reranking, and a smaller chunk budget per intent.
Tradeoff
Lower raw context volume, higher effective grounding quality.
Case 5: History Summaries Lost User Preference Detail
Goal
Preserve multi-turn recommendation quality during long conversations.
Initial Prompt Strategy
Replace old turns with one generic history summary.
Why It Failed
Important recurring preferences such as genre tolerance, language preference, and budget sensitivity were dropped.
Observable Symptom
- later recommendations stopped matching earlier preferences
Root Cause
The summary optimized for brevity but not task relevance.
Workaround
Store preference slots separately from natural-language summary.
Tradeoff
Slightly more memory management complexity, much better long-session personalization.
Case 6: Lower Temperature Hurt Recommendation Quality
Goal
Reduce hallucination in recommendation flows.
Initial Prompt Strategy
Set temperature very low globally.
Why It Failed
The model became repetitive and generic in recommendation wording.
Observable Symptom
- repeated phrasing
- weaker contrast between titles
Root Cause
The system tried to solve factual trust issues through generation randomness control alone.
Workaround
Use deterministic upstream ranking and keep the FM responsible only for explanation. Then use a slightly richer generation setting for recommendation copy while retaining hard factual boundaries.
Tradeoff
More routing logic, better balance between trust and UX.
Case 7: Prompt Hardening Alone Did Not Stop Injection
Goal
Prevent user attempts to override system instructions.
Initial Prompt Strategy
Add strong wording about ignoring malicious requests.
Why It Failed
Indirect attacks through retrieved or pasted content still leaked through edge cases.
Observable Symptom
- occasional off-scope answer
- policy leakage attempts
Root Cause
Untrusted content was still entering the prompt context too freely.
Workaround
Combine prompt hardening with content sanitization, source filtering, trusted vs untrusted separation, and output validation.
Tradeoff
Higher implementation complexity, lower attack surface.
Case 8: FM Could Not Reliably Resolve Ambiguous Multi-Intent Requests
Goal
Handle messages that combine support and shopping intents.
Initial Prompt Strategy
Ask the FM to infer priorities and answer all parts naturally.
Why It Failed
The model often blended operational facts with recommendation language or skipped one of the requests.
Observable Symptom
- one intent ignored
- mixed sections
Root Cause
The prompt delegated workflow orchestration to generation.
Workaround
Split intents upstream and either answer in two distinct sections or generate two separate sub-responses.
Tradeoff
Less conversational elegance, much better completeness and safety.
Case 9: Bedrock Timeout Could Not Be Solved by Prompt Trimming Alone
Goal
Reduce timeout rate during spikes.
Initial Prompt Strategy
Trim prompt size and lower output length.
Why It Failed
Timeouts were driven by load and model path choice, not only prompt size.
Observable Symptom
- degraded response rate during peak periods
Root Cause
Traffic and model selection dominated the problem.
Workaround
Use model tiering, stronger routing to template/API paths, and graceful fallback when the FM path degrades.
Tradeoff
Slightly less FM coverage during peak traffic, better availability.
Case 10: Guardrail Over-Blocking Reduced Useful Answers
Goal
Keep responses safe without making the chatbot feel broken.
Initial Prompt Strategy
Make the system prompt extremely restrictive.
Why It Failed
Many harmless answers became evasive or empty.
Observable Symptom
- high safe-fallback rate
- low user satisfaction for otherwise simple questions
Root Cause
The prompt collapsed nuance into blanket caution.
Workaround
Move business-critical caution into narrower intent-specific prompts and keep downstream guardrails focused on actual risk categories.
Tradeoff
More prompt variants to manage, much better user utility.