Prompt Optimization Offline Workflow
A systematic approach to improving LLM prompts without burning through production API budgets. This document covers the complete offline workflow — from baseline measurement through variant testing to production deployment.
The Prompt Optimization Lifecycle
flowchart TD
START["Identify Prompt Issue<br/>(metric regression, user complaint, or new feature)"]
START --> BASELINE["Step 1: Establish Baseline<br/>Run current prompt against golden dataset"]
BASELINE --> ANALYZE["Step 2: Failure Analysis<br/>Categorize where/why prompt fails"]
ANALYZE --> HYPOTHESIZE["Step 3: Form Hypotheses<br/>Design 2-4 prompt variants"]
HYPOTHESIZE --> LOCAL["Step 4: Local Smoke Test<br/>Run variants on Ollama (free)"]
LOCAL --> EVAL["Step 5: Automated Evaluation<br/>Score with BERTScore, heuristics"]
EVAL --> SELECT["Step 6: Select Top Variant<br/>Statistical comparison"]
SELECT --> SHADOW["Step 7: Shadow Deployment<br/>Run winner alongside production"]
SHADOW --> DEPLOY["Step 8: Canary Deployment<br/>1% → 10% → 50% → 100%"]
style START fill:#6c5ce7,color:#fff
style DEPLOY fill:#00b894,color:#fff
Step 1: Establish Baseline
Before changing anything, measure how the current prompt performs. This becomes the benchmark every variant must beat.
Baseline Measurement Setup
flowchart LR
GP["Golden Dataset<br/>(200-500 cases)"]
GP --> RUN["Run Current Prompt"]
RUN --> SCORE["Score Outputs"]
SCORE --> REPORT["Baseline Report"]
REPORT --> M1["BERTScore F1: 0.84"]
REPORT --> M2["Halluc Rate: 3.2%"]
REPORT --> M3["Format Pass: 91%"]
REPORT --> M4["Avg Latency: 1.4s"]
REPORT --> M5["Avg Tokens: 287"]
Baseline Collection Script
import json
from datetime import datetime
from typing import List, Dict
def collect_baseline(
golden_dataset: List[Dict],
prompt_template: str,
model_id: str = "anthropic.claude-3-5-sonnet",
output_path: str = "baselines/"
) -> Dict:
"""Run current prompt against entire golden dataset and record results."""
results = []
for case in golden_dataset:
# Build the full prompt from template + case context
full_prompt = prompt_template.format(
intent=case["intent"],
query=case["query"],
context=case.get("retrieved_context", ""),
conversation_history=case.get("history", [])
)
start_time = time.time()
response = invoke_bedrock(
model_id=model_id,
prompt=full_prompt,
max_tokens=500,
temperature=0.3
)
latency = time.time() - start_time
# Score against expected output
scores = evaluate_response(
generated=response.text,
expected=case["expected_response"],
intent=case["intent"],
context=case.get("retrieved_context", "")
)
results.append({
"case_id": case["id"],
"intent": case["intent"],
"scores": scores,
"latency": latency,
"tokens_in": response.usage.input_tokens,
"tokens_out": response.usage.output_tokens,
"raw_response": response.text,
})
# Aggregate metrics
baseline = {
"timestamp": datetime.utcnow().isoformat(),
"model_id": model_id,
"prompt_version": hash(prompt_template),
"dataset_size": len(golden_dataset),
"metrics": {
"bert_score_f1_mean": mean([r["scores"]["bert_score_f1"] for r in results]),
"bert_score_f1_p25": percentile([r["scores"]["bert_score_f1"] for r in results], 25),
"hallucination_rate": mean([r["scores"]["hallucinated"] for r in results]),
"format_pass_rate": mean([r["scores"]["format_correct"] for r in results]),
"avg_latency_sec": mean([r["latency"] for r in results]),
"avg_tokens_out": mean([r["tokens_out"] for r in results]),
"total_cost_usd": sum([
r["tokens_in"] * 0.003 / 1000 + r["tokens_out"] * 0.015 / 1000
for r in results
]),
},
"per_intent": aggregate_by_intent(results),
"worst_cases": sorted(results, key=lambda x: x["scores"]["bert_score_f1"])[:10],
}
# Save baseline for future comparison
output_file = f"{output_path}/baseline_{datetime.utcnow().strftime('%Y%m%d_%H%M')}.json"
with open(output_file, "w") as f:
json.dump(baseline, f, indent=2)
return baseline
Step 2: Failure Analysis
Understand why the current prompt fails before attempting to fix it. Blindly editing prompts without diagnosis leads to oscillating between broken states.
Failure Categories
pie title Typical Prompt Failure Breakdown
"Hallucinated Facts" : 28
"Wrong Format" : 22
"Missed Context" : 18
"Too Verbose" : 14
"Off-Topic" : 10
"Contradicted History" : 8
Systematic Analysis
def analyze_failures(baseline_results: Dict) -> Dict:
"""Categorize failures by root cause to guide prompt edits."""
failures = [r for r in baseline_results if r["scores"]["overall"] < 0.7]
categories = {
"hallucinated_facts": [],
"wrong_format": [],
"missed_context": [],
"too_verbose": [],
"off_topic": [],
"contradicted_history": [],
"wrong_intent_handling": [],
}
for f in failures:
# Check for hallucination
if f["scores"]["hallucinated"]:
categories["hallucinated_facts"].append(f)
# Check format compliance
if not f["scores"]["format_correct"]:
categories["wrong_format"].append(f)
# Check if retrieved context was actually used
if f["scores"]["context_utilization"] < 0.3:
categories["missed_context"].append(f)
# Check verbosity
if f["tokens_out"] > 400 and f["scores"]["bert_score_f1"] < 0.7:
categories["too_verbose"].append(f)
# Check topic alignment
if f["scores"]["intent_match"] < 0.5:
categories["off_topic"].append(f)
return {
category: {
"count": len(cases),
"pct": len(cases) / len(failures) * 100 if failures else 0,
"examples": cases[:3], # Top 3 examples for inspection
}
for category, cases in categories.items()
}
Reading Failure Analysis Results
| Failure Category | Count | Percentage | Likely Prompt Fix |
|---|---|---|---|
| Hallucinated Facts | 14 | 28% | Add explicit instruction: "Only use information from the provided context" |
| Wrong Format | 11 | 22% | Add format examples with exact JSON schema in prompt |
| Missed Context | 9 | 18% | Move context earlier in prompt, add "You MUST reference the context below" |
| Too Verbose | 7 | 14% | Add word limit: "Keep responses under 100 words" |
| Off-Topic | 5 | 10% | Add scope restriction: "Only answer questions about manga and orders" |
| Contradicted History | 4 | 8% | Add: "If the conversation history contradicts the current context, prefer the current context" |
Step 3: Design Prompt Variants
Based on the failure analysis, design targeted variants. The key principle: change one thing at a time so you know what caused improvement.
Variant Design Strategy
flowchart TD
BASE["Baseline Prompt (v2.1)"]
BASE --> VA["Variant A:<br/>Add grounding instruction<br/>'Only use provided context'"]
BASE --> VB["Variant B:<br/>Add format examples<br/>with JSON schema"]
BASE --> VC["Variant C:<br/>Restructure sections<br/>(context before instructions)"]
BASE --> VD["Variant D:<br/>Add conciseness constraint<br/>'Under 100 words'"]
VA --> COMBINED["After individual testing:<br/>Combine winning elements<br/>into Variant E"]
VB --> COMBINED
VC --> COMBINED
VD --> COMBINED
style COMBINED fill:#00b894,color:#fff
Prompt Template Registry
PROMPT_VARIANTS = {
"v2.1_baseline": {
"template": """You are MangaAssist, a helpful assistant for manga e-commerce.
Context: {context}
Conversation History: {history}
User Query: {query}
Provide a helpful response.""",
"hypothesis": "Current production prompt",
"changed_from_baseline": None,
},
"v2.2_grounded": {
"template": """You are MangaAssist, a helpful assistant for manga e-commerce.
IMPORTANT: Only use information from the context below. If the context does not contain
enough information to answer, say "I don't have that information" rather than guessing.
Context: {context}
Conversation History: {history}
User Query: {query}
Provide a helpful response based ONLY on the context above.""",
"hypothesis": "Grounding instruction reduces hallucination",
"changed_from_baseline": "Added grounding instructions at top and bottom",
},
"v2.2_structured": {
"template": """You are MangaAssist, a helpful assistant for manga e-commerce.
Context: {context}
Conversation History: {history}
User Query: {query}
Respond in the following format:
- If the query is about a product: name, price, availability, then a 1-2 sentence description
- If the query is about an order: status, expected date, next steps
- If the query is a general question: direct answer in under 80 words
Response:""",
"hypothesis": "Explicit format instructions reduce wrong-format failures",
"changed_from_baseline": "Added response format specification",
},
"v2.2_context_first": {
"template": """Context about relevant manga products and orders:
{context}
Previous conversation:
{history}
You are MangaAssist, a helpful assistant for manga e-commerce.
You MUST reference the context above when answering. Do not make up information.
User: {query}
MangaAssist:""",
"hypothesis": "Context placed before instructions improves context utilization",
"changed_from_baseline": "Moved context section above system instruction",
},
"v2.2_concise": {
"template": """You are MangaAssist, a helpful assistant for manga e-commerce.
Keep responses under 80 words. Be direct and specific.
Context: {context}
Conversation History: {history}
User Query: {query}
Provide a concise response (under 80 words):""",
"hypothesis": "Conciseness constraint reduces verbosity without hurting quality",
"changed_from_baseline": "Added 80-word limit instruction",
},
}
Step 4: Local Smoke Test (Free)
Before spending a cent on Bedrock, validate variants locally. Ollama with an open model catches ~60% of structural and behavioral prompt issues.
flowchart LR
VARIANT["Prompt<br/>Variant"] --> OLLAMA["Ollama<br/>Llama 3 8B"]
OLLAMA --> CHECK["Structural Checks"]
CHECK --> P1["Format OK?"]
CHECK --> P2["Uses context?"]
CHECK --> P3["Under token limit?"]
CHECK --> P4["Stays in persona?"]
P1 -->|"Fail"| DISCARD["Discard variant"]
P4 -->|"Fail"| DISCARD
P1 -->|"Pass"| PROCEED["Advance to<br/>Bedrock evaluation"]
P2 -->|"Pass"| PROCEED
P3 -->|"Pass"| PROCEED
P4 -->|"Pass"| PROCEED
style DISCARD fill:#e17055,color:#fff
style PROCEED fill:#00b894,color:#fff
Local Testing Configuration
import ollama
def local_smoke_test(
variant_name: str,
prompt_template: str,
test_cases: list, # Subset: 20-30 cases from golden dataset
) -> dict:
"""Fast, free validation using local Ollama before committing Bedrock spend."""
results = {"passed": 0, "failed": 0, "issues": []}
for case in test_cases:
full_prompt = prompt_template.format(
context=case.get("retrieved_context", ""),
history=case.get("history", ""),
query=case["query"]
)
response = ollama.chat(
model="llama3:8b",
messages=[{"role": "user", "content": full_prompt}],
options={"num_predict": 500, "temperature": 0.3}
)
text = response["message"]["content"]
# Structural checks (free, deterministic)
checks = {
"not_empty": len(text.strip()) > 0,
"not_too_long": len(text.split()) < 200,
"stays_in_persona": "as an AI" not in text.lower(),
"no_system_leak": "SYSTEM:" not in text and "INSTRUCTIONS:" not in text,
"uses_context": any(
keyword in text.lower()
for keyword in extract_keywords(case.get("retrieved_context", ""))
) if case.get("retrieved_context") else True,
"correct_format": check_format(text, case["intent"]),
}
if all(checks.values()):
results["passed"] += 1
else:
results["failed"] += 1
results["issues"].append({
"case_id": case["id"],
"failed_checks": [k for k, v in checks.items() if not v],
"response_preview": text[:200]
})
results["pass_rate"] = results["passed"] / len(test_cases)
# Gate: must pass >= 80% structural checks to proceed
results["advance_to_bedrock"] = results["pass_rate"] >= 0.80
print(f"[{variant_name}] Pass rate: {results['pass_rate']:.0%} "
f"→ {'ADVANCE' if results['advance_to_bedrock'] else 'DISCARD'}")
return results
What Local Testing Catches vs. Misses
| Catches | Misses |
|---|---|
| Prompt causes format violations | Subtle quality differences between Llama and Claude |
| Prompt causes system prompt leakage | Hallucination rates (different training data) |
| Prompt causes persona breaks | Exact tone and style matching |
| Prompt causes infinite loops | Model-specific instruction following quirks |
| Response length issues | Exact BERTScore metrics |
| Context not being referenced | Fine-grained semantic accuracy |
Step 5: Automated Evaluation on Bedrock
For variants that pass local smoke tests, run full evaluation on Bedrock against the golden dataset. This is where you spend money — so make it count.
Cost-Controlled Evaluation
flowchart TD
VARIANTS["3-4 Surviving Variants"]
VARIANTS --> SAMPLE["Phase 1: Sample Evaluation<br/>50 cases per variant<br/>Cost: ~$1.50"]
SAMPLE --> GATE{"Any variant<br/>beats baseline<br/>by > 3%?"}
GATE -->|"No"| STOP["Stop. Current prompt is fine.<br/>Optimize elsewhere."]
GATE -->|"Yes"| FULL["Phase 2: Full Evaluation<br/>200+ cases for top 2 variants<br/>Cost: ~$3.00"]
FULL --> STATS["Statistical Comparison<br/>Paired t-test per metric"]
STATS --> WINNER{"Winner with<br/>p < 0.05?"}
WINNER -->|"No"| STOP2["Difference not significant.<br/>Keep baseline."]
WINNER -->|"Yes"| PROMOTE["Promote to Shadow Mode"]
style STOP fill:#e17055,color:#fff
style STOP2 fill:#e17055,color:#fff
style PROMOTE fill:#00b894,color:#fff
Evaluation Script
import numpy as np
from scipy import stats
def evaluate_variants_on_bedrock(
variants: dict,
golden_dataset: list,
baseline_results: dict,
phase: str = "sample", # "sample" (50 cases) or "full" (all cases)
) -> dict:
"""Evaluate prompt variants against golden dataset on actual Bedrock model."""
if phase == "sample":
# Stratified sample: ensure each intent is represented
dataset = stratified_sample(golden_dataset, n=50)
else:
dataset = golden_dataset
all_results = {}
for name, variant in variants.items():
if name.endswith("_baseline"):
continue # Already have baseline
results = []
for case in dataset:
full_prompt = variant["template"].format(
context=case.get("retrieved_context", ""),
history=case.get("history", ""),
query=case["query"]
)
response = invoke_bedrock(
model_id="anthropic.claude-3-5-sonnet",
prompt=full_prompt,
max_tokens=500,
temperature=0.3
)
scores = evaluate_response(
generated=response.text,
expected=case["expected_response"],
intent=case["intent"],
context=case.get("retrieved_context", "")
)
results.append({
"case_id": case["id"],
"intent": case["intent"],
"scores": scores,
"tokens_out": response.usage.output_tokens,
})
all_results[name] = {
"results": results,
"metrics": aggregate_metrics(results),
}
# Statistical comparison against baseline
comparison = {}
for name, data in all_results.items():
variant_scores = [r["scores"]["bert_score_f1"] for r in data["results"]]
baseline_scores = [r["scores"]["bert_score_f1"] for r in baseline_results["results"][:len(variant_scores)]]
t_stat, p_value = stats.ttest_rel(variant_scores, baseline_scores)
comparison[name] = {
"bert_score_delta": np.mean(variant_scores) - np.mean(baseline_scores),
"p_value": p_value,
"significant": p_value < 0.05,
"better": np.mean(variant_scores) > np.mean(baseline_scores),
"hallucination_rate": data["metrics"]["hallucination_rate"],
"avg_tokens": data["metrics"]["avg_tokens_out"],
}
return comparison
Reading Evaluation Results
Example output after running 4 variants:
| Variant | BERTScore Δ | p-value | Halluc Rate | Avg Tokens | Verdict |
|---|---|---|---|---|---|
| v2.2_grounded | +0.06 | 0.003 | 1.2% (↓62%) | 265 | Winner (hallucination fix) |
| v2.2_structured | +0.03 | 0.041 | 2.8% | 198 (↓31%) | Winner (format fix) |
| v2.2_context_first | +0.01 | 0.312 | 3.0% | 290 | Not significant — discard |
| v2.2_concise | −0.02 | 0.089 | 3.5% | 142 (↓50%) | Quality dropped — discard |
Reading this table: - v2.2_grounded significantly improved quality AND reduced hallucinations. This is the primary winner. - v2.2_structured helped with format compliance but the quality improvement was smaller. - v2.2_context_first didn't help — moving context position alone wasn't enough. - v2.2_concise reduced tokens but hurt quality. The 80-word limit was too aggressive.
Step 6: Combine Winning Elements
After identifying which individual changes help, combine them into a final candidate.
flowchart LR
VA["Grounding<br/>(from v2.2_grounded)"]
VB["Format spec<br/>(from v2.2_structured)"]
VA --> COMBINED["v2.3_combined"]
VB --> COMBINED
COMBINED --> EVAL["Evaluate combined<br/>against baseline<br/>AND individual winners"]
EVAL --> BETTER{"Combined better<br/>than individuals?"}
BETTER -->|"Yes"| SHIP["Ship v2.3_combined"]
BETTER -->|"No"| SHIP2["Ship best individual<br/>(v2.2_grounded)"]
style SHIP fill:#00b894,color:#fff
style SHIP2 fill:#fdcb6e,color:#333
PROMPT_VARIANTS["v2.3_combined"] = {
"template": """You are MangaAssist, a helpful assistant for manga e-commerce.
IMPORTANT: Only use information from the context below. If the context does not contain
enough information to answer, say "I don't have that information" rather than guessing.
Context: {context}
Conversation History: {history}
User Query: {query}
Respond in the following format:
- If the query is about a product: name, price, availability, then a 1-2 sentence description
- If the query is about an order: status, expected date, next steps
- If the query is a general question: direct answer in under 80 words
Based ONLY on the context above, provide your response:""",
"hypothesis": "Combined grounding + format > either alone",
"changed_from_baseline": "Grounding instruction + format specification",
}
Step 7: Shadow Deployment
Run the winning variant in parallel with production. Both prompts process every request; only the production prompt's response is shown to the user. Compare results to validate in the real-world distribution.
sequenceDiagram
participant U as User
participant GW as API Gateway
participant P as Production (v2.1)
participant S as Shadow (v2.3)
participant EVAL as Evaluator
U->>GW: "Recommend horror manga"
par Parallel execution
GW->>P: Process with v2.1
P-->>GW: Response A
and
GW->>S: Process with v2.3
S-->>GW: Response B (not served)
end
GW-->>U: Response A (production)
GW->>EVAL: Log both responses
EVAL->>EVAL: Compare quality metrics
EVAL->>EVAL: Compare latency
EVAL->>EVAL: Compare token usage
Note over EVAL: After 500+ request pairs:<br/>Statistical comparison report
Shadow Mode Decision Logic
def shadow_mode_evaluation(shadow_logs: list) -> dict:
"""Analyze shadow deployment results after N requests."""
comparisons = {
"shadow_better": 0,
"production_better": 0,
"tie": 0,
}
for log in shadow_logs:
prod_score = log["production"]["quality_score"]
shadow_score = log["shadow"]["quality_score"]
if shadow_score - prod_score > 0.05:
comparisons["shadow_better"] += 1
elif prod_score - shadow_score > 0.05:
comparisons["production_better"] += 1
else:
comparisons["tie"] += 1
n = len(shadow_logs)
# Decision criteria
decision = {
"total_requests": n,
"shadow_win_rate": comparisons["shadow_better"] / n,
"production_win_rate": comparisons["production_better"] / n,
"tie_rate": comparisons["tie"] / n,
"shadow_hallucination_rate": mean([l["shadow"]["hallucinated"] for l in shadow_logs]),
"production_hallucination_rate": mean([l["production"]["hallucinated"] for l in shadow_logs]),
"shadow_avg_latency": mean([l["shadow"]["latency"] for l in shadow_logs]),
"production_avg_latency": mean([l["production"]["latency"] for l in shadow_logs]),
}
# Go/no-go
decision["promote_to_canary"] = (
decision["shadow_win_rate"] > decision["production_win_rate"]
and decision["shadow_hallucination_rate"] <= decision["production_hallucination_rate"]
and decision["shadow_avg_latency"] < decision["production_avg_latency"] * 1.2 # Max 20% slower
)
return decision
Step 8: Canary Deployment
Gradually shift traffic to the new prompt version with automatic rollback triggers.
stateDiagram-v2
[*] --> Canary_1pct: Deploy to 1% traffic
Canary_1pct --> Monitor_1: Observe 30 min
Monitor_1 --> Rollback: Metrics degrade
Monitor_1 --> Canary_10pct: Metrics stable
Canary_10pct --> Monitor_10: Observe 2 hours
Monitor_10 --> Rollback: Metrics degrade
Monitor_10 --> Canary_50pct: Metrics stable
Canary_50pct --> Monitor_50: Observe 6 hours
Monitor_50 --> Rollback: Metrics degrade
Monitor_50 --> Full_100pct: Metrics stable
Full_100pct --> [*]: New prompt is now production
Rollback --> [*]: Revert to previous prompt
Automatic Rollback Triggers
| Metric | Threshold | Action |
|---|---|---|
| Hallucination rate | > baseline × 1.5 | Immediate rollback |
| BERTScore F1 | < baseline − 0.05 | Rollback after 50 requests |
| P99 latency | > 5 seconds | Pause canary, investigate |
| Error rate | > 2% | Immediate rollback |
| Token usage | > baseline × 2.0 | Pause canary, investigate cost |
| User thumbs-down rate | > baseline × 1.3 | Rollback after 100 requests |
Cost Comparison: Naive vs. Systematic
flowchart LR
subgraph Naive["Naive Approach: $45+"]
N1["Run 500 cases × 5 variants<br/>= 2,500 Bedrock calls<br/>~$37.50"]
N1 --> N2["No statistical rigor<br/>Pick 'best looking' variant"]
N2 --> N3["Deploy directly<br/>Hope it works"]
N3 --> N4["Discover regression<br/>in production"]
N4 --> N5["Rollback + repeat<br/>another $37.50"]
end
subgraph Systematic["Systematic Approach: $4.50"]
S1["Local smoke test<br/>$0"]
S1 --> S2["Sample eval (50 cases × 4)<br/>~$1.50"]
S2 --> S3["Full eval (200 cases × 2)<br/>~$3.00"]
S3 --> S4["Shadow deployment<br/>(uses prod traffic, no extra cost)"]
S4 --> S5["Canary with auto-rollback<br/>(safe at each step)"]
end
style Naive fill:#e17055,color:#fff
style Systematic fill:#00b894,color:#fff
| Aspect | Naive | Systematic |
|---|---|---|
| Bedrock cost per optimization cycle | $45-75 | $4.50 |
| Time to confident result | 1-2 weeks (trial and error) | 2-3 days (structured) |
| Risk of production regression | High (no shadow/canary) | Low (3 safety gates) |
| Know WHY improvement happened | No (changed multiple things) | Yes (one variable at a time) |
| Reproducible | No | Yes (versioned prompts + datasets) |
Prompt Version Control
Every prompt is versioned and tracked, just like code.
gitGraph
commit id: "v2.0 - Initial prompt"
commit id: "v2.1 - Add product format"
branch grounding-experiment
commit id: "v2.2-grounded - Add grounding"
branch format-experiment
commit id: "v2.2-structured - Add format spec"
checkout grounding-experiment
commit id: "Eval: +6% BERTScore"
checkout format-experiment
commit id: "Eval: +3% BERTScore"
checkout main
merge grounding-experiment id: "v2.3 - Merge grounding"
commit id: "v2.3 - Combined variant"
commit id: "Shadow: validated"
commit id: "Canary: 1% → 100%"
commit id: "v2.3 is now production"
Version Registry
PROMPT_REGISTRY = {
"v2.0": {
"hash": "abc123",
"deployed": "2024-06-01",
"retired": "2024-08-15",
"baseline_bert_score": 0.78,
"notes": "Initial production prompt"
},
"v2.1": {
"hash": "def456",
"deployed": "2024-08-15",
"retired": "2024-10-10",
"baseline_bert_score": 0.84,
"notes": "Added product format template"
},
"v2.3": {
"hash": "ghi789",
"deployed": "2024-10-10",
"retired": None,
"baseline_bert_score": 0.90,
"notes": "Grounding + format combined (from optimization cycle)"
},
}
Common Pitfalls
| Pitfall | Why It Happens | How to Avoid |
|---|---|---|
| Changing 5 things at once | Impatience | One prompt element per variant |
| Evaluating on <20 cases | Cost anxiety | Use local testing for cheap validation; 50+ for Bedrock |
| Optimizing the wrong metric | Chasing numbers | Map metric to user experience (hallucination > BERTScore) |
| No baseline measurement | Assuming current is bad | Always measure before changing |
| Deploying without shadow | Confidence from offline eval | Shadow catches distribution mismatches |
| Ignoring per-intent slicing | Only looking at averages | A 5% average gain can mask 10% regression on order_tracking |
| Not versioning prompts | "I'll remember what I changed" | Version control every variant |
| Over-optimizing for benchmark | Golden dataset overfitting | Rotate 20% of golden dataset quarterly |
Workflow Checklist
Use this checklist for every prompt optimization cycle:
- Baseline collected — current prompt scored on full golden dataset
- Failure analysis done — know exactly which categories of failures to fix
- Variants designed — each variant changes exactly one thing
- Local smoke test passed — all variants pass structural checks on Ollama
- Sample evaluation complete — 50 cases per variant on Bedrock
- Statistical comparison done — paired t-test, p < 0.05 for winner
- Combined variant tested — if applicable
- Shadow deployment validated — winner runs alongside production for 500+ requests
- Canary deployed — 1% → 10% → 50% → 100% with rollback triggers
- Prompt registry updated — new version documented with metrics and notes
- Golden dataset refreshed — add 5-10 new cases from this cycle's failures