LOCAL PREVIEW View on GitHub

Prompt Optimization Offline Workflow

A systematic approach to improving LLM prompts without burning through production API budgets. This document covers the complete offline workflow — from baseline measurement through variant testing to production deployment.


The Prompt Optimization Lifecycle

flowchart TD
    START["Identify Prompt Issue<br/>(metric regression, user complaint, or new feature)"]
    START --> BASELINE["Step 1: Establish Baseline<br/>Run current prompt against golden dataset"]
    BASELINE --> ANALYZE["Step 2: Failure Analysis<br/>Categorize where/why prompt fails"]
    ANALYZE --> HYPOTHESIZE["Step 3: Form Hypotheses<br/>Design 2-4 prompt variants"]
    HYPOTHESIZE --> LOCAL["Step 4: Local Smoke Test<br/>Run variants on Ollama (free)"]
    LOCAL --> EVAL["Step 5: Automated Evaluation<br/>Score with BERTScore, heuristics"]
    EVAL --> SELECT["Step 6: Select Top Variant<br/>Statistical comparison"]
    SELECT --> SHADOW["Step 7: Shadow Deployment<br/>Run winner alongside production"]
    SHADOW --> DEPLOY["Step 8: Canary Deployment<br/>1% → 10% → 50% → 100%"]

    style START fill:#6c5ce7,color:#fff
    style DEPLOY fill:#00b894,color:#fff

Step 1: Establish Baseline

Before changing anything, measure how the current prompt performs. This becomes the benchmark every variant must beat.

Baseline Measurement Setup

flowchart LR
    GP["Golden Dataset<br/>(200-500 cases)"]
    GP --> RUN["Run Current Prompt"]
    RUN --> SCORE["Score Outputs"]
    SCORE --> REPORT["Baseline Report"]

    REPORT --> M1["BERTScore F1: 0.84"]
    REPORT --> M2["Halluc Rate: 3.2%"]
    REPORT --> M3["Format Pass: 91%"]
    REPORT --> M4["Avg Latency: 1.4s"]
    REPORT --> M5["Avg Tokens: 287"]

Baseline Collection Script

import json
from datetime import datetime
from typing import List, Dict

def collect_baseline(
    golden_dataset: List[Dict],
    prompt_template: str,
    model_id: str = "anthropic.claude-3-5-sonnet",
    output_path: str = "baselines/"
) -> Dict:
    """Run current prompt against entire golden dataset and record results."""

    results = []

    for case in golden_dataset:
        # Build the full prompt from template + case context
        full_prompt = prompt_template.format(
            intent=case["intent"],
            query=case["query"],
            context=case.get("retrieved_context", ""),
            conversation_history=case.get("history", [])
        )

        start_time = time.time()
        response = invoke_bedrock(
            model_id=model_id,
            prompt=full_prompt,
            max_tokens=500,
            temperature=0.3
        )
        latency = time.time() - start_time

        # Score against expected output
        scores = evaluate_response(
            generated=response.text,
            expected=case["expected_response"],
            intent=case["intent"],
            context=case.get("retrieved_context", "")
        )

        results.append({
            "case_id": case["id"],
            "intent": case["intent"],
            "scores": scores,
            "latency": latency,
            "tokens_in": response.usage.input_tokens,
            "tokens_out": response.usage.output_tokens,
            "raw_response": response.text,
        })

    # Aggregate metrics
    baseline = {
        "timestamp": datetime.utcnow().isoformat(),
        "model_id": model_id,
        "prompt_version": hash(prompt_template),
        "dataset_size": len(golden_dataset),
        "metrics": {
            "bert_score_f1_mean": mean([r["scores"]["bert_score_f1"] for r in results]),
            "bert_score_f1_p25": percentile([r["scores"]["bert_score_f1"] for r in results], 25),
            "hallucination_rate": mean([r["scores"]["hallucinated"] for r in results]),
            "format_pass_rate": mean([r["scores"]["format_correct"] for r in results]),
            "avg_latency_sec": mean([r["latency"] for r in results]),
            "avg_tokens_out": mean([r["tokens_out"] for r in results]),
            "total_cost_usd": sum([
                r["tokens_in"] * 0.003 / 1000 + r["tokens_out"] * 0.015 / 1000
                for r in results
            ]),
        },
        "per_intent": aggregate_by_intent(results),
        "worst_cases": sorted(results, key=lambda x: x["scores"]["bert_score_f1"])[:10],
    }

    # Save baseline for future comparison
    output_file = f"{output_path}/baseline_{datetime.utcnow().strftime('%Y%m%d_%H%M')}.json"
    with open(output_file, "w") as f:
        json.dump(baseline, f, indent=2)

    return baseline

Step 2: Failure Analysis

Understand why the current prompt fails before attempting to fix it. Blindly editing prompts without diagnosis leads to oscillating between broken states.

Failure Categories

pie title Typical Prompt Failure Breakdown
    "Hallucinated Facts" : 28
    "Wrong Format" : 22
    "Missed Context" : 18
    "Too Verbose" : 14
    "Off-Topic" : 10
    "Contradicted History" : 8

Systematic Analysis

def analyze_failures(baseline_results: Dict) -> Dict:
    """Categorize failures by root cause to guide prompt edits."""

    failures = [r for r in baseline_results if r["scores"]["overall"] < 0.7]

    categories = {
        "hallucinated_facts": [],
        "wrong_format": [],
        "missed_context": [],
        "too_verbose": [],
        "off_topic": [],
        "contradicted_history": [],
        "wrong_intent_handling": [],
    }

    for f in failures:
        # Check for hallucination
        if f["scores"]["hallucinated"]:
            categories["hallucinated_facts"].append(f)

        # Check format compliance
        if not f["scores"]["format_correct"]:
            categories["wrong_format"].append(f)

        # Check if retrieved context was actually used
        if f["scores"]["context_utilization"] < 0.3:
            categories["missed_context"].append(f)

        # Check verbosity
        if f["tokens_out"] > 400 and f["scores"]["bert_score_f1"] < 0.7:
            categories["too_verbose"].append(f)

        # Check topic alignment
        if f["scores"]["intent_match"] < 0.5:
            categories["off_topic"].append(f)

    return {
        category: {
            "count": len(cases),
            "pct": len(cases) / len(failures) * 100 if failures else 0,
            "examples": cases[:3],  # Top 3 examples for inspection
        }
        for category, cases in categories.items()
    }

Reading Failure Analysis Results

Failure Category Count Percentage Likely Prompt Fix
Hallucinated Facts 14 28% Add explicit instruction: "Only use information from the provided context"
Wrong Format 11 22% Add format examples with exact JSON schema in prompt
Missed Context 9 18% Move context earlier in prompt, add "You MUST reference the context below"
Too Verbose 7 14% Add word limit: "Keep responses under 100 words"
Off-Topic 5 10% Add scope restriction: "Only answer questions about manga and orders"
Contradicted History 4 8% Add: "If the conversation history contradicts the current context, prefer the current context"

Step 3: Design Prompt Variants

Based on the failure analysis, design targeted variants. The key principle: change one thing at a time so you know what caused improvement.

Variant Design Strategy

flowchart TD
    BASE["Baseline Prompt (v2.1)"]
    BASE --> VA["Variant A:<br/>Add grounding instruction<br/>'Only use provided context'"]
    BASE --> VB["Variant B:<br/>Add format examples<br/>with JSON schema"]
    BASE --> VC["Variant C:<br/>Restructure sections<br/>(context before instructions)"]
    BASE --> VD["Variant D:<br/>Add conciseness constraint<br/>'Under 100 words'"]

    VA --> COMBINED["After individual testing:<br/>Combine winning elements<br/>into Variant E"]
    VB --> COMBINED
    VC --> COMBINED
    VD --> COMBINED

    style COMBINED fill:#00b894,color:#fff

Prompt Template Registry

PROMPT_VARIANTS = {
    "v2.1_baseline": {
        "template": """You are MangaAssist, a helpful assistant for manga e-commerce.

Context: {context}

Conversation History: {history}

User Query: {query}

Provide a helpful response.""",
        "hypothesis": "Current production prompt",
        "changed_from_baseline": None,
    },

    "v2.2_grounded": {
        "template": """You are MangaAssist, a helpful assistant for manga e-commerce.

IMPORTANT: Only use information from the context below. If the context does not contain
enough information to answer, say "I don't have that information" rather than guessing.

Context: {context}

Conversation History: {history}

User Query: {query}

Provide a helpful response based ONLY on the context above.""",
        "hypothesis": "Grounding instruction reduces hallucination",
        "changed_from_baseline": "Added grounding instructions at top and bottom",
    },

    "v2.2_structured": {
        "template": """You are MangaAssist, a helpful assistant for manga e-commerce.

Context: {context}

Conversation History: {history}

User Query: {query}

Respond in the following format:
- If the query is about a product: name, price, availability, then a 1-2 sentence description
- If the query is about an order: status, expected date, next steps
- If the query is a general question: direct answer in under 80 words

Response:""",
        "hypothesis": "Explicit format instructions reduce wrong-format failures",
        "changed_from_baseline": "Added response format specification",
    },

    "v2.2_context_first": {
        "template": """Context about relevant manga products and orders:
{context}

Previous conversation:
{history}

You are MangaAssist, a helpful assistant for manga e-commerce.
You MUST reference the context above when answering. Do not make up information.

User: {query}

MangaAssist:""",
        "hypothesis": "Context placed before instructions improves context utilization",
        "changed_from_baseline": "Moved context section above system instruction",
    },

    "v2.2_concise": {
        "template": """You are MangaAssist, a helpful assistant for manga e-commerce.
Keep responses under 80 words. Be direct and specific.

Context: {context}

Conversation History: {history}

User Query: {query}

Provide a concise response (under 80 words):""",
        "hypothesis": "Conciseness constraint reduces verbosity without hurting quality",
        "changed_from_baseline": "Added 80-word limit instruction",
    },
}

Step 4: Local Smoke Test (Free)

Before spending a cent on Bedrock, validate variants locally. Ollama with an open model catches ~60% of structural and behavioral prompt issues.

flowchart LR
    VARIANT["Prompt<br/>Variant"] --> OLLAMA["Ollama<br/>Llama 3 8B"]
    OLLAMA --> CHECK["Structural Checks"]
    CHECK --> P1["Format OK?"]
    CHECK --> P2["Uses context?"]
    CHECK --> P3["Under token limit?"]
    CHECK --> P4["Stays in persona?"]

    P1 -->|"Fail"| DISCARD["Discard variant"]
    P4 -->|"Fail"| DISCARD

    P1 -->|"Pass"| PROCEED["Advance to<br/>Bedrock evaluation"]
    P2 -->|"Pass"| PROCEED
    P3 -->|"Pass"| PROCEED
    P4 -->|"Pass"| PROCEED

    style DISCARD fill:#e17055,color:#fff
    style PROCEED fill:#00b894,color:#fff

Local Testing Configuration

import ollama

def local_smoke_test(
    variant_name: str,
    prompt_template: str,
    test_cases: list,  # Subset: 20-30 cases from golden dataset
) -> dict:
    """Fast, free validation using local Ollama before committing Bedrock spend."""

    results = {"passed": 0, "failed": 0, "issues": []}

    for case in test_cases:
        full_prompt = prompt_template.format(
            context=case.get("retrieved_context", ""),
            history=case.get("history", ""),
            query=case["query"]
        )

        response = ollama.chat(
            model="llama3:8b",
            messages=[{"role": "user", "content": full_prompt}],
            options={"num_predict": 500, "temperature": 0.3}
        )

        text = response["message"]["content"]

        # Structural checks (free, deterministic)
        checks = {
            "not_empty": len(text.strip()) > 0,
            "not_too_long": len(text.split()) < 200,
            "stays_in_persona": "as an AI" not in text.lower(),
            "no_system_leak": "SYSTEM:" not in text and "INSTRUCTIONS:" not in text,
            "uses_context": any(
                keyword in text.lower()
                for keyword in extract_keywords(case.get("retrieved_context", ""))
            ) if case.get("retrieved_context") else True,
            "correct_format": check_format(text, case["intent"]),
        }

        if all(checks.values()):
            results["passed"] += 1
        else:
            results["failed"] += 1
            results["issues"].append({
                "case_id": case["id"],
                "failed_checks": [k for k, v in checks.items() if not v],
                "response_preview": text[:200]
            })

    results["pass_rate"] = results["passed"] / len(test_cases)

    # Gate: must pass >= 80% structural checks to proceed
    results["advance_to_bedrock"] = results["pass_rate"] >= 0.80

    print(f"[{variant_name}] Pass rate: {results['pass_rate']:.0%} "
          f"→ {'ADVANCE' if results['advance_to_bedrock'] else 'DISCARD'}")

    return results

What Local Testing Catches vs. Misses

Catches Misses
Prompt causes format violations Subtle quality differences between Llama and Claude
Prompt causes system prompt leakage Hallucination rates (different training data)
Prompt causes persona breaks Exact tone and style matching
Prompt causes infinite loops Model-specific instruction following quirks
Response length issues Exact BERTScore metrics
Context not being referenced Fine-grained semantic accuracy

Step 5: Automated Evaluation on Bedrock

For variants that pass local smoke tests, run full evaluation on Bedrock against the golden dataset. This is where you spend money — so make it count.

Cost-Controlled Evaluation

flowchart TD
    VARIANTS["3-4 Surviving Variants"]
    VARIANTS --> SAMPLE["Phase 1: Sample Evaluation<br/>50 cases per variant<br/>Cost: ~$1.50"]
    SAMPLE --> GATE{"Any variant<br/>beats baseline<br/>by > 3%?"}
    GATE -->|"No"| STOP["Stop. Current prompt is fine.<br/>Optimize elsewhere."]
    GATE -->|"Yes"| FULL["Phase 2: Full Evaluation<br/>200+ cases for top 2 variants<br/>Cost: ~$3.00"]
    FULL --> STATS["Statistical Comparison<br/>Paired t-test per metric"]
    STATS --> WINNER{"Winner with<br/>p < 0.05?"}
    WINNER -->|"No"| STOP2["Difference not significant.<br/>Keep baseline."]
    WINNER -->|"Yes"| PROMOTE["Promote to Shadow Mode"]

    style STOP fill:#e17055,color:#fff
    style STOP2 fill:#e17055,color:#fff
    style PROMOTE fill:#00b894,color:#fff

Evaluation Script

import numpy as np
from scipy import stats

def evaluate_variants_on_bedrock(
    variants: dict,
    golden_dataset: list,
    baseline_results: dict,
    phase: str = "sample",  # "sample" (50 cases) or "full" (all cases)
) -> dict:
    """Evaluate prompt variants against golden dataset on actual Bedrock model."""

    if phase == "sample":
        # Stratified sample: ensure each intent is represented
        dataset = stratified_sample(golden_dataset, n=50)
    else:
        dataset = golden_dataset

    all_results = {}

    for name, variant in variants.items():
        if name.endswith("_baseline"):
            continue  # Already have baseline

        results = []
        for case in dataset:
            full_prompt = variant["template"].format(
                context=case.get("retrieved_context", ""),
                history=case.get("history", ""),
                query=case["query"]
            )

            response = invoke_bedrock(
                model_id="anthropic.claude-3-5-sonnet",
                prompt=full_prompt,
                max_tokens=500,
                temperature=0.3
            )

            scores = evaluate_response(
                generated=response.text,
                expected=case["expected_response"],
                intent=case["intent"],
                context=case.get("retrieved_context", "")
            )

            results.append({
                "case_id": case["id"],
                "intent": case["intent"],
                "scores": scores,
                "tokens_out": response.usage.output_tokens,
            })

        all_results[name] = {
            "results": results,
            "metrics": aggregate_metrics(results),
        }

    # Statistical comparison against baseline
    comparison = {}
    for name, data in all_results.items():
        variant_scores = [r["scores"]["bert_score_f1"] for r in data["results"]]
        baseline_scores = [r["scores"]["bert_score_f1"] for r in baseline_results["results"][:len(variant_scores)]]

        t_stat, p_value = stats.ttest_rel(variant_scores, baseline_scores)

        comparison[name] = {
            "bert_score_delta": np.mean(variant_scores) - np.mean(baseline_scores),
            "p_value": p_value,
            "significant": p_value < 0.05,
            "better": np.mean(variant_scores) > np.mean(baseline_scores),
            "hallucination_rate": data["metrics"]["hallucination_rate"],
            "avg_tokens": data["metrics"]["avg_tokens_out"],
        }

    return comparison

Reading Evaluation Results

Example output after running 4 variants:

Variant BERTScore Δ p-value Halluc Rate Avg Tokens Verdict
v2.2_grounded +0.06 0.003 1.2% (↓62%) 265 Winner (hallucination fix)
v2.2_structured +0.03 0.041 2.8% 198 (↓31%) Winner (format fix)
v2.2_context_first +0.01 0.312 3.0% 290 Not significant — discard
v2.2_concise −0.02 0.089 3.5% 142 (↓50%) Quality dropped — discard

Reading this table: - v2.2_grounded significantly improved quality AND reduced hallucinations. This is the primary winner. - v2.2_structured helped with format compliance but the quality improvement was smaller. - v2.2_context_first didn't help — moving context position alone wasn't enough. - v2.2_concise reduced tokens but hurt quality. The 80-word limit was too aggressive.


Step 6: Combine Winning Elements

After identifying which individual changes help, combine them into a final candidate.

flowchart LR
    VA["Grounding<br/>(from v2.2_grounded)"]
    VB["Format spec<br/>(from v2.2_structured)"]

    VA --> COMBINED["v2.3_combined"]
    VB --> COMBINED

    COMBINED --> EVAL["Evaluate combined<br/>against baseline<br/>AND individual winners"]

    EVAL --> BETTER{"Combined better<br/>than individuals?"}
    BETTER -->|"Yes"| SHIP["Ship v2.3_combined"]
    BETTER -->|"No"| SHIP2["Ship best individual<br/>(v2.2_grounded)"]

    style SHIP fill:#00b894,color:#fff
    style SHIP2 fill:#fdcb6e,color:#333
PROMPT_VARIANTS["v2.3_combined"] = {
    "template": """You are MangaAssist, a helpful assistant for manga e-commerce.

IMPORTANT: Only use information from the context below. If the context does not contain
enough information to answer, say "I don't have that information" rather than guessing.

Context: {context}

Conversation History: {history}

User Query: {query}

Respond in the following format:
- If the query is about a product: name, price, availability, then a 1-2 sentence description
- If the query is about an order: status, expected date, next steps
- If the query is a general question: direct answer in under 80 words

Based ONLY on the context above, provide your response:""",
    "hypothesis": "Combined grounding + format > either alone",
    "changed_from_baseline": "Grounding instruction + format specification",
}

Step 7: Shadow Deployment

Run the winning variant in parallel with production. Both prompts process every request; only the production prompt's response is shown to the user. Compare results to validate in the real-world distribution.

sequenceDiagram
    participant U as User
    participant GW as API Gateway
    participant P as Production (v2.1)
    participant S as Shadow (v2.3)
    participant EVAL as Evaluator

    U->>GW: "Recommend horror manga"

    par Parallel execution
        GW->>P: Process with v2.1
        P-->>GW: Response A
    and
        GW->>S: Process with v2.3
        S-->>GW: Response B (not served)
    end

    GW-->>U: Response A (production)

    GW->>EVAL: Log both responses
    EVAL->>EVAL: Compare quality metrics
    EVAL->>EVAL: Compare latency
    EVAL->>EVAL: Compare token usage

    Note over EVAL: After 500+ request pairs:<br/>Statistical comparison report

Shadow Mode Decision Logic

def shadow_mode_evaluation(shadow_logs: list) -> dict:
    """Analyze shadow deployment results after N requests."""

    comparisons = {
        "shadow_better": 0,
        "production_better": 0,
        "tie": 0,
    }

    for log in shadow_logs:
        prod_score = log["production"]["quality_score"]
        shadow_score = log["shadow"]["quality_score"]

        if shadow_score - prod_score > 0.05:
            comparisons["shadow_better"] += 1
        elif prod_score - shadow_score > 0.05:
            comparisons["production_better"] += 1
        else:
            comparisons["tie"] += 1

    n = len(shadow_logs)

    # Decision criteria
    decision = {
        "total_requests": n,
        "shadow_win_rate": comparisons["shadow_better"] / n,
        "production_win_rate": comparisons["production_better"] / n,
        "tie_rate": comparisons["tie"] / n,
        "shadow_hallucination_rate": mean([l["shadow"]["hallucinated"] for l in shadow_logs]),
        "production_hallucination_rate": mean([l["production"]["hallucinated"] for l in shadow_logs]),
        "shadow_avg_latency": mean([l["shadow"]["latency"] for l in shadow_logs]),
        "production_avg_latency": mean([l["production"]["latency"] for l in shadow_logs]),
    }

    # Go/no-go
    decision["promote_to_canary"] = (
        decision["shadow_win_rate"] > decision["production_win_rate"]
        and decision["shadow_hallucination_rate"] <= decision["production_hallucination_rate"]
        and decision["shadow_avg_latency"] < decision["production_avg_latency"] * 1.2  # Max 20% slower
    )

    return decision

Step 8: Canary Deployment

Gradually shift traffic to the new prompt version with automatic rollback triggers.

stateDiagram-v2
    [*] --> Canary_1pct: Deploy to 1% traffic

    Canary_1pct --> Monitor_1: Observe 30 min
    Monitor_1 --> Rollback: Metrics degrade
    Monitor_1 --> Canary_10pct: Metrics stable

    Canary_10pct --> Monitor_10: Observe 2 hours
    Monitor_10 --> Rollback: Metrics degrade
    Monitor_10 --> Canary_50pct: Metrics stable

    Canary_50pct --> Monitor_50: Observe 6 hours
    Monitor_50 --> Rollback: Metrics degrade
    Monitor_50 --> Full_100pct: Metrics stable

    Full_100pct --> [*]: New prompt is now production
    Rollback --> [*]: Revert to previous prompt

Automatic Rollback Triggers

Metric Threshold Action
Hallucination rate > baseline × 1.5 Immediate rollback
BERTScore F1 < baseline − 0.05 Rollback after 50 requests
P99 latency > 5 seconds Pause canary, investigate
Error rate > 2% Immediate rollback
Token usage > baseline × 2.0 Pause canary, investigate cost
User thumbs-down rate > baseline × 1.3 Rollback after 100 requests

Cost Comparison: Naive vs. Systematic

flowchart LR
    subgraph Naive["Naive Approach: $45+"]
        N1["Run 500 cases × 5 variants<br/>= 2,500 Bedrock calls<br/>~$37.50"]
        N1 --> N2["No statistical rigor<br/>Pick 'best looking' variant"]
        N2 --> N3["Deploy directly<br/>Hope it works"]
        N3 --> N4["Discover regression<br/>in production"]
        N4 --> N5["Rollback + repeat<br/>another $37.50"]
    end

    subgraph Systematic["Systematic Approach: $4.50"]
        S1["Local smoke test<br/>$0"]
        S1 --> S2["Sample eval (50 cases × 4)<br/>~$1.50"]
        S2 --> S3["Full eval (200 cases × 2)<br/>~$3.00"]
        S3 --> S4["Shadow deployment<br/>(uses prod traffic, no extra cost)"]
        S4 --> S5["Canary with auto-rollback<br/>(safe at each step)"]
    end

    style Naive fill:#e17055,color:#fff
    style Systematic fill:#00b894,color:#fff
Aspect Naive Systematic
Bedrock cost per optimization cycle $45-75 $4.50
Time to confident result 1-2 weeks (trial and error) 2-3 days (structured)
Risk of production regression High (no shadow/canary) Low (3 safety gates)
Know WHY improvement happened No (changed multiple things) Yes (one variable at a time)
Reproducible No Yes (versioned prompts + datasets)

Prompt Version Control

Every prompt is versioned and tracked, just like code.

gitGraph
    commit id: "v2.0 - Initial prompt"
    commit id: "v2.1 - Add product format"
    branch grounding-experiment
    commit id: "v2.2-grounded - Add grounding"
    branch format-experiment
    commit id: "v2.2-structured - Add format spec"
    checkout grounding-experiment
    commit id: "Eval: +6% BERTScore"
    checkout format-experiment
    commit id: "Eval: +3% BERTScore"
    checkout main
    merge grounding-experiment id: "v2.3 - Merge grounding"
    commit id: "v2.3 - Combined variant"
    commit id: "Shadow: validated"
    commit id: "Canary: 1% → 100%"
    commit id: "v2.3 is now production"

Version Registry

PROMPT_REGISTRY = {
    "v2.0": {
        "hash": "abc123",
        "deployed": "2024-06-01",
        "retired": "2024-08-15",
        "baseline_bert_score": 0.78,
        "notes": "Initial production prompt"
    },
    "v2.1": {
        "hash": "def456",
        "deployed": "2024-08-15",
        "retired": "2024-10-10",
        "baseline_bert_score": 0.84,
        "notes": "Added product format template"
    },
    "v2.3": {
        "hash": "ghi789",
        "deployed": "2024-10-10",
        "retired": None,
        "baseline_bert_score": 0.90,
        "notes": "Grounding + format combined (from optimization cycle)"
    },
}

Common Pitfalls

Pitfall Why It Happens How to Avoid
Changing 5 things at once Impatience One prompt element per variant
Evaluating on <20 cases Cost anxiety Use local testing for cheap validation; 50+ for Bedrock
Optimizing the wrong metric Chasing numbers Map metric to user experience (hallucination > BERTScore)
No baseline measurement Assuming current is bad Always measure before changing
Deploying without shadow Confidence from offline eval Shadow catches distribution mismatches
Ignoring per-intent slicing Only looking at averages A 5% average gain can mask 10% regression on order_tracking
Not versioning prompts "I'll remember what I changed" Version control every variant
Over-optimizing for benchmark Golden dataset overfitting Rotate 20% of golden dataset quarterly

Workflow Checklist

Use this checklist for every prompt optimization cycle:

  • Baseline collected — current prompt scored on full golden dataset
  • Failure analysis done — know exactly which categories of failures to fix
  • Variants designed — each variant changes exactly one thing
  • Local smoke test passed — all variants pass structural checks on Ollama
  • Sample evaluation complete — 50 cases per variant on Bedrock
  • Statistical comparison done — paired t-test, p < 0.05 for winner
  • Combined variant tested — if applicable
  • Shadow deployment validated — winner runs alongside production for 500+ requests
  • Canary deployed — 1% → 10% → 50% → 100% with rollback triggers
  • Prompt registry updated — new version documented with metrics and notes
  • Golden dataset refreshed — add 5-10 new cases from this cycle's failures