LOCAL PREVIEW View on GitHub

07. Prompt Evaluation, Versioning, and Regression

Why Prompt Evaluation Needs Its Own Process

Prompt engineering becomes unstable quickly if changes are tested by intuition alone.

MangaAssist needs a release process for prompts just as much as it needs one for APIs and models.

Evaluation Layers

Layer Question
offline functional did the response follow the intended behavior?
offline quality was the answer useful, grounded, and concise?
adversarial did the prompt resist manipulation and edge cases?
online business did the change improve outcomes without hurting trust?

Test Set Categories

  1. recommendation scenarios
  2. product attribute questions
  3. FAQ and policy questions
  4. order and return support
  5. structured-output scenarios
  6. low-confidence and ambiguous requests
  7. adversarial prompt-injection attempts
  8. multilingual requests
  9. long-history sessions
  10. degraded-service scenarios

Golden Set Design

Each golden case should include:

  • user message
  • context bundle
  • retrieved chunks if applicable
  • expected behavior notes
  • unacceptable behavior notes
  • expected schema if structured output is required

The goal is not one exact answer string. The goal is a behavior envelope.

Example Scorecard

Metric Description
grounding pass rate answer used only approved sources
factual accuracy business facts match source of truth
schema validity output parses and matches contract
concision score answer stays within response target
recommendation usefulness human raters judge fit quality
safe refusal quality refusal is correct and still helpful
escalation precision system escalates when it should, not when it should not

Prompt Versioning Strategy

Use an explicit version id for each prompt family.

Example:

  • recommendation.v1
  • recommendation.v2
  • faq-policy.v3
  • escalation-summary.v2

Version the prompt by intent family, not by one giant global prompt.

Change Log Template

Field Example
version faq-policy.v3
change summary reduced chunk count from 3 to 2 and clarified incomplete-evidence behavior
hypothesis higher grounding precision and shorter latency
risk more false insufficiency responses
evaluation status offline passed, online canary pending

Regression Triggers

Run regression when any of the following changes:

  • system rules
  • output contract wording
  • chunk count
  • retrieval filter logic
  • history summarization logic
  • FM model or decoding settings
  • guardrail policy interaction

A/B Testing Guidance

Good A/B Targets

  • shorter vs longer recommendation explanation prompt
  • one-shot vs zero-shot prompt for JSON behavior
  • different follow-up suggestion instructions

Bad A/B Targets

  • changing safety and UX variables together
  • comparing prompt changes while also changing retrieval logic without isolation

Release Flow

  1. update prompt variant
  2. run offline golden set
  3. run adversarial suite
  4. run structured-output validation suite
  5. canary in production
  6. monitor operational and business metrics
  7. promote or roll back

Metrics to Watch in Production

Metric Why
thumbs down rate quick user dissatisfaction signal
escalation rate can reveal answer quality or trust issues
guardrail trigger rate may show prompt drift or model drift
parse failure rate structured-output health
first-token latency user-perceived responsiveness
total token usage cost control
recommendation click-through business signal for discovery quality

Failure Analysis Checklist

When a new prompt underperforms, ask in this order:

  1. did the routing send the request to the wrong path?
  2. was the truth source incomplete or wrong?
  3. was retrieval poor?
  4. was the prompt too large or too vague?
  5. did the FM violate the contract?
  6. did downstream validation hide a deeper generation issue?

Exit Rule

Do not ship a prompt just because it sounds better in a demo.

Ship it only if it improves the target behavior without breaking grounding, latency, schema, or safety.