07. Prompt Evaluation, Versioning, and Regression
Why Prompt Evaluation Needs Its Own Process
Prompt engineering becomes unstable quickly if changes are tested by intuition alone.
MangaAssist needs a release process for prompts just as much as it needs one for APIs and models.
Evaluation Layers
| Layer | Question |
|---|---|
| offline functional | did the response follow the intended behavior? |
| offline quality | was the answer useful, grounded, and concise? |
| adversarial | did the prompt resist manipulation and edge cases? |
| online business | did the change improve outcomes without hurting trust? |
Test Set Categories
- recommendation scenarios
- product attribute questions
- FAQ and policy questions
- order and return support
- structured-output scenarios
- low-confidence and ambiguous requests
- adversarial prompt-injection attempts
- multilingual requests
- long-history sessions
- degraded-service scenarios
Golden Set Design
Each golden case should include:
- user message
- context bundle
- retrieved chunks if applicable
- expected behavior notes
- unacceptable behavior notes
- expected schema if structured output is required
The goal is not one exact answer string. The goal is a behavior envelope.
Example Scorecard
| Metric | Description |
|---|---|
| grounding pass rate | answer used only approved sources |
| factual accuracy | business facts match source of truth |
| schema validity | output parses and matches contract |
| concision score | answer stays within response target |
| recommendation usefulness | human raters judge fit quality |
| safe refusal quality | refusal is correct and still helpful |
| escalation precision | system escalates when it should, not when it should not |
Prompt Versioning Strategy
Use an explicit version id for each prompt family.
Example:
recommendation.v1recommendation.v2faq-policy.v3escalation-summary.v2
Version the prompt by intent family, not by one giant global prompt.
Change Log Template
| Field | Example |
|---|---|
| version | faq-policy.v3 |
| change summary | reduced chunk count from 3 to 2 and clarified incomplete-evidence behavior |
| hypothesis | higher grounding precision and shorter latency |
| risk | more false insufficiency responses |
| evaluation status | offline passed, online canary pending |
Regression Triggers
Run regression when any of the following changes:
- system rules
- output contract wording
- chunk count
- retrieval filter logic
- history summarization logic
- FM model or decoding settings
- guardrail policy interaction
A/B Testing Guidance
Good A/B Targets
- shorter vs longer recommendation explanation prompt
- one-shot vs zero-shot prompt for JSON behavior
- different follow-up suggestion instructions
Bad A/B Targets
- changing safety and UX variables together
- comparing prompt changes while also changing retrieval logic without isolation
Release Flow
- update prompt variant
- run offline golden set
- run adversarial suite
- run structured-output validation suite
- canary in production
- monitor operational and business metrics
- promote or roll back
Metrics to Watch in Production
| Metric | Why |
|---|---|
| thumbs down rate | quick user dissatisfaction signal |
| escalation rate | can reveal answer quality or trust issues |
| guardrail trigger rate | may show prompt drift or model drift |
| parse failure rate | structured-output health |
| first-token latency | user-perceived responsiveness |
| total token usage | cost control |
| recommendation click-through | business signal for discovery quality |
Failure Analysis Checklist
When a new prompt underperforms, ask in this order:
- did the routing send the request to the wrong path?
- was the truth source incomplete or wrong?
- was retrieval poor?
- was the prompt too large or too vague?
- did the FM violate the contract?
- did downstream validation hide a deeper generation issue?
Exit Rule
Do not ship a prompt just because it sounds better in a demo.
Ship it only if it improves the target behavior without breaking grounding, latency, schema, or safety.