07. Prompt Evaluation, Versioning, and Regression

Why Prompt Evaluation Needs Its Own Process

Prompt engineering becomes unstable quickly if changes are tested by intuition alone.

MangaAssist needs a release process for prompts just as much as it needs one for APIs and models.

Evaluation Layers

Layer	Question
offline functional	did the response follow the intended behavior?
offline quality	was the answer useful, grounded, and concise?
adversarial	did the prompt resist manipulation and edge cases?
online business	did the change improve outcomes without hurting trust?

Golden Set Design

Each golden case should include:

user message
context bundle
retrieved chunks if applicable
expected behavior notes
unacceptable behavior notes
expected schema if structured output is required

The goal is not one exact answer string. The goal is a behavior envelope.

Example Scorecard

Metric	Description
grounding pass rate	answer used only approved sources
factual accuracy	business facts match source of truth
schema validity	output parses and matches contract
concision score	answer stays within response target
recommendation usefulness	human raters judge fit quality
safe refusal quality	refusal is correct and still helpful
escalation precision	system escalates when it should, not when it should not

Prompt Versioning Strategy

Use an explicit version id for each prompt family.

Example:

recommendation.v1
recommendation.v2
faq-policy.v3
escalation-summary.v2

Version the prompt by intent family, not by one giant global prompt.

Change Log Template

Field	Example
version	`faq-policy.v3`
change summary	reduced chunk count from 3 to 2 and clarified incomplete-evidence behavior
hypothesis	higher grounding precision and shorter latency
risk	more false insufficiency responses
evaluation status	offline passed, online canary pending

Regression Triggers

Run regression when any of the following changes:

system rules
output contract wording
chunk count
retrieval filter logic
history summarization logic
FM model or decoding settings
guardrail policy interaction

A/B Testing Guidance

Good A/B Targets

shorter vs longer recommendation explanation prompt
one-shot vs zero-shot prompt for JSON behavior
different follow-up suggestion instructions

Bad A/B Targets

changing safety and UX variables together
comparing prompt changes while also changing retrieval logic without isolation

Release Flow

update prompt variant
run offline golden set
run adversarial suite
run structured-output validation suite
canary in production
monitor operational and business metrics
promote or roll back

Metrics to Watch in Production

Metric	Why
thumbs down rate	quick user dissatisfaction signal
escalation rate	can reveal answer quality or trust issues
guardrail trigger rate	may show prompt drift or model drift
parse failure rate	structured-output health
first-token latency	user-perceived responsiveness
total token usage	cost control
recommendation click-through	business signal for discovery quality

Failure Analysis Checklist

When a new prompt underperforms, ask in this order:

did the routing send the request to the wrong path?
was the truth source incomplete or wrong?
was retrieval poor?
was the prompt too large or too vague?
did the FM violate the contract?
did downstream validation hide a deeper generation issue?

Exit Rule

Do not ship a prompt just because it sounds better in a demo.

Ship it only if it improves the target behavior without breaking grounding, latency, schema, or safety.