02. Offline Testing Scenarios With Answers
This file goes deep on realistic MangaAssist scenarios where teams are tempted to spend money on repeated LLM calls, but should first solve the problem offline.
Scenario 1: A prompt template changed for recommendations
Question
The team changed the recommendation system prompt to make answers sound more conversational. How do you test that change without running hundreds of Bedrock calls?
Short answer
Test the prompt as a compiled artifact first, not as a paid generation experiment.
Deep-dive answer
For MangaAssist, the recommendation prompt is built from:
- system rules
- user message
- recommendation-engine output
- product metadata
- optional retrieved editorial content
- conversation history
Most failures in a prompt edit are structural:
- important product JSON got dropped
- history got duplicated
- prohibited instructions disappeared
- token count exploded
- the model is no longer told to avoid inventing prices
So the first step is offline prompt compilation:
- Freeze 30 to 50 recommendation fixtures.
- For each fixture, compile the old prompt and the new prompt.
- Diff them.
- Assert required sections exist in the right order.
- Measure prompt token growth.
- Fail if the prompt grows past an agreed budget.
What I would assert
- prompt contains current product context when page ASIN exists
- recommendation engine outputs are present
- "never invent prices" rule is still present
- conversation history stays within max-turn budget
- prompt size growth is below a threshold such as 10 to 15 percent
Cheap second step
If the prompt compiles cleanly, run a local-model smoke test with a small open-source model. The goal is not to judge final quality. The goal is to catch catastrophic failures such as:
- empty answer
- malformed JSON
- ignoring product list
- extremely verbose completions
When I would pay for Bedrock
Only after the prompt passes offline checks would I run a small stratified paid sample, maybe 20 recommendation cases. The target model is only needed for final semantic confidence.
Why this is cost-optimized
Because the majority of prompt failures are discovered before any paid inference happens.
Scenario 2: The retriever chunking strategy changed
Question
We changed FAQ chunking from 512 tokens to 256 tokens with overlap. How do we know whether the chatbot got better without paying for generation on every test?
Short answer
Evaluate retrieval directly. Do not use generation as a proxy for retrieval quality.
Deep-dive answer
In MangaAssist, policy and FAQ quality depends heavily on the retriever finding the right chunks. If the retriever fails, the model either hallucinates or answers vaguely. Testing this through full generation is slow and expensive because it confounds two systems:
- the retriever
- the generator
Instead, build a retrieval gold set:
- query
- gold relevant document IDs
- optional distractor documents
- freshness requirement
Then compare old vs new retrieval on:
- Recall@3
- Recall@5
- MRR
- stale chunk rate
- average prompt context size
What matters most
If 256-token chunks increase Recall@3 but double the number of chunks needed in the prompt, you may improve answer quality while also increasing cost. That tradeoff must be measured explicitly.
Example decision rule
Promote the new chunking only if:
- Recall@3 improves by at least 2 points, or
- MRR improves without increasing average injected context by more than 10 percent
When I would pay for Bedrock
After retrieval wins offline, I would sample a small set of policy questions with the target model to verify answer fluency and citation behavior. The retrieval decision itself should already be made offline.
Why this is cost-optimized
Because generation is the expensive layer, and retrieval can be validated with direct metrics at zero GenAI cost.
Scenario 3: The intent classifier was retrained
Question
A new classifier model improves recommendation intent accuracy, but some order_tracking and promotion messages now fall into general_query. How should this be tested offline, and why is that also a cost question?
Short answer
Intent testing is not only a quality problem. It is a routing-cost problem.
Deep-dive answer
In MangaAssist, routing decides whether a message goes to:
- template path
- API path
- retrieval path
- full generation path
If the classifier sends structured traffic into a general LLM path, cost rises even if the user still gets an acceptable answer.
So the offline test suite should not stop at accuracy. It should also measure route economics.
Offline evaluation plan
- Replay a labeled intent dataset.
- Produce a confusion matrix.
- Add route class labels:
template,api,rag,llm. - Compute the percentage of traffic in each route class.
- Estimate token spend caused by the routing distribution.
What I would monitor
- per-intent F1
- false movement into
general_query - percent of traffic routed to expensive LLM path
- estimated average token cost per session
Why a seemingly better model can be worse
A classifier can improve on headline accuracy while still increasing cost. Example:
- it gets more recommendation queries right
- but it misroutes many order or promotion queries into the LLM path
That hurts cost and may hurt latency as well.
Correct answer
Do not approve the classifier based only on accuracy. Approve it only if both quality and route-cost metrics are acceptable.
Why this is cost-optimized
Because routing is one of the strongest cost levers in the system, and it can be tested entirely offline.
Scenario 4: Guardrails were tightened after a price hallucination incident
Question
After a production incident where the bot stated the wrong manga price, the team tightened the price and competitor guardrails. How do you test the fix deeply without repeatedly calling the model?
Short answer
Use synthetic bad outputs and deterministic validators. You do not need Bedrock to test a price checker.
Deep-dive answer
The price incident can be decomposed into separate checks:
- did the output mention a price?
- was the price present in the catalog response?
- was the product ASIN valid?
- did the output introduce competitor text?
None of those require live generation to validate.
Offline test design
Create a fixture bank of bad outputs such as:
- correct product, wrong price
- correct price, wrong currency symbol
- non-existent ASIN
- competitor mention hidden in a long paragraph
- multiple products with one incorrect price
Then run the output safety pipeline against them.
What I would measure
- false negatives: unsafe outputs that slip through
- false positives: valid outputs blocked incorrectly
- auto-correction success rate for price replacement
- latency added by the guardrail pipeline
Important nuance
Do not only test adversarial outputs. Also test normal outputs. If guardrails overfire, the system may block safe answers and escalate more often, which also increases cost.
When I would pay for Bedrock
Only for a small final sample to ensure the target model plus the new guardrails still produce natural answers. The correctness of the guardrail rules themselves should be proven offline first.
Why this is cost-optimized
Because the expensive model is not the thing being tested. The validator is.
Scenario 5: The memory summarizer changed and multi-turn quality regressed
Question
Users say the bot no longer understands "tell me more about the second one" after long chats. How do you investigate and test this offline?
Short answer
Replay archived multi-turn sessions and test entity preservation through summarization.
Deep-dive answer
This issue often looks like an LLM problem, but it is usually a memory problem. In MangaAssist, once a conversation grows large, older turns may be summarized. If the summary drops product references, later follow-ups fail.
Offline replay plan
- Build a dataset of long sessions where follow-up references are important.
- Mark the required preserved entities: - ASIN - title - ordinal position such as first or second - unresolved user need
- Run the old summarizer and the new summarizer.
- Compare what information survives.
- Run a follow-up resolution check on the summarized history.
Good offline assertions
- the summary preserves recommended items
- the summary preserves which item was selected
- the summary preserves whether the user asked to add to cart or compare
- the summary preserves unresolved issues for escalation
Cheap local-model use
If the summarizer itself is LLM-based, replace most regression runs with:
- stored gold summaries for comparison
- deterministic extraction checks
- optional local-model smoke tests
Do not spend paid-model budget on every memory replay.
Correct answer
Treat summarization as a separate component with its own dataset and regression tests. Do not wait until full-chat paid evaluations reveal the failure.
Why this is cost-optimized
Because long multi-turn chats are some of the most expensive conversations to replay on a paid model, yet the bug often sits in a cheaper layer.
Scenario 6: The team wants to add semantic caching
Question
How do you prove semantic caching will save money for MangaAssist before rolling it out?
Short answer
Simulate cache behavior on historical traffic and estimate avoided LLM calls offline.
Deep-dive answer
Semantic caching is a classic cost optimization, but it must be validated carefully. A bad cache saves money while hurting relevance. The right process is offline replay plus quality guardrails.
Offline experiment
- Take a representative traffic slice.
- Embed user queries with a local embedding model.
- Cluster or nearest-neighbor match semantically similar questions.
- Simulate a cache with different similarity thresholds.
- Estimate: - hit rate - avoided LLM calls - token savings - wrong-cache risk from mismatched intents
Example formula
Monthly savings
= cache_hits_per_month
* average_cost_per_uncached_response
What I would not do
I would not deploy semantic caching because a vendor demo says the hit rate is high. MangaAssist traffic includes product-specific, user-specific, and policy-specific messages. A cache hit is only valid if those contexts still match.
Required offline slices
- generic FAQ questions
- product-specific questions
- recommendation requests
- authenticated order questions
The expected result is that FAQ traffic caches well, but personalized or order-related traffic should have a much lower safe hit rate.
Correct answer
Approve semantic caching only where offline replay shows both strong hit rate and low semantic mismatch risk. Then scope it to eligible intents, not all traffic.
Why this is cost-optimized
Because it estimates savings before rollout and avoids expensive bad cache hits in production.
Scenario 7: Product wants more FAQ answers to bypass the LLM entirely
Question
Product asks, "Can we answer more FAQ and promotion questions with templates or retrieval snippets so we save LLM cost?" How do you test that idea offline?
Short answer
Replay traffic and compare resolution quality against estimated LLM avoidance.
Deep-dive answer
This is one of the highest-leverage cost optimizations in MangaAssist. If more FAQ or promotion traffic can be served from:
- templates
- retrieval snippets
- API-backed structured responses
then LLM cost drops directly.
Offline experiment design
- Take a labeled FAQ and promotion dataset.
- Define which questions are eligible for deterministic answering.
- Create the template or retrieval-only answer path.
- Compare old vs new on: - correctness - completeness - escalation risk - estimated LLM bypass rate
What I would require
- no meaningful quality regression on high-volume FAQ questions
- clear routing confidence threshold for bypass
- fallback to LLM when retrieval confidence is low
Example tradeoff
If deterministic FAQ answers bypass 35 percent of all FAQ LLM calls with only a 1 percent rise in escalations, that may be a very strong business decision. If escalations rise 8 percent, the cost savings may be false savings because human support cost goes up.
Correct answer
Measure the change as a total-system cost problem:
- Bedrock savings
- latency improvement
- any extra human support cost
The cheapest answer is not the best answer if it shifts cost to agents.
Why this is cost-optimized
Because it targets the strongest root lever: reducing unnecessary LLM invocations altogether.
Scenario 8: The team asks, "When do we still need paid GenAI offline testing?"
Question
If offline testing is so powerful, when do we still have to spend money on the real target model?
Short answer
Only for the parts where target-model behavior is the thing under test.
Deep-dive answer
There are limits to offline approximation. Local models, synthetic fixtures, and deterministic checks cannot fully predict:
- final tone and helpfulness of the target model
- prompt sensitivity unique to the target model
- jailbreak behavior unique to the target model
- latency and output length of the target model
I would still use a paid offline gate for
- major prompt-template rewrites
- model version upgrades
- high-risk policy or safety changes
- new multi-turn reasoning flows
- release candidates before shadow or canary
But I would keep it narrow
The correct pattern is:
- cheap offline suite on broad coverage
- small stratified paid gate on narrow coverage
Not:
- broad paid evaluation on every change
Recommended answer
Use the real target model only when the answer depends on target-model semantics. For routing, retrieval, guardrails, memory preservation, schema, and cost simulation, offline testing should do most of the work.
Why this is cost-optimized
Because it preserves confidence where the target model matters while refusing to spend money where it does not.
Final Takeaway
For MangaAssist, offline testing is not a compromise. It is the correct engineering design for a costly GenAI system.
The practical rule is:
- test routing offline
- test retrieval offline
- test guardrails offline
- test memory offline
- test prompt structure offline
- test spend offline
- use the paid model only for the narrow semantic layer that cannot be approximated cheaply
That is how you control GenAI cost without turning quality assurance into guesswork.