02. Offline Testing Scenarios With Answers

This file goes deep on realistic MangaAssist scenarios where teams are tempted to spend money on repeated LLM calls, but should first solve the problem offline.

Scenario 1: A prompt template changed for recommendations

Question

The team changed the recommendation system prompt to make answers sound more conversational. How do you test that change without running hundreds of Bedrock calls?

Short answer

Test the prompt as a compiled artifact first, not as a paid generation experiment.

Deep-dive answer

For MangaAssist, the recommendation prompt is built from:

system rules
user message
recommendation-engine output
product metadata
optional retrieved editorial content
conversation history

Most failures in a prompt edit are structural:

important product JSON got dropped
history got duplicated
prohibited instructions disappeared
token count exploded
the model is no longer told to avoid inventing prices

So the first step is offline prompt compilation:

Freeze 30 to 50 recommendation fixtures.
For each fixture, compile the old prompt and the new prompt.
Diff them.
Assert required sections exist in the right order.
Measure prompt token growth.
Fail if the prompt grows past an agreed budget.

What I would assert

prompt contains current product context when page ASIN exists
recommendation engine outputs are present
"never invent prices" rule is still present
conversation history stays within max-turn budget
prompt size growth is below a threshold such as 10 to 15 percent

Cheap second step

If the prompt compiles cleanly, run a local-model smoke test with a small open-source model. The goal is not to judge final quality. The goal is to catch catastrophic failures such as:

empty answer
malformed JSON
ignoring product list
extremely verbose completions

When I would pay for Bedrock

Only after the prompt passes offline checks would I run a small stratified paid sample, maybe 20 recommendation cases. The target model is only needed for final semantic confidence.

Why this is cost-optimized

Because the majority of prompt failures are discovered before any paid inference happens.

Scenario 2: The retriever chunking strategy changed

Question

We changed FAQ chunking from 512 tokens to 256 tokens with overlap. How do we know whether the chatbot got better without paying for generation on every test?

Short answer

Evaluate retrieval directly. Do not use generation as a proxy for retrieval quality.

Deep-dive answer

In MangaAssist, policy and FAQ quality depends heavily on the retriever finding the right chunks. If the retriever fails, the model either hallucinates or answers vaguely. Testing this through full generation is slow and expensive because it confounds two systems:

the retriever
the generator

Instead, build a retrieval gold set:

query
gold relevant document IDs
optional distractor documents
freshness requirement

Then compare old vs new retrieval on:

Recall@3
Recall@5
MRR
stale chunk rate
average prompt context size

What matters most

If 256-token chunks increase Recall@3 but double the number of chunks needed in the prompt, you may improve answer quality while also increasing cost. That tradeoff must be measured explicitly.

Example decision rule

Promote the new chunking only if:

Recall@3 improves by at least 2 points, or
MRR improves without increasing average injected context by more than 10 percent

When I would pay for Bedrock

After retrieval wins offline, I would sample a small set of policy questions with the target model to verify answer fluency and citation behavior. The retrieval decision itself should already be made offline.

Why this is cost-optimized

Because generation is the expensive layer, and retrieval can be validated with direct metrics at zero GenAI cost.

Scenario 3: The intent classifier was retrained

Question

A new classifier model improves recommendation intent accuracy, but some order_tracking and promotion messages now fall into general_query. How should this be tested offline, and why is that also a cost question?

Short answer

Intent testing is not only a quality problem. It is a routing-cost problem.

Deep-dive answer

In MangaAssist, routing decides whether a message goes to:

template path
API path
retrieval path
full generation path

If the classifier sends structured traffic into a general LLM path, cost rises even if the user still gets an acceptable answer.

So the offline test suite should not stop at accuracy. It should also measure route economics.

Offline evaluation plan

Replay a labeled intent dataset.
Produce a confusion matrix.
Add route class labels: template, api, rag, llm.
Compute the percentage of traffic in each route class.
Estimate token spend caused by the routing distribution.

What I would monitor

per-intent F1
false movement into general_query
percent of traffic routed to expensive LLM path
estimated average token cost per session

Why a seemingly better model can be worse

A classifier can improve on headline accuracy while still increasing cost. Example:

it gets more recommendation queries right
but it misroutes many order or promotion queries into the LLM path

That hurts cost and may hurt latency as well.

Correct answer

Do not approve the classifier based only on accuracy. Approve it only if both quality and route-cost metrics are acceptable.

Why this is cost-optimized

Because routing is one of the strongest cost levers in the system, and it can be tested entirely offline.

Scenario 4: Guardrails were tightened after a price hallucination incident

Question

After a production incident where the bot stated the wrong manga price, the team tightened the price and competitor guardrails. How do you test the fix deeply without repeatedly calling the model?

Short answer

Use synthetic bad outputs and deterministic validators. You do not need Bedrock to test a price checker.

Deep-dive answer

The price incident can be decomposed into separate checks:

did the output mention a price?
was the price present in the catalog response?
was the product ASIN valid?
did the output introduce competitor text?

None of those require live generation to validate.

Offline test design

Create a fixture bank of bad outputs such as:

correct product, wrong price
correct price, wrong currency symbol
non-existent ASIN
competitor mention hidden in a long paragraph
multiple products with one incorrect price

Then run the output safety pipeline against them.

What I would measure

false negatives: unsafe outputs that slip through
false positives: valid outputs blocked incorrectly
auto-correction success rate for price replacement
latency added by the guardrail pipeline

Important nuance

Do not only test adversarial outputs. Also test normal outputs. If guardrails overfire, the system may block safe answers and escalate more often, which also increases cost.

When I would pay for Bedrock

Only for a small final sample to ensure the target model plus the new guardrails still produce natural answers. The correctness of the guardrail rules themselves should be proven offline first.

Why this is cost-optimized

Because the expensive model is not the thing being tested. The validator is.

Scenario 5: The memory summarizer changed and multi-turn quality regressed

Question

Users say the bot no longer understands "tell me more about the second one" after long chats. How do you investigate and test this offline?

Short answer

Replay archived multi-turn sessions and test entity preservation through summarization.

Deep-dive answer

This issue often looks like an LLM problem, but it is usually a memory problem. In MangaAssist, once a conversation grows large, older turns may be summarized. If the summary drops product references, later follow-ups fail.

Offline replay plan

Build a dataset of long sessions where follow-up references are important.
Mark the required preserved entities: - ASIN - title - ordinal position such as first or second - unresolved user need
Run the old summarizer and the new summarizer.
Compare what information survives.
Run a follow-up resolution check on the summarized history.

Good offline assertions

the summary preserves recommended items
the summary preserves which item was selected
the summary preserves whether the user asked to add to cart or compare
the summary preserves unresolved issues for escalation

Cheap local-model use

If the summarizer itself is LLM-based, replace most regression runs with:

stored gold summaries for comparison
deterministic extraction checks
optional local-model smoke tests

Do not spend paid-model budget on every memory replay.

Correct answer

Treat summarization as a separate component with its own dataset and regression tests. Do not wait until full-chat paid evaluations reveal the failure.

Why this is cost-optimized

Because long multi-turn chats are some of the most expensive conversations to replay on a paid model, yet the bug often sits in a cheaper layer.

Scenario 6: The team wants to add semantic caching

Question

How do you prove semantic caching will save money for MangaAssist before rolling it out?

Short answer

Simulate cache behavior on historical traffic and estimate avoided LLM calls offline.

Deep-dive answer

Semantic caching is a classic cost optimization, but it must be validated carefully. A bad cache saves money while hurting relevance. The right process is offline replay plus quality guardrails.

Offline experiment

Take a representative traffic slice.
Embed user queries with a local embedding model.
Cluster or nearest-neighbor match semantically similar questions.
Simulate a cache with different similarity thresholds.
Estimate: - hit rate - avoided LLM calls - token savings - wrong-cache risk from mismatched intents

Example formula

Monthly savings
= cache_hits_per_month
  * average_cost_per_uncached_response

What I would not do

I would not deploy semantic caching because a vendor demo says the hit rate is high. MangaAssist traffic includes product-specific, user-specific, and policy-specific messages. A cache hit is only valid if those contexts still match.

Required offline slices

generic FAQ questions
product-specific questions
recommendation requests
authenticated order questions

The expected result is that FAQ traffic caches well, but personalized or order-related traffic should have a much lower safe hit rate.

Correct answer

Approve semantic caching only where offline replay shows both strong hit rate and low semantic mismatch risk. Then scope it to eligible intents, not all traffic.

Why this is cost-optimized

Because it estimates savings before rollout and avoids expensive bad cache hits in production.

Scenario 7: Product wants more FAQ answers to bypass the LLM entirely

Question

Product asks, "Can we answer more FAQ and promotion questions with templates or retrieval snippets so we save LLM cost?" How do you test that idea offline?

Short answer

Replay traffic and compare resolution quality against estimated LLM avoidance.

Deep-dive answer

This is one of the highest-leverage cost optimizations in MangaAssist. If more FAQ or promotion traffic can be served from:

templates
retrieval snippets
API-backed structured responses

then LLM cost drops directly.

Offline experiment design

Take a labeled FAQ and promotion dataset.
Define which questions are eligible for deterministic answering.
Create the template or retrieval-only answer path.
Compare old vs new on: - correctness - completeness - escalation risk - estimated LLM bypass rate

What I would require

no meaningful quality regression on high-volume FAQ questions
clear routing confidence threshold for bypass
fallback to LLM when retrieval confidence is low

Example tradeoff

If deterministic FAQ answers bypass 35 percent of all FAQ LLM calls with only a 1 percent rise in escalations, that may be a very strong business decision. If escalations rise 8 percent, the cost savings may be false savings because human support cost goes up.

Correct answer

Measure the change as a total-system cost problem:

Bedrock savings
latency improvement
any extra human support cost

The cheapest answer is not the best answer if it shifts cost to agents.

Why this is cost-optimized

Because it targets the strongest root lever: reducing unnecessary LLM invocations altogether.

Scenario 8: The team asks, "When do we still need paid GenAI offline testing?"

Question

If offline testing is so powerful, when do we still have to spend money on the real target model?

Short answer

Only for the parts where target-model behavior is the thing under test.

Deep-dive answer

There are limits to offline approximation. Local models, synthetic fixtures, and deterministic checks cannot fully predict:

final tone and helpfulness of the target model
prompt sensitivity unique to the target model
jailbreak behavior unique to the target model
latency and output length of the target model

I would still use a paid offline gate for

major prompt-template rewrites
model version upgrades
high-risk policy or safety changes
new multi-turn reasoning flows
release candidates before shadow or canary

But I would keep it narrow

The correct pattern is:

cheap offline suite on broad coverage
small stratified paid gate on narrow coverage

Not:

broad paid evaluation on every change

Why this is cost-optimized

Because it preserves confidence where the target model matters while refusing to spend money where it does not.

Final Takeaway

For MangaAssist, offline testing is not a compromise. It is the correct engineering design for a costly GenAI system.

The practical rule is:

test routing offline
test retrieval offline
test guardrails offline
test memory offline
test prompt structure offline
test spend offline
use the paid model only for the narrow semantic layer that cannot be approximated cheaply

That is how you control GenAI cost without turning quality assurance into guesswork.

02. Offline Testing Scenarios With Answers

Scenario 1: A prompt template changed for recommendations

Question

Short answer

Deep-dive answer

What I would assert

Cheap second step

When I would pay for Bedrock

Why this is cost-optimized

Scenario 2: The retriever chunking strategy changed

Question

Short answer

Deep-dive answer

What matters most

Example decision rule

When I would pay for Bedrock

Why this is cost-optimized

Scenario 3: The intent classifier was retrained

Question

Short answer

Deep-dive answer

Offline evaluation plan

What I would monitor

Why a seemingly better model can be worse

Correct answer

Why this is cost-optimized

Scenario 4: Guardrails were tightened after a price hallucination incident

Question

Short answer

Deep-dive answer

Offline test design

What I would measure

Important nuance

When I would pay for Bedrock

Why this is cost-optimized

Scenario 5: The memory summarizer changed and multi-turn quality regressed

Question

Short answer

Deep-dive answer

Offline replay plan

Good offline assertions

Cheap local-model use

Correct answer

Why this is cost-optimized

Scenario 6: The team wants to add semantic caching

Question

Short answer

Deep-dive answer

Offline experiment

Example formula

What I would not do

Required offline slices

Correct answer

Why this is cost-optimized

Scenario 7: Product wants more FAQ answers to bypass the LLM entirely

Question

Short answer

Deep-dive answer

Offline experiment design

What I would require

Example tradeoff

Correct answer

Why this is cost-optimized

Scenario 8: The team asks, "When do we still need paid GenAI offline testing?"

Question

Short answer

Deep-dive answer

I would still use a paid offline gate for

But I would keep it narrow

Recommended answer

Why this is cost-optimized

Final Takeaway