01. Offline Testing Strategy - How I Keep MangaAssist Testing Cheap

The Problem

If every chatbot test invokes Bedrock, cost grows in the worst possible way:

every prompt tweak becomes a paid experiment
every PR runs hundreds of model calls
failures are discovered late, after money is already spent
engineers stop testing aggressively because the loop is expensive

For MangaAssist, that would be the wrong operating model. The system is intentionally hybrid:

chitchat, acknowledgements, and many confirmations are template-only
order_tracking, return_request, and price lookups are API-first
faq and policy flows are retrieval-heavy
only ambiguous recommendation and explanation flows need full generation

So the test strategy should match the architecture. Most changes should be validated offline before we pay for any managed GenAI calls.

What "Offline Testing" Means Here

In this repo, offline testing means:

no production user traffic
local or mocked downstream services
replaying saved sessions or labeled prompts
evaluating classifier, retriever, guardrails, memory, and prompt assembly without requiring Bedrock
using a small local open-source model only when we need generation behavior for smoke testing

Offline does not mean "no intelligence at all." It means the expensive model is not on the hot path for routine validation.

The Offline-First Test Ladder

Level	Test Lane	What Runs	Bedrock Cost	When To Run
L0	Static and unit tests	Regex routing, prompt builder, schema validators, price checks, guardrails	$0	Every commit
L1	Component replay	Intent classifier eval, retrieval eval, memory replay, formatter tests	$0	Every PR
L2	Service integration in a lab	LocalStack, Redis/OpenSearch test containers, mocked catalog/orders services	$0	Every PR
L3	Local-model smoke tests	Optional vLLM or Ollama for prompt and response-shape smoke tests	Fixed infra only	Prompt or multi-turn changes
L4	Small paid promotion gate	Stratified sample on the real target model	Capped and approved	Pre-merge or pre-release only

The key rule is that each higher level sees fewer cases and runs less often.

Why This Works For MangaAssist

MangaAssist already decomposes the request path into separable layers:

intent classification
routing
service aggregation
retrieval
prompt assembly
generation
guardrails
response formatting

That decomposition is valuable for cost optimization because only step 6 truly requires the expensive model. Everything else can be tested offline.

Replay Dataset
   |
   v
Offline Test Harness
   |
   +--> Intent classifier eval
   +--> Retriever eval
   +--> Prompt compiler checks
   +--> Local fixture services
   +--> Guardrail validators
   +--> Response schema checks
   +--> Cost estimator
   |
   +--> Optional local model adapter
   |
   +--> Small capped Bedrock gate only if all cheaper checks pass

What I Test Offline By Component

1. Intent Classifier

This should almost never require a paid LLM.

Offline method

keep a labeled dataset of user messages and expected intents
replay messages through the rule engine and classifier
compute accuracy, per-class precision/recall/F1, and confusion matrix
track a cost-focused metric: percent of traffic routed to deterministic paths vs full generation

Why it matters for cost

Bad routing is a hidden GenAI tax. If order_tracking or promotion traffic falls into general_query, the system pays for generation when it should have used API-first logic.

Pass criteria

no regression in high-volume intents
general_query or fallback rate does not grow unexpectedly
deterministic-path coverage stays stable or improves

2. Retrieval and RAG

Retriever changes should be tested mostly without generation.

Offline method

maintain query-to-gold-document pairs
evaluate Recall@k, Precision@k, MRR, stale-chunk rate, and retrieval latency
compare chunking, filters, reranking, and freshness policies directly

Why it matters for cost

If retrieval gets worse, the LLM needs longer prompts, more retries, and more guardrail failures. Better retrieval usually reduces both hallucination risk and token waste.

Pass criteria

Recall@3 improves or stays flat
irrelevant chunk rate does not increase
prompt context size stays within token budget

3. Prompt Builder

Prompt changes are where teams often overspend because they immediately jump to full model evaluation.

Offline method

freeze fixture inputs: intent, retrieved chunks, product JSON, promotions, and conversation history
compile prompts without calling the model
assert section order, forbidden strings, token growth, and presence of required context
diff prompt versions like source code

What this catches

missing catalog facts
duplicated conversation history
prompt bloat
broken system rules
missing anti-hallucination instructions

What it does not catch

subtle tone changes
semantic helpfulness of the final answer

That is why prompt validation starts offline, then uses only a small paid sample later.

4. Guardrails

Guardrails are largely deterministic and should be heavily tested offline.

Offline method

maintain adversarial input fixtures and synthetic bad-output fixtures
run PII scrubbing, competitor blocking, price validation, ASIN checks, and scope checks
measure both false negatives and false positives

Why it matters for cost

A weak guardrail causes rework, escalations, and human reviews. An overly aggressive guardrail blocks safe responses and hurts resolution rate. Both are expensive, and neither problem requires Bedrock to detect.

5. Memory and Multi-turn Context

Multi-turn testing is expensive if every conversation replay uses a paid model. It should start offline.

Offline method

keep archived multi-turn transcripts with expected follow-up resolution
test summary generation separately from answer generation
verify entity preservation: ASINs, titles, unresolved issue, user preference, last recommended item
run reference resolution tests on turns such as "tell me more about the first one"

Why it matters for cost

Broken memory often causes repeated LLM calls, longer prompts, and frustrated users who re-ask the same question. Fixing memory quality is a cost optimization, not only a UX improvement.

6. Response Formatting and Frontend Contract

The chatbot can fail even when the model answer is good.

Offline method

feed formatter with canned model outputs
validate JSON schema, action button payloads, product cards, and follow-up suggestions
verify no invalid ASINs, prices, or URLs survive post-processing

Why it matters for cost

If schema failures are only caught after paid generation, the team is paying to discover a serialization bug.

7. Cost Estimation

Cost itself should be tested offline.

Offline method

Replay a representative traffic sample and estimate:

LLM-invocation rate
average prompt tokens per routed intent
average completion tokens
cache hit rate
fallback rate
cost per session

Simple formula

Estimated monthly LLM cost
= sessions_per_month
  * llm_invocation_rate
  * (avg_prompt_tokens * input_token_price
     + avg_completion_tokens * output_token_price)
  * (1 - cache_hit_rate)

This lets the team compare candidate designs before release instead of learning cost from the invoice.

The Dataset Design I Would Use

Offline testing only works if the datasets are engineered well.

Dataset types

Dataset	Purpose	Source
Golden intent set	Routing accuracy	Labeled production samples plus synthetic edge cases
Retrieval gold set	RAG quality	Query-doc relevance pairs
Multi-turn replay set	Memory and context resolution	Anonymized chat transcripts
Guardrail adversarial set	Safety and policy defense	Security cases, red-team prompts, malformed outputs
Cost replay set	Spend simulation	Traffic slices with token and route metadata

Required fields per example

For most scenarios, each test case should carry:

user message
optional conversation history
page context
expected intent
expected route type: template, API, retrieval, or generation
expected documents or product entities
required elements
prohibited elements
tags such as recommendation, policy, multi_turn, high_cost, guardrail

Refresh policy

weekly: add incidents and recent regressions
monthly: rebalance by traffic distribution
quarterly: retire stale products and policy text

The dataset must evolve with the product, or offline testing becomes a false signal.

Practical CI/CD Gating For Cheap Testing

Change-type matrix

Change Type	Primary Gate	Paid Model Needed?	Notes
Regex or routing rules	Unit + intent replay	No	Deterministic change
Guardrail rules	Unit + adversarial replay	No	Deterministic change
Retrieval/chunking/reranker	Retrieval gold set	Usually no	Small paid sample only if answer style changed
Prompt template	Prompt checks + local smoke	Yes, but sampled	Use small stratified paid gate
Local business logic	Unit + integration	No	Only paid if downstream prompt behavior changed
Model version swap	Full offline suite + sample gate	Yes	This is one of the few changes that truly needs paid evaluation

Promotion policy

Fail fast at L0-L2.
Run local-model smoke tests only on prompt-sensitive changes.
Run the paid gate only after everything else passes.
Cap the paid gate with a fixed prompt budget and a fixed dollar budget.

How I Keep The Paid Gate Small

The expensive step should be narrow and intentional.

Example gate policy

20 recommendation prompts
20 policy/FAQ prompts
10 multi-turn prompts
10 adversarial prompts

That is 60 carefully chosen prompts, not 500 random prompts on every PR.

Why stratified sampling matters

If the paid gate is tiny but representative, it still catches:

recommendation tone regressions
policy phrasing issues
memory behavior with the actual target model
jailbreak regressions that depend on target-model behavior

What it avoids is paying for low-signal duplication.

Local Models As A Middle Layer

I would use local open-source models for smoke testing, not for final truth.

Good uses

verify prompt formatting does not break generation completely
test response schema compliance
test long-context prompt assembly
smoke-test multi-turn flows in development

Bad uses

deciding final production quality of Claude or another target managed model
replacing production safety evaluation entirely

A local model is a cheap approximation layer, not a release authority.

Spend Governance Rules

Cost optimization fails when there is no policy around evaluation.

Rules I would enforce

No paid-model evaluation on every commit.
No paid-model evaluation for changes that only affect deterministic logic.
Every paid gate must have a scenario budget and an owner.
Every experiment must log estimated and actual token spend.
Shadow or canary experiments must be time-boxed and auto-stopped.
If offline gates fail, the paid gate is skipped automatically.

These rules stop the common failure mode where Bedrock becomes the test harness for avoidable bugs.

What Still Cannot Be Proven Offline

Offline testing is necessary, but it is not enough.

The following still require selective live validation:

final target-model tone and helpfulness
interaction between live traffic distribution and model behavior
real latency under concurrent load
production cache effectiveness
business metrics such as conversion and escalation rate

That is why the correct flow is:

offline checks
-> capped paid evaluation
-> shadow mode
-> canary
-> continuous monitoring

Offline testing reduces waste. It does not eliminate the need for production safety stages.

My Recommended Operating Model For MangaAssist

If I were running this chatbot in production, I would set the process up like this:

Most PRs run only deterministic and replay-based tests.
Prompt changes run prompt-compiler checks plus local-model smoke tests.
Only release candidates run a small stratified paid-model gate.
Only model upgrades and major prompt revisions enter shadow mode.
Monthly reviews compare estimated offline cost to actual Bedrock spend to tune routing and cache strategy.

That gives the team fast iteration loops, controlled cost, and far better signal than "just call the model and see what happens."

Short Answer

Offline testing for MangaAssist is done by separating the chatbot into layers and validating each layer with the cheapest possible mechanism:

deterministic logic with unit tests
routing and retrieval with replay datasets
prompt assembly with prompt diffs and token checks
memory and guardrails with archived transcripts and adversarial fixtures
local open-source models for smoke tests
a very small paid-model sample only at the end

That is how you keep GenAI testing cost under control without lowering engineering quality.