01. Offline Testing Strategy - How I Keep MangaAssist Testing Cheap
The Problem
If every chatbot test invokes Bedrock, cost grows in the worst possible way:
- every prompt tweak becomes a paid experiment
- every PR runs hundreds of model calls
- failures are discovered late, after money is already spent
- engineers stop testing aggressively because the loop is expensive
For MangaAssist, that would be the wrong operating model. The system is intentionally hybrid:
chitchat, acknowledgements, and many confirmations are template-onlyorder_tracking,return_request, and price lookups are API-firstfaqand policy flows are retrieval-heavy- only ambiguous recommendation and explanation flows need full generation
So the test strategy should match the architecture. Most changes should be validated offline before we pay for any managed GenAI calls.
What "Offline Testing" Means Here
In this repo, offline testing means:
- no production user traffic
- local or mocked downstream services
- replaying saved sessions or labeled prompts
- evaluating classifier, retriever, guardrails, memory, and prompt assembly without requiring Bedrock
- using a small local open-source model only when we need generation behavior for smoke testing
Offline does not mean "no intelligence at all." It means the expensive model is not on the hot path for routine validation.
The Offline-First Test Ladder
| Level | Test Lane | What Runs | Bedrock Cost | When To Run |
|---|---|---|---|---|
| L0 | Static and unit tests | Regex routing, prompt builder, schema validators, price checks, guardrails | $0 | Every commit |
| L1 | Component replay | Intent classifier eval, retrieval eval, memory replay, formatter tests | $0 | Every PR |
| L2 | Service integration in a lab | LocalStack, Redis/OpenSearch test containers, mocked catalog/orders services | $0 | Every PR |
| L3 | Local-model smoke tests | Optional vLLM or Ollama for prompt and response-shape smoke tests | Fixed infra only | Prompt or multi-turn changes |
| L4 | Small paid promotion gate | Stratified sample on the real target model | Capped and approved | Pre-merge or pre-release only |
The key rule is that each higher level sees fewer cases and runs less often.
Why This Works For MangaAssist
MangaAssist already decomposes the request path into separable layers:
- intent classification
- routing
- service aggregation
- retrieval
- prompt assembly
- generation
- guardrails
- response formatting
That decomposition is valuable for cost optimization because only step 6 truly requires the expensive model. Everything else can be tested offline.
Replay Dataset
|
v
Offline Test Harness
|
+--> Intent classifier eval
+--> Retriever eval
+--> Prompt compiler checks
+--> Local fixture services
+--> Guardrail validators
+--> Response schema checks
+--> Cost estimator
|
+--> Optional local model adapter
|
+--> Small capped Bedrock gate only if all cheaper checks pass
What I Test Offline By Component
1. Intent Classifier
This should almost never require a paid LLM.
Offline method
- keep a labeled dataset of user messages and expected intents
- replay messages through the rule engine and classifier
- compute accuracy, per-class precision/recall/F1, and confusion matrix
- track a cost-focused metric: percent of traffic routed to deterministic paths vs full generation
Why it matters for cost
Bad routing is a hidden GenAI tax. If order_tracking or promotion traffic falls into general_query, the system pays for generation when it should have used API-first logic.
Pass criteria
- no regression in high-volume intents
general_queryor fallback rate does not grow unexpectedly- deterministic-path coverage stays stable or improves
2. Retrieval and RAG
Retriever changes should be tested mostly without generation.
Offline method
- maintain query-to-gold-document pairs
- evaluate Recall@k, Precision@k, MRR, stale-chunk rate, and retrieval latency
- compare chunking, filters, reranking, and freshness policies directly
Why it matters for cost
If retrieval gets worse, the LLM needs longer prompts, more retries, and more guardrail failures. Better retrieval usually reduces both hallucination risk and token waste.
Pass criteria
- Recall@3 improves or stays flat
- irrelevant chunk rate does not increase
- prompt context size stays within token budget
3. Prompt Builder
Prompt changes are where teams often overspend because they immediately jump to full model evaluation.
Offline method
- freeze fixture inputs: intent, retrieved chunks, product JSON, promotions, and conversation history
- compile prompts without calling the model
- assert section order, forbidden strings, token growth, and presence of required context
- diff prompt versions like source code
What this catches
- missing catalog facts
- duplicated conversation history
- prompt bloat
- broken system rules
- missing anti-hallucination instructions
What it does not catch
- subtle tone changes
- semantic helpfulness of the final answer
That is why prompt validation starts offline, then uses only a small paid sample later.
4. Guardrails
Guardrails are largely deterministic and should be heavily tested offline.
Offline method
- maintain adversarial input fixtures and synthetic bad-output fixtures
- run PII scrubbing, competitor blocking, price validation, ASIN checks, and scope checks
- measure both false negatives and false positives
Why it matters for cost
A weak guardrail causes rework, escalations, and human reviews. An overly aggressive guardrail blocks safe responses and hurts resolution rate. Both are expensive, and neither problem requires Bedrock to detect.
5. Memory and Multi-turn Context
Multi-turn testing is expensive if every conversation replay uses a paid model. It should start offline.
Offline method
- keep archived multi-turn transcripts with expected follow-up resolution
- test summary generation separately from answer generation
- verify entity preservation: ASINs, titles, unresolved issue, user preference, last recommended item
- run reference resolution tests on turns such as "tell me more about the first one"
Why it matters for cost
Broken memory often causes repeated LLM calls, longer prompts, and frustrated users who re-ask the same question. Fixing memory quality is a cost optimization, not only a UX improvement.
6. Response Formatting and Frontend Contract
The chatbot can fail even when the model answer is good.
Offline method
- feed formatter with canned model outputs
- validate JSON schema, action button payloads, product cards, and follow-up suggestions
- verify no invalid ASINs, prices, or URLs survive post-processing
Why it matters for cost
If schema failures are only caught after paid generation, the team is paying to discover a serialization bug.
7. Cost Estimation
Cost itself should be tested offline.
Offline method
Replay a representative traffic sample and estimate:
- LLM-invocation rate
- average prompt tokens per routed intent
- average completion tokens
- cache hit rate
- fallback rate
- cost per session
Simple formula
Estimated monthly LLM cost
= sessions_per_month
* llm_invocation_rate
* (avg_prompt_tokens * input_token_price
+ avg_completion_tokens * output_token_price)
* (1 - cache_hit_rate)
This lets the team compare candidate designs before release instead of learning cost from the invoice.
The Dataset Design I Would Use
Offline testing only works if the datasets are engineered well.
Dataset types
| Dataset | Purpose | Source |
|---|---|---|
| Golden intent set | Routing accuracy | Labeled production samples plus synthetic edge cases |
| Retrieval gold set | RAG quality | Query-doc relevance pairs |
| Multi-turn replay set | Memory and context resolution | Anonymized chat transcripts |
| Guardrail adversarial set | Safety and policy defense | Security cases, red-team prompts, malformed outputs |
| Cost replay set | Spend simulation | Traffic slices with token and route metadata |
Required fields per example
For most scenarios, each test case should carry:
- user message
- optional conversation history
- page context
- expected intent
- expected route type: template, API, retrieval, or generation
- expected documents or product entities
- required elements
- prohibited elements
- tags such as
recommendation,policy,multi_turn,high_cost,guardrail
Refresh policy
- weekly: add incidents and recent regressions
- monthly: rebalance by traffic distribution
- quarterly: retire stale products and policy text
The dataset must evolve with the product, or offline testing becomes a false signal.
Practical CI/CD Gating For Cheap Testing
Change-type matrix
| Change Type | Primary Gate | Paid Model Needed? | Notes |
|---|---|---|---|
| Regex or routing rules | Unit + intent replay | No | Deterministic change |
| Guardrail rules | Unit + adversarial replay | No | Deterministic change |
| Retrieval/chunking/reranker | Retrieval gold set | Usually no | Small paid sample only if answer style changed |
| Prompt template | Prompt checks + local smoke | Yes, but sampled | Use small stratified paid gate |
| Local business logic | Unit + integration | No | Only paid if downstream prompt behavior changed |
| Model version swap | Full offline suite + sample gate | Yes | This is one of the few changes that truly needs paid evaluation |
Promotion policy
- Fail fast at L0-L2.
- Run local-model smoke tests only on prompt-sensitive changes.
- Run the paid gate only after everything else passes.
- Cap the paid gate with a fixed prompt budget and a fixed dollar budget.
How I Keep The Paid Gate Small
The expensive step should be narrow and intentional.
Example gate policy
- 20 recommendation prompts
- 20 policy/FAQ prompts
- 10 multi-turn prompts
- 10 adversarial prompts
That is 60 carefully chosen prompts, not 500 random prompts on every PR.
Why stratified sampling matters
If the paid gate is tiny but representative, it still catches:
- recommendation tone regressions
- policy phrasing issues
- memory behavior with the actual target model
- jailbreak regressions that depend on target-model behavior
What it avoids is paying for low-signal duplication.
Local Models As A Middle Layer
I would use local open-source models for smoke testing, not for final truth.
Good uses
- verify prompt formatting does not break generation completely
- test response schema compliance
- test long-context prompt assembly
- smoke-test multi-turn flows in development
Bad uses
- deciding final production quality of Claude or another target managed model
- replacing production safety evaluation entirely
A local model is a cheap approximation layer, not a release authority.
Spend Governance Rules
Cost optimization fails when there is no policy around evaluation.
Rules I would enforce
- No paid-model evaluation on every commit.
- No paid-model evaluation for changes that only affect deterministic logic.
- Every paid gate must have a scenario budget and an owner.
- Every experiment must log estimated and actual token spend.
- Shadow or canary experiments must be time-boxed and auto-stopped.
- If offline gates fail, the paid gate is skipped automatically.
These rules stop the common failure mode where Bedrock becomes the test harness for avoidable bugs.
What Still Cannot Be Proven Offline
Offline testing is necessary, but it is not enough.
The following still require selective live validation:
- final target-model tone and helpfulness
- interaction between live traffic distribution and model behavior
- real latency under concurrent load
- production cache effectiveness
- business metrics such as conversion and escalation rate
That is why the correct flow is:
offline checks
-> capped paid evaluation
-> shadow mode
-> canary
-> continuous monitoring
Offline testing reduces waste. It does not eliminate the need for production safety stages.
My Recommended Operating Model For MangaAssist
If I were running this chatbot in production, I would set the process up like this:
- Most PRs run only deterministic and replay-based tests.
- Prompt changes run prompt-compiler checks plus local-model smoke tests.
- Only release candidates run a small stratified paid-model gate.
- Only model upgrades and major prompt revisions enter shadow mode.
- Monthly reviews compare estimated offline cost to actual Bedrock spend to tune routing and cache strategy.
That gives the team fast iteration loops, controlled cost, and far better signal than "just call the model and see what happens."
Short Answer
Offline testing for MangaAssist is done by separating the chatbot into layers and validating each layer with the cheapest possible mechanism:
- deterministic logic with unit tests
- routing and retrieval with replay datasets
- prompt assembly with prompt diffs and token checks
- memory and guardrails with archived transcripts and adversarial fixtures
- local open-source models for smoke tests
- a very small paid-model sample only at the end
That is how you keep GenAI testing cost under control without lowering engineering quality.