Cost Optimization and Offline Testing - MangaAssist
This folder explains how I would test MangaAssist cheaply without sending every code change to Bedrock. The core idea is simple: most chatbot regressions are not "LLM quality" problems. They are routing, retrieval, memory, guardrail, schema, latency, or prompt-shape problems, and those can be tested offline at little or no GenAI cost.
Files
Foundation (general offline testing for the chatbot)
| File | What It Covers |
|---|---|
| 01-offline-testing-strategy.md | The offline-first test architecture, dataset design, CI/CD gates, and spend controls for MangaAssist |
| 02-offline-testing-scenarios-with-answers.md | Deep-dive scenarios showing exactly how to validate prompt, retrieval, memory, guardrails, and routing changes while keeping GenAI cost low |
Cost-Optimization Deep-Dive (per-scenario for the 8 user stories)
The files below apply offline testing specifically to the 8 cost-optimization user stories in ../Cost-Optimization-User-Stories/. Each story (US-01 through US-08) is dissected for what its offline test should look like and what an Amazon ML/AI Engineer or MLOps Engineer would be asked about it on a loop.
| File | What It Covers |
|---|---|
| 03-foundations-and-primitives-for-cost-optimization-testing.md | Why cost-opt offline testing is structurally different from quality testing; the 4 primitives (counterfactual replay, decision-equivalence, cost-aware golden, stress/saturation); paired-metric pattern; unified test-pipeline shape |
| 04-scenario-deep-dives-per-cost-story.md | One section per US story (US-01 through US-08): cost lever, quality contract, offline test design, mermaid pipeline diagram, real-incident sketches, threshold reconciliation |
| 05-ml-ai-engineer-grill-chains.md | Full interview grill chains (Opening + 4 follow-ups + 3 architect-level + intuition) per scenario, framed for the Amazon ML/AI Engineer loop — model behavior, calibration, distribution shift, statistical rigor |
| 06-mlops-engineer-grill-chains.md | Same scenarios from the MLOps Engineer lens — telemetry, deployment, observability, kill switches, CI gates, runbooks, on-call burden |
| 07-cross-cutting-system-grill.md | 6 system-level questions spanning multiple scenarios: compounding savings/risk, offline-online correlation, cost vs. quality SLO breaches, CI gate design, auditability, the ratchet problem; closing scoring rubric for both roles |
Reading Paths
| You are... | Read in this order |
|---|---|
| New to MangaAssist offline testing | 01 → 02 → 03 → 04 |
| Designing an offline test for one specific cost story | 03 (primitives) → 04 (find your scenario) |
| Preparing for an Amazon ML/AI Engineer loop | 03 → 04 → 05 → 07 |
| Preparing for an Amazon MLOps Engineer loop | 03 → 04 → 06 → 07 |
| Preparing for a system-design / staff loop | 03 → 04 → 07 |
Core Principle
For MangaAssist, the testing ladder should always move from cheapest to most expensive:
- Deterministic tests with no LLM calls
- Replay tests on labeled datasets with mocked services
- Local open-source model smoke tests when prompt behavior must be exercised
- Small, capped paid-model evaluation only for the final promotion gate
If a change fails in step 1 or step 2, it should never reach Bedrock evaluation.
Why This Matters
MangaAssist is a hybrid chatbot:
- many intents should never hit the LLM
- product facts must come from catalog data
- policy answers must come from retrieval
- prices and ASINs must be validated after generation
That architecture is good for production cost and also good for testing cost. It lets us validate most of the system offline, then reserve paid GenAI usage for the narrow set of questions that only the target model can answer.