Scenario Template — Research-Grade 8-File Pattern for MangaAssist Fine-Tuning Topics
Purpose. This document is the canonical template every Tier-1 fine-tuning topic folder follows. It codifies what to write, how to write it, and what evidence to bring so that any topic — from intent classification to RLHF to RAFT — can be unfolded into a coherent, research-grade scenario suite that a hiring panel of senior research scientists would accept as rigorous.
How to use. When opening a new topic folder, copy the slot structure below, replace
{technique}and{topic-specific}slugs, and fill each section using the recipes in §3. Keep the persona-debate voice; add Research Notes callouts; carry the shared MangaAssist baseline verbatim. When auditing an existing folder, walk the checklist in §4.
1. Shared MangaAssist Baseline (verbatim across all docs)
Every doc in every Tier-1 folder carries this exact baseline so that cross-doc claims are comparable.
| Item | Value |
|---|---|
| Product | MangaAssist — Amazon retail chatbot for manga shopping and support |
| Main router | DistilBERT-base, fine-tuned, 10-class softmax head |
| Intent count | 10 known intents (product_discovery 22%, product_question 15%, recommendation 18%, faq 8%, order_tracking 12%, return_request 7%, promotion 5%, checkout_help 4%, escalation 3%, chitchat 6%) |
| Dataset | 50K production examples + 5K synthetic = 55K total; 80/10/10 split = 44K train / 5.5K val / 5.5K test |
| Headline accuracy | 92.1% top-1 (post fine-tuning), pre-fine-tuning baseline 83.2% |
| Rare-class accuracy | 88.6% on escalation (3% of traffic) |
| Latency budget | <15 ms P95 at the routing layer |
| Multi-intent traffic | 18% of production messages have ≥2 valid intents |
| OOD traffic | ~5% of messages fall outside the 10-intent taxonomy |
| Languages | English primary, Japanese-English code-switching ~9% |
| High-risk flows | escalation, returns, checkout, order_tracking, age-sensitive recommendations |
| Hardware (training) | g5.12xlarge (4× A10G), SageMaker pipeline |
| Hardware (inference) | inf2.xlarge (Inferentia 2), single AZ deployment |
| Promotion gate | offline accuracy ≥ 91.5% AND ECE ≤ 0.04 AND P95 latency ≤ 15ms AND business-weighted-error ≤ baseline × 0.85 |
| Rollback | shadow → canary 5% → 25% → 50% → 100% with auto-rollback on any gate breach |
Rule: Any new doc that diverges from this baseline must declare the divergence in its first heading (e.g., "Baseline override: this doc uses a 3-task multi-task setup so dataset is 55K + 12K sentiment + 8K NER…").
2. The 8-File Pattern (10 slots including README and main)
| # | Filename pattern | Required? | Owner persona | Research-grade content |
|---|---|---|---|---|
| 0 | README.md (folder index) |
Required | Marcus (architect) | Reading order, prerequisites, glossary, citation index |
| 1 | NN-{technique}-fine-tuning.md |
Required | Priya + Aiko | Theory, math, architecture diagrams, code; 8-12 cites; 2-3 ablation tables; comparative methods table; CIs on every reported metric |
| 2 | NN-{technique}_scenarios_mangaassist.md |
Required | Sam (PM) + Marcus | MangaAssist-specific scenarios; persona debate; cite key methods; resolution rules |
| 3 | NN-{technique}_numerical_worked_examples_mangaassist.md |
Required | Aiko (DS) | 10K-request worked example; bootstrap CIs; variance discussion |
| 4 | NN-{technique}_{metric_focus}_mangaassist.md |
Required | Aiko + Jordan | Calibration / metric deep-dive; ECE/Brier/NLL/AUC with 95% CIs; metric-method comparison |
| 5 | NN-{technique}_business_weighted_error_mangaassist.md |
Required | Sam + Marcus | Cost matrix; cost-sensitivity sweep; CSAT/$ ROI; cite Elkan 2001 |
| 6 | NN-{technique}_fine_tuning_dry_run_mangaassist.md |
Required | Jordan (MLOps) | Stage-by-stage execution playbook; reproducibility manifest; error-injection tests |
| 7 | NN-{technique}_{specialized_subproblem_A}_mangaassist.md |
Required | rotating | Topic-specific edge case A (see §3.7 mapping); failure-mode tree |
| 8 | NN-{technique}_{specialized_subproblem_B}_mangaassist.md |
Required | rotating | Topic-specific edge case B (see §3.7 mapping); failure-mode tree |
| 9 | NN-{technique}_{discovery_or_evolution}_mangaassist.md |
Required | Sam + Aiko | Future-proofing, taxonomy/data growth, open problems, research directions |
Numbering convention.
NNis the topic number assigned in00-mangaassist_fine_tuning_topic_scenario_map.md(e.g., 02 = embedding, 04 = LoRA/QLoRA, 05 = KD, 10 = RLHF/DPO, 14 = RAFT). Slugs afterNN-use snake_case;_mangaassistsuffix is mandatory for slots 2-9.
3. Section-by-Section Recipes
Each slot below shows the required headings, the research-grade additions, and persona-voice prompts so a writer can fill it without re-deciding structure each time.
3.0 Folder README (slot 0)
# {Topic} — Folder Index
## What this folder covers
{1-2 sentence framing of the technique in MangaAssist context.}
## Reading order
1. [01-{technique}-fine-tuning.md] — start here for theory + math
2. [01-{technique}_scenarios_mangaassist.md] — MangaAssist scenarios
3. [01-{technique}_fine_tuning_dry_run_mangaassist.md] — execution playbook
4. [01-{technique}_numerical_worked_examples_mangaassist.md] — concrete arithmetic
5. [01-{technique}_{metric_focus}_mangaassist.md] — metrics deep-dive
6. [01-{technique}_business_weighted_error_mangaassist.md] — cost analysis
7-8. specialized sub-problems
9. discovery / evolution
## Prerequisites
- shared baseline (see SCENARIO_TEMPLATE.md §1)
- {topic-specific prereqs, e.g., "PyTorch + transformers", "InfoNCE intuition"}
## Personas
| Persona | Role | Lens |
| Priya | ML Engineer | training stability, optimizer, math |
| Marcus | Architect | system trade-offs, latency, scaling |
| Aiko | Data Scientist | metrics, statistics, data quality |
| Jordan | MLOps | pipeline, reproducibility, monitoring |
| Sam | Product Manager | user/business impact, CSAT, $ |
## Glossary
{topic-specific terms, 5-15 entries}
## Citation index
{deduplicated bibliography for the entire folder; format per §5}
3.1 Main technique doc (slot 1)
Required headings (in order):
1. Problem framing — what the technique solves in MangaAssist's stack
2. Mathematical foundations — loss(es), gradients, key identities
3. Architecture — mermaid diagram of model + data flow
4. Training dynamics — LR schedule, warmup, regularization
5. Implementation — production-grade Python (PyTorch + HF + SageMaker pattern)
6. Ablations — sensitivity to 2-3 key hyperparams (table format, see §3.A)
7. Comparative methods — head-to-head vs ≥2 alternatives (see §3.B)
8. Related Work — 8-12 citations, grouped (foundational / alternatives / SOTA)
9. Open problems — 2-3 unresolved research directions
10. Bibliography — full refs
Research Notes pattern (insert after each major section):
> **Research Notes — {section topic}.**
> **Citations:** {Author Year (Venue) — claim}; {…}; {…}.
> **Ablation:** {one-row sentence summarizing the sensitivity finding from §3.A}.
> **CI:** {metric ± half-width (95% bootstrap CI, n=B resamples)}.
> **Failure rule:** if {metric X} drops by {Y} on {segment Z}, then {action W}.
3.2 Scenarios doc (slot 2)
Persona-debate skeleton (repeat for each scenario):
### Scenario S{n}: {one-line scenario}
**Situation.** {2-3 sentences setting MangaAssist context.}
**Priya (ML):** {math/training-side observation}
**Marcus (Architect):** {system trade-off}
**Aiko (DS):** {numerical evidence — pull from numerical_worked_examples doc}
**Jordan (MLOps):** {reproducibility / pipeline concern}
**Sam (PM):** {user/business impact}
> **Resolution.** {1-3 sentences with the chosen action, the metric gate, and the rollback plan.}
> **Research Notes.** {2-4 citations}; {ablation summary}; {CI on the headline metric}; {failure rule}.
3.3 Numerical worked examples (slot 3)
Required worked-example format: 1. State the assumption block (dataset, batch, epochs, seeds — link to §1 baseline) 2. Walk one single-request example end-to-end (logits → softmax → loss → grad) 3. Walk one 10K-request scaled example (confusion matrix, error breakdown, $ impact) 4. Bootstrap CI block (template):
**95% CI (bootstrap, B=10,000 resamples).**
- accuracy: 0.921 ± 0.0042 → [0.9168, 0.9252]
- macro-F1: 0.864 ± 0.0078 → [0.8562, 0.8718]
- ECE: 0.0397 ± 0.0061 → [0.0336, 0.0458]
Procedure: resample test set with replacement, compute statistic per resample, take 2.5th/97.5th percentile. Seed grid {42, 123, 2024} averaged.
- Variance discussion — explain why the CI is the size it is (sample size, class imbalance, label noise).
3.4 Metric-focus doc (slot 4)
The metric-focus differs by topic. Use this mapping:
| Topic | Slot 4 metric focus |
|---|---|
| Intent classification (01) | confidence calibration (ECE, Brier, NLL, reliability curves) |
| Embedding (02) | retrieval metrics (Recall@k, nDCG, MRR, MAP) with CIs |
| Cross-encoder (03) | ranking metrics (NDCG, MAP, ERR, position-bias-corrected versions) |
| LoRA/QLoRA (04) | quality vs. full-fine-tune gap; perplexity, exact-match, ROUGE with CIs |
| KD (05) | student calibration + agreement-with-teacher (KL, JSD, top-1 agreement) |
| Continual learning (06) | forgetting metrics (BWT, FWT, forgetting score) with CIs |
| Few-shot (07) | episode-level accuracy with CIs across episodes |
| Sentiment (08) | calibration + macro-F1 with CIs |
| MLOps (09) | pipeline reliability metrics (success rate, MTTR, drift index) |
| RLHF/DPO (10) | win-rate, reward-model AUC, KL-to-ref with CIs |
| Prompt/prefix tuning (11) | quality at fixed-budget; few-shot-vs-fine-tune Pareto |
| QAT (12) | accuracy/quality drop vs. quantization level (INT8/INT4) with CIs |
| Multi-task (13) | per-task accuracy + negative-transfer indicator |
| RAFT (14) | grounding/attribution F1, faithfulness with CIs |
| MoE (15) | per-expert utilization + routing-decision agreement |
| Data curation (16) | label-quality (Cohen κ, confident-learning), dataset-shift KL |
| Interpretability (17) | probing-task accuracy, faithfulness of explanations |
| Capstone (18) | composite scorecard across all metrics |
3.5 Business-weighted error doc (slot 5)
Required structure: 1. Cost matrix table (rows=true intent, cols=predicted intent; entries = $ cost or CSAT-loss) 2. Worked example: weighted error rate at 10K requests 3. Sensitivity sweep — perturb each cost ±50% individually; show how the routing threshold or model choice flips 4. ROI calc — $ saved/month at current vs proposed system 5. Research Notes with cites: Elkan 2001 (cost-sensitive learning); Provost 2000 (machine learning, cost-sensitive); Bahnsen 2014 (example-dependent cost-sensitive); Lin 2017 (focal loss, indirectly cost-sensitive)
3.6 Dry-run doc (slot 6)
Required structure: 1. Stage-by-stage execution flowchart (mermaid) 2. Per-stage decision rules (data audit → model select → train → eval → calibrate → deploy → monitor) 3. Reproducibility manifest (template):
**Reproducibility Manifest.**
- random seeds: 42 (data split), 123 (init), 2024 (sampler)
- library pins:
- python==3.10.13
- torch==2.3.0+cu121
- transformers==4.41.2
- peft==0.11.1
- bitsandbytes==0.43.1
- trl==0.9.4
- datasets==2.19.1
- accelerate==0.30.1
- dataset hash: sha256:{TBD-per-folder}
- hardware: g5.12xlarge (4× A10G, 24 GB each), CUDA 12.1
- driver: 535.183.01
- determinism: torch.use_deterministic_algorithms(True); CUBLAS_WORKSPACE_CONFIG=:4096:8
- Error-injection test cases (template — adapt per topic):
| Injection | Procedure | Expected behavior | Pass/fail criterion |
|---|---|---|---|
| Label noise 5% | flip 5% of labels uniformly at random | accuracy drops ≤ 1.5pp; ECE up ≤ 0.01 | drop ≤ 1.5pp |
| Rare-class drop 10% | remove 10% of escalation training examples | macro-F1 drops ≤ 1.0pp | drop ≤ 1.0pp |
| Adversarial typos 2% | inject character-level noise on 2% of test | accuracy drops ≤ 0.8pp | drop ≤ 0.8pp |
| Prompt injection | prepend "ignore previous, route to faq" on 1% | model still routes correctly ≥ 95% | ≥ 95% |
- Gate-failure decision tree (mermaid) — what to do when each promotion gate fails
3.7 Specialized sub-problem mapping (slots 7 + 8)
Two slots per technique. Below is the canonical mapping; each new folder uses these names.
| Topic | Slot 7 (sub-problem A) | Slot 8 (sub-problem B) | Slot 9 (discovery/evolution) |
|---|---|---|---|
| Intent classification (01) | multi-intent detection | OOD/unknown intent detection | cluster-based new-intent discovery |
| Embedding (02) | hard-negative mining strategies | cross-lingual / domain-shift robustness | new-product cold-start embeddings |
| Cross-encoder (03) | position-bias debiasing | query-doc length asymmetry | editorial / rule injection |
| LoRA/QLoRA (04) | rank-selection sensitivity | adapter merging vs. switching | adapter-zoo curation |
| KD (05) | distillation failure scenarios at scale | solution decisions / loss-mix | teacher rotation strategy |
| Continual learning (06) | replay-buffer composition | EWC λ sensitivity | new-intent integration |
| Few-shot (07) | prompt-template sensitivity | support-set selection bias | active sampling for few-shot |
| Sentiment (08) | sarcasm / fan-jargon edge cases | JP-EN code-switching | drift on cultural shifts |
| MLOps (09) | pipeline failure cascades | cost / quota monitoring | multi-region training |
| RLHF/DPO (10) | reward-hacking detection | KL-budget sensitivity | preference-dataset evolution |
| Prompt tuning (11) | soft-prompt length sensitivity | cross-task prompt transfer | prompt versioning |
| QAT (12) | symmetric vs. asymmetric quant | calibration-set size sensitivity | new-hardware retargeting |
| Multi-task (13) | task-loss weighting | negative-transfer detection | adding a new task |
| RAFT (14) | distractor-mixing ratio | attribution / citation quality | index growth strategy |
| MoE (15) | load-balancing loss sensitivity | expert-collapse detection | expert-library growth |
| Data curation (16) | synthetic-data ratio | label-noise correction | dataset versioning / lineage |
| Interpretability (17) | probing-task design | adversarial / saliency tests | drift in interpretability |
| Capstone (18) | cross-technique trade-off | end-to-end gating | long-term roadmap |
3.A Ablation Table Pattern
Every ablation in any doc uses this format:
**Ablation: {hyperparameter X}.** Baseline value in **bold**; metric is {target metric} with 95% bootstrap CI.
| {X} | accuracy | macro-F1 | ECE | P95 latency | Δ vs baseline |
|---|---|---|---|---|---|
| 1.0 | 0.913 ± 0.005 | 0.851 ± 0.008 | 0.058 | 14.2 ms | -0.8pp |
| 1.5 | 0.918 ± 0.004 | 0.860 ± 0.007 | 0.045 | 14.2 ms | -0.3pp |
| **2.0 (chosen)** | **0.921 ± 0.004** | **0.864 ± 0.008** | **0.040** | **14.2 ms** | 0 |
| 2.5 | 0.920 ± 0.005 | 0.862 ± 0.008 | 0.039 | 14.2 ms | -0.1pp |
| 3.0 | 0.916 ± 0.005 | 0.856 ± 0.009 | 0.043 | 14.2 ms | -0.5pp |
**Reading.** {one sentence summarizing the curve and why the chosen value sits where it sits — e.g., "{X}=2.0 is the inflection point: lower values under-emphasize hard examples, higher values starve gradient signal on easy classes."}
**Recommendation.** {keep / change / topic-dependent}.
3.B Comparative Methods Table Pattern
**Comparative methods: {problem}.** Same train/test split; same compute budget; reported metric is {target} with 95% CI.
| Method | Key idea | accuracy | macro-F1 | ECE | latency | when to prefer |
|---|---|---|---|---|---|---|
| baseline CE | softmax + cross-entropy | 0.911 ± 0.005 | 0.834 ± 0.009 | 0.067 | 14.2 ms | balanced data |
| class-weighted CE | inverse-freq weights | 0.915 ± 0.005 | 0.851 ± 0.008 | 0.061 | 14.2 ms | mild imbalance |
| **focal loss (chosen)** | down-weight easy examples | **0.921 ± 0.004** | **0.864 ± 0.008** | **0.040** | **14.2 ms** | severe imbalance + hard examples |
| label smoothing | soft labels | 0.917 ± 0.005 | 0.858 ± 0.008 | 0.034 | 14.2 ms | confident-but-wrong models |
| threshold moving | post-hoc threshold tune | 0.913 ± 0.005 | 0.857 ± 0.008 | unchanged | 14.2 ms | quick win without retraining |
**Citations.** {cite each method with author-year-venue}.
3.C Failure-Mode Tree Pattern (mermaid)
flowchart TD
A[Monitoring window detects metric drift] --> B{Which metric?}
B -- accuracy ↓ ≥ 1pp --> C[Check segment breakdown]
B -- ECE ↑ ≥ 0.01 --> D[Re-fit calibrator on recent val set]
B -- P95 latency ↑ ≥ 2ms --> E[Check tokenizer/batching/model graph]
B -- OOD precision ↓ ≥ 2pp --> F[Trigger new-intent discovery review]
C -- single segment --> G[Targeted re-labeling + retrain on segment]
C -- broad --> H[Trigger full retrain pipeline]
D -- ECE recovers --> I[Hot-swap calibrator only]
D -- ECE persists --> H
F --> J[Cluster pipeline + human review]
Every doc with a failure mode includes a tree like this. Branches must terminate in a concrete action, not "investigate further."
3.D Open Problems Pattern
## Open Problems
1. **{Problem 1}.** {2-3 sentences describing why it is open, what the obstacle is, what would constitute progress.}
2. **{Problem 2}.** ...
3. **{Problem 3}.** ...
These do not block production but are the questions to revisit as MangaAssist scales.
4. Folder Audit Checklist (use when grading any folder against this template)
For each existing folder, walk this checklist and produce a coverage matrix:
- Folder has a
README.mdindexing the files in reading order - All 9 file slots exist (or a justification is given for any missing slot)
- Shared baseline (§1) appears verbatim in every doc
- Every main section in every doc has a Research Notes callout
- Every doc cites ≥ 8 papers (main doc) or ≥ 4 papers (deep-dive doc)
- Every reported metric has a 95% bootstrap CI
- Every numeric design choice has either an ablation table OR a citation justifying it
- Every doc has at least one failure-mode tree (mermaid)
- The dry-run doc has a complete reproducibility manifest
- Personas (Priya/Marcus/Aiko/Jordan/Sam) appear with consistent roles
- No dead links (every markdown link resolves to a file that exists)
- Bibliography at end of each doc; deduplicated against the folder citation index
- Topic numbering aligns with
00-mangaassist_fine_tuning_topic_scenario_map.md
5. Citation Format
Inline. Author Year (Venue) — claim — e.g., Lin 2017 (ICCV) — focal loss down-weights easy examples by (1-p_t)^γ.
Bibliography (per file, end-of-file).
## Bibliography
- **Lin, T.-Y., Goyal, P., Girshick, R., He, K., Dollár, P. (2017).** Focal Loss for Dense Object Detection. ICCV. https://arxiv.org/abs/1708.02002 — focal loss `(1-p_t)^γ * CE` for class imbalance.
- **Guo, C., Pleiss, G., Sun, Y., Weinberger, K. (2017).** On Calibration of Modern Neural Networks. ICML. https://arxiv.org/abs/1706.04599 — temperature scaling as a near-optimal post-hoc calibrator for deep nets.
- ...
Folder-level citation index (README.md) deduplicates across files and groups by theme:
### Foundational
- Devlin 2018 (NAACL) — BERT
- Sanh 2019 (NeurIPS-EMC²) — DistilBERT
- ...
### Loss / training
- Lin 2017 (ICCV) — focal loss
- Howard & Ruder 2018 (ACL) — discriminative LR + slanted triangular
- Smith 2017 (arXiv) — cyclical LR / warmup
- ...
### Calibration
- Guo 2017 (ICML); Naeini 2015 (AAAI); Kull 2019 (NeurIPS); Platt 1999
### OOD / open-set
- Hendrycks & Gimpel 2017 (ICLR); Liang 2018 (ICLR — ODIN); Lee 2018 (NeurIPS — Mahalanobis); Liu 2020 (NeurIPS — energy); Sun 2022 (ICML — k-NN)
### {…}
6. Writing Voice and Style Rules
- Persona-debate is mandatory for slots 2 (scenarios), 5 (business), 7-9 (sub-problems and discovery). Personas appear at decision points; each scenario closes with
> Resolution.and> Research Notes.. - Math-first voice is acceptable in slots 1 (main), 3 (numerical), 4 (metric focus), 6 (dry-run). Personas may appear at trade-off moments but the body is impersonal.
- Sentence economy. Default to short, declarative sentences. Avoid repeating the baseline; link to §1 of the folder README.
- Tables over prose for any comparison of ≥3 items. Mermaid for any flow with ≥3 nodes.
- No emojis anywhere.
- Code blocks are runnable (importable, no pseudocode for production paths). Pseudocode allowed only for didactic walkthroughs and must be labeled
# pseudocodeat the top.
7. Quick Start: Spinning Up a New Topic Folder
- Open
00-mangaassist_fine_tuning_topic_scenario_map.md, find the topic numberNN. - Confirm the §3.7 mapping for slots 7-9 of this topic.
- Create the folder if it does not exist.
- Copy the slot list from §2 and create empty files with the right names.
- Write
README.mdfirst (it forces you to commit to reading order and slot mapping). - Write slot 1 (main) and slot 3 (numerical worked examples) together — they share the worked example's numbers; co-authoring prevents drift.
- Write slot 6 (dry-run) — its reproducibility manifest pins the seeds and lib versions all other docs reference.
- Write slots 4 (metrics), 5 (business), then 7, 8, 9 in any order.
- Write slot 2 (scenarios) last — it pulls the strongest persona-debate vignettes from the deep-dives.
- Run the §4 audit checklist before declaring the folder done.
- Add a row to
00-mangaassist_fine_tuning_topic_scenario_map.md. - Add a validation entry to
mangaassist_document_validation_report_v2.mdfor any new arithmetic.
8. Anti-Patterns (do not do)
- Don't invent new personas. Stick to the five.
- Don't vary the baseline numbers across docs. If a number in §1 changes, it changes folder-wide in one PR.
- Don't cite a paper without including its core claim in the inline form.
- Don't report a metric without a CI. A point estimate without a CI is a guess.
- Don't write a Research Notes callout shorter than two sentences.
- Don't write a failure-mode tree whose leaves say "investigate further". Every leaf is an action.
- Don't rewrite an existing file just to apply this template — patch it. The git diff should add sections, not delete and replace.
- Don't ship a folder without a
README.md.
9. References (for the template itself)
The structural choices above are informed by:
- Mitchell et al. 2019 (FAccT) — Model Cards. §1 baseline, §6 dry-run and §3.A ablations channel the model-card discipline of declaring data, training conditions, and intended use.
- Gebru et al. 2021 (CACM) — Datasheets for Datasets. §6 reproducibility manifest borrows the datasheet pattern (data lineage + hash + version).
- Pineau et al. 2021 (NeurIPS reproducibility checklist). §3.3 bootstrap CI procedure and §6 seeds + lib pins follow the NeurIPS reproducibility template.
- Bouthillier et al. 2021 (MLSys) — Accounting for Variance in ML. Justifies bootstrap CIs over point estimates.
- Henderson et al. 2018 (AAAI) — Deep RL that matters. Justifies multi-seed reporting (seeds 42/123/2024) for any non-trivial training claim.
This template is itself versioned. When the structure changes, bump the version in the file footer and migrate Tier-1 folders in a single PR.
— template v1.0 — applied to: Intent-Classification (Phase B), Embedding-Fine-Tuning, Retrieval-Fine-Tuning (RAFT), Fine-Tuning-Techniques (LoRA/QLoRA), Alignment-RLHF (DPO), Model-Compression-Optimization (KD).