Training MLOps Scenarios - MangaAssist
This companion document turns the training infrastructure topic into a MangaAssist operating playbook. The goal is repeatable fine-tuning, validation, release, monitoring, and rollback across the full chatbot model stack.
Models In Scope
| Model | Cadence | Main gate |
|---|---|---|
| intent classifier | weekly or drift-triggered | accuracy, rare recall, business harm, latency |
| embedding adapter | monthly | Recall@3, NDCG@10 |
| cross-encoder reranker | monthly | MRR@10, latency |
| sentiment detector | weekly | escalation recall |
| LoRA/DPO response model | quarterly or quality-triggered | human preference and factuality |
| RAFT answer model | policy/catalog-triggered | grounded accuracy |
Scenario 1 - Weekly Router Release
Pipeline:
flowchart TD
A[Collect labeled logs] --> B[Data validation]
B --> C[Train candidate]
C --> D[Offline eval]
D --> E[Latency test]
E --> F[Shadow deploy]
F --> G{Promotion gates pass?}
G -- yes --> H[Blue-green release]
G -- no --> I[Keep champion and open error review]
Required artifacts:
- dataset version,
- tokenizer hash,
- training config,
- model artifact,
- eval report,
- confusion matrix,
- latency report,
- rollback pointer.
Promotion gate:
| Gate | Rule |
|---|---|
| overall accuracy | candidate >= champion - 0.2 points |
| rare-class accuracy | no critical regression |
| business-weighted harm | improves or remains within threshold |
| escalation miss rate | no increase |
| P95 latency | under 15 ms |
| shadow disagreement | reviewed if over threshold |
Scenario 2 - Cross-Model Release Coordination
The embedding adapter and reranker should not be released independently if their metrics are tightly coupled.
Example:
- embedding adapter changes candidate distribution,
- reranker was trained on old candidate distribution,
- final top-3 quality drops despite separate offline wins.
Solution:
- evaluate retrieval plus reranking together,
- version compatible model pairs,
- run end-to-end catalog search validation.
Scenario 3 - Model Registry Discipline
Every MangaAssist model should have a champion/challenger state.
Registry fields:
{
"model_name": "intent-distilbert",
"version": "v15",
"dataset_version": "intent-data-2026-04-20",
"training_code_sha": "abc123",
"metrics": {
"accuracy": 0.922,
"rare_accuracy": 0.889,
"p95_latency_ms": 12.1
},
"status": "shadow"
}
Failure Modes
| Failure | Detection | Fix |
|---|---|---|
| train/serve skew | offline pass, live fail | tokenizer and preprocessing hash checks |
| silent data leakage | validation too good | group split by conversation/user |
| untracked manual model | no reproducibility | registry required for deployment |
| shadow ignored | bad model promoted | promotion checklist enforced |
Final Decision
For MangaAssist, MLOps is part of model quality. A model is not done when it trains; it is done when it has a traceable dataset, repeatable pipeline, measurable gates, shadow evidence, and rollback safety.