Quantization-Aware Training Scenarios - MangaAssist
Quantization-aware training, or QAT, prepares a model for low-precision serving by simulating quantization during training. For MangaAssist, QAT matters when a model is accurate offline but too slow or expensive in production.
When This Topic Matters
Use QAT when:
- the intent classifier must stay under the 15 ms P95 routing budget,
- reranking or sentiment inference adds too much latency,
- post-training quantization causes rare-class regressions,
- edge or CPU serving is required for cost control.
Scenario 1 - INT8 Intent Classifier
The existing DistilBERT intent classifier is high value and latency-sensitive. A naive INT8 conversion may preserve overall accuracy but hurt rare classes such as escalation or checkout_help.
QAT setup:
| Setting | Value |
|---|---|
| Base model | fine-tuned DistilBERT |
| Quantization target | INT8 weights and activations |
| Calibration data | 5,500 validation examples |
| QAT data | 44,000 train examples |
| Epochs | 1 additional epoch |
| Loss | focal loss plus class weights |
Promotion gate:
| Metric | FP32 champion | INT8 gate |
|---|---|---|
| accuracy | 92.1% | >= 91.8% |
| rare-class accuracy | 88.6% | >= 88.0% |
| escalation recall | monitored | no regression over 0.5 points |
| P95 latency | 12 ms | <= 8 ms |
Scenario 2 - Reranker Quantization
Cross-encoders are slower than bi-encoders because they read query-candidate pairs jointly. INT8 QAT can make top-20 reranking affordable.
Key check: ranking metrics are more important than classification accuracy.
| Metric | Gate |
|---|---|
| NDCG@10 drop | <= 0.01 |
| MRR@10 drop | <= 0.01 |
| P95 rerank latency improvement | >= 30% |
Scenario 3 - Sentiment/Escalation Detector on CPU
If MangaAssist runs a small frustration detector on every message, CPU INT8 serving can reduce cost.
Do not accept quantized models that miss angry or frustrated users. The model can afford false positives more than false negatives for escalation risk.
QAT Workflow
flowchart TD
A[Fine-tuned FP32 model] --> B[Insert fake quantization ops]
B --> C[Train for 1-2 short epochs]
C --> D[Export INT8 model]
D --> E[Run parity evaluation]
E --> F{Quality and latency pass?}
F -- yes --> G[Shadow deploy]
F -- no --> H[Keep FP32 or adjust quantization]
Failure Modes
| Failure | Detection | Fix |
|---|---|---|
| rare-class collapse | macro F1 drops while accuracy holds | class-weighted QAT and rare-class calibration |
| confidence shift | ECE worsens | recalibrate after quantization |
| operator mismatch | offline pass, online fail | test exact serving runtime |
| latency gain too small | P95 barely moves | quantize more ops or choose smaller model |
Production Log
{
"event": "quantized_inference",
"model": "intent-distilbert-int8-v02",
"intent": "return_request",
"confidence": 0.91,
"latency_ms": 6.7,
"runtime": "onnxruntime"
}
Final Decision
For MangaAssist, QAT is a production optimization tool. It should be evaluated with business-sensitive metrics, because an INT8 model that is faster but misses escalation or checkout-help traffic is not actually cheaper.