LOCAL PREVIEW View on GitHub

Quantization-Aware Training Scenarios - MangaAssist

Quantization-aware training, or QAT, prepares a model for low-precision serving by simulating quantization during training. For MangaAssist, QAT matters when a model is accurate offline but too slow or expensive in production.

When This Topic Matters

Use QAT when:

  • the intent classifier must stay under the 15 ms P95 routing budget,
  • reranking or sentiment inference adds too much latency,
  • post-training quantization causes rare-class regressions,
  • edge or CPU serving is required for cost control.

Scenario 1 - INT8 Intent Classifier

The existing DistilBERT intent classifier is high value and latency-sensitive. A naive INT8 conversion may preserve overall accuracy but hurt rare classes such as escalation or checkout_help.

QAT setup:

Setting Value
Base model fine-tuned DistilBERT
Quantization target INT8 weights and activations
Calibration data 5,500 validation examples
QAT data 44,000 train examples
Epochs 1 additional epoch
Loss focal loss plus class weights

Promotion gate:

Metric FP32 champion INT8 gate
accuracy 92.1% >= 91.8%
rare-class accuracy 88.6% >= 88.0%
escalation recall monitored no regression over 0.5 points
P95 latency 12 ms <= 8 ms

Scenario 2 - Reranker Quantization

Cross-encoders are slower than bi-encoders because they read query-candidate pairs jointly. INT8 QAT can make top-20 reranking affordable.

Key check: ranking metrics are more important than classification accuracy.

Metric Gate
NDCG@10 drop <= 0.01
MRR@10 drop <= 0.01
P95 rerank latency improvement >= 30%

Scenario 3 - Sentiment/Escalation Detector on CPU

If MangaAssist runs a small frustration detector on every message, CPU INT8 serving can reduce cost.

Do not accept quantized models that miss angry or frustrated users. The model can afford false positives more than false negatives for escalation risk.

QAT Workflow

flowchart TD
    A[Fine-tuned FP32 model] --> B[Insert fake quantization ops]
    B --> C[Train for 1-2 short epochs]
    C --> D[Export INT8 model]
    D --> E[Run parity evaluation]
    E --> F{Quality and latency pass?}
    F -- yes --> G[Shadow deploy]
    F -- no --> H[Keep FP32 or adjust quantization]

Failure Modes

Failure Detection Fix
rare-class collapse macro F1 drops while accuracy holds class-weighted QAT and rare-class calibration
confidence shift ECE worsens recalibrate after quantization
operator mismatch offline pass, online fail test exact serving runtime
latency gain too small P95 barely moves quantize more ops or choose smaller model

Production Log

{
  "event": "quantized_inference",
  "model": "intent-distilbert-int8-v02",
  "intent": "return_request",
  "confidence": 0.91,
  "latency_ms": 6.7,
  "runtime": "onnxruntime"
}

Final Decision

For MangaAssist, QAT is a production optimization tool. It should be evaluated with business-sensitive metrics, because an INT8 model that is faster but misses escalation or checkout-help traffic is not actually cheaper.