LOCAL PREVIEW View on GitHub

MangaAssist Knowledge Distillation — Production Dry Run, Library Comparison, and Hardware Playbook

This document expands the original MangaAssist distillation pipeline into a production-focused runbook with:

  • numerical dry runs,
  • epoch-by-epoch training evolution,
  • early stopping logic,
  • production log examples,
  • hardware-specific strategies,
  • library ecosystem comparison,
  • OpenAI-based distillation options,
  • Mermaid diagrams for engineering reviews.

0. Scope and Reading Guide

This document covers four practical distillation tracks for MangaAssist:

  1. Intent classifier distillation
    DistilBERT teacher → TinyBERT student.

  2. Response model distillation (managed-model path)
    OpenAI gpt-4.1 teacher → fine-tuned gpt-4.1-mini student.

  3. Response model distillation (self-hosted path)
    Teacher outputs from a stronger managed LLM → LoRA-tuned Llama 3 8B student.

  4. Reranker distillation
    Large cross-encoder teacher → compact student reranker.

Where exact throughput or latency numbers are not guaranteed by vendors, they are marked as dry-run engineering estimates and should be treated as planning numbers, not contractual benchmarks.


1. What Distillation Means in MangaAssist

Knowledge distillation trains a smaller student model to imitate a stronger teacher model.

For MangaAssist, distillation matters when the teacher is too expensive or too slow to use on every request:

  • a managed LLM gives the best final answer quality,
  • a large reranker lifts ranking quality but breaks latency budgets,
  • an intent classifier needs to run on CPU or Lambda,
  • unlabeled production logs need weak supervision from a stronger model.

1.1 MangaAssist production targets

Component Teacher Student Why distill
Intent classification DistilBERT (66M) TinyBERT (14.5M) Lower Lambda latency, smaller cold-start memory
Response model gpt-4.1 / strong managed LLM gpt-4.1-mini or Llama 3 8B Lower serving cost, lower p95 latency
Reranker large cross-encoder compact 4-layer ONNX reranker fit within 15–20 ms inline ranking budget
Ambiguous intent resolution ensemble + rules + adjudicator single distilled classifier simpler deployment path

1.2 Success criteria

A student is useful only if the total system gets better on business constraints, not just model loss:

  • quality stays inside promotion gates,
  • latency meets SLO,
  • cost drops enough to matter,
  • operational simplicity improves,
  • safety/refusal behavior does not regress.

2. End-to-End Distillation Flow

flowchart TD
    A[Production traffic logs] --> B[Filter PII / deduplicate / stratify]
    B --> C[Build distillation dataset]
    C --> D[Teacher labeling]
    D --> E[Hard labels + soft labels + metadata]
    E --> F[Student training]
    F --> G[Offline evaluation]
    G --> H[Shadow evaluation]
    H --> I[Canary deploy]
    I --> J[Promotion or rollback]

    C --> C1[Human corrected set]
    C --> C2[Refusal and escalation set]
    C --> C3[Rare intents / long-tail queries]
    D --> D1[Teacher scores]
    D --> D2[Teacher responses]
    D --> D3[Teacher uncertainty]
    G --> G1[Accuracy / win rate]
    G --> G2[Hallucination / refusal precision]
    G --> G3[Latency / cost]

3. Dataset Design for MangaAssist

We start from the production-oriented scenario and make it concrete.

3.1 Response distillation dataset

Source Count Purpose
production prompts 25,000 realistic traffic distribution
teacher responses 25,000 target behavior
human-corrected responses 5,000 fix teacher mistakes
refusal / escalation examples 2,000 preserve support behavior
total supervised rows 32,000 final response training set

3.2 Suggested split

Split Count Notes
train 25,600 80%
validation 3,200 10%
test / gate 3,200 10%, frozen
golden human review set 500 never used for training

3.3 Intent classifier dataset

For classifier distillation, use a larger but lighter dataset:

Source Count
labeled intent examples 12,000
paraphrased augmentations 12,000
recent unlabeled queries + teacher soft labels 20,000
rare-escalation and high-risk phrases 4,000
total 48,000

3.4 Why production logs matter

Production logs tell you things a clean benchmark usually hides:

  • where users are ambiguous,
  • where teacher confidence is low,
  • which intents are rare but high-risk,
  • which prompts are long and expensive,
  • where refusals are triggered,
  • where the teacher hallucinates catalog or shipping facts.

A good distillation run does not only log the final answer. It logs the decision process around that answer.


4. What to Log During Distillation

4.1 Training-time logs

These logs are the minimum useful set:

Log field Why it matters
epoch locate training stage
step trend within epoch
loss_total overall optimization
loss_hard fit to labels
loss_kd fit to teacher distribution
loss_feature hidden-state matching quality
loss_attention attention transfer quality
grad_norm detect instability
lr tie behavior to schedule
tokens_per_sec or samples_per_sec training efficiency
gpu_mem_gb capacity planning
teacher_entropy_mean teacher softness / ambiguity level
student_entropy_mean whether student is overconfident
rare_class_recall long-tail protection
refusal_precision policy safety
hallucination_rate factuality check

4.2 Evaluation-time logs

Metric Meaning
teacher preference match how often student matches teacher choice
human win rate vs base student student quality lift vs pre-distilled baseline
catalog hallucination rate factual risk on catalog facts
escalation precision whether support escalation is triggered correctly
refusal precision / recall whether unsafe/out-of-scope queries are handled correctly
cost per 1K responses business reason for distillation
p50 / p95 / p99 latency deployment readiness

4.3 Example production log schema

{
  "event": "distill_epoch_end",
  "run_id": "kd_resp_v12",
  "student": "gpt-4.1-mini-ft-v12",
  "teacher": "gpt-4.1-2025-04-14",
  "epoch": 3,
  "loss_total": 0.812,
  "loss_ce": 0.471,
  "loss_kd": 0.958,
  "teacher_pref_match": 0.872,
  "human_win_rate_vs_base": 0.681,
  "catalog_hallucination_rate": 0.036,
  "refusal_precision": 0.942,
  "cost_per_1k_responses_usd": 4.80,
  "cost_reduction_vs_teacher": 0.61,
  "p95_latency_ms": 420
}

5. Distillation Math Refresher

5.1 Hard-label loss

[ \mathcal{L}{hard} = -\sum{c=1}^{C} y_c \log p_S(c|x) ]

This learns the correct class, but it throws away the teacher's view of near-miss classes.

5.2 Soft-label KD loss

[ \mathcal{L}{KD} = T^2 \cdot D{KL}(p_T^{(T)} \Vert p_S^{(T)}) ]

with

[ p_i^{(T)} = \frac{e^{z_i/T}}{\sum_j e^{z_j/T}} ]

and combined loss:

[ \mathcal{L} = (1-\alpha)\mathcal{L}{hard} + \alpha \mathcal{L}{KD} ]

For hidden-state matching:

[ \mathcal{L}_{feature} = \sum_l \text{MSE}(W_l h_S^{(l)}, h_T^{(m(l))}) ]

For attention transfer:

[ \mathcal{L}_{attn} = \sum_l \text{MSE}(A_S^{(l)}, A_T^{(m(l))}) ]

Final total for TinyBERT-style two-stage training:

[ \mathcal{L}{total} = \lambda{KD}\mathcal{L}{KD} + \lambda{CE}\mathcal{L}{hard} + \lambda{feat}\mathcal{L}{feature} + \lambda{attn}\mathcal{L}_{attn} ]


6. Dry Run A — DistilBERT → TinyBERT Intent Distillation

This is the best first production dry run because it is cheap, fast, and easy to evaluate.

6.1 Setup

  • Teacher: fine-tuned DistilBERT, 66M params
  • Student: TinyBERT 4L, 14.5M params
  • Classes: 10 intent classes
  • Temperature: 4
  • Alpha: 0.7
  • Max sequence length: 64
  • Training mode:
  • Stage 1: feature matching
  • Stage 2: output KD + CE
  • Training dataset: 48,000 rows
  • Validation set: 4,800
  • Gate set: 4,800

6.2 Stage 0 baseline

Model Accuracy Rare-class recall Refusal precision Warm latency
Teacher DistilBERT 92.1% 88.4% 96.8% 15 ms
Base TinyBERT, no KD 84.6% 73.9% 90.7% 5 ms

The raw gap to close is:

  • accuracy gap = 92.1 - 84.6 = 7.5 points
  • rare-class recall gap = 88.4 - 73.9 = 14.5 points

6.3 Stage 1 — feature matching dry run

Assume:

  • A100 80 GB
  • bf16
  • batch size 128
  • seq len 64
  • teacher and student in memory at once

Steps per epoch:

[ \text{steps/epoch} = \lceil 48,000 / 128 \rceil = 375 ]

If effective throughput is ~320 sequences/sec including teacher forward + hidden-state extraction, then:

[ 48,000 / 320 \approx 150 \text{ sec} \approx 2.5 \text{ min / epoch} ]

For 6 epochs, estimated wall-clock:

[ 6 \times 2.5 = 15 \text{ min} ]

6.4 Stage 1 epoch table

Epoch Feature loss Attention loss Val acc (probe head) Rare recall Notes
1 1.842 0.913 78.4% 66.2% student still unstable
2 1.221 0.642 81.0% 69.4% large gain, hidden states aligning
3 0.944 0.501 82.6% 71.3% gradients smooth
4 0.791 0.438 83.2% 72.0% diminishing returns start
5 0.714 0.401 83.5% 72.4% small gain
6 0.701 0.392 83.5% 72.5% plateau

6.5 Stage 1 stopping rule

Stop Stage 1 when all are true for 2 consecutive epochs:

  • feature loss improvement < 2%
  • attention loss improvement < 2%
  • probe-head val accuracy gain < 0.2 points

That happens at epoch 5–6, so we stop after epoch 6.

6.6 Stage 2 — output KD + CE dry run

Start from the Stage 1 checkpoint.

Assume:

  • batch size 128
  • same sequence length
  • output-only KD is cheaper than feature KD
  • effective throughput ~650 seq/sec

Epoch time:

[ 48,000 / 650 \approx 74 \text{ sec} \approx 1.2 \text{ min / epoch} ]

For 8 epochs, estimated time:

[ 8 \times 1.2 \approx 9.6 \text{ min} ]

6.7 Stage 2 epoch table

Epoch CE loss KD loss Total loss Val acc Rare recall Refusal precision Notes
1 0.921 1.404 1.259 86.8% 77.4% 92.1% big jump from Stage 1
2 0.772 1.115 1.012 88.1% 79.9% 93.4% teacher structure learned
3 0.681 0.982 0.892 88.8% 81.1% 94.2% stable
4 0.622 0.914 0.826 89.2% 82.2% 94.8% near best
5 0.593 0.878 0.792 89.3% 82.6% 95.1% best val accuracy
6 0.582 0.871 0.785 89.3% 82.8% 95.0% no real gain
7 0.579 0.870 0.784 89.2% 82.7% 94.8% slight overfit signs
8 0.567 0.873 0.781 89.1% 82.3% 94.5% overfit

6.8 Where to stop

We stop at epoch 5 for deployment candidate selection because:

  • best validation accuracy is first reached at epoch 5,
  • rare recall keeps improving slightly after that, but overall val accuracy does not,
  • refusal precision starts flattening,
  • epoch 7–8 shows classic memorization: training loss still falls, validation stops improving.

6.9 Classifier gate outcome

Gate Threshold Epoch 5 result Pass
accuracy >= 89.0% 89.3% yes
rare-class recall >= 80.0% 82.6% yes
refusal precision >= 94.0% 95.1% yes
Lambda warm latency <= 6 ms 5 ms yes
Lambda cold p95 <= 150 ms 122 ms yes

6.10 Final classifier business result

Metric Teacher Distilled student Delta
Accuracy 92.1% 89.3% -2.8
Rare recall 88.4% 82.6% -5.8
Warm latency 15 ms 5 ms 3.0× faster
Model size 264 MB 58 MB 4.6× smaller
Cost per 1M requests (CPU/Lambda dry run) $38 $17 55% lower

7. Dry Run B — Response Distillation with OpenAI Models

This is the managed-model path.

7.1 Why include OpenAI here

If MangaAssist wants a smaller managed student instead of self-hosting, a practical path is:

  • teacher: gpt-4.1
  • student base: gpt-4.1-mini
  • fine-tuning method: supervised fine-tuning using teacher outputs + human corrections + refusal data

This is not pure logit-level KD. It is response distillation through SFT.

7.2 Data recipe

We reuse the 32,000 row response dataset:

  • 25,000 production prompts
  • 25,000 teacher answers
  • 5,000 human-corrected answers
  • 2,000 refusal/escalation examples

We transform each row into a chat-style training example:

{
  "messages": [
    {"role": "system", "content": "You are MangaAssist. Follow catalog-safe, retrieval-grounded answer rules."},
    {"role": "user", "content": "I want romance manga with adult characters."},
    {"role": "assistant", "content": "Here are three good options..."}
  ]
}

7.3 Label weighting strategy

Not all rows should count equally.

Row type Weight Why
human-corrected 2.0 best ground truth
teacher response, clean 1.0 useful target
refusal / escalation 2.5 safety critical
low-confidence teacher rows 0.5 avoid copying ambiguity too strongly

7.4 Managed-model dry run assumptions

  • training rows: 25,600
  • validation rows: 3,200
  • average prompt + answer length: 420 tokens
  • total tokens / epoch:

[ 25,600 \times 420 = 10,752,000 \text{ tokens} ]

If 4 epochs are trained, the total training token volume is:

[ 4 \times 10.752M = 43.008M \text{ tokens} ]

7.5 Offline evaluation rubric

Each answer is scored on 5 axes:

Axis Weight
factuality / no catalog hallucination 0.30
recommendation relevance 0.25
policy / escalation correctness 0.20
answer format quality 0.15
conciseness 0.10

Rubric score:

[ \text{rubric} = 0.30F + 0.25R + 0.20P + 0.15Q + 0.10C ]

7.6 Epoch-by-epoch dry run

Epoch Val rubric Teacher preference match Human win rate vs base student Hallucination rate Refusal precision Notes
1 4.08 / 5 81.2% 59.0% 5.4% 91.8% clear improvement, still weak on catalog facts
2 4.23 / 5 84.9% 64.1% 4.2% 93.7% close to gate
3 4.31 / 5 87.2% 68.1% 3.6% 94.4% best balanced checkpoint
4 4.30 / 5 87.4% 68.4% 4.1% 93.8% slight overfit / style memorization

7.7 Promotion gate

Metric Gate Epoch 3
teacher preference match >= 85% 87.2%
human win rate vs base student >= 65% 68.1%
catalog hallucination rate <= 4% 3.6%
refusal precision >= 94% 94.4%
cost per 1K responses >= 50% lower than teacher 61% lower

7.8 Why epoch 3 wins over epoch 4

Epoch 4 slightly beats epoch 3 on teacher preference match, but not enough to justify:

  • hallucination rate rises from 3.6% to 4.1%,
  • refusal precision drops,
  • style starts becoming too rigid,
  • support escalation wording becomes overly templated.

This is a classic production tradeoff: a tiny gain in imitation quality is not worth a measurable regression in safety and factuality.

7.9 Example OpenAI-style distillation workflow

from openai import OpenAI
import json

client = OpenAI()

# 1) Upload training file
train_file = client.files.create(
    file=open("mangaassist_distill_train.jsonl", "rb"),
    purpose="fine-tune"
)

# 2) Start fine-tuning job on a smaller student
job = client.fine_tuning.jobs.create(
    model="gpt-4.1-mini-2025-04-14",
    training_file=train_file.id,
    method={"type": "supervised"}
)

print(job.id)

7.10 Example offline teacher-generation step

from openai import OpenAI
import json

client = OpenAI()

def label_with_teacher(prompt: str) -> str:
    resp = client.responses.create(
        model="gpt-4.1-2025-04-14",
        input=[
            {
                "role": "system",
                "content": "You are MangaAssist. Be retrieval-grounded, catalog-safe, and follow escalation policy."
            },
            {"role": "user", "content": prompt},
        ]
    )
    return resp.output_text

7.11 Example response-distillation log

{
  "event": "distilled_model_eval",
  "student": "gpt-4.1-mini-ft-manga-v03",
  "teacher": "gpt-4.1-2025-04-14",
  "epoch": 3,
  "teacher_preference_match": 0.872,
  "human_win_rate_vs_base": 0.681,
  "hallucination_rate": 0.036,
  "refusal_precision": 0.944,
  "cost_reduction": 0.61
}

8. Dry Run C — Self-Hosted Response Distillation into Llama 3 8B

This path is used when MangaAssist wants a self-hosted fallback or cost-controlled serving layer.

8.1 Distillation style

For a hosted teacher such as OpenAI or another managed LLM, we normally do:

  1. teacher inference offline,
  2. save teacher outputs,
  3. fine-tune the student with SFT / LoRA.

That means the teacher is not on GPU during student training.

8.2 Training setup

  • student: Llama 3 8B
  • fine-tuning: LoRA
  • precision: bf16
  • seq length: 512
  • micro-batch: 8
  • gradient accumulation: 8
  • effective batch size: 64
  • dataset: 32,000 examples
  • tokens / example: 512 average padded

Tokens per epoch:

[ 32,000 \times 512 = 16,384,000 ]

If effective training throughput is 5,800 tok/s on a single A100 80 GB with LoRA, then:

[ 16,384,000 / 5,800 \approx 2,825 \text{ sec} \approx 47 \text{ min / epoch} ]

Add eval + checkpoint overhead:

  • training: ~47 min / epoch
  • eval/checkpoint: ~8 min / epoch
  • total: ~55 min / epoch

For 4 epochs:

[ 4 \times 55 \approx 220 \text{ min} \approx 3.7 \text{ hours} ]

8.3 Epoch table

Epoch Train loss Val rubric Human score (1-5) Hallucination rate p95 latency Notes
1 1.92 3.61 3.5 7.4% 158 ms learns format first
2 1.41 3.83 3.8 5.5% 146 ms content relevance improves
3 1.19 3.92 3.9 4.0% 137 ms best balanced checkpoint
4 1.08 3.91 3.9 4.6% 135 ms slight overfit, hallucinations rise

8.4 Stop rule

Choose the first checkpoint that satisfies all of:

  • rubric score improvement < 0.02 on next epoch,
  • hallucination rate not increasing,
  • support escalation precision not decreasing,
  • no meaningful p95 latency benefit from further tuning.

That selects epoch 3.

8.5 Why not train longer

For distillation, longer training can make the student memorize teacher style more than teacher behavior. In MangaAssist that often shows up as:

  • repeating stock phrases,
  • too much certainty on weak retrieval,
  • nicer formatting but worse factuality,
  • lower diversity on recommendation prompts.

8.6 Example TRL SFTTrainer snippet

from transformers import AutoModelForCausalLM, AutoTokenizer
from trl import SFTTrainer, SFTConfig

model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Meta-Llama-3-8B",
    torch_dtype="bfloat16",
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3-8B")
tokenizer.pad_token = tokenizer.eos_token

config = SFTConfig(
    output_dir="./manga_llama_student",
    num_train_epochs=4,
    per_device_train_batch_size=8,
    gradient_accumulation_steps=8,
    learning_rate=2e-5,
    max_seq_length=512,
    bf16=True,
    gradient_checkpointing=True,
)

trainer = SFTTrainer(
    model=model,
    args=config,
    train_dataset=train_ds,
    eval_dataset=val_ds,
    processing_class=tokenizer,
)

trainer.train()

9. Dry Run D — Distilled Reranker

9.1 Why this matters

Rerankers are often the hidden latency offender in RAG systems.

If the teacher cross-encoder improves NDCG@10 but adds 40–60 ms, you often need a smaller student to stay in SLO.

9.2 Student objective

[ \mathcal{L} = \alpha \cdot \mathcal{L}{sup} + (1-\alpha)\cdot \text{MSE}(s{student}, s_{teacher}) ]

9.3 Example dry run numbers

  • teacher NDCG@10: 0.842
  • base student NDCG@10: 0.751
  • distilled student NDCG@10: 0.791
  • teacher p95 latency: 48 ms
  • distilled ONNX student p95 latency: 14 ms

9.4 Epoch table

Epoch Pairwise loss MSE score loss NDCG@10 p95 latency Notes
1 0.491 0.228 0.771 14 ms learns teacher ordering quickly
2 0.444 0.192 0.784 14 ms strong gain
3 0.437 0.183 0.791 14 ms best checkpoint
4 0.433 0.181 0.790 14 ms no useful gain

9.5 Stop condition

Stop at epoch 3 because NDCG gain from epoch 3 to 4 is negative.


10. Early Stopping and Tradeoffs

10.1 Do not stop on training loss alone

Training loss almost always keeps falling even after the student has started to overfit.
For distillation, better stop signals are:

  • validation KD loss,
  • gate-set human score,
  • hallucination rate,
  • refusal precision,
  • rare-class recall.

10.2 Practical stop rules

Classifier stop rules

Stop if 2 of 3 happen for 2 consecutive epochs:

  • validation accuracy improves by < 0.15 points,
  • rare-class recall improves by < 0.25 points,
  • KD loss improves by < 1%.

Response model stop rules

Stop if any of these happen after the minimum epoch count:

  • hallucination rate rises by > 0.3 points,
  • human score gain < 0.03,
  • refusal precision drops,
  • teacher preference match gain < 0.2 points.

Reranker stop rules

Stop if:

  • NDCG gain < 0.002 across one epoch,
  • latency target already met,
  • top-3 swap rate on validation set becomes unstable.

10.3 Tradeoff table

Choice Benefit Risk
higher temperature more dark knowledge signal becomes too flat
higher alpha stronger teacher imitation copies teacher mistakes
more epochs better teacher imitation hallucination and memorization
smaller student lower latency and cost capacity floor too low
more unlabeled logs broader coverage shift/noise if logs are stale
more refusal weighting safer behavior over-refusal on borderline prompts

11. Production Logs You Actually Need

11.1 Teacher-label generation logs

{
  "event": "teacher_label_generation",
  "run_id": "labeling_2026_04_21",
  "teacher_model": "gpt-4.1-2025-04-14",
  "rows_processed": 25000,
  "avg_prompt_tokens": 138,
  "avg_response_tokens": 212,
  "teacher_refusal_rate": 0.061,
  "teacher_entropy_mean": 1.74,
  "low_confidence_fraction": 0.084,
  "estimated_cost_usd": 148.20
}

11.2 Classifier epoch logs

{
  "event": "distill_epoch_end",
  "run_id": "tinybert_kd_v05",
  "stage": "output_kd",
  "epoch": 5,
  "loss_total": 0.792,
  "loss_ce": 0.593,
  "loss_kd": 0.878,
  "val_accuracy": 0.893,
  "rare_class_recall": 0.826,
  "refusal_precision": 0.951,
  "throughput_seq_per_sec": 648,
  "gpu_mem_gb": 11.8
}

11.3 Response-model gate logs

{
  "event": "gate_eval",
  "run_id": "resp_openai_kd_v03",
  "student": "gpt-4.1-mini-ft-v03",
  "teacher": "gpt-4.1-2025-04-14",
  "teacher_preference_match": 0.872,
  "human_win_rate_vs_base": 0.681,
  "catalog_hallucination_rate": 0.036,
  "escalation_precision": 0.947,
  "cost_reduction": 0.61,
  "decision": "promote_shadow"
}

11.4 Canary logs

{
  "event": "online_shadow_compare",
  "shadow_student": "gpt-4.1-mini-ft-v03",
  "control_teacher": "gpt-4.1-2025-04-14",
  "sample_size": 5000,
  "student_accept_rate": 0.944,
  "student_escalation_rate": 0.063,
  "teacher_escalation_rate": 0.059,
  "student_catalog_hallucination_rate": 0.039,
  "teacher_catalog_hallucination_rate": 0.021,
  "student_p95_latency_ms": 418,
  "teacher_p95_latency_ms": 922
}

12. Library Ecosystem Comparison

This section answers: Which tool should I use, and when?

12.1 Comparison table

Library / stack Ease of use Output KD Feature KD Attention KD Hardware target Maintenance view Best use
raw PyTorch medium-low yes yes yes GPU / CPU always viable max flexibility
transformers + custom Trainer high yes yes (manual) yes (manual) GPU / CPU active best general NLP default
TextBrewer medium yes yes yes GPU mainly limited / older classic NLP KD experiments
Knowledge-Distillation-Zoo patterns low-medium yes yes yes GPU mainly limited / reference-only research baselines
optimum high no training, yes export n/a n/a CPU / ONNX / edge active post-distillation optimization
trl SFTTrainer high response distillation no no GPU active LLM response distillation
lightning medium-high yes yes yes GPU / multi-GPU active clean engineering loops
Optimum Intel / OpenVINO high n/a n/a n/a Intel CPU / GPU / edge active CPU-first inference
llama.cpp / llama-cpp-python high for serving n/a n/a n/a CPU / Apple Silicon / edge very active quantized local serving

How to read “maintenance view”

  • active: frequent docs, releases, or official ecosystem support
  • limited / older: still usable, but not where you should expect the latest production integrations
  • reference-only: best as a pattern source, not as your primary production framework

12.2 Hugging Face transformers + custom Trainer

When to use it

Use this when you want:

  • standard HF model loading,
  • easy evaluation hooks,
  • mixed precision,
  • multi-GPU support,
  • enough flexibility to add KD losses.

Minimal snippet

import torch
import torch.nn.functional as F
from transformers import Trainer

class KDTrainer(Trainer):
    def __init__(self, teacher_model=None, temperature=4.0, alpha=0.7, **kwargs):
        super().__init__(**kwargs)
        self.teacher = teacher_model.eval()
        self.temperature = temperature
        self.alpha = alpha
        for p in self.teacher.parameters():
            p.requires_grad = False

    def compute_loss(self, model, inputs, return_outputs=False, **kwargs):
        labels = inputs["labels"]
        student_outputs = model(**inputs)
        with torch.no_grad():
            teacher_outputs = self.teacher(**inputs)

        s_logits = student_outputs.logits
        t_logits = teacher_outputs.logits

        ce = F.cross_entropy(s_logits, labels)
        kd = F.kl_div(
            F.log_softmax(s_logits / self.temperature, dim=-1),
            F.softmax(t_logits / self.temperature, dim=-1),
            reduction="batchmean"
        ) * (self.temperature ** 2)

        loss = (1 - self.alpha) * ce + self.alpha * kd
        return (loss, student_outputs) if return_outputs else loss

Strengths vs raw PyTorch

  • less boilerplate,
  • built-in metrics/checkpoints,
  • integrates with accelerate,
  • easy for reproducible training jobs.

Weaknesses vs raw PyTorch

  • feature/attention KD still requires manual plumbing,
  • custom distributed teacher logic can get messy,
  • callback/event model is helpful but not always enough for unusual pipelines.

12.3 TextBrewer

When to use it

Use it when you want a KD-focused NLP library with explicit support for:

  • soft-label distillation,
  • intermediate feature matching,
  • dynamic loss schedules.

Minimal snippet

from textbrewer import GeneralDistiller, TrainingConfig, DistillationConfig

train_config = TrainingConfig(
    output_dir="./tb_out",
    device="cuda"
)

distill_config = DistillationConfig(
    temperature=4,
    hard_label_weight=0.3,
    kd_loss_weight=0.7
)

distiller = GeneralDistiller(
    train_config=train_config,
    distill_config=distill_config,
    model_T=teacher,
    model_S=student,
    adaptor_T=teacher_adaptor,
    adaptor_S=student_adaptor
)

distiller.train(
    optimizer=optimizer,
    dataloader=train_loader,
    num_epochs=4
)

Strengths vs raw PyTorch

  • KD concepts are first-class,
  • easier feature and attention matching setup,
  • useful for classic BERT/TinyBERT style experiments.

Weaknesses vs raw PyTorch

  • ecosystem feels older,
  • fewer modern production examples,
  • less aligned with current HF + PEFT + LLM workflows.

12.4 Knowledge-Distillation-Zoo patterns

This is better thought of as a pattern repository than a production library.

When to use it

Use it when you want:

  • a fast starting point for many KD losses,
  • paper-reproduction style experimentation,
  • baseline implementations for losses like FitNet, AT, PKT, RKD, etc.

Minimal snippet pattern

# pattern example, not a pip-stable API
logits_loss = kd_criterion(student_logits, teacher_logits)
feat_loss = hint_criterion(student_feat, teacher_feat)

loss = 0.7 * logits_loss + 0.3 * feat_loss
loss.backward()
optimizer.step()

Strengths vs raw PyTorch

  • fast way to inspect many loss designs,
  • good learning/reference resource.

Weaknesses vs raw PyTorch

  • not a full training framework,
  • fewer production ergonomics,
  • best used as inspiration, not as final platform code.

12.5 optimum for ONNX export + quantization

This is usually the next step after distillation.

When to use it

Use it when the student is already good enough and you now want:

  • ONNX export,
  • INT8 quantization,
  • easier CPU deployment.

Minimal snippet

from optimum.onnxruntime import ORTModelForSequenceClassification, ORTQuantizer
from optimum.onnxruntime.configuration import AutoQuantizationConfig

model = ORTModelForSequenceClassification.from_pretrained("./tinybert_student", export=True)
quantizer = ORTQuantizer.from_pretrained(model)

qconfig = AutoQuantizationConfig.avx512_vnni(is_static=False)
quantizer.quantize(save_dir="./tinybert_student_int8", quantization_config=qconfig)

Strengths vs raw PyTorch

  • much easier ONNX path,
  • simpler CPU optimization workflow,
  • good post-training deployment step.

Weaknesses vs raw PyTorch

  • not a KD trainer by itself,
  • export/quantization edge cases still exist for some architectures.

12.6 trl SFTTrainer for LLM-to-LLM response distillation

When to use it

Use it for:

  • teacher-response imitation,
  • LLM instruction tuning,
  • distillation where logits from the teacher are unavailable.

Minimal snippet

from trl import SFTTrainer, SFTConfig

cfg = SFTConfig(
    output_dir="./student_out",
    num_train_epochs=3,
    per_device_train_batch_size=4,
    gradient_accumulation_steps=8,
    max_seq_length=512
)

trainer = SFTTrainer(
    model=student_model,
    args=cfg,
    train_dataset=train_ds,
    eval_dataset=val_ds,
    processing_class=tokenizer,
)

trainer.train()

Strengths vs raw PyTorch

  • very small amount of code,
  • great fit for response distillation,
  • integrates well with HF/PEFT stacks.

Weaknesses vs raw PyTorch

  • not intended for feature KD,
  • less natural for classical classifier KD.

12.7 lightning for clean distillation loops

When to use it

Use it when you want:

  • clean engineering separation,
  • callbacks,
  • multi-GPU structure,
  • long-lived training codebase.

Minimal snippet

import lightning as L
import torch.nn.functional as F

class LitKD(L.LightningModule):
    def __init__(self, teacher, student, temperature=4.0, alpha=0.7):
        super().__init__()
        self.teacher = teacher.eval()
        self.student = student
        self.temperature = temperature
        self.alpha = alpha
        for p in self.teacher.parameters():
            p.requires_grad = False

    def training_step(self, batch, batch_idx):
        labels = batch["labels"]
        s = self.student(**batch).logits
        with torch.no_grad():
            t = self.teacher(**batch).logits

        ce = F.cross_entropy(s, labels)
        kd = F.kl_div(
            F.log_softmax(s / self.temperature, dim=-1),
            F.softmax(t / self.temperature, dim=-1),
            reduction="batchmean"
        ) * (self.temperature ** 2)

        loss = (1 - self.alpha) * ce + self.alpha * kd
        self.log("train_loss", loss)
        return loss

Strengths vs raw PyTorch

  • easier large-project organization,
  • strong callback/logging pattern,
  • good for repeatable MLOps pipelines.

Weaknesses vs raw PyTorch

  • adds another abstraction layer,
  • sometimes slower to debug unusual distributed issues.

12.8 Optimum Intel / OpenVINO

When to use it

Use it when your distilled student must run mainly on:

  • Intel CPUs,
  • Intel iGPU / accelerator environments,
  • edge or desktop inference.

Minimal snippet

from optimum.intel import OVModelForSequenceClassification
from transformers import AutoTokenizer

model = OVModelForSequenceClassification.from_pretrained(
    "./tinybert_student_int8",
    export=True
)
tokenizer = AutoTokenizer.from_pretrained("./tinybert_student")

inputs = tokenizer("romance manga with adult cast", return_tensors="pt")
outputs = model(**inputs)

Strengths vs raw PyTorch

  • strong CPU-first deployment path,
  • useful OpenVINO export and runtime integration,
  • practical for low-latency CPU serving.

Weaknesses vs raw PyTorch

  • mainly inference-focused,
  • best after model training is finished.

12.9 llama.cpp / llama-cpp-python

When to use it

Use it after distillation when you want to serve a quantized student locally or on low-cost hardware.

Minimal snippet

from llama_cpp import Llama

llm = Llama(
    model_path="./manga-student-q4_k_m.gguf",
    n_ctx=4096,
    n_gpu_layers=20
)

resp = llm.create_chat_completion(
    messages=[{"role": "user", "content": "Suggest mature romance manga"}]
)
print(resp["choices"][0]["message"]["content"])

Strengths vs raw PyTorch

  • simple local serving,
  • excellent quantized inference story,
  • strong Apple Silicon and CPU usability.

Weaknesses vs raw PyTorch

  • serving stack, not training stack,
  • you usually convert/export into it after training elsewhere.

13. Hardware-Specific Distillation Strategies

13.1 A. Single A100 80GB

Best use cases

  • DistilBERT → TinyBERT full KD
  • Llama 8B LoRA response distillation
  • large batch experiments
  • frequent checkpoint sweeps

Precision

  • bf16 preferred for training on A100
  • fallback: fp16 if library path requires it
  • keep optimizer states in standard mixed-precision defaults

Realistic batch sizes

Workload Batch size Notes
TinyBERT classifier KD 128 teacher + student + features fit comfortably
TinyBERT feature KD with seq 128 96 if attention tensors are kept
Llama 3 8B LoRA SFT 8 good starting micro-batch
Llama 3 8B full finetune usually not recommended here LoRA/QLoRA preferred

Throughput dry-run numbers

Workload Throughput Epoch time
TinyBERT feature KD ~320 seq/s ~2.5 min / epoch on 48K rows
TinyBERT output KD ~650 seq/s ~1.2 min / epoch on 48K rows
Llama 3 8B LoRA, seq 512 ~5,800 tok/s ~47 min / epoch on 32K rows

Bottlenecks to watch

  • hidden-state extraction cost in feature KD,
  • dataloader underfeeding GPU,
  • sequence padding waste,
  • checkpoint save stalls,
  • teacher forward pass doubling compute in online KD.

Important engineering note

For managed teachers like OpenAI, do not keep the teacher in the training loop.
Generate teacher outputs first, then train only the student.


13.2 B. Multi-GPU — 2×A10G or 4×A100

DDP vs FSDP for distillation

Use DDP when:

  • both teacher and student fit per GPU,
  • you want simpler debugging,
  • the student is moderate size.

Use FSDP when:

  • the student does not fit comfortably as a full replica,
  • optimizer memory is the problem,
  • you want bigger effective context or batch.

Distillation-specific problem

The teacher must be available on every rank if teacher inference happens online.

You have three patterns:

  1. Replicate frozen teacher on every rank
    easiest, best for small teachers.

  2. Precompute teacher outputs
    best for managed teachers or large teachers.

  3. FSDP student, replicated teacher
    best hybrid for medium teacher + larger student.

DDP wrapper pattern for the teacher

import torch
from torch.nn.parallel import DistributedDataParallel as DDP

teacher = TeacherModel().to(local_rank)
student = StudentModel().to(local_rank)

for p in teacher.parameters():
    p.requires_grad = False
teacher.eval()

teacher = DDP(teacher, device_ids=[local_rank], broadcast_buffers=False)
student = DDP(student, device_ids=[local_rank])

for batch in train_loader:
    with torch.no_grad():
        t_logits = teacher(**batch).module.logits
    s_logits = student(**batch).logits
    loss = kd_loss(s_logits, t_logits, batch["labels"])
    loss.backward()
    optimizer.step()
    optimizer.zero_grad()

When to prefer accelerate

Use accelerate when:

  • you want one codepath across 1 GPU / many GPUs,
  • you may switch between DDP and FSDP,
  • you want HF integration without hand-writing launch logic.

Minimal pattern:

from accelerate import Accelerator

accelerator = Accelerator()
student, optimizer, train_loader = accelerator.prepare(student, optimizer, train_loader)

teacher.to(accelerator.device)
teacher.eval()

for batch in train_loader:
    with torch.no_grad():
        t_logits = teacher(**batch).logits
    s_logits = student(**batch).logits
    loss = kd_loss(s_logits, t_logits, batch["labels"])
    accelerator.backward(loss)
    optimizer.step()
    optimizer.zero_grad()

Expected scaling dry-run numbers

Hardware Workload Aggregate throughput Practical note
2×A10G TinyBERT KD ~430 seq/s ~1.35× to 1.5× over single A10G
4×A100 TinyBERT KD ~1,150 seq/s near-linear if input pipeline is healthy
2×A10G Llama 8B LoRA ~1,700 tok/s use gradient accumulation heavily
4×A100 Llama 8B LoRA/FSDP ~17,000 tok/s strong fit for fast checkpoint sweeps

DDP vs FSDP tradeoff summary

Choice Good Bad
DDP simpler, stable, easy teacher replication duplicates model memory
FSDP lower memory, bigger models more complex debugging and checkpointing
precomputed teacher cache fastest train loop, cheapest runtime more storage and preprocessing

13.3 C. AWS Lambda / CPU-only inference of the distilled student

This is where distillation usually pays real business value.

Export flow

flowchart LR
    A[PyTorch student checkpoint] --> B[ONNX export with Optimum]
    B --> C[INT8 quantization]
    C --> D[Package model artifact]
    D --> E[Lambda container or zip]
    E --> F[Provisioned Concurrency optional]

ONNX export with optimum

from optimum.onnxruntime import ORTModelForSequenceClassification

model = ORTModelForSequenceClassification.from_pretrained(
    "./tinybert_student",
    export=True
)
model.save_pretrained("./tinybert_student_onnx")

INT8 question: does accuracy degrade further from 89.3%?

In the dry run, yes, but only slightly.

Variant Accuracy Delta vs FP32 distilled
TinyBERT distilled fp32 89.3% baseline
ONNX dynamic INT8 89.0% -0.3
ONNX static INT8 88.8% -0.5

This is usually acceptable if the latency gain is material.

Why memory allocation matters

Lambda allocates CPU power proportional to memory.
That means 1 GB is not just “more memory”; it is also more CPU.

Expected Lambda latency dry run

Memory Model format p50 warm p95 warm cold start p95 Notes
512 MB PyTorch fp32 31 ms 44 ms 240 ms misses strict target
512 MB ONNX INT8 18 ms 29 ms 148 ms usable but tight
1024 MB ONNX INT8 11 ms 18 ms 103 ms recommended
1536 MB ONNX INT8 9 ms 15 ms 95 ms diminishing returns

Practical recommendation

For MangaAssist intent classification:

  • deploy ONNX INT8
  • start with 1024 MB
  • move to 512 MB only if traffic cost pressure is high and p95 still meets SLO

13.4 D. Apple Silicon (M2 / M3) for local development

What it is good for

  • correctness testing,
  • small dry runs,
  • prompt formatting validation,
  • tiny classifier distillation experiments,
  • local quantized student inference.

PyTorch MPS basics

import torch

device = torch.device("mps" if torch.backends.mps.is_available() else "cpu")
model = model.to(device)
batch = {k: v.to(device) for k, v in batch.items()}

Practical guidance

  • prefer fp16 or fp32-style workflows for local experimentation,
  • keep batch sizes conservative,
  • expect some operator gaps or performance differences versus CUDA,
  • use Apple Silicon mainly for development, not final throughput claims.

Realistic local batch sizes

Workload M2/M3 batch size
TinyBERT KD, seq 64 8–16
TinyBERT inference 32–64
7B/8B LoRA toy run 1–2
quantized local inference via llama.cpp depends on RAM and quant level

Dry-run throughput

Workload Throughput
TinyBERT training on MPS ~110 seq/s
TinyBERT inference on MPS ~260 seq/s
8B quantized local generation ~18–45 tok/s depending on quant + memory

Main limitations vs CUDA

  • less mature distributed story,
  • smaller effective memory ceiling,
  • weaker training throughput,
  • local results are useful for debugging but should not be treated as production capacity numbers.

14. Example Fine-Tuning Evolution Narratives

These are the kinds of summaries that should be written into experiment notes after every run.

14.1 TinyBERT run summary

  • Epoch 1–2: student learns coarse teacher structure; rare-class recall jumps most here.
  • Epoch 3–4: student begins separating confusable intents cleanly.
  • Epoch 5: best overall tradeoff between accuracy, rare recall, and refusal precision.
  • Epoch 6+: training loss still falls, but user-facing metrics flatten.

14.2 OpenAI managed-student run summary

  • Epoch 1: student learns answer format and common recommendation phrasing.
  • Epoch 2: factuality improves because teacher response structure is internalized.
  • Epoch 3: best checkpoint; hallucinations lowest while format quality stays strong.
  • Epoch 4: style becomes more rigid and catalog hallucinations creep up.

14.3 Llama 8B run summary

  • Epoch 1: formatting and response skeleton improve.
  • Epoch 2: answer relevance rises sharply.
  • Epoch 3: best human score / hallucination balance.
  • Epoch 4: no meaningful answer quality gain, more memorized phrasing.

15. Deployment Decision Framework

15.1 Promotion checklist

A distilled student is promoted only if all pass:

Category Gate
quality teacher preference match meets target
safety hallucination/refusal metrics meet target
latency p95 meets SLO
cost at least 50% lower if this is a cost-driven project
rollback last-known-good model available
observability dashboards and alerts are live

15.2 Shadow deployment plan

  1. Route 1–5% of traffic to the student in shadow mode.
  2. Compare answer category, refusal rate, and escalation rate.
  3. Human review only the disagreement bucket.
  4. Promote to 10%, then 25%, then 50%, then 100%.

15.3 Rollback triggers

Rollback immediately if any of these cross threshold:

  • hallucination rate +1.0 point above baseline,
  • refusal precision -2.0 points below baseline,
  • rare-class recall -3.0 points,
  • p95 latency +20% vs approved benchmark,
  • support tickets on recommendation quality spike.

Scenario Recommended stack
classic NLP classifier KD transformers + custom Trainer
paper-like feature KD experiment TextBrewer or raw PyTorch
response distillation into open model trl + PEFT + HF
managed-model distillation OpenAI teacher generation + OpenAI SFT
CPU deployment optimum + ONNX + optionally Optimum Intel
local quantized serving llama.cpp / llama-cpp-python
long-lived training codebase lightning or HF + accelerate

17. Final Recommendations for MangaAssist

17.1 What to run first

Run these in order:

  1. DistilBERT → TinyBERT
    Fastest proof of value. Cheap. Clear metrics.

  2. Managed response distillation
    gpt-4.1 teacher → fine-tuned gpt-4.1-mini.

  3. Self-hosted fallback distillation
    same teacher outputs → Llama 3 8B student.

  4. Reranker distillation + ONNX INT8
    only if reranking latency is still a bottleneck.

17.2 Best stopping points from the dry runs

Run Best checkpoint
TinyBERT classifier Stage 2, epoch 5
OpenAI managed student epoch 3
Llama 3 8B student epoch 3
Reranker epoch 3

17.3 Core lesson

The correct question is not “Did the loss go down?”
The correct question is:

“At which checkpoint did the student become cheap and fast enough, while still preserving safety, factuality, and user-perceived quality?”

That checkpoint is the one to deploy.


18. Appendix — Extra Useful Metrics

18.1 Teacher confidence spread

[ \text{confidence spread} = p_{top1} - p_{top2} ]

Low spread means ambiguous teacher signal.
These examples are good for:

  • soft labels,
  • human review prioritization,
  • rare-class calibration.

18.2 Example ranking metrics

[ NDCG@k = \frac{DCG@k}{IDCG@k} ]

with

[ DCG@k = \sum_{i=1}^{k} \frac{2^{rel_i}-1}{\log_2(i+1)} ]

18.3 Cost reduction formula

[ \text{cost reduction} = \frac{\text{teacher cost} - \text{student cost}}{\text{teacher cost}} ]

Example:

  • teacher cost per 1K responses = 12.4
  • student cost per 1K responses = 4.8

[ (12.4 - 4.8) / 12.4 = 0.6129 \approx 61.3\% ]

18.4 Hallucination rate

[ \text{hallucination rate} = \frac{\text{responses with unsupported factual claims}}{\text{all evaluated responses}} ]

Example:

  • unsupported claims in sample = 29
  • evaluated responses = 800

[ 29/800 = 0.03625 = 3.6\% ]


19. Appendix — Mermaid Diagram for Distillation Decisions

flowchart TD
    A[Need lower cost or latency?] -->|No| B[Keep teacher in production]
    A -->|Yes| C[Can smaller zero-shot model already pass?]
    C -->|Yes| D[Use smaller base model directly]
    C -->|No| E[Can teacher outputs be collected offline?]
    E -->|Yes| F[Run distillation]
    E -->|No| G[Use online KD only if teacher cost is acceptable]
    F --> H[Offline eval]
    H --> I{Pass quality + safety + latency gates?}
    I -->|No| J[Revise data / temperature / student size]
    I -->|Yes| K[Shadow deploy]
    K --> L{Shadow stable?}
    L -->|No| M[Rollback]
    L -->|Yes| N[Promote]

20. Appendix — Source-aware Implementation Notes

This document expands the original MangaAssist distillation write-up with: - the original DistilBERT → TinyBERT and Claude/LLM-style examples, - concrete OpenAI-based distillation options, - additional hardware and deployment planning, - production-log-first dry-run analysis.

Keep the original baseline metrics and diagrams as the “teacher” document, and use this one as the operations + implementation expansion.


21. Official Docs and Repositories to Check While Implementing

These are the main docs/repositories worth checking while turning the dry run into a real pipeline:

  • Hugging Face transformers Trainer documentation
  • Hugging Face trl SFTTrainer documentation
  • Hugging Face optimum ONNX Runtime quantization documentation
  • Hugging Face optimum-intel / OpenVINO documentation
  • Hugging Face accelerate documentation
  • PyTorch DistributedDataParallel documentation
  • PyTorch FullyShardedDataParallel documentation
  • PyTorch MPS backend documentation
  • AWS Lambda memory/CPU allocation documentation
  • OpenAI supervised fine-tuning guide
  • OpenAI model optimization guide
  • OpenAI distillation cookbook example
  • TextBrewer GitHub repository and docs
  • Knowledge-Distillation-Zoo GitHub repository
  • Lightning-AI repository/docs
  • llama.cpp repository
  • llama-cpp-python documentation

For production use, always re-check: - supported model versions, - fine-tuning availability, - export/quantization compatibility, - hardware backend support, - recent release notes before locking the pipeline.