MangaAssist Knowledge Distillation — Production Dry Run, Library Comparison, and Hardware Playbook

This document expands the original MangaAssist distillation pipeline into a production-focused runbook with:

numerical dry runs,

epoch-by-epoch training evolution,

early stopping logic,

production log examples,

hardware-specific strategies,

library ecosystem comparison,

OpenAI-based distillation options,

Mermaid diagrams for engineering reviews.

0. Scope and Reading Guide

This document covers four practical distillation tracks for MangaAssist:

Intent classifier distillation
DistilBERT teacher → TinyBERT student.
Response model distillation (managed-model path)
OpenAI gpt-4.1 teacher → fine-tuned gpt-4.1-mini student.
Response model distillation (self-hosted path)
Teacher outputs from a stronger managed LLM → LoRA-tuned Llama 3 8B student.
Reranker distillation
Large cross-encoder teacher → compact student reranker.

Where exact throughput or latency numbers are not guaranteed by vendors, they are marked as dry-run engineering estimates and should be treated as planning numbers, not contractual benchmarks.

1. What Distillation Means in MangaAssist

Knowledge distillation trains a smaller student model to imitate a stronger teacher model.

For MangaAssist, distillation matters when the teacher is too expensive or too slow to use on every request:

a managed LLM gives the best final answer quality,
a large reranker lifts ranking quality but breaks latency budgets,
an intent classifier needs to run on CPU or Lambda,
unlabeled production logs need weak supervision from a stronger model.

1.1 MangaAssist production targets

Component	Teacher	Student	Why distill
Intent classification	DistilBERT (66M)	TinyBERT (14.5M)	Lower Lambda latency, smaller cold-start memory
Response model	`gpt-4.1` / strong managed LLM	`gpt-4.1-mini` or Llama 3 8B	Lower serving cost, lower p95 latency
Reranker	large cross-encoder	compact 4-layer ONNX reranker	fit within 15–20 ms inline ranking budget
Ambiguous intent resolution	ensemble + rules + adjudicator	single distilled classifier	simpler deployment path

1.2 Success criteria

A student is useful only if the total system gets better on business constraints, not just model loss:

quality stays inside promotion gates,
latency meets SLO,
cost drops enough to matter,
operational simplicity improves,
safety/refusal behavior does not regress.

2. End-to-End Distillation Flow

flowchart TD
    A[Production traffic logs] --> B[Filter PII / deduplicate / stratify]
    B --> C[Build distillation dataset]
    C --> D[Teacher labeling]
    D --> E[Hard labels + soft labels + metadata]
    E --> F[Student training]
    F --> G[Offline evaluation]
    G --> H[Shadow evaluation]
    H --> I[Canary deploy]
    I --> J[Promotion or rollback]

    C --> C1[Human corrected set]
    C --> C2[Refusal and escalation set]
    C --> C3[Rare intents / long-tail queries]
    D --> D1[Teacher scores]
    D --> D2[Teacher responses]
    D --> D3[Teacher uncertainty]
    G --> G1[Accuracy / win rate]
    G --> G2[Hallucination / refusal precision]
    G --> G3[Latency / cost]

3. Dataset Design for MangaAssist

We start from the production-oriented scenario and make it concrete.

3.1 Response distillation dataset

Source	Count	Purpose
production prompts	25,000	realistic traffic distribution
teacher responses	25,000	target behavior
human-corrected responses	5,000	fix teacher mistakes
refusal / escalation examples	2,000	preserve support behavior
total supervised rows	32,000	final response training set

3.2 Suggested split

Split	Count	Notes
train	25,600	80%
validation	3,200	10%
test / gate	3,200	10%, frozen
golden human review set	500	never used for training

3.3 Intent classifier dataset

For classifier distillation, use a larger but lighter dataset:

Source	Count
labeled intent examples	12,000
paraphrased augmentations	12,000
recent unlabeled queries + teacher soft labels	20,000
rare-escalation and high-risk phrases	4,000
total	48,000

3.4 Why production logs matter

Production logs tell you things a clean benchmark usually hides:

where users are ambiguous,
where teacher confidence is low,
which intents are rare but high-risk,
which prompts are long and expensive,
where refusals are triggered,
where the teacher hallucinates catalog or shipping facts.

A good distillation run does not only log the final answer. It logs the decision process around that answer.

4. What to Log During Distillation

4.1 Training-time logs

These logs are the minimum useful set:

Log field	Why it matters
`epoch`	locate training stage
`step`	trend within epoch
`loss_total`	overall optimization
`loss_hard`	fit to labels
`loss_kd`	fit to teacher distribution
`loss_feature`	hidden-state matching quality
`loss_attention`	attention transfer quality
`grad_norm`	detect instability
`lr`	tie behavior to schedule
`tokens_per_sec` or `samples_per_sec`	training efficiency
`gpu_mem_gb`	capacity planning
`teacher_entropy_mean`	teacher softness / ambiguity level
`student_entropy_mean`	whether student is overconfident
`rare_class_recall`	long-tail protection
`refusal_precision`	policy safety
`hallucination_rate`	factuality check

4.2 Evaluation-time logs

Metric	Meaning
teacher preference match	how often student matches teacher choice
human win rate vs base student	student quality lift vs pre-distilled baseline
catalog hallucination rate	factual risk on catalog facts
escalation precision	whether support escalation is triggered correctly
refusal precision / recall	whether unsafe/out-of-scope queries are handled correctly
cost per 1K responses	business reason for distillation
p50 / p95 / p99 latency	deployment readiness

4.3 Example production log schema

{
  "event": "distill_epoch_end",
  "run_id": "kd_resp_v12",
  "student": "gpt-4.1-mini-ft-v12",
  "teacher": "gpt-4.1-2025-04-14",
  "epoch": 3,
  "loss_total": 0.812,
  "loss_ce": 0.471,
  "loss_kd": 0.958,
  "teacher_pref_match": 0.872,
  "human_win_rate_vs_base": 0.681,
  "catalog_hallucination_rate": 0.036,
  "refusal_precision": 0.942,
  "cost_per_1k_responses_usd": 4.80,
  "cost_reduction_vs_teacher": 0.61,
  "p95_latency_ms": 420
}

5. Distillation Math Refresher

5.1 Hard-label loss

[ \mathcal{L}{hard} = -\sum{c=1}^{C} y_c \log p_S(c|x) ]

This learns the correct class, but it throws away the teacher's view of near-miss classes.

5.2 Soft-label KD loss

[ \mathcal{L}{KD} = T^2 \cdot D{KL}(p_T^{(T)} \Vert p_S^{(T)}) ]

with

[ p_i^{(T)} = \frac{e^{z_i/T}}{\sum_j e^{z_j/T}} ]

and combined loss:

[ \mathcal{L} = (1-\alpha)\mathcal{L}{hard} + \alpha \mathcal{L}{KD} ]

For hidden-state matching:

[ \mathcal{L}_{feature} = \sum_l \text{MSE}(W_l h_S^{(l)}, h_T^{(m(l))}) ]

For attention transfer:

[ \mathcal{L}_{attn} = \sum_l \text{MSE}(A_S^{(l)}, A_T^{(m(l))}) ]

Final total for TinyBERT-style two-stage training:

[ \mathcal{L}{total} = \lambda{KD}\mathcal{L}{KD} + \lambda{CE}\mathcal{L}{hard} + \lambda{feat}\mathcal{L}{feature} + \lambda{attn}\mathcal{L}_{attn} ]

6. Dry Run A — DistilBERT → TinyBERT Intent Distillation

This is the best first production dry run because it is cheap, fast, and easy to evaluate.

6.1 Setup

Teacher: fine-tuned DistilBERT, 66M params
Student: TinyBERT 4L, 14.5M params
Classes: 10 intent classes
Temperature: 4
Alpha: 0.7
Max sequence length: 64
Training mode:
Stage 1: feature matching
Stage 2: output KD + CE
Training dataset: 48,000 rows
Validation set: 4,800
Gate set: 4,800

6.2 Stage 0 baseline

Model	Accuracy	Rare-class recall	Refusal precision	Warm latency
Teacher DistilBERT	92.1%	88.4%	96.8%	15 ms
Base TinyBERT, no KD	84.6%	73.9%	90.7%	5 ms

The raw gap to close is:

accuracy gap = 92.1 - 84.6 = 7.5 points
rare-class recall gap = 88.4 - 73.9 = 14.5 points

6.3 Stage 1 — feature matching dry run

Assume:

A100 80 GB
bf16
batch size 128
seq len 64
teacher and student in memory at once

Steps per epoch:

[ \text{steps/epoch} = \lceil 48,000 / 128 \rceil = 375 ]

If effective throughput is ~320 sequences/sec including teacher forward + hidden-state extraction, then:

[ 48,000 / 320 \approx 150 \text{ sec} \approx 2.5 \text{ min / epoch} ]

For 6 epochs, estimated wall-clock:

[ 6 \times 2.5 = 15 \text{ min} ]

6.4 Stage 1 epoch table

Epoch	Feature loss	Attention loss	Val acc (probe head)	Rare recall	Notes
1	1.842	0.913	78.4%	66.2%	student still unstable
2	1.221	0.642	81.0%	69.4%	large gain, hidden states aligning
3	0.944	0.501	82.6%	71.3%	gradients smooth
4	0.791	0.438	83.2%	72.0%	diminishing returns start
5	0.714	0.401	83.5%	72.4%	small gain
6	0.701	0.392	83.5%	72.5%	plateau

6.5 Stage 1 stopping rule

Stop Stage 1 when all are true for 2 consecutive epochs:

feature loss improvement < 2%
attention loss improvement < 2%
probe-head val accuracy gain < 0.2 points

That happens at epoch 5–6, so we stop after epoch 6.

6.6 Stage 2 — output KD + CE dry run

Start from the Stage 1 checkpoint.

Assume:

batch size 128
same sequence length
output-only KD is cheaper than feature KD
effective throughput ~650 seq/sec

Epoch time:

[ 48,000 / 650 \approx 74 \text{ sec} \approx 1.2 \text{ min / epoch} ]

For 8 epochs, estimated time:

[ 8 \times 1.2 \approx 9.6 \text{ min} ]

6.7 Stage 2 epoch table

Epoch	CE loss	KD loss	Total loss	Val acc	Rare recall	Refusal precision	Notes
1	0.921	1.404	1.259	86.8%	77.4%	92.1%	big jump from Stage 1
2	0.772	1.115	1.012	88.1%	79.9%	93.4%	teacher structure learned
3	0.681	0.982	0.892	88.8%	81.1%	94.2%	stable
4	0.622	0.914	0.826	89.2%	82.2%	94.8%	near best
5	0.593	0.878	0.792	89.3%	82.6%	95.1%	best val accuracy
6	0.582	0.871	0.785	89.3%	82.8%	95.0%	no real gain
7	0.579	0.870	0.784	89.2%	82.7%	94.8%	slight overfit signs
8	0.567	0.873	0.781	89.1%	82.3%	94.5%	overfit

6.8 Where to stop

We stop at epoch 5 for deployment candidate selection because:

best validation accuracy is first reached at epoch 5,
rare recall keeps improving slightly after that, but overall val accuracy does not,
refusal precision starts flattening,
epoch 7–8 shows classic memorization: training loss still falls, validation stops improving.

6.9 Classifier gate outcome

Gate	Threshold	Epoch 5 result	Pass
accuracy	>= 89.0%	89.3%	yes
rare-class recall	>= 80.0%	82.6%	yes
refusal precision	>= 94.0%	95.1%	yes
Lambda warm latency	<= 6 ms	5 ms	yes
Lambda cold p95	<= 150 ms	122 ms	yes

6.10 Final classifier business result

Metric	Teacher	Distilled student	Delta
Accuracy	92.1%	89.3%	-2.8
Rare recall	88.4%	82.6%	-5.8
Warm latency	15 ms	5 ms	3.0× faster
Model size	264 MB	58 MB	4.6× smaller
Cost per 1M requests (CPU/Lambda dry run)	$38	$17	55% lower

7. Dry Run B — Response Distillation with OpenAI Models

This is the managed-model path.

7.1 Why include OpenAI here

If MangaAssist wants a smaller managed student instead of self-hosting, a practical path is:

teacher: gpt-4.1
student base: gpt-4.1-mini
fine-tuning method: supervised fine-tuning using teacher outputs + human corrections + refusal data

This is not pure logit-level KD. It is response distillation through SFT.

7.2 Data recipe

We reuse the 32,000 row response dataset:

25,000 production prompts
25,000 teacher answers
5,000 human-corrected answers
2,000 refusal/escalation examples

We transform each row into a chat-style training example:

{
  "messages": [
    {"role": "system", "content": "You are MangaAssist. Follow catalog-safe, retrieval-grounded answer rules."},
    {"role": "user", "content": "I want romance manga with adult characters."},
    {"role": "assistant", "content": "Here are three good options..."}
  ]
}

7.3 Label weighting strategy

Not all rows should count equally.

Row type	Weight	Why
human-corrected	2.0	best ground truth
teacher response, clean	1.0	useful target
refusal / escalation	2.5	safety critical
low-confidence teacher rows	0.5	avoid copying ambiguity too strongly

7.4 Managed-model dry run assumptions

training rows: 25,600
validation rows: 3,200
average prompt + answer length: 420 tokens
total tokens / epoch:

[ 25,600 \times 420 = 10,752,000 \text{ tokens} ]

If 4 epochs are trained, the total training token volume is:

[ 4 \times 10.752M = 43.008M \text{ tokens} ]

7.5 Offline evaluation rubric

Each answer is scored on 5 axes:

Axis	Weight
factuality / no catalog hallucination	0.30
recommendation relevance	0.25
policy / escalation correctness	0.20
answer format quality	0.15
conciseness	0.10

Rubric score:

[ \text{rubric} = 0.30F + 0.25R + 0.20P + 0.15Q + 0.10C ]

7.6 Epoch-by-epoch dry run

Epoch	Val rubric	Teacher preference match	Human win rate vs base student	Hallucination rate	Refusal precision	Notes
1	4.08 / 5	81.2%	59.0%	5.4%	91.8%	clear improvement, still weak on catalog facts
2	4.23 / 5	84.9%	64.1%	4.2%	93.7%	close to gate
3	4.31 / 5	87.2%	68.1%	3.6%	94.4%	best balanced checkpoint
4	4.30 / 5	87.4%	68.4%	4.1%	93.8%	slight overfit / style memorization

7.7 Promotion gate

Metric	Gate	Epoch 3
teacher preference match	>= 85%	87.2%
human win rate vs base student	>= 65%	68.1%
catalog hallucination rate	<= 4%	3.6%
refusal precision	>= 94%	94.4%
cost per 1K responses	>= 50% lower than teacher	61% lower

7.8 Why epoch 3 wins over epoch 4

Epoch 4 slightly beats epoch 3 on teacher preference match, but not enough to justify:

hallucination rate rises from 3.6% to 4.1%,
refusal precision drops,
style starts becoming too rigid,
support escalation wording becomes overly templated.

This is a classic production tradeoff: a tiny gain in imitation quality is not worth a measurable regression in safety and factuality.

7.9 Example OpenAI-style distillation workflow

from openai import OpenAI
import json

client = OpenAI()

# 1) Upload training file
train_file = client.files.create(
    file=open("mangaassist_distill_train.jsonl", "rb"),
    purpose="fine-tune"
)

# 2) Start fine-tuning job on a smaller student
job = client.fine_tuning.jobs.create(
    model="gpt-4.1-mini-2025-04-14",
    training_file=train_file.id,
    method={"type": "supervised"}
)

print(job.id)

7.10 Example offline teacher-generation step

from openai import OpenAI
import json

client = OpenAI()

def label_with_teacher(prompt: str) -> str:
    resp = client.responses.create(
        model="gpt-4.1-2025-04-14",
        input=[
            {
                "role": "system",
                "content": "You are MangaAssist. Be retrieval-grounded, catalog-safe, and follow escalation policy."
            },
            {"role": "user", "content": prompt},
        ]
    )
    return resp.output_text

7.11 Example response-distillation log

{
  "event": "distilled_model_eval",
  "student": "gpt-4.1-mini-ft-manga-v03",
  "teacher": "gpt-4.1-2025-04-14",
  "epoch": 3,
  "teacher_preference_match": 0.872,
  "human_win_rate_vs_base": 0.681,
  "hallucination_rate": 0.036,
  "refusal_precision": 0.944,
  "cost_reduction": 0.61
}

8. Dry Run C — Self-Hosted Response Distillation into Llama 3 8B

This path is used when MangaAssist wants a self-hosted fallback or cost-controlled serving layer.

8.1 Distillation style

For a hosted teacher such as OpenAI or another managed LLM, we normally do:

teacher inference offline,
save teacher outputs,
fine-tune the student with SFT / LoRA.

That means the teacher is not on GPU during student training.

8.2 Training setup

student: Llama 3 8B
fine-tuning: LoRA
precision: bf16
seq length: 512
micro-batch: 8
gradient accumulation: 8
effective batch size: 64
dataset: 32,000 examples
tokens / example: 512 average padded

Tokens per epoch:

[ 32,000 \times 512 = 16,384,000 ]

If effective training throughput is 5,800 tok/s on a single A100 80 GB with LoRA, then:

[ 16,384,000 / 5,800 \approx 2,825 \text{ sec} \approx 47 \text{ min / epoch} ]

Add eval + checkpoint overhead:

training: ~47 min / epoch
eval/checkpoint: ~8 min / epoch
total: ~55 min / epoch

For 4 epochs:

[ 4 \times 55 \approx 220 \text{ min} \approx 3.7 \text{ hours} ]

8.3 Epoch table

Epoch	Train loss	Val rubric	Human score (1-5)	Hallucination rate	p95 latency	Notes
1	1.92	3.61	3.5	7.4%	158 ms	learns format first
2	1.41	3.83	3.8	5.5%	146 ms	content relevance improves
3	1.19	3.92	3.9	4.0%	137 ms	best balanced checkpoint
4	1.08	3.91	3.9	4.6%	135 ms	slight overfit, hallucinations rise

8.4 Stop rule

Choose the first checkpoint that satisfies all of:

rubric score improvement < 0.02 on next epoch,
hallucination rate not increasing,
support escalation precision not decreasing,
no meaningful p95 latency benefit from further tuning.

That selects epoch 3.

8.5 Why not train longer

For distillation, longer training can make the student memorize teacher style more than teacher behavior. In MangaAssist that often shows up as:

repeating stock phrases,
too much certainty on weak retrieval,
nicer formatting but worse factuality,
lower diversity on recommendation prompts.

8.6 Example TRL SFTTrainer snippet

from transformers import AutoModelForCausalLM, AutoTokenizer
from trl import SFTTrainer, SFTConfig

model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Meta-Llama-3-8B",
    torch_dtype="bfloat16",
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3-8B")
tokenizer.pad_token = tokenizer.eos_token

config = SFTConfig(
    output_dir="./manga_llama_student",
    num_train_epochs=4,
    per_device_train_batch_size=8,
    gradient_accumulation_steps=8,
    learning_rate=2e-5,
    max_seq_length=512,
    bf16=True,
    gradient_checkpointing=True,
)

trainer = SFTTrainer(
    model=model,
    args=config,
    train_dataset=train_ds,
    eval_dataset=val_ds,
    processing_class=tokenizer,
)

trainer.train()

9. Dry Run D — Distilled Reranker

9.1 Why this matters

Rerankers are often the hidden latency offender in RAG systems.

If the teacher cross-encoder improves NDCG@10 but adds 40–60 ms, you often need a smaller student to stay in SLO.

9.2 Student objective

[ \mathcal{L} = \alpha \cdot \mathcal{L}{sup} + (1-\alpha)\cdot \text{MSE}(s{student}, s_{teacher}) ]

9.3 Example dry run numbers

teacher NDCG@10: 0.842
base student NDCG@10: 0.751
distilled student NDCG@10: 0.791
teacher p95 latency: 48 ms
distilled ONNX student p95 latency: 14 ms

9.4 Epoch table

Epoch	Pairwise loss	MSE score loss	NDCG@10	p95 latency	Notes
1	0.491	0.228	0.771	14 ms	learns teacher ordering quickly
2	0.444	0.192	0.784	14 ms	strong gain
3	0.437	0.183	0.791	14 ms	best checkpoint
4	0.433	0.181	0.790	14 ms	no useful gain

9.5 Stop condition

Stop at epoch 3 because NDCG gain from epoch 3 to 4 is negative.

10. Early Stopping and Tradeoffs

10.1 Do not stop on training loss alone

Training loss almost always keeps falling even after the student has started to overfit.
For distillation, better stop signals are:

validation KD loss,
gate-set human score,
hallucination rate,
refusal precision,
rare-class recall.

10.2 Practical stop rules

Classifier stop rules

Stop if 2 of 3 happen for 2 consecutive epochs:

validation accuracy improves by < 0.15 points,
rare-class recall improves by < 0.25 points,
KD loss improves by < 1%.

Response model stop rules

Stop if any of these happen after the minimum epoch count:

hallucination rate rises by > 0.3 points,
human score gain < 0.03,
refusal precision drops,
teacher preference match gain < 0.2 points.

Reranker stop rules

Stop if:

NDCG gain < 0.002 across one epoch,
latency target already met,
top-3 swap rate on validation set becomes unstable.

10.3 Tradeoff table

Choice	Benefit	Risk
higher temperature	more dark knowledge	signal becomes too flat
higher alpha	stronger teacher imitation	copies teacher mistakes
more epochs	better teacher imitation	hallucination and memorization
smaller student	lower latency and cost	capacity floor too low
more unlabeled logs	broader coverage	shift/noise if logs are stale
more refusal weighting	safer behavior	over-refusal on borderline prompts

11. Production Logs You Actually Need

11.1 Teacher-label generation logs

{
  "event": "teacher_label_generation",
  "run_id": "labeling_2026_04_21",
  "teacher_model": "gpt-4.1-2025-04-14",
  "rows_processed": 25000,
  "avg_prompt_tokens": 138,
  "avg_response_tokens": 212,
  "teacher_refusal_rate": 0.061,
  "teacher_entropy_mean": 1.74,
  "low_confidence_fraction": 0.084,
  "estimated_cost_usd": 148.20
}

11.2 Classifier epoch logs

{
  "event": "distill_epoch_end",
  "run_id": "tinybert_kd_v05",
  "stage": "output_kd",
  "epoch": 5,
  "loss_total": 0.792,
  "loss_ce": 0.593,
  "loss_kd": 0.878,
  "val_accuracy": 0.893,
  "rare_class_recall": 0.826,
  "refusal_precision": 0.951,
  "throughput_seq_per_sec": 648,
  "gpu_mem_gb": 11.8
}

11.3 Response-model gate logs

{
  "event": "gate_eval",
  "run_id": "resp_openai_kd_v03",
  "student": "gpt-4.1-mini-ft-v03",
  "teacher": "gpt-4.1-2025-04-14",
  "teacher_preference_match": 0.872,
  "human_win_rate_vs_base": 0.681,
  "catalog_hallucination_rate": 0.036,
  "escalation_precision": 0.947,
  "cost_reduction": 0.61,
  "decision": "promote_shadow"
}

11.4 Canary logs

{
  "event": "online_shadow_compare",
  "shadow_student": "gpt-4.1-mini-ft-v03",
  "control_teacher": "gpt-4.1-2025-04-14",
  "sample_size": 5000,
  "student_accept_rate": 0.944,
  "student_escalation_rate": 0.063,
  "teacher_escalation_rate": 0.059,
  "student_catalog_hallucination_rate": 0.039,
  "teacher_catalog_hallucination_rate": 0.021,
  "student_p95_latency_ms": 418,
  "teacher_p95_latency_ms": 922
}

12. Library Ecosystem Comparison

This section answers: Which tool should I use, and when?

12.1 Comparison table

Library / stack	Ease of use	Output KD	Feature KD	Attention KD	Hardware target	Maintenance view	Best use
raw PyTorch	medium-low	yes	yes	yes	GPU / CPU	always viable	max flexibility
`transformers` + custom Trainer	high	yes	yes (manual)	yes (manual)	GPU / CPU	active	best general NLP default
`TextBrewer`	medium	yes	yes	yes	GPU mainly	limited / older	classic NLP KD experiments
`Knowledge-Distillation-Zoo` patterns	low-medium	yes	yes	yes	GPU mainly	limited / reference-only	research baselines
`optimum`	high	no training, yes export	n/a	n/a	CPU / ONNX / edge	active	post-distillation optimization
`trl` SFTTrainer	high	response distillation	no	no	GPU	active	LLM response distillation
`lightning`	medium-high	yes	yes	yes	GPU / multi-GPU	active	clean engineering loops
`Optimum Intel` / `OpenVINO`	high	n/a	n/a	n/a	Intel CPU / GPU / edge	active	CPU-first inference
`llama.cpp` / `llama-cpp-python`	high for serving	n/a	n/a	n/a	CPU / Apple Silicon / edge	very active	quantized local serving

How to read “maintenance view”

active: frequent docs, releases, or official ecosystem support
limited / older: still usable, but not where you should expect the latest production integrations
reference-only: best as a pattern source, not as your primary production framework

12.2 Hugging Face `transformers` + custom Trainer

When to use it

Use this when you want:

standard HF model loading,
easy evaluation hooks,
mixed precision,
multi-GPU support,
enough flexibility to add KD losses.

Minimal snippet

import torch
import torch.nn.functional as F
from transformers import Trainer

class KDTrainer(Trainer):
    def __init__(self, teacher_model=None, temperature=4.0, alpha=0.7, **kwargs):
        super().__init__(**kwargs)
        self.teacher = teacher_model.eval()
        self.temperature = temperature
        self.alpha = alpha
        for p in self.teacher.parameters():
            p.requires_grad = False

    def compute_loss(self, model, inputs, return_outputs=False, **kwargs):
        labels = inputs["labels"]
        student_outputs = model(**inputs)
        with torch.no_grad():
            teacher_outputs = self.teacher(**inputs)

        s_logits = student_outputs.logits
        t_logits = teacher_outputs.logits

        ce = F.cross_entropy(s_logits, labels)
        kd = F.kl_div(
            F.log_softmax(s_logits / self.temperature, dim=-1),
            F.softmax(t_logits / self.temperature, dim=-1),
            reduction="batchmean"
        ) * (self.temperature ** 2)

        loss = (1 - self.alpha) * ce + self.alpha * kd
        return (loss, student_outputs) if return_outputs else loss

Strengths vs raw PyTorch

less boilerplate,
built-in metrics/checkpoints,
integrates with accelerate,
easy for reproducible training jobs.

Weaknesses vs raw PyTorch

feature/attention KD still requires manual plumbing,
custom distributed teacher logic can get messy,
callback/event model is helpful but not always enough for unusual pipelines.

12.3 `TextBrewer`

When to use it

Use it when you want a KD-focused NLP library with explicit support for:

soft-label distillation,
intermediate feature matching,
dynamic loss schedules.

Minimal snippet

from textbrewer import GeneralDistiller, TrainingConfig, DistillationConfig

train_config = TrainingConfig(
    output_dir="./tb_out",
    device="cuda"
)

distill_config = DistillationConfig(
    temperature=4,
    hard_label_weight=0.3,
    kd_loss_weight=0.7
)

distiller = GeneralDistiller(
    train_config=train_config,
    distill_config=distill_config,
    model_T=teacher,
    model_S=student,
    adaptor_T=teacher_adaptor,
    adaptor_S=student_adaptor
)

distiller.train(
    optimizer=optimizer,
    dataloader=train_loader,
    num_epochs=4
)

Strengths vs raw PyTorch

KD concepts are first-class,
easier feature and attention matching setup,
useful for classic BERT/TinyBERT style experiments.

Weaknesses vs raw PyTorch

ecosystem feels older,
fewer modern production examples,
less aligned with current HF + PEFT + LLM workflows.

12.4 `Knowledge-Distillation-Zoo` patterns

This is better thought of as a pattern repository than a production library.

When to use it

Use it when you want:

a fast starting point for many KD losses,
paper-reproduction style experimentation,
baseline implementations for losses like FitNet, AT, PKT, RKD, etc.

Minimal snippet pattern

# pattern example, not a pip-stable API
logits_loss = kd_criterion(student_logits, teacher_logits)
feat_loss = hint_criterion(student_feat, teacher_feat)

loss = 0.7 * logits_loss + 0.3 * feat_loss
loss.backward()
optimizer.step()

Strengths vs raw PyTorch

fast way to inspect many loss designs,
good learning/reference resource.

Weaknesses vs raw PyTorch

not a full training framework,
fewer production ergonomics,
best used as inspiration, not as final platform code.

12.5 `optimum` for ONNX export + quantization

This is usually the next step after distillation.

When to use it

Use it when the student is already good enough and you now want:

ONNX export,
INT8 quantization,
easier CPU deployment.

Minimal snippet

from optimum.onnxruntime import ORTModelForSequenceClassification, ORTQuantizer
from optimum.onnxruntime.configuration import AutoQuantizationConfig

model = ORTModelForSequenceClassification.from_pretrained("./tinybert_student", export=True)
quantizer = ORTQuantizer.from_pretrained(model)

qconfig = AutoQuantizationConfig.avx512_vnni(is_static=False)
quantizer.quantize(save_dir="./tinybert_student_int8", quantization_config=qconfig)

Strengths vs raw PyTorch

much easier ONNX path,
simpler CPU optimization workflow,
good post-training deployment step.

Weaknesses vs raw PyTorch

not a KD trainer by itself,
export/quantization edge cases still exist for some architectures.

12.6 `trl` SFTTrainer for LLM-to-LLM response distillation

When to use it

Use it for:

teacher-response imitation,
LLM instruction tuning,
distillation where logits from the teacher are unavailable.

Minimal snippet

from trl import SFTTrainer, SFTConfig

cfg = SFTConfig(
    output_dir="./student_out",
    num_train_epochs=3,
    per_device_train_batch_size=4,
    gradient_accumulation_steps=8,
    max_seq_length=512
)

trainer = SFTTrainer(
    model=student_model,
    args=cfg,
    train_dataset=train_ds,
    eval_dataset=val_ds,
    processing_class=tokenizer,
)

trainer.train()

Strengths vs raw PyTorch

very small amount of code,
great fit for response distillation,
integrates well with HF/PEFT stacks.

Weaknesses vs raw PyTorch

not intended for feature KD,
less natural for classical classifier KD.

12.7 `lightning` for clean distillation loops

When to use it

Use it when you want:

clean engineering separation,
callbacks,
multi-GPU structure,
long-lived training codebase.

Minimal snippet

import lightning as L
import torch.nn.functional as F

class LitKD(L.LightningModule):
    def __init__(self, teacher, student, temperature=4.0, alpha=0.7):
        super().__init__()
        self.teacher = teacher.eval()
        self.student = student
        self.temperature = temperature
        self.alpha = alpha
        for p in self.teacher.parameters():
            p.requires_grad = False

    def training_step(self, batch, batch_idx):
        labels = batch["labels"]
        s = self.student(**batch).logits
        with torch.no_grad():
            t = self.teacher(**batch).logits

        ce = F.cross_entropy(s, labels)
        kd = F.kl_div(
            F.log_softmax(s / self.temperature, dim=-1),
            F.softmax(t / self.temperature, dim=-1),
            reduction="batchmean"
        ) * (self.temperature ** 2)

        loss = (1 - self.alpha) * ce + self.alpha * kd
        self.log("train_loss", loss)
        return loss

Strengths vs raw PyTorch

easier large-project organization,
strong callback/logging pattern,
good for repeatable MLOps pipelines.

Weaknesses vs raw PyTorch

adds another abstraction layer,
sometimes slower to debug unusual distributed issues.

12.8 `Optimum Intel` / `OpenVINO`

When to use it

Use it when your distilled student must run mainly on:

Intel CPUs,
Intel iGPU / accelerator environments,
edge or desktop inference.

Minimal snippet

from optimum.intel import OVModelForSequenceClassification
from transformers import AutoTokenizer

model = OVModelForSequenceClassification.from_pretrained(
    "./tinybert_student_int8",
    export=True
)
tokenizer = AutoTokenizer.from_pretrained("./tinybert_student")

inputs = tokenizer("romance manga with adult cast", return_tensors="pt")
outputs = model(**inputs)

Strengths vs raw PyTorch

strong CPU-first deployment path,
useful OpenVINO export and runtime integration,
practical for low-latency CPU serving.

Weaknesses vs raw PyTorch

mainly inference-focused,
best after model training is finished.

12.9 `llama.cpp` / `llama-cpp-python`

When to use it

Use it after distillation when you want to serve a quantized student locally or on low-cost hardware.

Minimal snippet

from llama_cpp import Llama

llm = Llama(
    model_path="./manga-student-q4_k_m.gguf",
    n_ctx=4096,
    n_gpu_layers=20
)

resp = llm.create_chat_completion(
    messages=[{"role": "user", "content": "Suggest mature romance manga"}]
)
print(resp["choices"][0]["message"]["content"])

Strengths vs raw PyTorch

simple local serving,
excellent quantized inference story,
strong Apple Silicon and CPU usability.

Weaknesses vs raw PyTorch

serving stack, not training stack,
you usually convert/export into it after training elsewhere.

13. Hardware-Specific Distillation Strategies

13.1 A. Single A100 80GB

Best use cases

DistilBERT → TinyBERT full KD
Llama 8B LoRA response distillation
large batch experiments
frequent checkpoint sweeps

Precision

bf16 preferred for training on A100
fallback: fp16 if library path requires it
keep optimizer states in standard mixed-precision defaults

Realistic batch sizes

Workload	Batch size	Notes
TinyBERT classifier KD	128	teacher + student + features fit comfortably
TinyBERT feature KD with seq 128	96	if attention tensors are kept
Llama 3 8B LoRA SFT	8	good starting micro-batch
Llama 3 8B full finetune	usually not recommended here	LoRA/QLoRA preferred

Throughput dry-run numbers

Workload	Throughput	Epoch time
TinyBERT feature KD	~320 seq/s	~2.5 min / epoch on 48K rows
TinyBERT output KD	~650 seq/s	~1.2 min / epoch on 48K rows
Llama 3 8B LoRA, seq 512	~5,800 tok/s	~47 min / epoch on 32K rows

Bottlenecks to watch

hidden-state extraction cost in feature KD,
dataloader underfeeding GPU,
sequence padding waste,
checkpoint save stalls,
teacher forward pass doubling compute in online KD.

Important engineering note

For managed teachers like OpenAI, do not keep the teacher in the training loop.
Generate teacher outputs first, then train only the student.

13.2 B. Multi-GPU — 2×A10G or 4×A100

DDP vs FSDP for distillation

Use DDP when:

both teacher and student fit per GPU,
you want simpler debugging,
the student is moderate size.

Use FSDP when:

the student does not fit comfortably as a full replica,
optimizer memory is the problem,
you want bigger effective context or batch.

Distillation-specific problem

The teacher must be available on every rank if teacher inference happens online.

You have three patterns:

Replicate frozen teacher on every rank
easiest, best for small teachers.
Precompute teacher outputs
best for managed teachers or large teachers.
FSDP student, replicated teacher
best hybrid for medium teacher + larger student.

DDP wrapper pattern for the teacher

import torch
from torch.nn.parallel import DistributedDataParallel as DDP

teacher = TeacherModel().to(local_rank)
student = StudentModel().to(local_rank)

for p in teacher.parameters():
    p.requires_grad = False
teacher.eval()

teacher = DDP(teacher, device_ids=[local_rank], broadcast_buffers=False)
student = DDP(student, device_ids=[local_rank])

for batch in train_loader:
    with torch.no_grad():
        t_logits = teacher(**batch).module.logits
    s_logits = student(**batch).logits
    loss = kd_loss(s_logits, t_logits, batch["labels"])
    loss.backward()
    optimizer.step()
    optimizer.zero_grad()

When to prefer `accelerate`

Use accelerate when:

you want one codepath across 1 GPU / many GPUs,
you may switch between DDP and FSDP,
you want HF integration without hand-writing launch logic.

Minimal pattern:

from accelerate import Accelerator

accelerator = Accelerator()
student, optimizer, train_loader = accelerator.prepare(student, optimizer, train_loader)

teacher.to(accelerator.device)
teacher.eval()

for batch in train_loader:
    with torch.no_grad():
        t_logits = teacher(**batch).logits
    s_logits = student(**batch).logits
    loss = kd_loss(s_logits, t_logits, batch["labels"])
    accelerator.backward(loss)
    optimizer.step()
    optimizer.zero_grad()

Expected scaling dry-run numbers

Hardware	Workload	Aggregate throughput	Practical note
2×A10G	TinyBERT KD	~430 seq/s	~1.35× to 1.5× over single A10G
4×A100	TinyBERT KD	~1,150 seq/s	near-linear if input pipeline is healthy
2×A10G	Llama 8B LoRA	~1,700 tok/s	use gradient accumulation heavily
4×A100	Llama 8B LoRA/FSDP	~17,000 tok/s	strong fit for fast checkpoint sweeps

DDP vs FSDP tradeoff summary

Choice	Good	Bad
DDP	simpler, stable, easy teacher replication	duplicates model memory
FSDP	lower memory, bigger models	more complex debugging and checkpointing
precomputed teacher cache	fastest train loop, cheapest runtime	more storage and preprocessing

13.3 C. AWS Lambda / CPU-only inference of the distilled student

This is where distillation usually pays real business value.

Export flow

flowchart LR
    A[PyTorch student checkpoint] --> B[ONNX export with Optimum]
    B --> C[INT8 quantization]
    C --> D[Package model artifact]
    D --> E[Lambda container or zip]
    E --> F[Provisioned Concurrency optional]

ONNX export with `optimum`

from optimum.onnxruntime import ORTModelForSequenceClassification

model = ORTModelForSequenceClassification.from_pretrained(
    "./tinybert_student",
    export=True
)
model.save_pretrained("./tinybert_student_onnx")

INT8 question: does accuracy degrade further from 89.3%?

In the dry run, yes, but only slightly.

Variant	Accuracy	Delta vs FP32 distilled
TinyBERT distilled fp32	89.3%	baseline
ONNX dynamic INT8	89.0%	-0.3
ONNX static INT8	88.8%	-0.5

This is usually acceptable if the latency gain is material.

Why memory allocation matters

Lambda allocates CPU power proportional to memory.
That means 1 GB is not just “more memory”; it is also more CPU.

Expected Lambda latency dry run

Memory	Model format	p50 warm	p95 warm	cold start p95	Notes
512 MB	PyTorch fp32	31 ms	44 ms	240 ms	misses strict target
512 MB	ONNX INT8	18 ms	29 ms	148 ms	usable but tight
1024 MB	ONNX INT8	11 ms	18 ms	103 ms	recommended
1536 MB	ONNX INT8	9 ms	15 ms	95 ms	diminishing returns

Practical recommendation

For MangaAssist intent classification:

deploy ONNX INT8
start with 1024 MB
move to 512 MB only if traffic cost pressure is high and p95 still meets SLO

13.4 D. Apple Silicon (M2 / M3) for local development

What it is good for

correctness testing,
small dry runs,
prompt formatting validation,
tiny classifier distillation experiments,
local quantized student inference.

PyTorch MPS basics

import torch

device = torch.device("mps" if torch.backends.mps.is_available() else "cpu")
model = model.to(device)
batch = {k: v.to(device) for k, v in batch.items()}

Practical guidance

prefer fp16 or fp32-style workflows for local experimentation,
keep batch sizes conservative,
expect some operator gaps or performance differences versus CUDA,
use Apple Silicon mainly for development, not final throughput claims.

Realistic local batch sizes

Workload	M2/M3 batch size
TinyBERT KD, seq 64	8–16
TinyBERT inference	32–64
7B/8B LoRA toy run	1–2
quantized local inference via `llama.cpp`	depends on RAM and quant level

Dry-run throughput

Workload	Throughput
TinyBERT training on MPS	~110 seq/s
TinyBERT inference on MPS	~260 seq/s
8B quantized local generation	~18–45 tok/s depending on quant + memory

Main limitations vs CUDA

less mature distributed story,
smaller effective memory ceiling,
weaker training throughput,
local results are useful for debugging but should not be treated as production capacity numbers.

14. Example Fine-Tuning Evolution Narratives

These are the kinds of summaries that should be written into experiment notes after every run.

14.1 TinyBERT run summary

Epoch 1–2: student learns coarse teacher structure; rare-class recall jumps most here.
Epoch 3–4: student begins separating confusable intents cleanly.
Epoch 5: best overall tradeoff between accuracy, rare recall, and refusal precision.
Epoch 6+: training loss still falls, but user-facing metrics flatten.

14.2 OpenAI managed-student run summary

Epoch 1: student learns answer format and common recommendation phrasing.
Epoch 2: factuality improves because teacher response structure is internalized.
Epoch 3: best checkpoint; hallucinations lowest while format quality stays strong.
Epoch 4: style becomes more rigid and catalog hallucinations creep up.

14.3 Llama 8B run summary

Epoch 1: formatting and response skeleton improve.
Epoch 2: answer relevance rises sharply.
Epoch 3: best human score / hallucination balance.
Epoch 4: no meaningful answer quality gain, more memorized phrasing.

15. Deployment Decision Framework

15.1 Promotion checklist

A distilled student is promoted only if all pass:

Category	Gate
quality	teacher preference match meets target
safety	hallucination/refusal metrics meet target
latency	p95 meets SLO
cost	at least 50% lower if this is a cost-driven project
rollback	last-known-good model available
observability	dashboards and alerts are live

15.2 Shadow deployment plan

Route 1–5% of traffic to the student in shadow mode.
Compare answer category, refusal rate, and escalation rate.
Human review only the disagreement bucket.
Promote to 10%, then 25%, then 50%, then 100%.

15.3 Rollback triggers

Rollback immediately if any of these cross threshold:

hallucination rate +1.0 point above baseline,
refusal precision -2.0 points below baseline,
rare-class recall -3.0 points,
p95 latency +20% vs approved benchmark,
support tickets on recommendation quality spike.

16. Recommended Stack by Scenario

Scenario	Recommended stack
classic NLP classifier KD	`transformers` + custom Trainer
paper-like feature KD experiment	`TextBrewer` or raw PyTorch
response distillation into open model	`trl` + PEFT + HF
managed-model distillation	OpenAI teacher generation + OpenAI SFT
CPU deployment	`optimum` + ONNX + optionally `Optimum Intel`
local quantized serving	`llama.cpp` / `llama-cpp-python`
long-lived training codebase	`lightning` or HF + `accelerate`

17. Final Recommendations for MangaAssist

17.1 What to run first

Run these in order:

DistilBERT → TinyBERT
Fastest proof of value. Cheap. Clear metrics.
Managed response distillation
gpt-4.1 teacher → fine-tuned gpt-4.1-mini.
Self-hosted fallback distillation
same teacher outputs → Llama 3 8B student.
Reranker distillation + ONNX INT8
only if reranking latency is still a bottleneck.

17.2 Best stopping points from the dry runs

Run	Best checkpoint
TinyBERT classifier	Stage 2, epoch 5
OpenAI managed student	epoch 3
Llama 3 8B student	epoch 3
Reranker	epoch 3

17.3 Core lesson

The correct question is not “Did the loss go down?”
The correct question is:

“At which checkpoint did the student become cheap and fast enough, while still preserving safety, factuality, and user-perceived quality?”

That checkpoint is the one to deploy.

18. Appendix — Extra Useful Metrics

18.1 Teacher confidence spread

[ \text{confidence spread} = p_{top1} - p_{top2} ]

Low spread means ambiguous teacher signal.
These examples are good for:

soft labels,
human review prioritization,
rare-class calibration.

18.2 Example ranking metrics

[ NDCG@k = \frac{DCG@k}{IDCG@k} ]

with

[ DCG@k = \sum_{i=1}^{k} \frac{2^{rel_i}-1}{\log_2(i+1)} ]

18.3 Cost reduction formula

[ \text{cost reduction} = \frac{\text{teacher cost} - \text{student cost}}{\text{teacher cost}} ]

Example:

teacher cost per 1K responses = 12.4
student cost per 1K responses = 4.8

[ (12.4 - 4.8) / 12.4 = 0.6129 \approx 61.3\% ]

18.4 Hallucination rate

[ \text{hallucination rate} = \frac{\text{responses with unsupported factual claims}}{\text{all evaluated responses}} ]

Example:

unsupported claims in sample = 29
evaluated responses = 800

[ 29/800 = 0.03625 = 3.6\% ]

19. Appendix — Mermaid Diagram for Distillation Decisions

flowchart TD
    A[Need lower cost or latency?] -->|No| B[Keep teacher in production]
    A -->|Yes| C[Can smaller zero-shot model already pass?]
    C -->|Yes| D[Use smaller base model directly]
    C -->|No| E[Can teacher outputs be collected offline?]
    E -->|Yes| F[Run distillation]
    E -->|No| G[Use online KD only if teacher cost is acceptable]
    F --> H[Offline eval]
    H --> I{Pass quality + safety + latency gates?}
    I -->|No| J[Revise data / temperature / student size]
    I -->|Yes| K[Shadow deploy]
    K --> L{Shadow stable?}
    L -->|No| M[Rollback]
    L -->|Yes| N[Promote]

20. Appendix — Source-aware Implementation Notes

This document expands the original MangaAssist distillation write-up with: - the original DistilBERT → TinyBERT and Claude/LLM-style examples, - concrete OpenAI-based distillation options, - additional hardware and deployment planning, - production-log-first dry-run analysis.

Keep the original baseline metrics and diagrams as the “teacher” document, and use this one as the operations + implementation expansion.

21. Official Docs and Repositories to Check While Implementing

These are the main docs/repositories worth checking while turning the dry run into a real pipeline:

Hugging Face transformers Trainer documentation
Hugging Face trl SFTTrainer documentation
Hugging Face optimum ONNX Runtime quantization documentation
Hugging Face optimum-intel / OpenVINO documentation
Hugging Face accelerate documentation
PyTorch DistributedDataParallel documentation
PyTorch FullyShardedDataParallel documentation
PyTorch MPS backend documentation
AWS Lambda memory/CPU allocation documentation
OpenAI supervised fine-tuning guide
OpenAI model optimization guide
OpenAI distillation cookbook example
TextBrewer GitHub repository and docs
Knowledge-Distillation-Zoo GitHub repository
Lightning-AI repository/docs
llama.cpp repository
llama-cpp-python documentation

For production use, always re-check: - supported model versions, - fine-tuning availability, - export/quantization compatibility, - hardware backend support, - recent release notes before locking the pipeline.

MangaAssist Knowledge Distillation — Production Dry Run, Library Comparison, and Hardware Playbook

0. Scope and Reading Guide

1. What Distillation Means in MangaAssist

1.1 MangaAssist production targets

1.2 Success criteria

2. End-to-End Distillation Flow

3. Dataset Design for MangaAssist

3.1 Response distillation dataset

3.2 Suggested split

3.3 Intent classifier dataset

3.4 Why production logs matter

4. What to Log During Distillation

4.1 Training-time logs

4.2 Evaluation-time logs

4.3 Example production log schema

5. Distillation Math Refresher

5.1 Hard-label loss

5.2 Soft-label KD loss

6. Dry Run A — DistilBERT → TinyBERT Intent Distillation

6.1 Setup

6.2 Stage 0 baseline

6.3 Stage 1 — feature matching dry run

6.4 Stage 1 epoch table

6.5 Stage 1 stopping rule

6.6 Stage 2 — output KD + CE dry run

6.7 Stage 2 epoch table

6.8 Where to stop

6.9 Classifier gate outcome

6.10 Final classifier business result

7. Dry Run B — Response Distillation with OpenAI Models

7.1 Why include OpenAI here

7.2 Data recipe

7.3 Label weighting strategy

7.4 Managed-model dry run assumptions

7.5 Offline evaluation rubric

7.6 Epoch-by-epoch dry run

7.7 Promotion gate

7.8 Why epoch 3 wins over epoch 4

7.9 Example OpenAI-style distillation workflow

7.10 Example offline teacher-generation step

7.11 Example response-distillation log

8. Dry Run C — Self-Hosted Response Distillation into Llama 3 8B

8.1 Distillation style

8.2 Training setup

8.3 Epoch table

8.4 Stop rule

8.5 Why not train longer

8.6 Example TRL SFTTrainer snippet

9. Dry Run D — Distilled Reranker

9.1 Why this matters

9.2 Student objective

9.3 Example dry run numbers

9.4 Epoch table

9.5 Stop condition

10. Early Stopping and Tradeoffs

10.1 Do not stop on training loss alone

10.2 Practical stop rules

Classifier stop rules

Response model stop rules

Reranker stop rules

10.3 Tradeoff table

11. Production Logs You Actually Need

11.1 Teacher-label generation logs

11.2 Classifier epoch logs

11.3 Response-model gate logs

11.4 Canary logs

12. Library Ecosystem Comparison

12.1 Comparison table

How to read “maintenance view”

12.2 Hugging Face transformers + custom Trainer

When to use it

Minimal snippet

Strengths vs raw PyTorch

Weaknesses vs raw PyTorch

12.3 TextBrewer

When to use it

Minimal snippet

Strengths vs raw PyTorch

Weaknesses vs raw PyTorch

12.4 Knowledge-Distillation-Zoo patterns

12.2 Hugging Face `transformers` + custom Trainer

12.3 `TextBrewer`

12.4 `Knowledge-Distillation-Zoo` patterns

12.5 `optimum` for ONNX export + quantization

12.6 `trl` SFTTrainer for LLM-to-LLM response distillation

12.7 `lightning` for clean distillation loops

12.8 `Optimum Intel` / `OpenVINO`

12.9 `llama.cpp` / `llama-cpp-python`