LOCAL PREVIEW View on GitHub

MangaAssist Knowledge Distillation — Improved Prompt + Expanded Library and Hardware Guide

This document does two things:

  1. Improves the original prompt so the writing task is clearer, less repetitive, and more production-ready.
  2. Implements the requested expansion for the MangaAssist knowledge distillation pipeline with concrete, numerical, engineering-focused detail.

The content continues the existing MangaAssist distillation document, which already covers: - KL-divergence loss math - temperature scaling - DistilBERT → TinyBERT distillation - managed-teacher → Llama 3 8B response distillation - TinyBERT two-stage feature matching - deployment results


0. Improved Prompt

Use this improved prompt when generating the next version of the document.

I have a production knowledge distillation document for the MangaAssist chatbot.

The existing document already covers:
- KL-divergence loss math
- temperature scaling
- DistilBERT → TinyBERT intent-classifier distillation
- managed-teacher → Llama 3 8B response-model distillation
- TinyBERT two-stage feature matching
- deployment and serving results

Please EXPAND and IMPROVE the document by adding the sections below.

Requirements:
- Write in markdown.
- Be concrete, numerical, and production-focused throughout.
- Ground examples in the MangaAssist system, not generic toy examples.
- Use realistic assumptions and state them explicitly before giving throughput, latency, or cost numbers.
- Prefer short equations, tables, logs, and worked examples over broad descriptions.
- Include Mermaid diagrams where they clarify architecture or execution flow.
- When a number is an estimate rather than a measured benchmark, label it clearly as a planning estimate.
- Show tradeoffs, not just recommendations.
- Compare options against raw PyTorch where requested.
- Use OpenAI-managed teacher examples where helpful, but keep the student side deployable in self-hosted or CPU-friendly setups when relevant.

Add the following new sections:

## 1. Library ecosystem comparison

For each library/tool below:
- show a minimal working code snippet for knowledge distillation or the closest production-relevant equivalent,
- explain where it fits in the MangaAssist pipeline,
- list strengths and weaknesses relative to raw PyTorch,
- state whether it supports:
  - output distillation,
  - feature distillation,
  - attention distillation,
  - post-distillation optimization,
  - LLM response distillation,
- state the best-fit hardware target.

Libraries/tools to cover:
- Hugging Face `transformers` + custom `Trainer`
- `TextBrewer`
- `Knowledge-Distillation-Zoo` patterns
- `optimum` for ONNX export + post-training quantization
- `trl` `SFTTrainer` for response distillation
- `trl` distillation trainer (mention if useful in addition to SFTTrainer)
- `lightning` / PyTorch Lightning callbacks
- `Optimum Intel` / `OpenVINO`
- `llama.cpp` / `llama-cpp-python`

Then provide a comparison table with these columns:
- tool
- primary use
- ease of use
- output KD
- feature KD
- attention KD
- post-distillation optimization
- GPU / CPU / edge fit
- maintenance status
- recommended role in MangaAssist

## 2. Hardware-specific distillation strategies

For each hardware target below, explain:
- exact setup assumptions,
- precision,
- realistic batch size,
- memory considerations,
- main bottlenecks,
- expected throughput,
- expected wall-clock training time,
- failure modes,
- when that setup is good enough vs when to move to another setup.

### A. Single A100 80GB-class GPU
Cover both:
- DistilBERT → TinyBERT classifier distillation
- managed-teacher or OpenAI-generated labels → Llama 3 8B student response distillation

Use practical assumptions like:
- batch_size = 128 for TinyBERT at seq_len = 128
- batch_size = 8 for Llama 3 8B at seq_len = 512
- bf16 where appropriate

Include:
- step-time estimates
- epoch-time estimates
- total-job estimates
- why live teacher inference on the same GPU may or may not be a good idea

### B. Multi-GPU distillation (2×A10G or 4×A100)
Compare DDP vs FSDP specifically for distillation:
- how to place the teacher,
- whether teacher weights are replicated or sharded,
- communication cost,
- when the student should use FSDP,
- when the teacher should stay frozen under DDP or plain model replication,
- when `accelerate` is the cleanest choice.

Include a minimal code pattern using `DistributedDataParallel` for the teacher.

### C. AWS Lambda / CPU-only inference for the distilled student
Focus on post-distillation serving of the TinyBERT student:
- ONNX export with `optimum`
- INT8 quantization
- expected quality drop from 89.3%
- expected latency at 512MB vs 1024MB Lambda memory
- cold start vs warm start behavior
- whether Lambda is enough or whether ECS/Fargate should be used instead

### D. Apple Silicon (M2/M3) for development and local dry runs
Explain:
- `mps` training support,
- practical precision choices,
- realistic batch sizes,
- limitations vs CUDA,
- what can be validated locally,
- what should not be trusted until re-run on NVIDIA GPUs.

## 3. Final recommendation section

End with:
- the best library stack for MangaAssist classifier distillation,
- the best stack for MangaAssist LLM response distillation,
- the best CPU-serving path,
- the minimum setup for a solo engineer,
- the recommended setup for a production team.

Output format:
- markdown only
- include tables
- include code blocks
- include Mermaid diagrams
- keep the tone like a senior ML platform engineer writing an internal engineering design note

Why this prompt is better

The improved prompt fixes five common problems in the original request:

  1. It removes duplication.
    The original repeated the same requirement block twice.

  2. It resolves the truncated Apple Silicon section.
    The improved version completes the ask and makes the Apple section testable.

  3. It forces assumptions before numbers.
    That prevents fake precision.

  4. It separates training-time distillation from serving-time optimization.
    This matters because optimum, OpenVINO, and llama.cpp are mostly deployment tools, not KD trainers.

  5. It clarifies where OpenAI-managed teachers fit.
    Managed-teacher output generation and student fine-tuning are different stages and should not be mixed.


1. Assumptions Used in This Expansion

To keep the numbers consistent, this document uses the following MangaAssist planning assumptions.

1.1 Classifier Distillation Workload

Item Value
task 10-class intent classification
teacher DistilBERT, 66M params
student TinyBERT 4L-312D, 14.5M params
train examples 25,000
validation examples 3,000
sequence length 128 tokens
epochs 10
KD temperature 4
alpha 0.7
stage-1 feature KD corpus 200,000 unlabeled utterances
stage-2 output KD corpus 25,000 labeled utterances

1.2 LLM Response Distillation Workload

Item Value
task grounded manga shopping / FAQ / support responses
teacher OpenAI-managed teacher or equivalent managed high-quality teacher
student Llama 3 8B
train examples 12,000 prompt-response pairs
validation examples 1,500
avg prompt length 180 tokens
avg target length 320 tokens
train sequence length cap 512 tokens
epochs 3
batch size 8
gradient accumulation 4
effective batch size 32

1.3 Serving Targets

Path Latency Target
intent classifier <= 15 ms warm
reranker <= 50 ms
fallback response model <= 200 ms local/self-hosted p50
Lambda intent cold start <= 250 ms preferred
Lambda intent warm <= 20 ms preferred

1.4 Important Scope Boundary

For LLM response distillation, there are two very different production modes:

  1. Offline teacher labeling
    Teacher outputs are generated first and stored. Training later uses those outputs.
    This is the normal production choice.

  2. Online teacher-student co-training
    Teacher runs during student training.
    This is expensive and rarely the best production default for LLM response distillation.

Most of the recommendations below assume offline teacher labeling, because that is the simpler and cheaper path for MangaAssist.


2. Library Ecosystem Comparison

2.1 Where Each Library Fits in the MangaAssist Pipeline

flowchart LR
    A[Raw PyTorch loop] --> B[transformers Trainer custom loss]
    A --> C[TextBrewer / KD utilities]
    A --> D[Knowledge-Distillation-Zoo patterns]

    B --> E[Student checkpoint]
    C --> E
    D --> E

    E --> F[optimum ONNX export]
    E --> G[Optimum Intel / OpenVINO]
    E --> H[llama.cpp GGUF path]

    I[Teacher outputs from OpenAI-managed teacher] --> J[TRL SFTTrainer or DistillationTrainer]
    J --> K[LLM student checkpoint]
    K --> H

2.2 Comparison Table

Tool Primary use Ease of use Output KD Feature KD Attention KD Post-distillation optimization GPU / CPU / edge fit Maintenance status Recommended role in MangaAssist
raw PyTorch full-control training loop low yes yes yes no GPU always viable baseline for custom research or odd losses
transformers + custom Trainer production-friendly KD for HF models high yes yes, with custom code yes, with custom code indirect GPU active best default for classifier KD
TextBrewer NLP KD recipes medium yes yes yes no GPU older but usable good for fast NLP KD experiments
Knowledge-Distillation-Zoo patterns reference loss implementations medium-low yes yes yes no GPU reference-style / older borrow losses, not full production stack
optimum ONNX export + ORT quantization high no no no yes CPU / edge active best post-KD path for Lambda TinyBERT
trl SFTTrainer response distillation by teacher outputs high response-level only no no indirect GPU active best simple path for LLM teacher-output imitation
trl distillation trainer sequence-model KD medium yes no no indirect GPU active and growing use when true teacher-student LM KD is needed
lightning modular training + callbacks medium yes yes yes indirect GPU / local dev active useful for team codebases with reusable hooks
Optimum Intel / OpenVINO CPU-optimized inference medium no no no yes CPU / Intel edge active best x86 CPU-serving path
llama.cpp / llama-cpp-python quantized local LLM serving high for serving no no no yes CPU / edge / Apple active best self-hosted small-footprint LLM serving

2.3 Hugging Face transformers + Custom Trainer

Minimal KD snippet

import torch
import torch.nn.functional as F
from transformers import Trainer

class KDTrainer(Trainer):
    def __init__(self, teacher_model, temperature=4.0, alpha=0.7, **kwargs):
        super().__init__(**kwargs)
        self.teacher = teacher_model.eval()
        self.temperature = temperature
        self.alpha = alpha
        for p in self.teacher.parameters():
            p.requires_grad = False

    def compute_loss(self, model, inputs, return_outputs=False, **kwargs):
        labels = inputs["labels"]

        student_outputs = model(**inputs)
        student_logits = student_outputs.logits

        with torch.no_grad():
            teacher_logits = self.teacher(
                input_ids=inputs["input_ids"],
                attention_mask=inputs["attention_mask"],
            ).logits

        hard_loss = F.cross_entropy(student_logits, labels)

        student_logp = F.log_softmax(student_logits / self.temperature, dim=-1)
        teacher_p = F.softmax(teacher_logits / self.temperature, dim=-1)

        kd_loss = F.kl_div(student_logp, teacher_p, reduction="batchmean") * (self.temperature ** 2)
        loss = (1 - self.alpha) * hard_loss + self.alpha * kd_loss

        return (loss, student_outputs) if return_outputs else loss

Why it fits MangaAssist

This is the best default when: - the teacher and student are Hugging Face models, - the student is a standard text classifier, - the team wants fast experimentation without building a full custom loop.

Strengths vs raw PyTorch

  • Faster to stand up.
  • Built-in checkpointing, evaluation, logging, mixed precision, distributed support.
  • Easier to integrate with Hugging Face tokenizers, datasets, and schedulers.

Weaknesses vs raw PyTorch

  • Feature matching and attention matching still require custom plumbing.
  • Less transparent than a handwritten loop during debugging.
  • Easy to accidentally hide extra teacher forward-pass cost inside compute_loss.

MangaAssist recommendation

Use this as the default classifier distillation stack.
Add custom hooks only when intermediate feature loss is required.


2.4 TextBrewer

TextBrewer is purpose-built for NLP distillation and includes output KD plus intermediate feature matching.

Minimal KD snippet

from textbrewer import GeneralDistiller, TrainingConfig, DistillationConfig

train_config = TrainingConfig(
    output_dir="./tb_out",
    gradient_accumulation_steps=1,
    device="cuda"
)

distill_config = DistillationConfig(
    temperature=4.0,
    kd_loss_type="ce",
    hard_label_weight=0.3,
    kd_loss_weight=0.7,
)

distiller = GeneralDistiller(
    train_config=train_config,
    distill_config=distill_config,
    model_T=teacher_model,
    model_S=student_model,
    adaptor_T=teacher_adaptor,
    adaptor_S=student_adaptor,
)

with distiller:
    distiller.train(
        optimizer=optimizer,
        dataloader=train_loader,
        num_epochs=10,
    )

Why it fits MangaAssist

Good when you want: - output KD, - feature KD, - attention KD, - a cleaner abstraction than raw PyTorch.

Strengths vs raw PyTorch

  • Faster setup for classic NLP KD methods.
  • Cleaner config model for loss weighting and teacher-student adaptors.
  • Good for TinyBERT-style intermediate matching experiments.

Weaknesses vs raw PyTorch

  • Smaller ecosystem than transformers.
  • Less common in modern production ML stacks.
  • Harder to align with current HF-first platform tooling.

MangaAssist recommendation

Use it for rapid KD experimentation if the team wants built-in feature distillation abstractions.
Do not make it the long-term platform default unless the team is already comfortable with it.


2.5 Knowledge-Distillation-Zoo Patterns

This is best thought of as a reference repo of KD losses, not as a modern end-to-end production training stack.

Minimal KD-style snippet inspired by KD-Zoo patterns

import torch.nn.functional as F

def kd_loss(student_logits, teacher_logits, T=4.0):
    s_logp = F.log_softmax(student_logits / T, dim=-1)
    t_prob = F.softmax(teacher_logits / T, dim=-1)
    return F.kl_div(s_logp, t_prob, reduction="batchmean") * (T * T)

def fitnet_hint_loss(student_feat, teacher_feat, proj):
    return F.mse_loss(proj(student_feat), teacher_feat)

loss = 0.7 * kd_loss(student_logits, teacher_logits) \
     + 0.3 * F.cross_entropy(student_logits, labels) \
     + 1.0 * fitnet_hint_loss(student_hidden, teacher_hidden, proj_layer)

Why it fits MangaAssist

It is useful when you want to borrow a loss design: - KL output KD - FitNet hint loss - relation-based or feature-based loss - attention transfer ideas

Strengths vs raw PyTorch

  • Gives known KD formulas quickly.
  • Good source of ablation ideas.

Weaknesses vs raw PyTorch

  • Not a production framework.
  • You still need to build your own training loop, logging, evaluation, and deployment story.
  • Better as a cookbook than a platform dependency.

MangaAssist recommendation

Use it as a design reference, not as the main training framework.


2.6 optimum for ONNX Export + Quantization

This sits after training, not during KD.

Minimal export + quantization snippet

from optimum.onnxruntime import ORTModelForSequenceClassification, ORTQuantizer
from optimum.onnxruntime.configuration import AutoQuantizationConfig
from transformers import AutoTokenizer

model_id = "./tinybert_distilled"
tokenizer = AutoTokenizer.from_pretrained(model_id)

ort_model = ORTModelForSequenceClassification.from_pretrained(
    model_id,
    export=True
)
ort_model.save_pretrained("./tinybert_onnx")

quantizer = ORTQuantizer.from_pretrained("./tinybert_onnx")
qconfig = AutoQuantizationConfig.avx2(is_static=False, per_channel=False)
quantizer.quantize(
    save_dir="./tinybert_onnx_int8",
    quantization_config=qconfig,
)
tokenizer.save_pretrained("./tinybert_onnx_int8")

Why it fits MangaAssist

This is the main path for: - Lambda CPU serving, - lower cold start artifact size, - lower RAM usage, - lower inference latency for the TinyBERT student.

Strengths vs raw PyTorch

  • Easier CPU deployment.
  • Better runtime options than eager PyTorch on Lambda.
  • Smaller artifacts and faster startup.

Weaknesses vs raw PyTorch

  • Not a KD training library.
  • Quantization can shift accuracy slightly.
  • Calibration and runtime testing are still required.

MangaAssist recommendation

Use it as the default post-distillation deployment step for the intent student.


2.7 trl SFTTrainer for LLM Response Distillation

This is the simple path for response-level distillation: teacher outputs become the supervised targets.

Minimal response-distillation snippet

from datasets import Dataset
from trl import SFTTrainer, SFTConfig

train_data = Dataset.from_list([
    {
        "text": (
            "<|system|>You are MangaAssist.\n"
            "<|user|>I want a romance manga with adult characters.\n"
            "<|assistant|>Try Nana, Paradise Kiss, or Wotakoi..."
        )
    }
])

config = SFTConfig(
    output_dir="./llama8b_student",
    per_device_train_batch_size=8,
    gradient_accumulation_steps=4,
    learning_rate=2e-5,
    num_train_epochs=3,
    bf16=True,
    max_seq_length=512,
)

trainer = SFTTrainer(
    model="meta-llama/Meta-Llama-3-8B",
    args=config,
    train_dataset=train_data,
)

trainer.train()

Why it fits MangaAssist

Best when: - teacher outputs have already been generated, - you want a stable and simple student-training path, - you do not have teacher logits.

Strengths vs raw PyTorch

  • Very fast to stand up.
  • Great fit for teacher-output imitation.
  • Built on familiar HF training abstractions.

Weaknesses vs raw PyTorch

  • This is not true logit-level KD.
  • Student learns from teacher text, not teacher probability distributions.
  • Sensitive to teacher verbosity and formatting mistakes.

MangaAssist recommendation

For the LLM fallback student, this is the cleanest first production path.


2.8 trl DistillationTrainer

For sequence-model distillation, TRL now has a dedicated distillation path.

Minimal pattern

from trl.experimental.distillation import DistillationTrainer, DistillationConfig

args = DistillationConfig(
    output_dir="./distilled_chat_model",
    per_device_train_batch_size=4,
    gradient_accumulation_steps=8,
    learning_rate=2e-5,
    num_train_epochs=1,
    lmbda=0.0,   # off-policy only
)

trainer = DistillationTrainer(
    model=student_model,
    teacher_model=teacher_model,
    args=args,
    train_dataset=train_dataset,
    processing_class=tokenizer,
)

trainer.train()

Why it fits MangaAssist

Use this when response distillation needs to be more faithful than plain SFT: - on-policy or mixed-policy distillation, - direct student-teacher sequence alignment, - larger-scale LLM KD experiments.

Strengths vs raw PyTorch

  • Less custom code than building sequence KD manually.
  • More direct KD semantics than plain SFT.
  • Compatible with modern HF ecosystem.

Weaknesses vs raw PyTorch

  • Newer and more specialized.
  • Not needed for a simple teacher-output imitation pipeline.
  • Operationally more complex than offline-label SFT.

MangaAssist recommendation

Use only when plain SFT stops being enough.


2.9 lightning / PyTorch Lightning

Good for teams that want structure, callbacks, and reusable hooks.

Minimal callback-style KD snippet

import lightning as L
import torch
import torch.nn.functional as F

class DistillModule(L.LightningModule):
    def __init__(self, teacher, student, T=4.0, alpha=0.7):
        super().__init__()
        self.teacher = teacher.eval()
        self.student = student
        self.T = T
        self.alpha = alpha
        for p in self.teacher.parameters():
            p.requires_grad = False

    def training_step(self, batch, batch_idx):
        labels = batch["labels"]
        s_logits = self.student(**batch).logits
        with torch.no_grad():
            t_logits = self.teacher(
                input_ids=batch["input_ids"],
                attention_mask=batch["attention_mask"],
            ).logits

        hard = F.cross_entropy(s_logits, labels)
        kd = F.kl_div(
            F.log_softmax(s_logits / self.T, dim=-1),
            F.softmax(t_logits / self.T, dim=-1),
            reduction="batchmean",
        ) * (self.T ** 2)

        loss = (1 - self.alpha) * hard + self.alpha * kd
        self.log("train_loss", loss)
        return loss

Why it fits MangaAssist

Useful when: - the team wants stronger engineering structure, - training logic, logging, and early stopping should be reusable, - multiple KD jobs will share patterns.

Strengths vs raw PyTorch

  • Cleaner code organization.
  • Good callback and early stopping ecosystem.
  • Easy to standardize metrics and checkpoints.

Weaknesses vs raw PyTorch

  • Another abstraction layer to debug.
  • Less directly aligned with HF ecosystem unless wrapped carefully.

MangaAssist recommendation

Strong choice for an internal ML platform team, but not the shortest path for a single experiment.


2.10 Optimum Intel / OpenVINO

This is for CPU-optimized inference, especially on Intel hardware.

Minimal snippet

from optimum.intel.openvino import OVModelForSequenceClassification
from transformers import AutoTokenizer

model = OVModelForSequenceClassification.from_pretrained(
    "./tinybert_distilled",
    export=True,
    compile=True,
)
tokenizer = AutoTokenizer.from_pretrained("./tinybert_distilled")

inputs = tokenizer("where is my manga order", return_tensors="pt")
outputs = model(**inputs)
logits = outputs.logits

Why it fits MangaAssist

Best when: - the student must run on Intel CPU fleets, - ECS or edge x86 is preferred over Lambda, - per-core efficiency matters more than GPU flexibility.

Strengths vs raw PyTorch

  • Better optimized CPU execution.
  • Supports quantization and compression.
  • Strong choice for real CPU production deployments.

Weaknesses vs raw PyTorch

  • Intel-oriented path.
  • More deployment engineering than plain Python inference.
  • Not the main training framework.

MangaAssist recommendation

Best CPU-serving path when you control x86 servers.


2.11 llama.cpp / llama-cpp-python

Best for quantized local serving of the distilled LLM student.

Minimal snippet

from llama_cpp import Llama

llm = Llama(
    model_path="./mangaassist-llama8b-q4_k_m.gguf",
    n_ctx=4096,
    n_threads=8,
)

resp = llm.create_chat_completion(
    messages=[
        {"role": "system", "content": "You are MangaAssist."},
        {"role": "user", "content": "Suggest romance manga with adult characters."}
    ]
)
print(resp["choices"][0]["message"]["content"])

Why it fits MangaAssist

Strong when: - you need a cheap local/self-hosted fallback, - GPU is not guaranteed, - you want OpenAI-compatible local serving via llama-cpp-python.

Strengths vs raw PyTorch

  • Much easier lightweight deployment for quantized LLMs.
  • Good CPU and Apple Silicon story.
  • Strong practical choice for fallback serving.

Weaknesses vs raw PyTorch

  • Not for KD training.
  • Less flexible for advanced training-time experimentation.
  • Quality/latency tradeoff depends heavily on GGUF quantization level.

MangaAssist recommendation

Use this for student LLM serving, not for student training.


3. Hardware-Specific Distillation Strategies

3.1 A. Single A100 80GB-Class GPU

This section assumes a single A100 80GB-class GPU. The goal is not perfect benchmark precision; it is to make planning decisions concrete.

A1. DistilBERT → TinyBERT Classifier Distillation

Setup

Item Value
teacher DistilBERT 66M, frozen
student TinyBERT 14.5M
seq_len 128
batch size 128
precision bf16
teacher placement same GPU
examples 25,000
steps/epoch ceil(25,000 / 128) = 196

Memory intuition

Approximate model weight memory: - DistilBERT bf16 weights: ~132 MB - TinyBERT bf16 weights: ~29 MB - student gradients + optimizer state: still small relative to 80 GB - activations dominate during training, but even then this setup is nowhere near memory-bound

Planning estimate

Metric Planning estimate
step time 0.18 to 0.30 s
samples/s 425 to 710
epoch time 35 to 60 s
10-epoch train loop 6 to 10 min
with eval + checkpoints 10 to 18 min

Why it is fast

The teacher is small, frozen, and classification sequence length is short.
This workload is compute-light compared with LLM fine-tuning.

Bottlenecks to watch

  1. Python data loader overhead
  2. teacher forward inside compute_loss hiding extra cost
  3. CPU tokenization on the fly
  4. evaluation every epoch causing extra sync and save time

Practical recommendation

Pre-tokenize the dataset and pin memory.
Otherwise the GPU will wait on the CPU more than expected.

A2. Llama 3 8B Student Response Distillation

This section assumes offline teacher labels already exist.
That means the A100 only trains the student.

Setup

Item Value
student Llama 3 8B
data source teacher-generated responses already materialized
seq_len 512
batch size 8
grad accumulation 4
effective batch 32
precision bf16
examples 12,000
steps/epoch 12,000 / 8 = 1,500 micro-steps

Planning estimate

Metric Planning estimate
micro-step time 0.9 to 1.5 s
optimizer-step time 3.6 to 6.0 s
tokens/s 2,700 to 4,500 train tokens/s
epoch time 25 to 40 min
3 epochs 1.3 to 2.0 hours
with eval/checkpoints 1.8 to 3.0 hours

Why live teacher inference is usually a bad idea here

If the teacher is an external managed teacher: - the training loop becomes network-bound, - teacher latency leaks into the training job, - reproducibility gets worse, - cost becomes hard to control.

If the teacher is another local LLM on the same GPU: - memory headroom disappears, - throughput collapses, - orchestration gets much harder.

Final A100 rule

For MangaAssist: - classifier KD: same-GPU teacher + student is fine - LLM response KD: pre-generate teacher outputs first, then train student offline


3.2 B. Multi-GPU: 2×A10G or 4×A100

The key question in multi-GPU distillation is:

should the teacher be replicated, sharded, or moved out of the critical training loop?

B1. DDP vs FSDP Intuition

Method What gets replicated Best when Main downside
DDP full model on each rank model fits easily, simple training high memory duplication
FSDP params/grad/optimizer sharded student is large more communication and config complexity

B2. Teacher Placement Strategy

Small teacher, small student

Example: DistilBERT → TinyBERT

Use plain replication or DDP-style teacher copies on each rank.

Why: - teacher is tiny, - communication overhead of sharding is not worth it, - code stays simple.

Large student, frozen teacher

Example: 8B student, smaller 1B-3B teacher

Use: - teacher replicated and frozen - student under FSDP

Why: - you save memory where it matters most: the student - frozen teacher does not need optimizer state - teacher forward still happens locally on each rank

Very large teacher and very large student

Usually do not run both live together unless you truly need online KD.

Instead: - generate teacher labels first, - train student separately.

B3. Minimal DDP teacher wrapper pattern

import os
import torch
import torch.distributed as dist
from torch.nn.parallel import DistributedDataParallel as DDP

def setup():
    dist.init_process_group("nccl")
    local_rank = int(os.environ["LOCAL_RANK"])
    torch.cuda.set_device(local_rank)
    return local_rank

local_rank = setup()

teacher = teacher_model.to(local_rank).eval()
student = student_model.to(local_rank)

for p in teacher.parameters():
    p.requires_grad = False

# Teacher can stay as a plain module on each rank because it is frozen.
# Student is wrapped in DDP for gradient sync.
student = DDP(student, device_ids=[local_rank])

for batch in train_loader:
    batch = {k: v.to(local_rank) for k, v in batch.items()}

    with torch.no_grad():
        teacher_logits = teacher(
            input_ids=batch["input_ids"],
            attention_mask=batch["attention_mask"],
        ).logits

    student_logits = student(
        input_ids=batch["input_ids"],
        attention_mask=batch["attention_mask"],
    ).logits

    loss = kd_loss(student_logits, teacher_logits, batch["labels"])
    loss.backward()
    optimizer.step()
    optimizer.zero_grad()

Why the teacher is often not wrapped in DDP

The teacher is frozen, so there are: - no gradients, - no parameter updates, - no all-reduce need.

Wrapping it in DDP usually adds complexity without giving much back.

B4. When accelerate is the cleanest choice

Use accelerate when: - the team wants one codepath for 1 GPU and multi-GPU, - you may switch between DDP and FSDP, - you want to avoid hand-rolling launch logic.

Minimal accelerate pattern

from accelerate import Accelerator

accelerator = Accelerator(mixed_precision="bf16")

teacher = teacher_model.eval()
student = student_model
optimizer = torch.optim.AdamW(student.parameters(), lr=5e-5)

student, optimizer, train_loader = accelerator.prepare(student, optimizer, train_loader)
teacher.to(accelerator.device)

for batch in train_loader:
    with torch.no_grad():
        teacher_logits = teacher(
            input_ids=batch["input_ids"],
            attention_mask=batch["attention_mask"],
        ).logits
    student_logits = student(
        input_ids=batch["input_ids"],
        attention_mask=batch["attention_mask"],
    ).logits
    loss = kd_loss(student_logits, teacher_logits, batch["labels"])
    accelerator.backward(loss)
    optimizer.step()
    optimizer.zero_grad()

B5. Throughput planning estimates

2×A10G for classifier KD

Metric Estimate
global batch 256
samples/s 550 to 900
epoch time on 25K samples 28 to 45 s

This is only moderately better than single A100 because the workload is small enough that scaling efficiency is not perfect.

4×A100 for 8B student SFT-style response distillation

Metric Estimate
global batch 32 to 64
effective tokens/s 8,000 to 14,000
epoch time on 12K examples 10 to 18 min
3 epochs + eval 40 to 80 min

Final DDP/FSDP rule

  • Use DDP when both teacher and student fit comfortably.
  • Use FSDP on the student when the student becomes memory-heavy.
  • Avoid FSDP for a small frozen teacher unless there is a very unusual reason.

3.3 C. AWS Lambda / CPU-Only Inference for Distilled TinyBERT

This section is about serving, not training.

C1. Why ONNX + INT8 is the right target

Starting point after KD: - TinyBERT accuracy: 89.3% - FP32 PyTorch artifact: ~58 MB - warm CPU inference is acceptable, but cold start and memory pressure still matter

After ONNX + INT8: - artifact often drops toward ~15–20 MB depending on export details - CPU inference becomes meaningfully faster - cold start improves because model load is smaller

C2. Expected quality drop after quantization

Quantization is another compression step after KD, so yes, quality can drop slightly.

Planning estimate for MangaAssist intent accuracy

Variant Accuracy
TinyBERT fp32 89.3%
TinyBERT ONNX fp32 89.2% to 89.3%
TinyBERT ONNX dynamic INT8 88.7% to 89.1%
TinyBERT ONNX static INT8 88.4% to 89.0%

Practical interpretation

A good INT8 path usually costs 0.2 to 0.6 accuracy points.
That is acceptable if the latency savings materially improve end-to-end routing.

C3. Lambda latency estimates

These are planning estimates for: - batch size 1 - seq_len 64 to 128 - model already loaded in memory for warm path

AWS documents that Lambda CPU scales with configured memory, so the 1024 MB configuration gets meaningfully more CPU than 512 MB.

Warm latency estimate

Memory Warm latency
512 MB 18 to 35 ms
1024 MB 10 to 20 ms

Cold start estimate

Memory Cold start with ONNX INT8
512 MB 140 to 260 ms
1024 MB 110 to 220 ms

Main reasons 1024 MB often wins

Even though 1024 MB costs more per ms: - CPU allocation is higher, - inference time drops, - cost per request may be closer than expected, - p95 latency looks much better.

C4. When Lambda is enough

Use Lambda when: - traffic is bursty, - classifier load is light to moderate, - batch size stays 1, - cold starts are acceptable, - the model is <= ~20 MB after optimization.

C5. When Lambda stops being the best fit

Move to ECS/Fargate or always-warm service when: - p95 cold-start variance becomes a product problem, - traffic is constant enough that containerized serving is cheaper, - you want batching or multiple models in the same process, - you need consistent low-latency under load.

Final Lambda rule

For MangaAssist intent routing: - Lambda + ONNX INT8 TinyBERT is enough for light-to-moderate classifier traffic. - Move to ECS/Fargate when traffic stabilizes and p95 matters more than operational simplicity.


3.4 D. Apple Silicon (M2/M3) for Development and Local Dry Runs

Apple Silicon is good for development, but it is not a substitute for final CUDA validation.

D1. What MPS is good for

Use mps for: - verifying the loss function runs, - checking that teacher-student wiring is correct, - smoke-testing a small data subset, - sanity-checking epoch curves, - small-scale local SFT experiments.

D2. Practical precision choices

Recommended local choices: - fp32 for maximum safety - fp16 where stable - do not plan around bf16 as a default assumption on MPS

D3. Realistic local batch sizes

DistilBERT → TinyBERT classifier KD on M2/M3

RAM / unified memory seq_len likely batch size
16 GB 128 8 to 16
24 GB 128 16 to 32
36+ GB 128 32 to 64

Llama 3 8B response distillation

Full bf16/fp16 fine-tuning is usually not the right local choice.
Local runs are better for: - LoRA dry runs - data formatting checks - short functional tests

D4. Limitations vs CUDA

Area Apple Silicon / MPS CUDA
bf16 planning confidence weak strong
multi-GPU no practical multi-device path standard
kernel coverage improving but uneven mature
distributed training not the target path standard
high-throughput training limited strong
final benchmark trust low-medium high

D5. What should be re-run on NVIDIA before sign-off

Always re-run these on CUDA before production decisions: - final throughput numbers - final memory envelope - final mixed precision stability - final eval accuracy after longer training - final quantization regression testing

Final Apple rule

Apple Silicon is excellent for: - local dry runs - debugging - pipeline wiring - small-scale KD smoke tests

It is not the final truth source for: - throughput, - memory headroom, - or production quality sign-off.


4. Numerical Tradeoff Summary

4.1 Best Stack by Use Case

Use case Best stack Why
classifier KD training transformers + custom Trainer fastest path, HF-native, enough flexibility
classifier feature KD experiments TextBrewer or custom PyTorch easier intermediate matching experiments
response distillation for 8B student trl SFTTrainer simplest and most production-ready for teacher-output imitation
advanced LM KD trl DistillationTrainer better if true sequence KD is needed
CPU serving of TinyBERT optimum ONNX + INT8 best fit for Lambda or x86 CPU
x86 optimized CPU serving Optimum Intel / OpenVINO strongest Intel CPU path
local quantized LLM serving llama.cpp / llama-cpp-python easiest cheap fallback deployment

4.2 Minimum Setup for a Solo Engineer

Stage Recommended tool
teacher-output generation OpenAI-managed teacher or other managed teacher
classifier KD transformers + custom Trainer
LLM student training trl SFTTrainer
classifier deployment optimum ONNX INT8
fallback LLM serving llama-cpp-python if CPU-only, otherwise standard GPU runtime

This is the highest-leverage path with the fewest moving parts.

Area Recommended team stack
experiment tracking HF-compatible training + central metrics store
classifier KD transformers or Lightning-based training wrapper
LLM student distillation trl
distributed scaling accelerate with DDP / FSDP
CPU deployment optimum + OpenVINO where Intel fleets justify it
edge / local fallback GGUF + llama.cpp serving

5. Final Recommendations for MangaAssist

5.1 Best Library Stack for MangaAssist Classifier Distillation

Use transformers + custom Trainer for the main path.

Reason: - the teacher and student are transformer classifiers, - the ecosystem around datasets, tokenizers, evaluation, and export is strongest, - the team gets fast experimentation and clean integration with optimum.

5.2 Best Stack for MangaAssist LLM Response Distillation

Use offline teacher labeling + trl SFTTrainer first.

Reason: - simplest pipeline, - lowest operational complexity, - easy to scale once teacher outputs are in storage, - enough for response imitation before moving to more advanced sequence KD.

5.3 Best CPU-Serving Path

Use optimum ONNX INT8 for the TinyBERT classifier.
Use OpenVINO if the production fleet is Intel-heavy and sustained CPU throughput matters.

5.4 Minimum Hardware That Is “Enough”

  • for classifier KD: a single modern NVIDIA GPU is enough
  • for classifier CPU serving: Lambda or a tiny x86 service is enough
  • for LLM response distillation: a single A100 80GB-class GPU is enough when teacher outputs are pre-generated

5.5 Final Architecture Decision

For MangaAssist, the most practical production path is:

  1. generate high-quality teacher outputs offline,
  2. distill TinyBERT for intent on HF Trainer,
  3. train the 8B fallback student with trl using teacher outputs,
  4. export classifier to ONNX INT8,
  5. serve classifier on Lambda or x86 CPU,
  6. serve LLM fallback through a standard GPU stack or a quantized llama.cpp path depending on latency and cost targets.

That combination minimizes: - engineering complexity, - teacher-in-the-loop training cost, - deployment friction, - and CPU serving waste.


6. Source Grounding Notes

This expansion aligns with current official documentation patterns for: - OpenAI supervised fine-tuning and distillation workflows - Hugging Face Trainer, TRL, Accelerate, Optimum, and Optimum Intel - PyTorch DDP, FSDP, and MPS - AWS Lambda CPU scaling with memory

Use the official docs for exact API syntax at implementation time, because library signatures evolve faster than design choices.