vLLM Model Preparation And Quantization For MangaAssist
End-to-end guide for preparing, quantizing, evaluating, and promoting models and LoRA adapters for the vLLM-powered self-hosted generation path. Covers AWQ calibration, LoRA adapter management, artifact pipelines, and quality gates.
1. Scope
This document answers:
- How do we prepare the base model for vLLM serving?
- How is AWQ INT4 quantization performed and validated?
- How are LoRA adapters trained, versioned, and promoted?
- What quality gates prevent a bad model from reaching production?
- What is the CI/CD pipeline for model and adapter updates?
2. Base Model Selection And Preparation
Why Llama-3-8B-Instruct
| Requirement | Llama-3-8B-Instruct | Llama-3-70B | Mistral-7B-Instruct |
|---|---|---|---|
| Fits on single A10G (24 GB) after INT4 | Yes (4.5 GB) | No (35+ GB) | Yes (~4 GB) |
| Instruction following quality | Strong | Excellent | Good |
| Japanese language support | Adequate (fine-tuned via LoRA) | Better baseline | Weak |
| Community and ecosystem | Largest | Large | Growing |
| License for commercial use | Llama 3 Community License | Same | Apache 2.0 |
| vLLM support maturity | Full (PagedAttention, prefix caching, Multi-LoRA) | Full but needs multi-GPU | Full |
Decision: Llama-3-8B-Instruct. It fits a single A10G after AWQ quantization, leaving ~14 GB for KV cache. The 70B model would require tensor parallelism across multiple GPUs, multiplying cost and complexity while the 8B model with domain LoRA meets quality targets.
Model Artifact Download
"""
Download and convert model artifacts to the format vLLM expects.
Run this on a machine with sufficient disk space (50 GB+).
"""
import os
from pathlib import Path
from huggingface_hub import snapshot_download
MODEL_ID = "meta-llama/Meta-Llama-3-8B-Instruct"
OUTPUT_DIR = Path("/models/llama-3-8b-instruct-base")
def download_base_model() -> Path:
"""Download model weights in safetensors format."""
snapshot_download(
repo_id=MODEL_ID,
local_dir=str(OUTPUT_DIR),
allow_patterns=["*.safetensors", "*.json", "tokenizer*"],
ignore_patterns=["*.bin", "*.gguf", "*.ggml"], # Only safetensors
)
return OUTPUT_DIR
def verify_download(model_dir: Path) -> None:
"""Verify all expected files are present."""
required = [
"config.json",
"tokenizer.json",
"tokenizer_config.json",
"special_tokens_map.json",
]
for f in required:
assert (model_dir / f).exists(), f"Missing: {f}"
safetensors = list(model_dir.glob("*.safetensors"))
assert len(safetensors) > 0, "No safetensors files found"
total_size = sum(f.stat().st_size for f in safetensors)
assert total_size > 10_000_000_000, f"Model too small: {total_size / 1e9:.1f} GB"
print(f"Verified: {len(safetensors)} safetensor files, {total_size / 1e9:.1f} GB total")
3. AWQ INT4 Quantization
What Is AWQ
AWQ (Activation-aware Weight Quantization) is a post-training quantization method that: 1. Identifies which weight channels are most important by analyzing activation distributions 2. Applies per-channel scaling to protect important channels from quantization error 3. Quantizes weights to 4-bit integers with group-size granularity
Why AWQ over other methods: - GPTQ: Higher quality at same bit-width but much slower calibration (hours vs minutes) - bitsandbytes NF4: Dynamic quantization, no calibration needed, but no vLLM kernel support at our version - FP8: Requires Hopper GPUs (H100), we use Ampere (A10G) - AWQ: Fast calibration, vLLM has optimized GEMM kernels, battle-tested in production
Calibration Dataset Preparation
The calibration dataset critically affects quantization quality. It must represent the actual inference distribution.
"""
Prepare calibration dataset from MangaAssist production traffic.
Uses real request/response pairs to calibrate AWQ quantization.
"""
import json
import random
from pathlib import Path
import boto3
def build_calibration_dataset(
output_path: Path,
num_samples: int = 512,
min_tokens: int = 100,
max_tokens: int = 2048,
) -> Path:
"""
Build calibration dataset from production inference traces.
Design decisions:
- 512 samples: Enough for stable calibration, not too many to slow it down
- Distribution must match production: ~30% short factual, ~40% recommendation,
~20% detailed comparison, ~10% support queries
- Include multi-turn conversations to calibrate attention patterns
- Include both JP and EN content to calibrate tokenizer coverage
"""
# Pull recent production traces from MLflow or S3
s3 = boto3.client("s3")
# Download the curated calibration corpus
# This corpus is refreshed monthly from production traces
s3.download_file(
"mangaassist-model-artifacts",
"calibration/manga_calibration_v3.jsonl",
str(output_path),
)
# Validate distribution
with open(output_path) as f:
samples = [json.loads(line) for line in f]
adapter_dist = {}
for s in samples:
adapter = s.get("adapter_id", "unknown")
adapter_dist[adapter] = adapter_dist.get(adapter, 0) + 1
print(f"Calibration dataset: {len(samples)} samples")
for adapter, count in sorted(adapter_dist.items()):
print(f" {adapter}: {count} ({count / len(samples) * 100:.1f}%)")
return output_path
Quantization Script
"""
AWQ quantization for Llama-3-8B-Instruct.
Produces a vLLM-compatible quantized model.
"""
from pathlib import Path
from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer
BASE_MODEL_DIR = Path("/models/llama-3-8b-instruct-base")
CALIB_DATASET = Path("/models/calibration/manga_calibration_v3.jsonl")
OUTPUT_DIR = Path("/models/llama-3-8b-instruct-awq-int4")
def quantize_model() -> Path:
"""
Run AWQ quantization.
Key parameters:
- w_bit=4: INT4 quantization (4.5 GB model size)
- q_group_size=128: Standard group size, good balance of quality vs speed
- zero_point=True: Includes zero-point for better asymmetric quantization
- version="GEMV": Uses GEMV kernels optimized for inference (not GEMM)
Expected duration: ~15-30 minutes on a single A10G
Expected output size: ~4.5 GB (from ~16 GB FP16)
"""
tokenizer = AutoTokenizer.from_pretrained(str(BASE_MODEL_DIR))
model = AutoAWQForCausalLM.from_pretrained(str(BASE_MODEL_DIR))
quant_config = {
"w_bit": 4,
"q_group_size": 128,
"zero_point": True,
"version": "GEMV",
}
model.quantize(
tokenizer,
quant_config=quant_config,
calib_data=str(CALIB_DATASET),
)
model.save_quantized(str(OUTPUT_DIR))
tokenizer.save_pretrained(str(OUTPUT_DIR))
print(f"Quantized model saved to {OUTPUT_DIR}")
return OUTPUT_DIR
Post-Quantization Validation
"""
Validate quantized model quality against the unquantized baseline.
This runs a subset of the offline eval suite to detect quality regression.
"""
import json
from dataclasses import dataclass
from pathlib import Path
from vllm import LLM, SamplingParams
@dataclass
class QualityGate:
metric: str
threshold: float
direction: str # "higher_is_better" or "lower_is_better"
QUALITY_GATES = [
QualityGate("bleu_score", 0.85, "higher_is_better"),
QualityGate("factual_accuracy", 0.90, "higher_is_better"),
QualityGate("language_quality", 0.88, "higher_is_better"),
QualityGate("safety_score", 0.99, "higher_is_better"),
QualityGate("hallucination_rate", 0.05, "lower_is_better"),
]
def validate_quantized_model(
model_dir: Path,
eval_dataset: Path,
quality_gates: list[QualityGate] = QUALITY_GATES,
) -> dict:
"""
Run quality validation on the quantized model.
Steps:
1. Load quantized model in vLLM
2. Run eval dataset (200 test cases from the offline eval suite)
3. Score against quality gates
4. Return pass/fail with per-metric results
Failing any quality gate blocks promotion to production.
"""
llm = LLM(
model=str(model_dir),
quantization="awq",
max_model_len=4096,
gpu_memory_utilization=0.90,
)
sampling_params = SamplingParams(
temperature=0.0, # Deterministic for evaluation
max_tokens=512,
)
with open(eval_dataset) as f:
test_cases = [json.loads(line) for line in f]
# Run inference on all test cases
prompts = [tc["prompt"] for tc in test_cases]
outputs = llm.generate(prompts, sampling_params)
# Score each output against expected answers
results = {}
for gate in quality_gates:
score = _compute_metric(gate.metric, test_cases, outputs)
passed = (
score >= gate.threshold
if gate.direction == "higher_is_better"
else score <= gate.threshold
)
results[gate.metric] = {
"score": score,
"threshold": gate.threshold,
"passed": passed,
}
status = "PASS" if passed else "FAIL"
print(f" {gate.metric}: {score:.3f} (threshold: {gate.threshold}) [{status}]")
all_passed = all(r["passed"] for r in results.values())
print(f"\nOverall: {'PASS' if all_passed else 'FAIL'}")
return {"passed": all_passed, "metrics": results}
def _compute_metric(metric_name: str, test_cases: list, outputs: list) -> float:
"""Compute a specific quality metric. Implementation depends on metric type."""
# Placeholder — actual implementations use domain-specific scoring
# See eval/metrics/ for full implementations
raise NotImplementedError(f"Metric {metric_name} scorer not shown here")
VRAM Budget After Quantization
┌─────────────────────────────────────────────┐
│ A10G 24 GB VRAM Budget (AWQ INT4) │
├─────────────────────────────────────────────┤
│ Model weights (AWQ INT4): 4.50 GB │
│ 3× LoRA adapters (rank 16): 0.12 GB │
│ CUDA workspace: 1.50 GB │
│ CUDA graphs: 1.00 GB │
│ OS / driver overhead: 0.96 GB │
│ Safety headroom (8%): 1.92 GB │
│ ───────────────────────────────────────── │
│ Available for KV cache: 14.00 GB │
│ ═════════════════════════════════════════ │
│ Total: 24.00 GB │
└─────────────────────────────────────────────┘
KV cache capacity at block_size=16, max_model_len=4096:
→ ~128 concurrent sequences at average context length
→ This is why max_num_seqs=128 is the ceiling
4. LoRA Adapter Management
Adapter Inventory
| Adapter ID | Purpose | Training data | Rank | Size | Status |
|---|---|---|---|---|---|
manga_domain_v3 |
Manga-specific knowledge (titles, editions, culture) | 15K manga Q&A pairs | 16 | ~40 MB | Production |
general_support_v2 |
Order tracking, returns, general CS | 10K support conversations | 16 | ~40 MB | Production |
jp_style_v1 |
Japanese language style and honorifics | 5K JP conversations | 16 | ~40 MB | Production |
Adapter Training Pipeline
graph LR
subgraph "Data Preparation"
D1["Curate training data"] --> D2["Quality filter"]
D2 --> D3["Format for fine-tuning"]
end
subgraph "Training"
D3 --> T1["LoRA fine-tuning\n(rank 16, alpha 32)"]
T1 --> T2["Checkpoint selection"]
end
subgraph "Validation"
T2 --> V1["Offline eval suite"]
V1 --> V2["Quality gates"]
V2 --> V3{Pass?}
end
subgraph "Promotion"
V3 -->|Yes| P1["Register in artifact store"]
P1 --> P2["Canary deployment"]
P2 --> P3["Production rollout"]
V3 -->|No| R1["Reject, iterate on training data"]
end
LoRA Configuration For vLLM
"""
LoRA adapter registration for vLLM Multi-LoRA serving.
"""
from vllm.lora.request import LoRARequest
def build_lora_requests() -> dict[str, LoRARequest]:
"""
Build LoRA request objects for each adapter.
Key settings:
- lora_int_id: Unique integer per adapter, used internally by vLLM
- lora_name: String ID used in API requests
- lora_local_path: Path inside the container where adapter weights live
Critical decision: All adapters must have the same rank and target modules.
vLLM allocates a fixed LoRA slot size based on the first adapter loaded.
Mixing ranks would waste memory or fail to load.
"""
return {
"manga_domain_v3": LoRARequest(
lora_int_id=1,
lora_name="manga_domain_v3",
lora_local_path="/models/adapters/manga_domain_v3",
),
"general_support_v2": LoRARequest(
lora_int_id=2,
lora_name="general_support_v2",
lora_local_path="/models/adapters/general_support_v2",
),
"jp_style_v1": LoRARequest(
lora_int_id=3,
lora_name="jp_style_v1",
lora_local_path="/models/adapters/jp_style_v1",
),
}
Adapter Versioning Strategy
Adapter naming: {domain}_{purpose}_v{major}
Artifact path: s3://mangaassist-model-artifacts/adapters/{adapter_id}/
Version policy:
- Major version (v1 → v2 → v3): New training data or architecture change
- Minor updates: Retrain on refreshed data with same config (replaces artifact in-place)
- All versions are immutable in S3; "promotion" updates a pointer file
Pointer file: s3://mangaassist-model-artifacts/adapters/{adapter_id}/CURRENT
→ Contains: {version}/{sha256_hash}
→ Example: v3/a1b2c3d4e5f6...
Rollback: Update CURRENT pointer to previous version, redeploy endpoint
Adapter Training Script (Reference)
"""
LoRA fine-tuning script for MangaAssist adapters.
Uses PEFT + Transformers on a single A10G.
"""
from peft import LoraConfig, get_peft_model, TaskType
from transformers import (
AutoModelForCausalLM,
AutoTokenizer,
TrainingArguments,
Trainer,
)
def train_adapter(
base_model_path: str,
training_data_path: str,
output_path: str,
adapter_name: str,
) -> str:
"""
Fine-tune a LoRA adapter.
Critical parameters:
- r=16: Rank 16 balances quality vs memory
We tested r=8 (slightly worse quality) and r=32 (marginal improvement, 2× memory)
- lora_alpha=32: Standard 2× rank scaling factor
- target_modules: q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj
We target all linear layers for best quality
- lora_dropout=0.05: Mild dropout to prevent overfitting on small datasets
"""
tokenizer = AutoTokenizer.from_pretrained(base_model_path)
model = AutoModelForCausalLM.from_pretrained(
base_model_path,
torch_dtype="auto",
device_map="auto",
)
lora_config = LoraConfig(
task_type=TaskType.CAUSAL_LM,
r=16,
lora_alpha=32,
target_modules=[
"q_proj", "k_proj", "v_proj", "o_proj",
"gate_proj", "up_proj", "down_proj",
],
lora_dropout=0.05,
bias="none",
)
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
training_args = TrainingArguments(
output_dir=output_path,
num_train_epochs=3,
per_device_train_batch_size=4,
gradient_accumulation_steps=4,
learning_rate=2e-4,
warmup_ratio=0.05,
logging_steps=10,
save_strategy="epoch",
bf16=True,
report_to="mlflow",
)
trainer = Trainer(
model=model,
args=training_args,
train_dataset=_load_training_data(training_data_path, tokenizer),
tokenizer=tokenizer,
)
trainer.train()
trainer.save_model(output_path)
return output_path
def _load_training_data(path: str, tokenizer) -> object:
"""Load and tokenize training data. Implementation in data/prepare.py."""
raise NotImplementedError("See data/prepare.py for full implementation")
5. CI/CD Pipeline For Model Updates
Pipeline Architecture
graph TD
subgraph "Trigger"
A1["New model artifact\npushed to S3"]
A2["Scheduled monthly\nretraining"]
A3["Manual promotion\nrequest"]
end
subgraph "Stage 1: Validation"
B1["Format validation\n(safetensors, config)"]
B2["Load test in vLLM\n(can it serve?)"]
B3["Offline eval suite\n(200 test cases)"]
end
subgraph "Stage 2: Quality Gates"
C1{"All quality\ngates pass?"}
end
subgraph "Stage 3: Canary Deploy"
D1["Deploy to canary\nvariant (10% traffic)"]
D2["Monitor for 1 hour"]
D3{"SLO maintained?"}
end
subgraph "Stage 4: Production"
E1["Shift to 100% traffic"]
E2["Monitor for 24 hours"]
E3["Archive old model"]
end
subgraph "Rollback"
R1["Revert to previous\nmodel version"]
end
A1 --> B1
A2 --> B1
A3 --> B1
B1 --> B2
B2 --> B3
B3 --> C1
C1 -->|Yes| D1
C1 -->|No| R1
D1 --> D2
D2 --> D3
D3 -->|Yes| E1
D3 -->|No| R1
E1 --> E2
E2 --> E3
Quality Gate Details
| Gate | Stage | Criteria | Blocks promotion? |
|---|---|---|---|
| Format validation | 1 | safetensors files present, config.json valid, tokenizer loads | Yes |
| Load test | 1 | vLLM can load model and serve 10 requests without error | Yes |
| BLEU score | 1 | ≥ 0.85 on eval dataset (vs reference answers) | Yes |
| Factual accuracy | 1 | ≥ 90% on fact-check subset (manga titles, dates, editions) | Yes |
| Language quality | 1 | ≥ 88% on fluency scoring (both EN and JP) | Yes |
| Safety score | 1 | ≥ 99% on safety eval (no harmful outputs) | Yes |
| Hallucination rate | 1 | ≤ 5% on grounded QA subset | Yes |
| Canary TTFT P95 | 3 | < 500 ms during 1-hour canary period | Yes |
| Canary error rate | 3 | < 0.5% during 1-hour canary period | Yes |
| Canary user satisfaction | 3 | No statistically significant drop (p < 0.05) | Warning only |
Rollback Procedure
Model rollback (< 5 minutes with warm pool):
1. Identify the previous known-good model version
→ Check: s3://mangaassist-model-artifacts/adapters/{id}/versions/
→ Or check the Git tag for the last successful promotion
2. Update the model pointer
→ Update CURRENT file to point to previous version
→ Or update the SageMaker model artifact path
3. Deploy new endpoint configuration
→ If using model baking: Deploy the previous Docker image tag
→ If using runtime download: Update the S3 path and restart
4. Verify rollback
→ Check /v1/models endpoint returns expected version
→ Check health endpoint returns 200
→ Run 5 sample requests and verify quality
5. Post-mortem
→ Document what went wrong with the new model
→ Update quality gates if the failure was a gap in coverage
6. Model Baking vs. Runtime Download
Decision Analysis
| Factor | Model baking (weights in Docker image) | Runtime download (weights from S3 at startup) |
|---|---|---|
| Cold start time | ~67s (model already on disk) | ~5-8 min (download 4.5 GB from S3) |
| Image size | ~8.2 GB (large, but build once) | ~3.5 GB (smaller image, separate weights) |
| Update frequency | Rebuild image for each model update | Just update S3 path, redeploy endpoint |
| Warm pool effectiveness | Excellent (model ready immediately) | Good (model cached on warm pool disk) |
| Rollback speed | Switch Docker tag (fast with pre-built images) | Update S3 pointer (fast if warm pool has previous version) |
Decision: Model baking for the base model. The 67-second cold start is critical for scaling responsiveness. The image build cost is paid once per model version, and model versions change infrequently (monthly). LoRA adapters are also baked because they are small (~40 MB each) and switching adapter versions requires a code change anyway.
7. Known Issues And Lessons Learned
AWQ Calibration Gotchas
-
Calibration dataset too small: Using < 256 samples gives unstable quantization. Some weight groups get poor calibration, causing random hallucinations on specific topics. We saw this when an early calibration used only 128 samples and the model hallucinated manga publication dates.
-
Calibration dataset distribution mismatch: If the calibration dataset is all English but production traffic is 30% Japanese, the Japanese performance degrades significantly. The calibration dataset must reflect the actual production language mix.
-
AWQ version compatibility: AWQ 0.1.7 produces different artifacts than 0.1.6. Pin the AWQ library version in the Docker image and version the calibration artifacts.
LoRA Adapter Pitfalls
-
Rank mismatch across adapters: vLLM allocates a fixed-size LoRA slot. If adapter A has rank 16 and adapter B has rank 32, loading both fails silently (B gets truncated). All adapters must have the same rank.
-
Target module mismatch: If an adapter was trained with
target_modules=["q_proj", "v_proj"]but another uses all 7 modules, the one with fewer modules loads fine but wastes the pre-allocated memory for unused modules. -
Base model version drift: If the base model is re-quantized with a new calibration dataset, existing LoRA adapters may need retraining. The weight space shifts slightly, and adapters trained on the old quantization may produce slightly different outputs.
vLLM Version-Specific Notes
| Version transition | What changed | Impact |
|---|---|---|
| 0.4.1 → 0.4.2 | Prefix cache eviction policy changed | 15% hit rate drop; workaround: increase cache capacity |
| 0.4.2 → 0.4.3 | Eviction fixed, Multi-LoRA loading semantics changed | Need to re-register adapters; test adapter switching in staging |
| 0.4.x → 0.5.x | Block allocator rewritten | May affect block_size tuning; rerun memory validation |
8. Cross-References
- Scenario narratives: 01-vllm-game-changer-scenarios.md — Scenario 5 covers AWQ + OOM defense
- Low-level implementation: 02-vllm-low-level-implementation-and-critical-decisions.md — LoRA routing patterns
- Deployment and infrastructure: 04-vllm-deployment-and-infrastructure.md — Docker image build, model baking
- Monitoring and troubleshooting: 05-vllm-monitoring-and-troubleshooting.md — Post-deployment quality signals
- Interview prep: 03-vllm-interview-prep-deep-dive.md
- GPU architecture context: ../Model-Inference/01-inference-pipeline-challenges.md