vLLM Model Preparation And Quantization For MangaAssist

End-to-end guide for preparing, quantizing, evaluating, and promoting models and LoRA adapters for the vLLM-powered self-hosted generation path. Covers AWQ calibration, LoRA adapter management, artifact pipelines, and quality gates.

1. Scope

This document answers:

How do we prepare the base model for vLLM serving?
How is AWQ INT4 quantization performed and validated?
How are LoRA adapters trained, versioned, and promoted?
What quality gates prevent a bad model from reaching production?
What is the CI/CD pipeline for model and adapter updates?

2. Base Model Selection And Preparation

Why Llama-3-8B-Instruct

Requirement	Llama-3-8B-Instruct	Llama-3-70B	Mistral-7B-Instruct
Fits on single A10G (24 GB) after INT4	Yes (4.5 GB)	No (35+ GB)	Yes (~4 GB)
Instruction following quality	Strong	Excellent	Good
Japanese language support	Adequate (fine-tuned via LoRA)	Better baseline	Weak
Community and ecosystem	Largest	Large	Growing
License for commercial use	Llama 3 Community License	Same	Apache 2.0
vLLM support maturity	Full (PagedAttention, prefix caching, Multi-LoRA)	Full but needs multi-GPU	Full

Decision: Llama-3-8B-Instruct. It fits a single A10G after AWQ quantization, leaving ~14 GB for KV cache. The 70B model would require tensor parallelism across multiple GPUs, multiplying cost and complexity while the 8B model with domain LoRA meets quality targets.

Model Artifact Download

"""
Download and convert model artifacts to the format vLLM expects.
Run this on a machine with sufficient disk space (50 GB+).
"""

import os
from pathlib import Path

from huggingface_hub import snapshot_download


MODEL_ID = "meta-llama/Meta-Llama-3-8B-Instruct"
OUTPUT_DIR = Path("/models/llama-3-8b-instruct-base")


def download_base_model() -> Path:
    """Download model weights in safetensors format."""
    snapshot_download(
        repo_id=MODEL_ID,
        local_dir=str(OUTPUT_DIR),
        allow_patterns=["*.safetensors", "*.json", "tokenizer*"],
        ignore_patterns=["*.bin", "*.gguf", "*.ggml"],  # Only safetensors
    )
    return OUTPUT_DIR


def verify_download(model_dir: Path) -> None:
    """Verify all expected files are present."""
    required = [
        "config.json",
        "tokenizer.json",
        "tokenizer_config.json",
        "special_tokens_map.json",
    ]
    for f in required:
        assert (model_dir / f).exists(), f"Missing: {f}"

    safetensors = list(model_dir.glob("*.safetensors"))
    assert len(safetensors) > 0, "No safetensors files found"
    total_size = sum(f.stat().st_size for f in safetensors)
    assert total_size > 10_000_000_000, f"Model too small: {total_size / 1e9:.1f} GB"
    print(f"Verified: {len(safetensors)} safetensor files, {total_size / 1e9:.1f} GB total")

3. AWQ INT4 Quantization

What Is AWQ

AWQ (Activation-aware Weight Quantization) is a post-training quantization method that: 1. Identifies which weight channels are most important by analyzing activation distributions 2. Applies per-channel scaling to protect important channels from quantization error 3. Quantizes weights to 4-bit integers with group-size granularity

Why AWQ over other methods: - GPTQ: Higher quality at same bit-width but much slower calibration (hours vs minutes) - bitsandbytes NF4: Dynamic quantization, no calibration needed, but no vLLM kernel support at our version - FP8: Requires Hopper GPUs (H100), we use Ampere (A10G) - AWQ: Fast calibration, vLLM has optimized GEMM kernels, battle-tested in production

Calibration Dataset Preparation

The calibration dataset critically affects quantization quality. It must represent the actual inference distribution.

"""
Prepare calibration dataset from MangaAssist production traffic.
Uses real request/response pairs to calibrate AWQ quantization.
"""

import json
import random
from pathlib import Path

import boto3


def build_calibration_dataset(
    output_path: Path,
    num_samples: int = 512,
    min_tokens: int = 100,
    max_tokens: int = 2048,
) -> Path:
    """
    Build calibration dataset from production inference traces.

    Design decisions:
    - 512 samples: Enough for stable calibration, not too many to slow it down
    - Distribution must match production: ~30% short factual, ~40% recommendation,
      ~20% detailed comparison, ~10% support queries
    - Include multi-turn conversations to calibrate attention patterns
    - Include both JP and EN content to calibrate tokenizer coverage
    """
    # Pull recent production traces from MLflow or S3
    s3 = boto3.client("s3")

    # Download the curated calibration corpus
    # This corpus is refreshed monthly from production traces
    s3.download_file(
        "mangaassist-model-artifacts",
        "calibration/manga_calibration_v3.jsonl",
        str(output_path),
    )

    # Validate distribution
    with open(output_path) as f:
        samples = [json.loads(line) for line in f]

    adapter_dist = {}
    for s in samples:
        adapter = s.get("adapter_id", "unknown")
        adapter_dist[adapter] = adapter_dist.get(adapter, 0) + 1

    print(f"Calibration dataset: {len(samples)} samples")
    for adapter, count in sorted(adapter_dist.items()):
        print(f"  {adapter}: {count} ({count / len(samples) * 100:.1f}%)")

    return output_path

Quantization Script

"""
AWQ quantization for Llama-3-8B-Instruct.
Produces a vLLM-compatible quantized model.
"""

from pathlib import Path

from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer


BASE_MODEL_DIR = Path("/models/llama-3-8b-instruct-base")
CALIB_DATASET = Path("/models/calibration/manga_calibration_v3.jsonl")
OUTPUT_DIR = Path("/models/llama-3-8b-instruct-awq-int4")


def quantize_model() -> Path:
    """
    Run AWQ quantization.

    Key parameters:
    - w_bit=4: INT4 quantization (4.5 GB model size)
    - q_group_size=128: Standard group size, good balance of quality vs speed
    - zero_point=True: Includes zero-point for better asymmetric quantization
    - version="GEMV": Uses GEMV kernels optimized for inference (not GEMM)

    Expected duration: ~15-30 minutes on a single A10G
    Expected output size: ~4.5 GB (from ~16 GB FP16)
    """
    tokenizer = AutoTokenizer.from_pretrained(str(BASE_MODEL_DIR))
    model = AutoAWQForCausalLM.from_pretrained(str(BASE_MODEL_DIR))

    quant_config = {
        "w_bit": 4,
        "q_group_size": 128,
        "zero_point": True,
        "version": "GEMV",
    }

    model.quantize(
        tokenizer,
        quant_config=quant_config,
        calib_data=str(CALIB_DATASET),
    )

    model.save_quantized(str(OUTPUT_DIR))
    tokenizer.save_pretrained(str(OUTPUT_DIR))

    print(f"Quantized model saved to {OUTPUT_DIR}")
    return OUTPUT_DIR

Post-Quantization Validation

"""
Validate quantized model quality against the unquantized baseline.
This runs a subset of the offline eval suite to detect quality regression.
"""

import json
from dataclasses import dataclass
from pathlib import Path

from vllm import LLM, SamplingParams


@dataclass
class QualityGate:
    metric: str
    threshold: float
    direction: str  # "higher_is_better" or "lower_is_better"


QUALITY_GATES = [
    QualityGate("bleu_score", 0.85, "higher_is_better"),
    QualityGate("factual_accuracy", 0.90, "higher_is_better"),
    QualityGate("language_quality", 0.88, "higher_is_better"),
    QualityGate("safety_score", 0.99, "higher_is_better"),
    QualityGate("hallucination_rate", 0.05, "lower_is_better"),
]


def validate_quantized_model(
    model_dir: Path,
    eval_dataset: Path,
    quality_gates: list[QualityGate] = QUALITY_GATES,
) -> dict:
    """
    Run quality validation on the quantized model.

    Steps:
    1. Load quantized model in vLLM
    2. Run eval dataset (200 test cases from the offline eval suite)
    3. Score against quality gates
    4. Return pass/fail with per-metric results

    Failing any quality gate blocks promotion to production.
    """
    llm = LLM(
        model=str(model_dir),
        quantization="awq",
        max_model_len=4096,
        gpu_memory_utilization=0.90,
    )

    sampling_params = SamplingParams(
        temperature=0.0,  # Deterministic for evaluation
        max_tokens=512,
    )

    with open(eval_dataset) as f:
        test_cases = [json.loads(line) for line in f]

    # Run inference on all test cases
    prompts = [tc["prompt"] for tc in test_cases]
    outputs = llm.generate(prompts, sampling_params)

    # Score each output against expected answers
    results = {}
    for gate in quality_gates:
        score = _compute_metric(gate.metric, test_cases, outputs)
        passed = (
            score >= gate.threshold
            if gate.direction == "higher_is_better"
            else score <= gate.threshold
        )
        results[gate.metric] = {
            "score": score,
            "threshold": gate.threshold,
            "passed": passed,
        }
        status = "PASS" if passed else "FAIL"
        print(f"  {gate.metric}: {score:.3f} (threshold: {gate.threshold}) [{status}]")

    all_passed = all(r["passed"] for r in results.values())
    print(f"\nOverall: {'PASS' if all_passed else 'FAIL'}")

    return {"passed": all_passed, "metrics": results}


def _compute_metric(metric_name: str, test_cases: list, outputs: list) -> float:
    """Compute a specific quality metric. Implementation depends on metric type."""
    # Placeholder — actual implementations use domain-specific scoring
    # See eval/metrics/ for full implementations
    raise NotImplementedError(f"Metric {metric_name} scorer not shown here")

VRAM Budget After Quantization

┌─────────────────────────────────────────────┐
│  A10G 24 GB VRAM Budget (AWQ INT4)          │
├─────────────────────────────────────────────┤
│  Model weights (AWQ INT4):     4.50 GB      │
│  3× LoRA adapters (rank 16):   0.12 GB      │
│  CUDA workspace:               1.50 GB      │
│  CUDA graphs:                  1.00 GB      │
│  OS / driver overhead:         0.96 GB      │
│  Safety headroom (8%):         1.92 GB      │
│  ─────────────────────────────────────────  │
│  Available for KV cache:      14.00 GB      │
│  ═════════════════════════════════════════  │
│  Total:                       24.00 GB      │
└─────────────────────────────────────────────┘

KV cache capacity at block_size=16, max_model_len=4096:
  → ~128 concurrent sequences at average context length
  → This is why max_num_seqs=128 is the ceiling

4. LoRA Adapter Management

Adapter Inventory

Adapter ID	Purpose	Training data	Rank	Size	Status
`manga_domain_v3`	Manga-specific knowledge (titles, editions, culture)	15K manga Q&A pairs	16	~40 MB	Production
`general_support_v2`	Order tracking, returns, general CS	10K support conversations	16	~40 MB	Production
`jp_style_v1`	Japanese language style and honorifics	5K JP conversations	16	~40 MB	Production

Adapter Training Pipeline

graph LR
    subgraph "Data Preparation"
        D1["Curate training data"] --> D2["Quality filter"]
        D2 --> D3["Format for fine-tuning"]
    end

    subgraph "Training"
        D3 --> T1["LoRA fine-tuning\n(rank 16, alpha 32)"]
        T1 --> T2["Checkpoint selection"]
    end

    subgraph "Validation"
        T2 --> V1["Offline eval suite"]
        V1 --> V2["Quality gates"]
        V2 --> V3{Pass?}
    end

    subgraph "Promotion"
        V3 -->|Yes| P1["Register in artifact store"]
        P1 --> P2["Canary deployment"]
        P2 --> P3["Production rollout"]
        V3 -->|No| R1["Reject, iterate on training data"]
    end

LoRA Configuration For vLLM

"""
LoRA adapter registration for vLLM Multi-LoRA serving.
"""

from vllm.lora.request import LoRARequest


def build_lora_requests() -> dict[str, LoRARequest]:
    """
    Build LoRA request objects for each adapter.

    Key settings:
    - lora_int_id: Unique integer per adapter, used internally by vLLM
    - lora_name: String ID used in API requests
    - lora_local_path: Path inside the container where adapter weights live

    Critical decision: All adapters must have the same rank and target modules.
    vLLM allocates a fixed LoRA slot size based on the first adapter loaded.
    Mixing ranks would waste memory or fail to load.
    """
    return {
        "manga_domain_v3": LoRARequest(
            lora_int_id=1,
            lora_name="manga_domain_v3",
            lora_local_path="/models/adapters/manga_domain_v3",
        ),
        "general_support_v2": LoRARequest(
            lora_int_id=2,
            lora_name="general_support_v2",
            lora_local_path="/models/adapters/general_support_v2",
        ),
        "jp_style_v1": LoRARequest(
            lora_int_id=3,
            lora_name="jp_style_v1",
            lora_local_path="/models/adapters/jp_style_v1",
        ),
    }

Adapter Versioning Strategy

Adapter naming: {domain}_{purpose}_v{major}
Artifact path:  s3://mangaassist-model-artifacts/adapters/{adapter_id}/

Version policy:
  - Major version (v1 → v2 → v3): New training data or architecture change
  - Minor updates: Retrain on refreshed data with same config (replaces artifact in-place)
  - All versions are immutable in S3; "promotion" updates a pointer file

Pointer file: s3://mangaassist-model-artifacts/adapters/{adapter_id}/CURRENT
  → Contains: {version}/{sha256_hash}
  → Example: v3/a1b2c3d4e5f6...

Rollback: Update CURRENT pointer to previous version, redeploy endpoint

Adapter Training Script (Reference)

"""
LoRA fine-tuning script for MangaAssist adapters.
Uses PEFT + Transformers on a single A10G.
"""

from peft import LoraConfig, get_peft_model, TaskType
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    TrainingArguments,
    Trainer,
)


def train_adapter(
    base_model_path: str,
    training_data_path: str,
    output_path: str,
    adapter_name: str,
) -> str:
    """
    Fine-tune a LoRA adapter.

    Critical parameters:
    - r=16: Rank 16 balances quality vs memory
      We tested r=8 (slightly worse quality) and r=32 (marginal improvement, 2× memory)
    - lora_alpha=32: Standard 2× rank scaling factor
    - target_modules: q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj
      We target all linear layers for best quality
    - lora_dropout=0.05: Mild dropout to prevent overfitting on small datasets
    """
    tokenizer = AutoTokenizer.from_pretrained(base_model_path)
    model = AutoModelForCausalLM.from_pretrained(
        base_model_path,
        torch_dtype="auto",
        device_map="auto",
    )

    lora_config = LoraConfig(
        task_type=TaskType.CAUSAL_LM,
        r=16,
        lora_alpha=32,
        target_modules=[
            "q_proj", "k_proj", "v_proj", "o_proj",
            "gate_proj", "up_proj", "down_proj",
        ],
        lora_dropout=0.05,
        bias="none",
    )

    model = get_peft_model(model, lora_config)
    model.print_trainable_parameters()

    training_args = TrainingArguments(
        output_dir=output_path,
        num_train_epochs=3,
        per_device_train_batch_size=4,
        gradient_accumulation_steps=4,
        learning_rate=2e-4,
        warmup_ratio=0.05,
        logging_steps=10,
        save_strategy="epoch",
        bf16=True,
        report_to="mlflow",
    )

    trainer = Trainer(
        model=model,
        args=training_args,
        train_dataset=_load_training_data(training_data_path, tokenizer),
        tokenizer=tokenizer,
    )

    trainer.train()
    trainer.save_model(output_path)

    return output_path


def _load_training_data(path: str, tokenizer) -> object:
    """Load and tokenize training data. Implementation in data/prepare.py."""
    raise NotImplementedError("See data/prepare.py for full implementation")

5. CI/CD Pipeline For Model Updates

Pipeline Architecture

graph TD
    subgraph "Trigger"
        A1["New model artifact\npushed to S3"]
        A2["Scheduled monthly\nretraining"]
        A3["Manual promotion\nrequest"]
    end

    subgraph "Stage 1: Validation"
        B1["Format validation\n(safetensors, config)"]
        B2["Load test in vLLM\n(can it serve?)"]
        B3["Offline eval suite\n(200 test cases)"]
    end

    subgraph "Stage 2: Quality Gates"
        C1{"All quality\ngates pass?"}
    end

    subgraph "Stage 3: Canary Deploy"
        D1["Deploy to canary\nvariant (10% traffic)"]
        D2["Monitor for 1 hour"]
        D3{"SLO maintained?"}
    end

    subgraph "Stage 4: Production"
        E1["Shift to 100% traffic"]
        E2["Monitor for 24 hours"]
        E3["Archive old model"]
    end

    subgraph "Rollback"
        R1["Revert to previous\nmodel version"]
    end

    A1 --> B1
    A2 --> B1
    A3 --> B1
    B1 --> B2
    B2 --> B3
    B3 --> C1
    C1 -->|Yes| D1
    C1 -->|No| R1
    D1 --> D2
    D2 --> D3
    D3 -->|Yes| E1
    D3 -->|No| R1
    E1 --> E2
    E2 --> E3

Quality Gate Details

Gate	Stage	Criteria	Blocks promotion?
Format validation	1	safetensors files present, config.json valid, tokenizer loads	Yes
Load test	1	vLLM can load model and serve 10 requests without error	Yes
BLEU score	1	≥ 0.85 on eval dataset (vs reference answers)	Yes
Factual accuracy	1	≥ 90% on fact-check subset (manga titles, dates, editions)	Yes
Language quality	1	≥ 88% on fluency scoring (both EN and JP)	Yes
Safety score	1	≥ 99% on safety eval (no harmful outputs)	Yes
Hallucination rate	1	≤ 5% on grounded QA subset	Yes
Canary TTFT P95	3	< 500 ms during 1-hour canary period	Yes
Canary error rate	3	< 0.5% during 1-hour canary period	Yes
Canary user satisfaction	3	No statistically significant drop (p < 0.05)	Warning only

Rollback Procedure

Model rollback (< 5 minutes with warm pool):

1. Identify the previous known-good model version
   → Check: s3://mangaassist-model-artifacts/adapters/{id}/versions/
   → Or check the Git tag for the last successful promotion

2. Update the model pointer
   → Update CURRENT file to point to previous version
   → Or update the SageMaker model artifact path

3. Deploy new endpoint configuration
   → If using model baking: Deploy the previous Docker image tag
   → If using runtime download: Update the S3 path and restart

4. Verify rollback
   → Check /v1/models endpoint returns expected version
   → Check health endpoint returns 200
   → Run 5 sample requests and verify quality

5. Post-mortem
   → Document what went wrong with the new model
   → Update quality gates if the failure was a gap in coverage

6. Model Baking vs. Runtime Download

Decision Analysis

Factor	Model baking (weights in Docker image)	Runtime download (weights from S3 at startup)
Cold start time	~67s (model already on disk)	~5-8 min (download 4.5 GB from S3)
Image size	~8.2 GB (large, but build once)	~3.5 GB (smaller image, separate weights)
Update frequency	Rebuild image for each model update	Just update S3 path, redeploy endpoint
Warm pool effectiveness	Excellent (model ready immediately)	Good (model cached on warm pool disk)
Rollback speed	Switch Docker tag (fast with pre-built images)	Update S3 pointer (fast if warm pool has previous version)

Decision: Model baking for the base model. The 67-second cold start is critical for scaling responsiveness. The image build cost is paid once per model version, and model versions change infrequently (monthly). LoRA adapters are also baked because they are small (~40 MB each) and switching adapter versions requires a code change anyway.

7. Known Issues And Lessons Learned

AWQ Calibration Gotchas

Calibration dataset too small: Using < 256 samples gives unstable quantization. Some weight groups get poor calibration, causing random hallucinations on specific topics. We saw this when an early calibration used only 128 samples and the model hallucinated manga publication dates.
Calibration dataset distribution mismatch: If the calibration dataset is all English but production traffic is 30% Japanese, the Japanese performance degrades significantly. The calibration dataset must reflect the actual production language mix.
AWQ version compatibility: AWQ 0.1.7 produces different artifacts than 0.1.6. Pin the AWQ library version in the Docker image and version the calibration artifacts.

LoRA Adapter Pitfalls

Rank mismatch across adapters: vLLM allocates a fixed-size LoRA slot. If adapter A has rank 16 and adapter B has rank 32, loading both fails silently (B gets truncated). All adapters must have the same rank.
Target module mismatch: If an adapter was trained with target_modules=["q_proj", "v_proj"] but another uses all 7 modules, the one with fewer modules loads fine but wastes the pre-allocated memory for unused modules.
Base model version drift: If the base model is re-quantized with a new calibration dataset, existing LoRA adapters may need retraining. The weight space shifts slightly, and adapters trained on the old quantization may produce slightly different outputs.

vLLM Version-Specific Notes

Version transition	What changed	Impact
0.4.1 → 0.4.2	Prefix cache eviction policy changed	15% hit rate drop; workaround: increase cache capacity
0.4.2 → 0.4.3	Eviction fixed, Multi-LoRA loading semantics changed	Need to re-register adapters; test adapter switching in staging
0.4.x → 0.5.x	Block allocator rewritten	May affect block_size tuning; rerun memory validation

8. Cross-References

Scenario narratives: 01-vllm-game-changer-scenarios.md — Scenario 5 covers AWQ + OOM defense
Low-level implementation: 02-vllm-low-level-implementation-and-critical-decisions.md — LoRA routing patterns
Deployment and infrastructure: 04-vllm-deployment-and-infrastructure.md — Docker image build, model baking
Monitoring and troubleshooting: 05-vllm-monitoring-and-troubleshooting.md — Post-deployment quality signals
Interview prep: 03-vllm-interview-prep-deep-dive.md
GPU architecture context: ../Model-Inference/01-inference-pipeline-challenges.md