LOCAL PREVIEW View on GitHub

Open-Source High-Performance Libraries — How I Championed OSS for MangaAssist

How I identified, evaluated, and deployed open-source high-performance inference and ML libraries that transformed our chatbot's cost, latency, and scalability.


Philosophy: Why Open Source First?

When building MangaAssist, I established a principle early on: default to open-source, justify managed services. This wasn't anti-AWS — it was engineering pragmatism:

  1. Transparency — OSS lets you read the source, understand failure modes, and fix issues without waiting for a vendor ticket
  2. Performance Innovation — The open-source LLM ecosystem moves faster than any single vendor; libraries like vLLM and FlashAttention shipped months before equivalent managed features
  3. Cost Efficiency — Self-hosted OSS on owned infrastructure eliminates per-request markups at scale
  4. No Lock-In — Every OSS component has a migration path; managed services often don't
  5. Team Growth — Working with cutting-edge OSS keeps the team sharp and attractive for hiring

1. vLLM — The Game-Changer for LLM Inference

What Is vLLM?

vLLM is an open-source, high-throughput LLM serving engine from UC Berkeley's Sky Computing Lab. It is the most widely adopted inference engine in the industry — 73,500+ GitHub stars, 2,300+ contributors, used by Microsoft, and deployed in production at scale by LMSYS (Chatbot Arena).

The Core Innovation: PagedAttention

Traditional LLM inference wastes 60-80% of GPU memory due to how KV (key-value) cache is managed. Every token generated during autoregressive decoding requires keeping previous KV tensors in GPU memory, and since sequence lengths are unpredictable, systems pre-allocate the maximum possible memory — wasting most of it.

vLLM's PagedAttention borrows from operating system virtual memory:

┌─────────────────────────────────────────────────────┐
│           OS Virtual Memory vs PagedAttention       │
├──────────────────┬──────────────────────────────────┤
│   OS Concept     │   PagedAttention Analog          │
├──────────────────┼──────────────────────────────────┤
│   Pages          │   KV cache blocks                │
│   Bytes          │   Tokens                         │
│   Processes      │   Sequences (requests)           │
│   Page table     │   Block table                    │
│   Copy-on-Write  │   Shared blocks for beam search  │
└──────────────────┴──────────────────────────────────┘

Physical GPU memory blocks are allocated on-demand as tokens are generated, mapped through a block table from logical to physical addresses. This eliminates pre-allocation waste entirely.

Result: Memory waste drops from 60-80% to under 4%.

How I Deployed vLLM at MangaAssist

We used vLLM for serving our fine-tuned models — specifically the intent classifier (DistilBERT) reranker (ms-marco-MiniLM), and a fine-tuned Llama model for Japanese manga domain knowledge that Bedrock didn't support.

Deployment Architecture:

graph LR
    subgraph SageMaker["SageMaker Endpoint"]
        A[vLLM Serving Container]
        B[PagedAttention Engine]
        C[Continuous Batching]
        D[Prefix Caching]
    end

    E[API Gateway] --> F[Load Balancer]
    F --> SageMaker
    SageMaker --> G[OpenAI-Compatible API]

    H[Bedrock — Claude] -.->|"Primary LLM"| E
    SageMaker -.->|"Fine-tuned models"| E

Key vLLM Features We Leveraged:

Feature How We Used It Impact
PagedAttention Near-zero memory waste on GPU Fit 2x more concurrent requests per GPU
Continuous Batching Dynamic request batching instead of fixed windows 40% latency reduction during traffic spikes
Automatic Prefix Caching System prompt + conversation history reuse 35% reduction in redundant computation for multi-turn chats
Streaming Token-by-token streaming to chat widget Users see first token in <500ms instead of waiting 2-3s for full response
OpenAI-Compatible API Drop-in replacement for OpenAI client SDK Zero application code changes when switching inference backends
Multi-LoRA Serve manga-domain and general LoRA adapters from one base model 50% fewer GPU instances (one endpoint serves both)
Quantization (AWQ) INT4 quantization of Llama model 3x memory reduction with <2% quality loss

Benchmark Results: vLLM vs. Alternatives

I ran comparative benchmarks before adoption, using our actual MangaAssist workload (Japanese manga Q&A, avg input 200 tokens, avg output 150 tokens):

┌──────────────────────────────────────────────────────────────────┐
│  Throughput (requests/sec) — Llama-13B on A100 (80GB)           │
├────────────────────┬────────────┬────────────┬──────────────────┤
│  Engine            │  Batch=1   │  Batch=32  │  Max Concurrent  │
├────────────────────┼────────────┼────────────┼──────────────────┤
│  HF Transformers   │  1.2       │  8.4       │  12              │
│  HF TGI            │  3.8       │  22.1      │  35              │
│  vLLM              │  4.1       │  38.6      │  85              │
│  TensorRT-LLM      │  4.5       │  41.2      │  90              │
└────────────────────┴────────────┴────────────┴──────────────────┘

Improvement: vLLM delivers up to 24x over HF Transformers, 1.8x over TGI
TensorRT-LLM was ~7% faster but required NVIDIA-specific compilation
┌──────────────────────────────────────────────────────────────────┐
│  Time to First Token (TTFT) — P95 latency in milliseconds       │
├────────────────────┬────────────┬────────────┬──────────────────┤
│  Engine            │  Short     │  Medium    │  Long Context    │
│                    │  (64 tok)  │  (512 tok) │  (2048 tok)      │
├────────────────────┼────────────┼────────────┼──────────────────┤
│  HF Transformers   │  180ms     │  920ms     │  3,800ms         │
│  HF TGI            │  95ms      │  380ms     │  1,400ms         │
│  vLLM              │  82ms      │  290ms     │  1,100ms         │
│  TensorRT-LLM      │  65ms      │  240ms     │  950ms           │
└────────────────────┴────────────┴────────────┴──────────────────┘

Why vLLM Won Over TensorRT-LLM

Despite TensorRT-LLM being 5-10% faster, I chose vLLM because:

Criteria vLLM TensorRT-LLM
Hardware Lock-In Runs on NVIDIA, AMD, Intel, TPU NVIDIA-only
Setup Complexity pip install vllm + docker Multi-step compilation per model + GPU arch
Model Support 50+ architectures out of the box Requires manual engine build per model
LoRA Hot-Swap Built-in multi-LoRA Limited support
Community 73K+ stars, rapid fixes Smaller community, NVIDIA-gated
Operational Burden Low — just a Docker container High — requires CUDA version management
Migration Risk Can switch to any hardware Locked to NVIDIA ecosystem

Decision: The 5-10% raw performance gap didn't justify the operational complexity and vendor lock-in of TensorRT-LLM. vLLM's broader hardware support also gave us a migration path to AWS Inferentia/Trainium in the future.

Business Impact

Before vLLM (HF Transformers + custom serving):
  - 8x A100 GPUs for fine-tuned model serving
  - $32,000/month compute cost
  - P95 latency: 920ms (medium context)

After vLLM deployment:
  - 4x A100 GPUs (50% reduction)
  - $16,000/month compute cost
  - P95 latency: 290ms (medium context)

Net savings: $16,000/month ($192K/year)
Latency improvement: 68% reduction

2. FlashAttention — Faster Attention, Less Memory

What Is FlashAttention?

FlashAttention (by Tri Dao et al.) is an IO-aware, exact attention algorithm that reduces the memory footprint of self-attention from O(N^2) to O(N) and achieves 2-4x speedup by minimizing GPU memory reads/writes (HBM accesses).

How We Used It

FlashAttention is integrated directly into vLLM's serving engine, but we also used it independently for:

  1. Fine-tuning DistilBERT — FlashAttention-2 reduced fine-tuning time by 2x on our manga intent classification dataset
  2. Embedding model training — When fine-tuning Titan embeddings wasn't an option, we trained custom embeddings with FlashAttention for memory efficiency
  3. Cross-encoder reranker — FlashAttention enabled us to run the ms-marco-MiniLM reranker on smaller GPU instances (ml.g4dn.xlarge instead of ml.g5.xlarge)

Savings: ~$3,200/month in training and inference costs by running on smaller instances.


3. Hugging Face Transformers & Ecosystem

What We Used

Library Purpose Why
transformers Model loading, tokenization, fine-tuning Industry standard; every model we evaluated was available
datasets Training data management Streaming, lazy loading, Apache Arrow backend
tokenizers Fast tokenization (Rust-based) 10-100x faster than Python tokenizers
peft (Parameter-Efficient Fine-Tuning) LoRA adapters for domain fine-tuning Train adapters in hours instead of days; <1% of parameters
accelerate Multi-GPU training orchestration Zero-code distributed training
evaluate Model evaluation metrics Standardized metrics (BLEU, ROUGE, F1, accuracy)

LoRA Fine-Tuning Pipeline

graph LR
    A[Manga Q&A Dataset] --> B[tokenizers — fast tokenization]
    B --> C[peft — LoRA config]
    C --> D[accelerate — multi-GPU training]
    D --> E[evaluate — quality metrics]
    E --> F{Quality Gate}
    F -->|Pass| G[vLLM — serve with LoRA adapter]
    F -->|Fail| H[Iterate: adjust rank, learning rate]
    H --> C

LoRA Configuration That Worked Best:

peft_config = LoraConfig(
    r=16,              # Rank — sweet spot for our task
    lora_alpha=32,     # Scaling factor
    lora_dropout=0.05, # Light dropout
    target_modules=["q_proj", "v_proj"],  # Attention projections only
    task_type=TaskType.CAUSAL_LM,
)
# 0.1% of total parameters trained
# Training time: 4 hours on 2x A100
# Quality: 94% of full fine-tune at 1% of cost


4. LangChain — Orchestration Framework

What Is It?

LangChain is an open-source framework for building LLM-powered applications, providing abstractions for chains, agents, retrieval, and memory management.

How We Used It

Component LangChain Abstraction Our Usage
RAG Pipeline RetrievalQA chain Product Q&A with OpenSearch vector retrieval
Conversation Memory ConversationBufferWindowMemory Last-K turns with DynamoDB backing store
Prompt Templates ChatPromptTemplate Structured prompts with dynamic context injection
Output Parsing PydanticOutputParser Enforce structured JSON responses for product recommendations
Agent Routing RouterChain Route to different chains based on intent classification

Why LangChain (and Its Limitations)

Why we adopted it: - Rapid prototyping — MVP in weeks instead of months - Community-contributed integrations (Bedrock, OpenSearch, DynamoDB) - Standardized abstractions that survived our first 3 architecture pivots

Limitations we hit: - Abstraction overhead added ~15ms per chain call - Debugging was difficult — stack traces went through 10+ layers - Version churn — breaking changes every few weeks in early 2024

Mitigation: We used LangChain for orchestration but wrote performance-critical paths (intent classification, caching, guardrails) as raw code — no framework overhead where latency mattered.


5. Sentence-Transformers — Embedding Quality

What Is It?

Sentence-Transformers is an open-source library for computing dense vector embeddings for text, built on HuggingFace Transformers.

How We Used It

While our primary embedding model was Amazon Titan Embeddings V2 (managed), we used Sentence-Transformers for:

  1. Embedding evaluation — Comparing Titan vs. open-source alternatives (e5-large-v2, bge-large-en-v1.5, multilingual-e5-large) on our manga corpus
  2. Custom embedding fine-tuning — Fine-tuned multilingual-e5-large on manga-specific queries for Japanese text
  3. Semantic caching — Used lightweight all-MiniLM-L6-v2 to detect semantically similar queries and serve cached responses

Semantic Caching Impact:

Cache hit rate:     42% of all queries (semantically similar to previous queries)
Latency for hits:   8ms (vs. 800ms for full LLM call)
Cost savings:       ~$12,000/month in avoided Bedrock invocations


Why OpenSearch Over Dedicated Vector DBs?

Factor OpenSearch Serverless Pinecone Weaviate
Managed within AWS Native External API Self-hosted or Cloud
Data egress cost $0 (same VPC) $0.09/GB Varies
Hybrid search (BM25 + vector) Built-in Vector-only Built-in
IAM integration Native API keys Custom
Serverless option Yes Yes No (cloud or self-hosted)
HNSW performance Comparable Comparable Comparable

Decision: OpenSearch's hybrid search (BM25 keyword + vector similarity) was critical for manga titles where exact matching matters (e.g., "One Piece" should match exactly, not just semantically).


7. ONNX Runtime — Cross-Platform Model Optimization

How We Used It

Converted our DistilBERT intent classifier to ONNX format for:

  1. Inference speed — ONNX Runtime with graph optimization: 40% faster than PyTorch inference
  2. Hardware portability — Same ONNX model runs on GPU, CPU, and Inferentia (via Neuron)
  3. Smaller deployment — No PyTorch dependency in production container (image size: 800MB → 200MB)
PyTorch (eager):     12ms per inference (GPU)
ONNX Runtime:        7ms per inference (GPU)
ONNX on Inferentia:  4ms per inference (AWS Inferentia chip)

8. Weights & Biases (W&B) — Experiment Tracking

How We Used It

W&B was our primary experiment tracking platform during model development (before transitioning to MLflow for production tracing):

Feature Usage
Experiment Tracking Every fine-tuning run logged with hyperparameters, metrics, and artifacts
Sweep (Hyperparameter Tuning) Bayesian optimization for LoRA rank, learning rate, batch size
Artifact Versioning Model checkpoints versioned and linked to training runs
Tables Evaluation results visualized across model versions
Reports Shared with data scientists for collaborative model selection

Why we later added MLflow: W&B is excellent for R&D but lacks production tracing capabilities (per-request trace visualization, latency breakdown, streaming support). MLflow Tracing filled this gap — see 03-mlflow-llm-observability.md.


Summary: Open-Source Impact on MangaAssist

Library Where Used Key Metric Improvement
vLLM Self-hosted model serving 50% GPU reduction, 68% latency improvement
FlashAttention Training + inference 2x training speedup, smaller instance sizes
HuggingFace (transformers, peft) Fine-tuning pipeline 94% quality at 1% of full fine-tuning cost
LangChain RAG orchestration MVP delivery in 3 weeks
Sentence-Transformers Semantic caching, embedding eval 42% cache hit rate, $12K/month savings
OpenSearch Vector + keyword hybrid search Hybrid search: 15% higher recall than vector-only
ONNX Runtime Intent classification 40% faster inference, 75% smaller container
W&B Experiment tracking Systematic model selection; reproducible results

Total annual savings from OSS adoption: ~$370K+

The key insight: open-source doesn't mean "cheap and risky." It means transparent, benchmarkable, and community-hardened. Every library above has thousands of production deployments validating its reliability before we ever ran it on MangaAssist traffic.