06. GPU Architecture Challenges — MangaAssist
"Every GPU dollar we burned on wasted memory, idle capacity, or poor batching was a dollar we couldn't spend on better models, more features, or lower latency. GPU architecture isn't an ops problem — it's a product problem."
Overview
MangaAssist ran 4 ML models in production simultaneously — DistilBERT intent classifier, Titan Embeddings V2, a cross-encoder reranker, and Claude 3.5 Sonnet via Bedrock. The self-hosted models (DistilBERT, reranker, fine-tuned Llama) ran on GPU-backed SageMaker endpoints. This document covers every GPU-level architecture challenge we hit, the user stories that drove each fix, and the exact implementation approach.
GPU Challenges at a Glance
| # | Challenge | Root Cause | Solution | Impact |
|---|---|---|---|---|
| 1 | KV Cache Memory Fragmentation | Static KV cache pre-allocation wastes 60-80% GPU VRAM | PagedAttention (vLLM) | 2x concurrent request capacity per GPU |
| 2 | Low GPU Utilization from Fixed Batching | Requests served one-at-a-time or in fixed windows | Continuous Batching | 40% latency reduction during spikes |
| 3 | GPU Cold Start Latency | Model weights loaded on first request (45-60s) | Warmup + min-instance pinning | Eliminated 99% of cold starts |
| 4 | Idle GPU Waste in Off-Peak Hours | Always-on GPU instances with no work to do | Predictive scaling + Inferentia migration | $21K/month saved |
| 5 | GPU OOM on Long Conversation Turns | Multi-turn context grows KV cache unboundedly | Context windowing + quantization | Zero OOM kills in production |
| 6 | Multi-Model GPU Contention | Each model claimed an isolated GPU, sharing was impossible | Multi-LoRA on shared base model | 50% fewer GPU instances |
| 7 | Slow GPU Provisioning During Traffic Spikes | SageMaker scale-up: 5-8 minutes | Predictive scaling + step scaling | Spike absorption in <90 seconds |
| 8 | INT8/INT4 Quality Degradation | Naive quantization hurt Japanese-language quality | AWQ calibration on manga corpus | <2% quality loss with 3x memory reduction |
User Story 1 — KV Cache Memory Fragmentation
User Story
AS A platform engineer responsible for GPU cost efficiency,
I WANT KV cache memory to be allocated dynamically instead of pre-allocated per request,
SO THAT I can fit more concurrent user sessions on each GPU without provisioning additional instances.
Background & Problem
When MangaAssist first launched with naive HuggingFace Transformers inference, our profiling revealed a shocking number: 72% of allocated GPU VRAM was sitting unused on the KV cache alone.
Every autoregressive generation step requires keeping key-value tensors for all previous tokens in GPU memory. The naive approach pre-allocates the maximum possible sequence length (2048 tokens in our case) for every request — even if the actual response is 50 tokens long.
Request A: actual 180 tokens → allocated 2048 tokens worth of KV memory (1868 wasted)
Request B: actual 220 tokens → allocated 2048 tokens worth of KV memory (1828 wasted)
Request C: actual 95 tokens → allocated 2048 tokens worth of KV memory (1953 wasted)
Total waste: ~72% of KV cache VRAM
GPU: ml.g4dn.xlarge (16GB VRAM)
Max concurrent requests with naive approach: 4-6
This directly translated to user-facing latency. During traffic spikes, new requests queued waiting for GPU memory to free up from one request before the next could be loaded.
Implementation Detail
Step 1 — Profiled the Baseline
import torch
import time
def profile_kv_cache_utilization(model, tokenizer, prompts):
"""Measure actual vs. allocated KV cache memory."""
results = []
for prompt in prompts:
torch.cuda.reset_peak_memory_stats()
mem_before = torch.cuda.memory_allocated()
inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
with torch.no_grad():
output = model.generate(
**inputs,
max_new_tokens=512,
return_dict_in_generate=True,
output_scores=True
)
actual_tokens = output.sequences.shape[-1]
peak_mem = torch.cuda.max_memory_allocated()
baseline_mem_per_token = (peak_mem - mem_before) / 2048 # pre-allocated max
actual_mem_needed = baseline_mem_per_token * actual_tokens
utilization = actual_mem_needed / (peak_mem - mem_before)
results.append({
"prompt_len": inputs["input_ids"].shape[-1],
"output_len": actual_tokens,
"peak_vram_mb": peak_mem / 1e6,
"kv_utilization": utilization
})
return results
# Result: avg utilization = 0.28 (72% waste)
Step 2 — Adopted vLLM with PagedAttention
PagedAttention maps the KV cache using virtual pages (analogous to OS virtual memory), allocating physical GPU memory blocks only as tokens are generated:
┌────────────────────────────────────────────────────────┐
│ PagedAttention Block Layout │
├────────────────────────┬───────────────────────────────┤
│ OS Concept │ PagedAttention Analog │
├────────────────────────┼───────────────────────────────┤
│ Memory pages │ KV cache blocks (16 tokens) │
│ Physical memory │ GPU VRAM blocks │
│ Virtual address space │ Logical KV sequence │
│ Page table │ Block table per request │
│ Copy-on-Write │ Shared blocks (beam search) │
└────────────────────────┴───────────────────────────────┘
Step 3 — SageMaker Deployment Config
# vllm_serving_container/serve.py
from vllm import AsyncLLMEngine, AsyncEngineArgs, SamplingParams
from vllm.engine.arg_utils import AsyncEngineArgs
def build_engine():
engine_args = AsyncEngineArgs(
model="meta-llama/Llama-3-8b-instruct", # fine-tuned checkpoint
tensor_parallel_size=1, # single A10G on ml.g5.xlarge
gpu_memory_utilization=0.92, # leave 8% headroom for CUDA ops
max_num_seqs=128, # max concurrent sequences
max_model_len=4096, # context window
block_size=16, # tokens per KV cache block
enable_prefix_caching=True, # reuse system prompt KV blocks
quantization="awq", # INT4 AWQ quantization
enforce_eager=False, # use CUDA graphs for speed
)
return AsyncLLMEngine.from_engine_args(engine_args)
engine = build_engine()
Step 4 — Measure Before/After
Before (HF Transformers, static KV allocation):
VRAM utilization: 28% (72% waste)
Max concurrent requests: 4-6 per GPU
P50 latency at 30 RPS: 1,820ms
P99 latency at 30 RPS: 4,200ms
After (vLLM PagedAttention):
VRAM utilization: 96% (4% waste)
Max concurrent requests: 85-90 per GPU
P50 latency at 30 RPS: 620ms
P99 latency at 30 RPS: 1,400ms
GPU instances required: Cut from 8 → 4 (50% reduction)
Monthly GPU spend: $18,400 → $9,200
Acceptance Criteria
- KV cache memory waste < 10% under normal load
- Max concurrent requests per GPU >= 60
- P99 latency < 2s at 50 concurrent requests
- No memory-related request failures in 7-day soak test
User Story 2 — Low GPU Utilization from Fixed Batching
User Story
AS A senior ML engineer optimizing inference throughput,
I WANT the inference engine to continuously accept and batch new requests as generation slots free up,
SO THAT GPU utilization stays above 80% even under variable traffic patterns
without artificially increasing latency by holding requests for a fixed batch window.
Background & Problem
Our initial deployment used static batching: collect N requests, wait up to T ms, send the batch to the GPU. This caused two failure modes:
- Under-loaded batches: At 3 AM, with 50 concurrent users, batches of size 2-3 were being sent. The GPU processed these quickly but sat idle 40-60% of the time waiting for the next batch window.
- Over-loaded queue at peak: At peak (2,000 concurrent users), some requests in a batch finished generation after 3 tokens while others needed 300 tokens. The short responses held the entire batch hostage, blocking new requests from filling freed GPU slots.
Fixed Batching Timeline:
Time 0ms: Request A starts (needs 300 tokens)
Time 0ms: Request B starts (needs 12 tokens)
Time 0ms: Request C starts (needs 8 tokens)
Time 45ms: B finishes → GPU slot free BUT waits for batch
Time 35ms: C finishes → GPU slot free BUT waits for batch
Time 890ms: A finishes → entire batch done
Time 890ms: D, E, F finally start (waited 890ms for a slot!)
Implementation Detail
Continuous Batching with vLLM's Iteration-Level Scheduling
vLLM processes requests at the iteration level, not the batch level. At every generation step, if any sequence has finished (hit EOS or max tokens), a new request can immediately fill that sequence slot:
# continuous_batching_demo.py — conceptual illustration
class IterationLevelScheduler:
"""
At each decode step, retire finished sequences and
immediately admit new requests from the waiting queue.
Unlike fixed batching, the batch composition changes every token step.
"""
def __init__(self, max_seqs: int = 128):
self.running: list[Sequence] = []
self.waiting: deque[Request] = deque()
self.max_seqs = max_seqs
def schedule_step(self) -> list[Sequence]:
# 1. Remove finished sequences
self.running = [s for s in self.running if not s.is_finished]
# 2. Fill open slots from waiting queue immediately
while len(self.running) < self.max_seqs and self.waiting:
req = self.waiting.popleft()
self.running.append(Sequence(req))
# 3. Execute one decode step for all running sequences
return self.running
Deployment Config — Tuned for MangaAssist Traffic
# sagemaker_endpoint/model_handler.py
engine_args = AsyncEngineArgs(
model=MODEL_PATH,
# Continuous batching is the default in vLLM — no flag needed.
# Tune these for our traffic profile:
max_num_seqs=128, # max sequences in flight simultaneously
max_num_batched_tokens=8192, # max total tokens across all seqs per step
scheduler_delay_factor=0.0, # no artificial delay — admit requests immediately
)
Monitoring GPU Utilization
# metrics/gpu_utilization_tracker.py
import boto3
import pynvml
pynvml.nvmlInit()
def emit_gpu_metrics():
handle = pynvml.nvmlDeviceGetHandleByIndex(0)
utilization = pynvml.nvmlDeviceGetUtilizationRates(handle)
mem_info = pynvml.nvmlDeviceGetMemoryInfo(handle)
cw = boto3.client("cloudwatch", region_name="ap-northeast-1")
cw.put_metric_data(
Namespace="MangaAssist/GPU",
MetricData=[
{"MetricName": "GPUUtilizationPercent",
"Value": utilization.gpu, "Unit": "Percent"},
{"MetricName": "GPUMemoryUsedMB",
"Value": mem_info.used / 1_048_576, "Unit": "Megabytes"},
{"MetricName": "ActiveSequences",
"Value": engine.get_num_running_requests(), "Unit": "Count"},
]
)
Before / After Results
Fixed Batching (batch_size=16, wait_time=50ms):
GPU compute utilization: 42% avg, 68% peak
Throughput at 100 RPS: 38 req/sec actual
P99 queue wait time: 650ms
Continuous Batching (vLLM):
GPU compute utilization: 81% avg, 95% peak
Throughput at 100 RPS: 94 req/sec actual
P99 queue wait time: 85ms
Latency improvement: 40% reduction at P99 (spike conditions)
Acceptance Criteria
- GPU utilization > 75% on average during business hours
- Queue wait time P99 < 150ms under 100 RPS load
- No artificial batching delay introduced
- Metrics emitted and alarmed in CloudWatch
User Story 3 — GPU Cold Start Latency
User Story
AS A customer using MangaAssist for the first time after a scale-up event,
I WANT my first message to receive a response within the same 3-second SLA as all other requests,
SO THAT I am not penalized with a 45-60 second wait while the model loads onto the GPU.
Background & Problem
When SageMaker provisioned a new ml.g5.xlarge instance to handle a traffic spike, the process looked like this:
Instance provisioned: t + 0s
Docker container start: t + 45s
Model weights download: t + 90s (Llama-8B = ~16GB from S3)
CUDA graph capture: t + 115s
First request served: t + 120s (2 minutes after provisioning started!)
The first 3-5 real user requests that hit a freshly provisioned instance waited over a minute for a response. This was a 40x latency regression vs. the 3s SLA.
Implementation Detail
Step 1 — Warmup Request Script on Container Start
# inference/warmup.py
import requests
import json
import time
WARMUP_PROMPTS = [
# Cover the most common inference shapes at launch
{"role": "user", "content": "What manga series are similar to Naruto?"}, # medium output
{"role": "user", "content": "Is One Piece volume 1 available in English?"}, # short output
{"role": "user", "content": "Recommend 5 manga for someone who likes fantasy."}, # long output
]
def run_warmup(endpoint_url: str, max_wait_seconds: int = 180):
"""
Send warmup requests through the full inference path so that:
1. Model weights are in GPU HBM (not system RAM)
2. CUDA graphs are captured for common sequence lengths
3. PagedAttention block pool is initialized
"""
start = time.time()
for idx, prompt in enumerate(WARMUP_PROMPTS):
try:
resp = requests.post(
f"{endpoint_url}/generate",
json={"messages": [prompt], "max_tokens": 150, "stream": False},
timeout=60
)
resp.raise_for_status()
elapsed = time.time() - start
print(f"Warmup {idx+1}/3 complete in {elapsed:.1f}s")
except Exception as e:
print(f"Warmup {idx+1} failed: {e} — proceeding anyway")
total_elapsed = time.time() - start
print(f"Warmup complete in {total_elapsed:.1f}s. Instance is hot.")
Step 2 — Fail the Health Check Until Warmed Up
# inference/health.py
import os
from fastapi import FastAPI, Response
app = FastAPI()
_warmed_up = False
@app.get("/ping")
def ping(response: Response):
"""
SageMaker calls /ping to determine instance health.
Return 503 until warmup completes so the load balancer
never routes live traffic to a cold instance.
"""
if not _warmed_up:
response.status_code = 503
return {"status": "warming_up"}
return {"status": "healthy"}
@app.on_event("startup")
async def startup_event():
global _warmed_up
# Run warmup synchronously before marking healthy
run_warmup("http://localhost:8080")
_warmed_up = True
Step 3 — Minimum Instance Count = 2 (Never Scale to Zero)
# infrastructure/sagemaker_autoscaling.py
import boto3
def configure_endpoint_autoscaling(endpoint_name: str):
aas = boto3.client("application-autoscaling", region_name="ap-northeast-1")
# Register scalable target
aas.register_scalable_target(
ServiceNamespace="sagemaker",
ResourceId=f"endpoint/{endpoint_name}/variant/AllTraffic",
ScalableDimension="sagemaker:variant:DesiredInstanceCount",
MinCapacity=2, # NEVER go below 2 — cold start prevention
MaxCapacity=20,
)
# Scale out fast, scale in slow
aas.put_scaling_policy(
PolicyName=f"{endpoint_name}-target-tracking",
ServiceNamespace="sagemaker",
ResourceId=f"endpoint/{endpoint_name}/variant/AllTraffic",
ScalableDimension="sagemaker:variant:DesiredInstanceCount",
PolicyType="TargetTrackingScaling",
TargetTrackingScalingPolicyConfiguration={
"TargetValue": 200.0, # invocations/minute per instance
"PredefinedMetricSpecification": {
"PredefinedMetricType": "SageMakerVariantInvocationsPerInstance"
},
"ScaleOutCooldown": 60, # scale out aggressive: 1 minute cooldown
"ScaleInCooldown": 600, # scale in conservative: 10 minute cooldown
}
)
Step 4 — Fast Model Loading with Safetensors and S3 Express
# Switch from PyTorch bin format to safetensors for faster deserialization
# safetensors skips Python deserialization — direct memory mapping
# Upload model to S3 Express One Zone (10x faster than standard S3)
aws s3 cp ./model-weights/ s3://manga-assist-models-express/llama-8b/ \
--recursive \
--storage-class EXPRESS_ONEZONE
# Mount S3 Express as a local filesystem via Mountpoint-S3
# During container startup: weights are streamed directly into GPU memory
Cold Start Time Comparison
Before (PyTorch .bin, standard S3):
Full cold start: 118 seconds
First request latency: ~120 seconds
SLA violations: 100% of first requests on new instances
After (safetensors, S3 Express, warmup, min=2):
Cold start eliminated: 99.7% of requests (min=2 keeps instances hot)
Remaining 0.3%: Scale-out edge case, ~22 second warmup
SLA violations: Reduced to 0.1% of requests
Acceptance Criteria
- No live traffic routed to un-warmed instances (health check gate)
- Model loading time < 30s from instance started to first request served
- Minimum 2 instances always warm in production
- P99.9 first-response latency < 5s including worst-case warmup scenario
User Story 4 — Idle GPU Waste in Off-Peak Hours
User Story
AS AN engineering manager owning the MangaAssist infrastructure budget,
I WANT GPU instances to scale down to the minimum footprint during off-peak hours (midnight–6 AM JST),
SO THAT we do not pay for idle GPU compute while maintaining readiness for any traffic.
Background & Problem
MangaAssist operates on Japan Standard Time (JST) traffic patterns. At 3 AM JST, traffic dropped to ~500 requests/minute. One ml.g5.xlarge instance could serve this load comfortably. But our auto-scaling config conservatively kept 4 instances running to absorb potential bursts.
Off-peak reality (2 AM – 6 AM JST):
Actual traffic: ~500 requests/minute
One instance capacity: ~1,200 requests/minute
Running instances: 4 (over-provisioned 3x)
Idle instances: 3
Cost per idle instance: ml.g5.xlarge = $1.41/hr
Daily waste: 3 instances × 4 hours × $1.41 = $16.92/day
Monthly waste: ~$507/month (just from off-peak hours)
Additionally, the DistilBERT intent classifier running on ml.g4dn.xlarge had the same problem. The combined idle GPU waste across both models was $1,400/month.
Implementation Detail
Step 1 — Migrate DistilBERT to AWS Inferentia (ml.inf1.xlarge)
The DistilBERT BERT-base classifier (66M parameters) did not need GPU. AWS Inferentia chips are purpose-built for inference and cost 70% less than equivalent GPU instances.
# Compile DistilBERT for Inferentia using AWS Neuron SDK
import torch
import torch_neuron
from transformers import AutoTokenizer, AutoModelForSequenceClassification
MODEL_NAME = "distilbert-base-uncased-finetuned-manga-intent-v2"
def compile_for_inferentia():
model = AutoModelForSequenceClassification.from_pretrained(MODEL_NAME)
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
model.eval()
# Trace with representative input shapes for Neuron compilation
# Must cover all shapes that will be seen in production
example_inputs = tokenizer(
"What manga is similar to Attack on Titan?",
max_length=128,
padding="max_length",
return_tensors="pt"
)
# Neuron compilation — converts PyTorch ops to Neuron graph
model_neuron = torch_neuron.trace(
model,
example_inputs=(
example_inputs["input_ids"],
example_inputs["attention_mask"]
),
compiler_args=["--neuron-core-pipeline-cores", "2"], # pipeline across 2 NeuronCores
)
# Save compiled model
torch.jit.save(model_neuron, "distilbert_neuron_compiled.pt")
print("Neuron compilation complete. Model saved.")
return model_neuron
# Deploy to SageMaker on ml.inf1.xlarge (4 NeuronCores, $0.228/hr vs $0.736/hr for g4dn.xlarge)
Step 2 — Scheduled Scaling for Predictable Traffic Patterns
# infrastructure/scheduled_scaling.py
import boto3
from datetime import datetime
def setup_scheduled_scaling(endpoint_name: str):
aas = boto3.client("application-autoscaling", region_name="ap-northeast-1")
# Scale DOWN: 1 AM JST (16:00 UTC) → min capacity 2
aas.put_scheduled_action(
ServiceNamespace="sagemaker",
ScheduledActionName=f"{endpoint_name}-scale-down-off-peak",
ResourceId=f"endpoint/{endpoint_name}/variant/AllTraffic",
ScalableDimension="sagemaker:variant:DesiredInstanceCount",
Schedule="cron(0 16 * * ? *)", # 1 AM JST daily
ScalableTargetAction={"MinCapacity": 2, "MaxCapacity": 6},
)
# Scale UP: 8 AM JST (23:00 UTC) → restore full capacity
aas.put_scheduled_action(
ServiceNamespace="sagemaker",
ScheduledActionName=f"{endpoint_name}-scale-up-business-hours",
ResourceId=f"endpoint/{endpoint_name}/variant/AllTraffic",
ScalableDimension="sagemaker:variant:DesiredInstanceCount",
Schedule="cron(0 23 * * ? *)", # 8 AM JST daily
ScalableTargetAction={"MinCapacity": 4, "MaxCapacity": 20},
)
# MANGA RELEASE DAY scaling: Fridays at 6 PM JST (9:00 UTC)
# Major manga chapters release on Fridays — pre-scale 2 hours before
aas.put_scheduled_action(
ServiceNamespace="sagemaker",
ScheduledActionName=f"{endpoint_name}-manga-release-day",
ResourceId=f"endpoint/{endpoint_name}/variant/AllTraffic",
ScalableDimension="sagemaker:variant:DesiredInstanceCount",
Schedule="cron(0 9 ? * FRI *)", # Fridays 6 PM JST
ScalableTargetAction={"MinCapacity": 8, "MaxCapacity": 20},
)
Step 3 — Savings Summary
DistilBERT — GPU → Inferentia Migration:
Before: ml.g4dn.xlarge → $0.736/hr × 4 instances = $2.944/hr
After: ml.inf1.xlarge → $0.228/hr × 2 instances = $0.456/hr (Inferentia is faster)
Saving: $2.49/hr → $1,790/month
LLM Inference — Scheduled Scaling (off-peak 5 hours/night):
Before: 4 instances always on → $1.41/hr × 4 = $5.64/hr
After: 2 instances off-peak → $1.41/hr × 2 = $2.82/hr (save $2.82/hr × 5hrs × 30days)
Saving: $423/month
Total GPU Cost Savings: $2,213/month
Acceptance Criteria
- DistilBERT latency on Inferentia within 20% of GPU baseline (<18ms)
- Off-peak: maximum 2 LLM GPU instances running at 2 AM JST
- Pre-scale event fires 2 hours before Friday manga release window
- Zero oncall alerts caused by scheduled scale actions
User Story 5 — GPU OOM on Long Conversation Turns
User Story
AS A customer having a long multi-turn conversation with MangaAssist (20+ turns),
I WANT my session to continue normally without errors,
SO THAT I am not dropped mid-conversation with an unintelligible error because
the chatbot ran out of GPU memory trying to fit my entire conversation history.
Background & Problem
KV cache size grows linearly with conversation length. In multi-turn conversations, the KV cache for a 20-turn conversation (averaging 100 tokens/turn) consumed ~2GB of VRAM on the reranker's A10G. Under concurrent load, we hit OOM crashes:
Error observed in production logs:
torch.cuda.OutOfMemoryError: CUDA out of memory.
Tried to allocate 1.98 GiB (GPU 0; 22.20 GiB total capacity;
19.41 GiB already allocated; 892.00 MiB free; 20.81 GiB reserved)
This killed the serving worker process, and SageMaker's health check triggered a container restart — a 45-second outage for that instance's share of traffic.
Implementation Detail
Step 1 — Context Window Policy with Sliding Window
# conversation/context_manager.py
from dataclasses import dataclass
from typing import Optional
import tiktoken
@dataclass
class ConversationTurn:
role: str # "user" | "assistant"
content: str
token_count: int
turn_index: int
class SlidingWindowContextManager:
"""
Implements a sliding window over conversation history to cap
the total token count injected into the LLM prompt, preventing
KV cache OOM while preserving the most recent and most relevant context.
"""
SYSTEM_PROMPT_TOKENS = 512 # reserved for system prompt
CONTEXT_TOKENS = 1024 # reserved for RAG chunks
MAX_HISTORY_TOKENS = 2048 # budget for conversation history
MAX_OUTPUT_TOKENS = 512 # max generation length
MAX_CONTEXT_WINDOW = 4096 # model's total context window
def __init__(self):
self.enc = tiktoken.get_encoding("cl100k_base")
def count_tokens(self, text: str) -> int:
return len(self.enc.encode(text))
def build_context_window(
self,
turns: list[ConversationTurn],
system_prompt: str,
rag_context: str,
current_query: str
) -> list[dict]:
"""
Always include: system prompt + current query + RAG context.
Fill remaining budget with most recent history turns (LIFO).
If a turn would exceed budget, summarize older turns instead.
"""
budget = (
self.MAX_CONTEXT_WINDOW
- self.SYSTEM_PROMPT_TOKENS
- self.count_tokens(rag_context)
- self.count_tokens(current_query)
- self.MAX_OUTPUT_TOKENS
- 64 # overhead / formatting tokens
)
selected_turns = []
tokens_used = 0
# Walk history in reverse (most recent first)
for turn in reversed(turns):
if tokens_used + turn.token_count > budget:
# Inject a truncation marker instead of dropping silently
selected_turns.append({
"role": "system",
"content": f"[Earlier conversation truncated — {len(turns) - len(selected_turns)} turns not shown]"
})
break
selected_turns.append({"role": turn.role, "content": turn.content})
tokens_used += turn.token_count
# Reconstruct in chronological order
messages = [{"role": "system", "content": system_prompt}]
messages += list(reversed(selected_turns))
messages.append({"role": "user", "content": f"Context:\n{rag_context}\n\nQuestion: {current_query}"})
return messages
Step 2 — AWQ INT4 Quantization to Reclaim VRAM
# quantization/awq_calibration.py
from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer
from datasets import load_dataset
def quantize_with_manga_corpus(model_path: str, output_path: str):
"""
AWQ (Activation-aware Weight Quantization) calibrates quantization
using real activation statistics from our manga domain corpus.
This is critical — generic calibration data hurts Japanese-language
quality significantly more than domain-matched calibration.
"""
model = AutoAWQForCausalLM.from_pretrained(model_path, device_map="auto")
tokenizer = AutoTokenizer.from_pretrained(model_path)
# Load manga-domain calibration data (critical for Japanese quality)
calibration_data = load_dataset(
"json",
data_files="s3://manga-assist-data/calibration/manga_qa_500.jsonl",
split="train"
)
quant_config = {
"zero_point": True, # Zero-point quantization for better accuracy
"q_group_size": 128, # Groups of 128 weights share a scale factor
"w_bit": 4, # 4-bit weights (INT4)
"version": "GEMM", # GEMM kernel — best for batch inference
}
model.quantize(
tokenizer,
quant_config=quant_config,
calib_data=calibration_data,
text_column="question",
max_calib_samples=512, # 512 representative prompts from manga QA
max_calib_seq_len=512,
)
model.save_quantized(output_path)
tokenizer.save_pretrained(output_path)
print(f"AWQ INT4 model saved to {output_path}")
# Original: 16GB (BF16) → Quantized: 5.2GB (INT4 + metadata)
# Quality on manga eval set: 4.21/5.0 → 4.17/5.0 (-0.9% regression)
Step 3 — OOM Circuit Breaker
# inference/oom_guard.py
import torch
import functools
import logging
logger = logging.getLogger(__name__)
def oom_circuit_breaker(fallback_response: str):
"""
Decorator: catches GPU OOM errors, frees cache, and returns a
graceful degradation response rather than crashing the worker.
"""
def decorator(func):
@functools.wraps(func)
async def wrapper(*args, **kwargs):
try:
return await func(*args, **kwargs)
except torch.cuda.OutOfMemoryError as e:
logger.error(f"GPU OOM in {func.__name__}: {e}")
torch.cuda.empty_cache() # release unreferenced tensors
torch.cuda.synchronize() # wait for CUDA ops to complete
# Emit a metric so we know this is happening
emit_metric("gpu_oom_count", 1)
return fallback_response
return wrapper
return decorator
# Usage on the LLM generation endpoint:
@oom_circuit_breaker(
fallback_response="I'm having trouble with that right now. Let me connect you with our support team."
)
async def generate_response(messages: list, sampling_params: SamplingParams) -> str:
return await engine.generate(messages, sampling_params)
OOM Incidents After Fixes
Before (no context windowing, no quantization, no OOM guard):
OOM incidents per week: ~14
Container restarts/week: ~14 (each = ~45s mini-outage for one instance)
Longest conversation (P99): 12 turns before OOM risk
After (sliding window + AWQ + OOM guard):
OOM incidents per week: 0 (in 8 weeks post-deployment)
VRAM used by 20-turn conv: 3.8GB (INT4) vs 11.2GB (BF16)
Longest conversation (P99): 35+ turns handled without degradation
Acceptance Criteria
- Zero GPU OOM container restarts in 30-day production monitoring window
- Conversations of 30+ turns served within latency SLA
- AWQ quantization quality regression < 2% on manga evaluation set
- Sliding window truncation marker shown to users when history is trimmed
User Story 6 — Multi-Model GPU Contention (Multi-LoRA)
User Story
AS A cost-conscious senior engineer deploying multiple model variants,
I WANT multiple fine-tuned LoRA adapters to share a single base model instance on one GPU,
SO THAT I do not need separate SageMaker endpoints and separate GPU instances
for every domain-specific or task-specific model variant.
Background & Problem
We had two fine-tuned model variants during development: 1. Manga-domain adapter — fine-tuned on 50K manga QA pairs for product Q&A 2. General assistant adapter — fine-tuned on customer service data for support flows
Initially deployed on separate endpoints:
manga-domain endpoint: ml.g5.xlarge = $1.41/hr
general-assist endpoint: ml.g5.xlarge = $1.41/hr
Total: $2.82/hr = $2,030/month
Each instance was ~40% utilized independently. Combining them would cut costs in half.
Implementation Detail
Multi-LoRA Serving with vLLM
# multi_lora_config.py
from vllm import AsyncLLMEngine, AsyncEngineArgs
from vllm.lora.request import LoRARequest
engine_args = AsyncEngineArgs(
model="meta-llama/Llama-3-8b-instruct", # base model (shared)
enable_lora=True,
max_loras=4, # simultaneously loaded LoRA adapters
max_lora_rank=64,
max_cpu_loras=8, # keep up to 8 adapters in CPU memory for hot swapping
gpu_memory_utilization=0.90,
max_num_seqs=96,
)
engine = AsyncLLMEngine.from_engine_args(engine_args)
# LoRA adapter registry
LORA_ADAPTERS = {
"manga_domain": LoRARequest("manga", 1, "/opt/ml/lora/manga_domain_r64/"),
"general_support": LoRARequest("support", 2, "/opt/ml/lora/general_support_r64/"),
"jp_translation": LoRARequest("jp", 3, "/opt/ml/lora/jp_translation_r32/"),
}
async def generate_with_adapter(
messages: list[dict],
adapter_name: str,
sampling_params: SamplingParams
) -> str:
lora_request = LORA_ADAPTERS.get(adapter_name) # None = base model only
request_id = f"req-{uuid.uuid4()}"
results = engine.generate(
messages,
sampling_params,
request_id=request_id,
lora_request=lora_request # inject the appropriate adapter
)
output = ""
async for result in results:
output = result.outputs[0].text
return output
Routing Logic — Which Adapter to Use
# orchestrator/adapter_router.py
INTENT_TO_ADAPTER = {
"product_question": "manga_domain",
"recommendation": "manga_domain",
"discovery": "manga_domain",
"order_tracking": "general_support",
"return_request": "general_support",
"escalation": "general_support",
"chitchat": None, # base model sufficient
"faq": None, # RAG context handles it
}
def select_adapter(intent: str, locale: str) -> str | None:
base_adapter = INTENT_TO_ADAPTER.get(intent)
# Japanese locale gets JP translation adapter stacked on top
if locale == "ja-JP" and base_adapter == "manga_domain":
return "jp_translation"
return base_adapter
Cost Comparison
Before (separate endpoints per adapter):
2 × ml.g5.xlarge: $2.82/hr = $2,030/month
After (Multi-LoRA, single endpoint):
1 × ml.g5.xlarge: $1.41/hr = $1,015/month
Saving: $1,015/month (50% reduction)
GPU utilization: From 40% per instance → 76% on combined instance
Quality delta: No observable change (adapters isolated by LoRA)
Acceptance Criteria
- All 3 LoRA adapters served from a single GPU instance
- Latency overhead from LoRA routing < 5ms additional
- No cross-contamination between adapters (verified via eval benchmarks)
- 50% GPU instance reduction confirmed in AWS Cost Explorer
User Story 7 — Slow GPU Provisioning During Traffic Spikes
User Story
AS AN SRE managing MangaAssist reliability during major manga release events,
I WANT new GPU instances to be available within 90 seconds of a traffic spike,
SO THAT we never experience the 5-8 minute provisioning lag that caused
429 ThrottlingExceptions during the One Piece chapter release traffic spike.
Background & Problem
During the One Piece chapter 1,100 release, traffic spiked 3x in 15 minutes. SageMaker's reactive auto-scaling triggered but couldn't provision instances fast enough:
Timeline of the One Piece Incident:
8:00 PM JST: Chapter releases, traffic starts climbing
8:07 PM JST: Auto-scaling trigger fires (InvocationsPerInstance > 300)
8:07 PM JST: SageMaker begins provisioning 2 new instances
8:15 PM JST: First new instance becomes healthy
8:07–8:15 PM: 12,000 requests queued, 20% failed with 429
8:20 PM JST: All new instances healthy, 0% error rate
8:07–8:20 PM: 13 minutes of degraded service
Implementation Detail
Step 1 — Predictive Scaling via Manga Release Calendar
# scaling/manga_release_predictor.py
import boto3
from datetime import datetime, timedelta
import pytz
JST = pytz.timezone("Asia/Tokyo")
# Known high-traffic events with historical multipliers
KNOWN_EVENTS = [
# Weekly manga releases (Shonen Jump: Mondays, Shonen Magazine: Wednesdays)
{"day_of_week": "MON", "hour_jst": 18, "multiplier": 2.0, "reason": "Shonen Jump release"},
{"day_of_week": "WED", "hour_jst": 18, "multiplier": 1.8, "reason": "Shonen Magazine release"},
# Seasonal: Spring/Summer anime announcements drive catalog searches
{"month": 4, "day": 1, "multiplier": 2.5, "reason": "Spring anime announcement season"},
{"month": 10, "day": 1, "multiplier": 2.5, "reason": "Fall anime announcement season"},
]
def calculate_required_instances(base_rps: float, multiplier: float, instance_capacity_rps: float = 200) -> int:
required_rps = base_rps * multiplier
# Add 30% headroom buffer
return max(4, int((required_rps / instance_capacity_rps) * 1.3))
def schedule_predictive_scale_events(endpoint_name: str):
"""Pre-scale 2 hours before every known high-traffic event."""
aas = boto3.client("application-autoscaling", region_name="ap-northeast-1")
for event in KNOWN_EVENTS:
required = calculate_required_instances(
base_rps=350,
multiplier=event["multiplier"]
)
# Schedule 2 hours before event
cron = f"cron(0 {event['hour_jst'] - 2} ? * {event.get('day_of_week', '*')} *)"
aas.put_scheduled_action(
ServiceNamespace="sagemaker",
ScheduledActionName=f"{endpoint_name}-pre-scale-{event['reason'].replace(' ','-')}",
ResourceId=f"endpoint/{endpoint_name}/variant/AllTraffic",
ScalableDimension="sagemaker:variant:DesiredInstanceCount",
Schedule=cron,
ScalableTargetAction={"MinCapacity": required, "MaxCapacity": required + 4},
)
Step 2 — Aggressive Step Scaling (Scale Out Fast, Scale In Slow)
# infrastructure/step_scaling.py
import boto3
def configure_step_scaling(endpoint_name: str):
aas = boto3.client("application-autoscaling", region_name="ap-northeast-1")
cw = boto3.client("cloudwatch", region_name="ap-northeast-1")
# CloudWatch alarm: high invocations
cw.put_metric_alarm(
AlarmName=f"{endpoint_name}-high-invocations",
MetricName="InvocationsPerInstance",
Namespace="AWS/SageMaker",
Dimensions=[{"Name": "EndpointName", "Value": endpoint_name}],
Period=60,
EvaluationPeriods=1, # trigger after just 1 minute (aggressive)
Threshold=200,
ComparisonOperator="GreaterThanOrEqualToThreshold",
Statistic="Sum",
AlarmActions=[f"arn:aws:autoscaling:ap-northeast-1:ACCOUNT:scalingPolicy:..."],
)
# Step scaling: +2 when > 200/min, +4 when > 400/min, +8 when > 800/min
aas.put_scaling_policy(
PolicyName=f"{endpoint_name}-step-scale-out",
ServiceNamespace="sagemaker",
ResourceId=f"endpoint/{endpoint_name}/variant/AllTraffic",
ScalableDimension="sagemaker:variant:DesiredInstanceCount",
PolicyType="StepScaling",
StepScalingPolicyConfiguration={
"AdjustmentType": "ChangeInCapacity",
"Cooldown": 60,
"StepAdjustments": [
{"MetricIntervalLowerBound": 0, "MetricIntervalUpperBound": 200, "ScalingAdjustment": 2},
{"MetricIntervalLowerBound": 200, "MetricIntervalUpperBound": 600, "ScalingAdjustment": 4},
{"MetricIntervalLowerBound": 600, "MetricIntervalUpperBound": None, "ScalingAdjustment": 8},
]
}
)
Before/After the One Piece Incident Pattern
Before (reactive target tracking only):
Trigger to capacity available: 8-13 minutes
Requests dropped during spike: ~20%
Error rate peak: 20% for ~13 minutes
After (predictive + step scaling + event calendar):
Pre-scale fires 2 hours early: instances already provisioned
Reactive step scale (fallback): trigger to capacity in <90 seconds
Requests dropped: 0% for predicted events
Error rate during unpredicted spikes: <1% for <90 seconds
Acceptance Criteria
- Zero 429 errors during scheduled manga release windows
- Unpredicted spike absorption: additional capacity online within 90 seconds
- Scaling event calendar covers 100% of known weekly release patterns
- Post-spike scale-in happens no sooner than 10 minutes after traffic normalizes
User Story 8 — INT4 Quantization Quality Degradation on Japanese Text
User Story
AS A product manager measuring MangaAssist recommendation quality,
I WANT INT4-quantized models to maintain the same response quality
as the full-precision BF16 model on Japanese manga Q&A,
SO THAT we can achieve 3x VRAM savings without sacrificing the product experience
for our Japanese-language users.
Background & Problem
Naive INT8/INT4 quantization (round weights to nearest integer) caused a significant quality drop on Japanese text. Japanese characters have high token density and the language is morphologically complex — small weight perturbations caused mistranslations and hallucinated manga titles:
BF16 Response:
"鬼滅の刃(Demon Slayer)に似たマンガとして、進撃の巨人(Attack on Titan)、
呪術廻戦(Jujutsu Kaisen)、ワンピース(One Piece)をお勧めします。"
Quality score: 4.21/5.0
Naive INT8 Response:
"鬼滅の刃に似たマンガとして、東京喰種(Tokyo Ghoul)、嘘喰いをお勧めします。
また、聖おにいさん(Saint Young Men)も良いかもしれません。"
Quality score: 3.40/5.0 ← acceptable threshold is 4.0
Naive INT4 Response:
"お勧めのマンガは九鬼スピリット、魔術バックです。これらのタイトルは..."
Quality score: 2.10/5.0 ← hallucinated fictional titles!
Implementation Detail
AWQ (Activation-Aware Weight Quantization) with In-Domain Calibration
Standard quantization uses generic text for calibration. AWQ instead observes the actual activation magnitudes for weights that matter most and protects those weights with higher precision. The critical insight is that calibrating on domain-matched data (manga QA) preserves the activations specific to Japanese manga vocabulary.
# quantization/awq_manga_calibration.py
from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer
import json
def build_calibration_dataset(path: str) -> list[str]:
"""
500 representative manga Q&A pairs covering:
- Japanese-character title mentions
- Mixed Japanese/English text (common in manga descriptions)
- Numeric content (volume numbers, chapter references)
- Long-form recommendations
"""
with open(path) as f:
data = [json.loads(line) for line in f]
calibration_texts = []
for item in data[:500]:
# Format as the exact prompt template used in production
text = f"<|system|>あなたはマンガショッピングアシスタントです。\n<|user|>{item['question']}\n<|assistant|>{item['answer']}"
calibration_texts.append(text)
return calibration_texts
def run_awq_quantization(model_path: str, output_path: str, calib_path: str):
model = AutoAWQForCausalLM.from_pretrained(
model_path,
device_map="auto",
torch_dtype="auto"
)
tokenizer = AutoTokenizer.from_pretrained(model_path)
calibration_data = build_calibration_dataset(calib_path)
# AWQ with conservative hyperparameters for Japanese quality preservation
quant_config = {
"zero_point": True,
"q_group_size": 64, # smaller group = more precision, less compression
"w_bit": 4,
"version": "GEMM",
}
model.quantize(
tokenizer,
quant_config=quant_config,
calib_data=calibration_data,
max_calib_samples=500,
max_calib_seq_len=768, # cover full manga recommendation responses
)
model.save_quantized(output_path, safetensors=True)
print(f"AWQ INT4 model saved: {output_path}")
Quality Evaluation Pipeline
# evaluation/quantization_quality_eval.py
from typing import Callable
import statistics
EVAL_PROMPTS = [
{"prompt": "鬼滅の刃に似たマンガを5つ教えてください", "language": "ja",
"required_titles": ["進撃の巨人", "呪術廻戦", "ワンピース", "僕のヒーローアカデミア"]},
{"prompt": "What are the differences between the standard and deluxe edition of One Piece?",
"language": "en", "required_terms": ["color pages", "hardcover", "larger format"]},
# ... 50 total eval prompts
]
def evaluate_model(generate_fn: Callable, eval_prompts: list) -> float:
scores = []
for item in eval_prompts:
response = generate_fn(item["prompt"])
score = score_response(response, item)
scores.append(score)
return statistics.mean(scores)
def score_response(response: str, item: dict) -> float:
score = 0.0
# Check required titles/terms are present
required = item.get("required_titles", []) + item.get("required_terms", [])
for req in required:
if req.lower() in response.lower():
score += 1.0 / len(required)
# Penalize hallucinated titles (checked against known manga DB)
hallucinated = count_hallucinated_titles(response)
score -= hallucinated * 0.25
return max(0.0, min(1.0, score)) * 5.0 # scale to /5.0
Quantization Quality Comparison
Method | VRAM | Manga Eval | EN Eval | JP Eval | Accept?
──────────────────────────┼────────┼────────────┼─────────┼─────────┼────────
BF16 (baseline) | 16.1GB | 4.21/5.0 | 4.30 | 4.12 | ✓
Naive INT8 | 8.2GB | 3.89/5.0 | 4.01 | 3.77 | Marginal
Naive INT4 | 4.8GB | 2.83/5.0 | 3.44 | 2.22 | ✗
AWQ INT4 (generic calib) | 5.1GB | 3.72/5.0 | 4.05 | 3.38 | ✗
AWQ INT4 (manga calib) ✓ | 5.2GB | 4.17/5.0 | 4.24 | 4.10 | ✓
Acceptance threshold: 4.0/5.0 on all three dimensions
Only AWQ INT4 with manga-domain calibration passes all thresholds.
Acceptance Criteria
- AWQ INT4 manga eval score >= 4.0/5.0
- AWQ INT4 Japanese-specific eval >= 4.0/5.0 (critical for JP store)
- Zero hallucinated manga titles in 500-prompt eval set
- VRAM reduction >= 60% vs BF16 (achieved: 5.2GB vs 16.1GB = 68% reduction)
GPU Architecture Summary — ROI Dashboard
| Challenge Solved | Monthly Saving | Latency Gain | Reliability Gain |
|---|---|---|---|
| PagedAttention (KV cache) | $9,200 (50% fewer GPU instances) | 40% P99 improvement | 0 memory-related failures |
| Continuous Batching | — | 40% latency at peak | Consistent throughput |
| Cold Start Elimination | — | 99.7% of cold starts removed | 0 SLA violations on warmup |
| Inferentia Migration (DistilBERT) | $1,790 | within 20% of GPU baseline | Zero regressions |
| Off-Peak Scheduled Scaling | $507 | — | Zero over-provisioning |
| Multi-LoRA Consolidation | $1,015 | < 5ms adapter overhead | No cross-contamination |
| Predictive Spike Scaling | — | — | 0 errors on predicted events |
| AWQ Quantization | Enabled all above savings | — | 0 OOM incidents post-fix |
| Total | ~$12,500/month | 40–65% P99 improvement | Near-zero GPU-related incidents |
References
- 01-inference-pipeline-challenges.md — Multi-model pipeline design
- 03-tradeoffs-decisions.md — Decision framework and reversal triggers
- ../Tech-Stack/02-open-source-libraries.md — vLLM, PagedAttention deep dive
- ../Tech-Stack/01-detailed-tech-stack.md — Hardware choices (Inferentia, g5, g4dn)