06. GPU Architecture Challenges — MangaAssist

"Every GPU dollar we burned on wasted memory, idle capacity, or poor batching was a dollar we couldn't spend on better models, more features, or lower latency. GPU architecture isn't an ops problem — it's a product problem."

Overview

MangaAssist ran 4 ML models in production simultaneously — DistilBERT intent classifier, Titan Embeddings V2, a cross-encoder reranker, and Claude 3.5 Sonnet via Bedrock. The self-hosted models (DistilBERT, reranker, fine-tuned Llama) ran on GPU-backed SageMaker endpoints. This document covers every GPU-level architecture challenge we hit, the user stories that drove each fix, and the exact implementation approach.

GPU Challenges at a Glance

#	Challenge	Root Cause	Solution	Impact
1	KV Cache Memory Fragmentation	Static KV cache pre-allocation wastes 60-80% GPU VRAM	PagedAttention (vLLM)	2x concurrent request capacity per GPU
2	Low GPU Utilization from Fixed Batching	Requests served one-at-a-time or in fixed windows	Continuous Batching	40% latency reduction during spikes
3	GPU Cold Start Latency	Model weights loaded on first request (45-60s)	Warmup + min-instance pinning	Eliminated 99% of cold starts
4	Idle GPU Waste in Off-Peak Hours	Always-on GPU instances with no work to do	Predictive scaling + Inferentia migration	$21K/month saved
5	GPU OOM on Long Conversation Turns	Multi-turn context grows KV cache unboundedly	Context windowing + quantization	Zero OOM kills in production
6	Multi-Model GPU Contention	Each model claimed an isolated GPU, sharing was impossible	Multi-LoRA on shared base model	50% fewer GPU instances
7	Slow GPU Provisioning During Traffic Spikes	SageMaker scale-up: 5-8 minutes	Predictive scaling + step scaling	Spike absorption in <90 seconds
8	INT8/INT4 Quality Degradation	Naive quantization hurt Japanese-language quality	AWQ calibration on manga corpus	<2% quality loss with 3x memory reduction

User Story 1 — KV Cache Memory Fragmentation

User Story

AS A platform engineer responsible for GPU cost efficiency,
I WANT KV cache memory to be allocated dynamically instead of pre-allocated per request,
SO THAT I can fit more concurrent user sessions on each GPU without provisioning additional instances.

Background & Problem

When MangaAssist first launched with naive HuggingFace Transformers inference, our profiling revealed a shocking number: 72% of allocated GPU VRAM was sitting unused on the KV cache alone.

Every autoregressive generation step requires keeping key-value tensors for all previous tokens in GPU memory. The naive approach pre-allocates the maximum possible sequence length (2048 tokens in our case) for every request — even if the actual response is 50 tokens long.

Request A: actual 180 tokens → allocated 2048 tokens worth of KV memory (1868 wasted)
Request B: actual 220 tokens → allocated 2048 tokens worth of KV memory (1828 wasted)
Request C: actual 95 tokens  → allocated 2048 tokens worth of KV memory (1953 wasted)

Total waste: ~72% of KV cache VRAM
GPU: ml.g4dn.xlarge (16GB VRAM)
Max concurrent requests with naive approach: 4-6

This directly translated to user-facing latency. During traffic spikes, new requests queued waiting for GPU memory to free up from one request before the next could be loaded.

Implementation Detail

Step 1 — Profiled the Baseline

import torch
import time

def profile_kv_cache_utilization(model, tokenizer, prompts):
    """Measure actual vs. allocated KV cache memory."""
    results = []
    for prompt in prompts:
        torch.cuda.reset_peak_memory_stats()
        mem_before = torch.cuda.memory_allocated()

        inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
        with torch.no_grad():
            output = model.generate(
                **inputs,
                max_new_tokens=512,
                return_dict_in_generate=True,
                output_scores=True
            )

        actual_tokens = output.sequences.shape[-1]
        peak_mem = torch.cuda.max_memory_allocated()
        baseline_mem_per_token = (peak_mem - mem_before) / 2048  # pre-allocated max
        actual_mem_needed = baseline_mem_per_token * actual_tokens
        utilization = actual_mem_needed / (peak_mem - mem_before)

        results.append({
            "prompt_len": inputs["input_ids"].shape[-1],
            "output_len": actual_tokens,
            "peak_vram_mb": peak_mem / 1e6,
            "kv_utilization": utilization
        })

    return results

# Result: avg utilization = 0.28 (72% waste)

Step 2 — Adopted vLLM with PagedAttention

PagedAttention maps the KV cache using virtual pages (analogous to OS virtual memory), allocating physical GPU memory blocks only as tokens are generated:

┌────────────────────────────────────────────────────────┐
│               PagedAttention Block Layout               │
├────────────────────────┬───────────────────────────────┤
│  OS Concept            │  PagedAttention Analog         │
├────────────────────────┼───────────────────────────────┤
│  Memory pages          │  KV cache blocks (16 tokens)  │
│  Physical memory       │  GPU VRAM blocks               │
│  Virtual address space │  Logical KV sequence           │
│  Page table            │  Block table per request       │
│  Copy-on-Write         │  Shared blocks (beam search)   │
└────────────────────────┴───────────────────────────────┘

Step 3 — SageMaker Deployment Config

# vllm_serving_container/serve.py
from vllm import AsyncLLMEngine, AsyncEngineArgs, SamplingParams
from vllm.engine.arg_utils import AsyncEngineArgs

def build_engine():
    engine_args = AsyncEngineArgs(
        model="meta-llama/Llama-3-8b-instruct",   # fine-tuned checkpoint
        tensor_parallel_size=1,                     # single A10G on ml.g5.xlarge
        gpu_memory_utilization=0.92,                # leave 8% headroom for CUDA ops
        max_num_seqs=128,                           # max concurrent sequences
        max_model_len=4096,                         # context window
        block_size=16,                              # tokens per KV cache block
        enable_prefix_caching=True,                 # reuse system prompt KV blocks
        quantization="awq",                         # INT4 AWQ quantization
        enforce_eager=False,                        # use CUDA graphs for speed
    )
    return AsyncLLMEngine.from_engine_args(engine_args)

engine = build_engine()

Step 4 — Measure Before/After

Before (HF Transformers, static KV allocation):
  VRAM utilization:        28% (72% waste)
  Max concurrent requests: 4-6 per GPU
  P50 latency at 30 RPS:   1,820ms
  P99 latency at 30 RPS:   4,200ms

After (vLLM PagedAttention):
  VRAM utilization:        96% (4% waste)
  Max concurrent requests: 85-90 per GPU
  P50 latency at 30 RPS:   620ms
  P99 latency at 30 RPS:   1,400ms

GPU instances required:    Cut from 8 → 4 (50% reduction)
Monthly GPU spend:         $18,400 → $9,200

Acceptance Criteria

KV cache memory waste < 10% under normal load
Max concurrent requests per GPU >= 60
P99 latency < 2s at 50 concurrent requests
No memory-related request failures in 7-day soak test

User Story 2 — Low GPU Utilization from Fixed Batching

User Story

AS A senior ML engineer optimizing inference throughput,
I WANT the inference engine to continuously accept and batch new requests as generation slots free up,
SO THAT GPU utilization stays above 80% even under variable traffic patterns
without artificially increasing latency by holding requests for a fixed batch window.

Background & Problem

Our initial deployment used static batching: collect N requests, wait up to T ms, send the batch to the GPU. This caused two failure modes:

Under-loaded batches: At 3 AM, with 50 concurrent users, batches of size 2-3 were being sent. The GPU processed these quickly but sat idle 40-60% of the time waiting for the next batch window.
Over-loaded queue at peak: At peak (2,000 concurrent users), some requests in a batch finished generation after 3 tokens while others needed 300 tokens. The short responses held the entire batch hostage, blocking new requests from filling freed GPU slots.

Fixed Batching Timeline:
Time 0ms:  Request A starts (needs 300 tokens)
Time 0ms:  Request B starts (needs 12 tokens)
Time 0ms:  Request C starts (needs 8 tokens)

Time 45ms: B finishes → GPU slot free BUT waits for batch
Time 35ms: C finishes → GPU slot free BUT waits for batch
Time 890ms: A finishes → entire batch done
Time 890ms: D, E, F finally start (waited 890ms for a slot!)

Implementation Detail

Continuous Batching with vLLM's Iteration-Level Scheduling

vLLM processes requests at the iteration level, not the batch level. At every generation step, if any sequence has finished (hit EOS or max tokens), a new request can immediately fill that sequence slot:

# continuous_batching_demo.py — conceptual illustration
class IterationLevelScheduler:
    """
    At each decode step, retire finished sequences and
    immediately admit new requests from the waiting queue.
    Unlike fixed batching, the batch composition changes every token step.
    """
    def __init__(self, max_seqs: int = 128):
        self.running: list[Sequence] = []
        self.waiting: deque[Request] = deque()
        self.max_seqs = max_seqs

    def schedule_step(self) -> list[Sequence]:
        # 1. Remove finished sequences
        self.running = [s for s in self.running if not s.is_finished]

        # 2. Fill open slots from waiting queue immediately
        while len(self.running) < self.max_seqs and self.waiting:
            req = self.waiting.popleft()
            self.running.append(Sequence(req))

        # 3. Execute one decode step for all running sequences
        return self.running

Deployment Config — Tuned for MangaAssist Traffic

# sagemaker_endpoint/model_handler.py
engine_args = AsyncEngineArgs(
    model=MODEL_PATH,
    # Continuous batching is the default in vLLM — no flag needed.
    # Tune these for our traffic profile:
    max_num_seqs=128,           # max sequences in flight simultaneously
    max_num_batched_tokens=8192, # max total tokens across all seqs per step
    scheduler_delay_factor=0.0,  # no artificial delay — admit requests immediately
)

Monitoring GPU Utilization

# metrics/gpu_utilization_tracker.py
import boto3
import pynvml

pynvml.nvmlInit()

def emit_gpu_metrics():
    handle = pynvml.nvmlDeviceGetHandleByIndex(0)
    utilization = pynvml.nvmlDeviceGetUtilizationRates(handle)
    mem_info = pynvml.nvmlDeviceGetMemoryInfo(handle)

    cw = boto3.client("cloudwatch", region_name="ap-northeast-1")
    cw.put_metric_data(
        Namespace="MangaAssist/GPU",
        MetricData=[
            {"MetricName": "GPUUtilizationPercent",
             "Value": utilization.gpu, "Unit": "Percent"},
            {"MetricName": "GPUMemoryUsedMB",
             "Value": mem_info.used / 1_048_576, "Unit": "Megabytes"},
            {"MetricName": "ActiveSequences",
             "Value": engine.get_num_running_requests(), "Unit": "Count"},
        ]
    )

Before / After Results

Fixed Batching (batch_size=16, wait_time=50ms):
  GPU compute utilization:   42% avg, 68% peak
  Throughput at 100 RPS:     38 req/sec actual
  P99 queue wait time:       650ms

Continuous Batching (vLLM):
  GPU compute utilization:   81% avg, 95% peak
  Throughput at 100 RPS:     94 req/sec actual
  P99 queue wait time:       85ms

Latency improvement:         40% reduction at P99 (spike conditions)

Acceptance Criteria

GPU utilization > 75% on average during business hours
Queue wait time P99 < 150ms under 100 RPS load
No artificial batching delay introduced
Metrics emitted and alarmed in CloudWatch

User Story 3 — GPU Cold Start Latency

User Story

AS A customer using MangaAssist for the first time after a scale-up event,
I WANT my first message to receive a response within the same 3-second SLA as all other requests,
SO THAT I am not penalized with a 45-60 second wait while the model loads onto the GPU.

Background & Problem

When SageMaker provisioned a new ml.g5.xlarge instance to handle a traffic spike, the process looked like this:

Instance provisioned:    t + 0s
Docker container start:  t + 45s
Model weights download:  t + 90s   (Llama-8B = ~16GB from S3)
CUDA graph capture:      t + 115s
First request served:    t + 120s  (2 minutes after provisioning started!)

The first 3-5 real user requests that hit a freshly provisioned instance waited over a minute for a response. This was a 40x latency regression vs. the 3s SLA.

Implementation Detail

Step 1 — Warmup Request Script on Container Start

# inference/warmup.py
import requests
import json
import time

WARMUP_PROMPTS = [
    # Cover the most common inference shapes at launch
    {"role": "user", "content": "What manga series are similar to Naruto?"},           # medium output
    {"role": "user", "content": "Is One Piece volume 1 available in English?"},        # short output
    {"role": "user", "content": "Recommend 5 manga for someone who likes fantasy."},   # long output
]

def run_warmup(endpoint_url: str, max_wait_seconds: int = 180):
    """
    Send warmup requests through the full inference path so that:
    1. Model weights are in GPU HBM (not system RAM)
    2. CUDA graphs are captured for common sequence lengths
    3. PagedAttention block pool is initialized
    """
    start = time.time()
    for idx, prompt in enumerate(WARMUP_PROMPTS):
        try:
            resp = requests.post(
                f"{endpoint_url}/generate",
                json={"messages": [prompt], "max_tokens": 150, "stream": False},
                timeout=60
            )
            resp.raise_for_status()
            elapsed = time.time() - start
            print(f"Warmup {idx+1}/3 complete in {elapsed:.1f}s")
        except Exception as e:
            print(f"Warmup {idx+1} failed: {e} — proceeding anyway")

    total_elapsed = time.time() - start
    print(f"Warmup complete in {total_elapsed:.1f}s. Instance is hot.")

Step 2 — Fail the Health Check Until Warmed Up

# inference/health.py
import os
from fastapi import FastAPI, Response

app = FastAPI()
_warmed_up = False

@app.get("/ping")
def ping(response: Response):
    """
    SageMaker calls /ping to determine instance health.
    Return 503 until warmup completes so the load balancer
    never routes live traffic to a cold instance.
    """
    if not _warmed_up:
        response.status_code = 503
        return {"status": "warming_up"}
    return {"status": "healthy"}

@app.on_event("startup")
async def startup_event():
    global _warmed_up
    # Run warmup synchronously before marking healthy
    run_warmup("http://localhost:8080")
    _warmed_up = True

Step 3 — Minimum Instance Count = 2 (Never Scale to Zero)

# infrastructure/sagemaker_autoscaling.py
import boto3

def configure_endpoint_autoscaling(endpoint_name: str):
    aas = boto3.client("application-autoscaling", region_name="ap-northeast-1")

    # Register scalable target
    aas.register_scalable_target(
        ServiceNamespace="sagemaker",
        ResourceId=f"endpoint/{endpoint_name}/variant/AllTraffic",
        ScalableDimension="sagemaker:variant:DesiredInstanceCount",
        MinCapacity=2,   # NEVER go below 2 — cold start prevention
        MaxCapacity=20,
    )

    # Scale out fast, scale in slow
    aas.put_scaling_policy(
        PolicyName=f"{endpoint_name}-target-tracking",
        ServiceNamespace="sagemaker",
        ResourceId=f"endpoint/{endpoint_name}/variant/AllTraffic",
        ScalableDimension="sagemaker:variant:DesiredInstanceCount",
        PolicyType="TargetTrackingScaling",
        TargetTrackingScalingPolicyConfiguration={
            "TargetValue": 200.0,       # invocations/minute per instance
            "PredefinedMetricSpecification": {
                "PredefinedMetricType": "SageMakerVariantInvocationsPerInstance"
            },
            "ScaleOutCooldown": 60,     # scale out aggressive: 1 minute cooldown
            "ScaleInCooldown": 600,     # scale in conservative: 10 minute cooldown
        }
    )

Step 4 — Fast Model Loading with Safetensors and S3 Express

# Switch from PyTorch bin format to safetensors for faster deserialization
# safetensors skips Python deserialization — direct memory mapping

# Upload model to S3 Express One Zone (10x faster than standard S3)
aws s3 cp ./model-weights/ s3://manga-assist-models-express/llama-8b/ \
  --recursive \
  --storage-class EXPRESS_ONEZONE

# Mount S3 Express as a local filesystem via Mountpoint-S3
# During container startup: weights are streamed directly into GPU memory

Cold Start Time Comparison

Before (PyTorch .bin, standard S3):
  Full cold start:         118 seconds
  First request latency:   ~120 seconds
  SLA violations:          100% of first requests on new instances

After (safetensors, S3 Express, warmup, min=2):
  Cold start eliminated:   99.7% of requests (min=2 keeps instances hot)
  Remaining 0.3%:          Scale-out edge case, ~22 second warmup
  SLA violations:          Reduced to 0.1% of requests

Acceptance Criteria

No live traffic routed to un-warmed instances (health check gate)
Model loading time < 30s from instance started to first request served
Minimum 2 instances always warm in production
P99.9 first-response latency < 5s including worst-case warmup scenario

User Story 4 — Idle GPU Waste in Off-Peak Hours

User Story

AS AN engineering manager owning the MangaAssist infrastructure budget,
I WANT GPU instances to scale down to the minimum footprint during off-peak hours (midnight–6 AM JST),
SO THAT we do not pay for idle GPU compute while maintaining readiness for any traffic.

Background & Problem

MangaAssist operates on Japan Standard Time (JST) traffic patterns. At 3 AM JST, traffic dropped to ~500 requests/minute. One ml.g5.xlarge instance could serve this load comfortably. But our auto-scaling config conservatively kept 4 instances running to absorb potential bursts.

Off-peak reality (2 AM – 6 AM JST):
  Actual traffic:          ~500 requests/minute
  One instance capacity:   ~1,200 requests/minute
  Running instances:       4 (over-provisioned 3x)
  Idle instances:          3
  Cost per idle instance:  ml.g5.xlarge = $1.41/hr
  Daily waste:             3 instances × 4 hours × $1.41 = $16.92/day
  Monthly waste:           ~$507/month (just from off-peak hours)

Additionally, the DistilBERT intent classifier running on ml.g4dn.xlarge had the same problem. The combined idle GPU waste across both models was $1,400/month.

Implementation Detail

Step 1 — Migrate DistilBERT to AWS Inferentia (ml.inf1.xlarge)

The DistilBERT BERT-base classifier (66M parameters) did not need GPU. AWS Inferentia chips are purpose-built for inference and cost 70% less than equivalent GPU instances.

# Compile DistilBERT for Inferentia using AWS Neuron SDK
import torch
import torch_neuron
from transformers import AutoTokenizer, AutoModelForSequenceClassification

MODEL_NAME = "distilbert-base-uncased-finetuned-manga-intent-v2"

def compile_for_inferentia():
    model = AutoModelForSequenceClassification.from_pretrained(MODEL_NAME)
    tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
    model.eval()

    # Trace with representative input shapes for Neuron compilation
    # Must cover all shapes that will be seen in production
    example_inputs = tokenizer(
        "What manga is similar to Attack on Titan?",
        max_length=128,
        padding="max_length",
        return_tensors="pt"
    )

    # Neuron compilation — converts PyTorch ops to Neuron graph
    model_neuron = torch_neuron.trace(
        model,
        example_inputs=(
            example_inputs["input_ids"],
            example_inputs["attention_mask"]
        ),
        compiler_args=["--neuron-core-pipeline-cores", "2"],  # pipeline across 2 NeuronCores
    )

    # Save compiled model
    torch.jit.save(model_neuron, "distilbert_neuron_compiled.pt")
    print("Neuron compilation complete. Model saved.")
    return model_neuron

# Deploy to SageMaker on ml.inf1.xlarge (4 NeuronCores, $0.228/hr vs $0.736/hr for g4dn.xlarge)

Step 2 — Scheduled Scaling for Predictable Traffic Patterns

# infrastructure/scheduled_scaling.py
import boto3
from datetime import datetime

def setup_scheduled_scaling(endpoint_name: str):
    aas = boto3.client("application-autoscaling", region_name="ap-northeast-1")

    # Scale DOWN: 1 AM JST (16:00 UTC) → min capacity 2
    aas.put_scheduled_action(
        ServiceNamespace="sagemaker",
        ScheduledActionName=f"{endpoint_name}-scale-down-off-peak",
        ResourceId=f"endpoint/{endpoint_name}/variant/AllTraffic",
        ScalableDimension="sagemaker:variant:DesiredInstanceCount",
        Schedule="cron(0 16 * * ? *)",  # 1 AM JST daily
        ScalableTargetAction={"MinCapacity": 2, "MaxCapacity": 6},
    )

    # Scale UP: 8 AM JST (23:00 UTC) → restore full capacity
    aas.put_scheduled_action(
        ServiceNamespace="sagemaker",
        ScheduledActionName=f"{endpoint_name}-scale-up-business-hours",
        ResourceId=f"endpoint/{endpoint_name}/variant/AllTraffic",
        ScalableDimension="sagemaker:variant:DesiredInstanceCount",
        Schedule="cron(0 23 * * ? *)",  # 8 AM JST daily
        ScalableTargetAction={"MinCapacity": 4, "MaxCapacity": 20},
    )

    # MANGA RELEASE DAY scaling: Fridays at 6 PM JST (9:00 UTC)
    # Major manga chapters release on Fridays — pre-scale 2 hours before
    aas.put_scheduled_action(
        ServiceNamespace="sagemaker",
        ScheduledActionName=f"{endpoint_name}-manga-release-day",
        ResourceId=f"endpoint/{endpoint_name}/variant/AllTraffic",
        ScalableDimension="sagemaker:variant:DesiredInstanceCount",
        Schedule="cron(0 9 ? * FRI *)",  # Fridays 6 PM JST
        ScalableTargetAction={"MinCapacity": 8, "MaxCapacity": 20},
    )

Step 3 — Savings Summary

DistilBERT — GPU → Inferentia Migration:
  Before: ml.g4dn.xlarge  → $0.736/hr × 4 instances = $2.944/hr
  After:  ml.inf1.xlarge  → $0.228/hr × 2 instances = $0.456/hr (Inferentia is faster)
  Saving: $2.49/hr → $1,790/month

LLM Inference — Scheduled Scaling (off-peak 5 hours/night):
  Before: 4 instances always on → $1.41/hr × 4 = $5.64/hr
  After:  2 instances off-peak  → $1.41/hr × 2 = $2.82/hr (save $2.82/hr × 5hrs × 30days)
  Saving: $423/month

Total GPU Cost Savings: $2,213/month

Acceptance Criteria

DistilBERT latency on Inferentia within 20% of GPU baseline (<18ms)
Off-peak: maximum 2 LLM GPU instances running at 2 AM JST
Pre-scale event fires 2 hours before Friday manga release window
Zero oncall alerts caused by scheduled scale actions

User Story 5 — GPU OOM on Long Conversation Turns

User Story

AS A customer having a long multi-turn conversation with MangaAssist (20+ turns),
I WANT my session to continue normally without errors,
SO THAT I am not dropped mid-conversation with an unintelligible error because
the chatbot ran out of GPU memory trying to fit my entire conversation history.

Background & Problem

KV cache size grows linearly with conversation length. In multi-turn conversations, the KV cache for a 20-turn conversation (averaging 100 tokens/turn) consumed ~2GB of VRAM on the reranker's A10G. Under concurrent load, we hit OOM crashes:

Error observed in production logs:
torch.cuda.OutOfMemoryError: CUDA out of memory.
Tried to allocate 1.98 GiB (GPU 0; 22.20 GiB total capacity;
19.41 GiB already allocated; 892.00 MiB free; 20.81 GiB reserved)

This killed the serving worker process, and SageMaker's health check triggered a container restart — a 45-second outage for that instance's share of traffic.

Implementation Detail

Step 1 — Context Window Policy with Sliding Window

# conversation/context_manager.py
from dataclasses import dataclass
from typing import Optional
import tiktoken

@dataclass
class ConversationTurn:
    role: str       # "user" | "assistant"
    content: str
    token_count: int
    turn_index: int

class SlidingWindowContextManager:
    """
    Implements a sliding window over conversation history to cap
    the total token count injected into the LLM prompt, preventing
    KV cache OOM while preserving the most recent and most relevant context.
    """

    SYSTEM_PROMPT_TOKENS = 512      # reserved for system prompt
    CONTEXT_TOKENS = 1024           # reserved for RAG chunks
    MAX_HISTORY_TOKENS = 2048       # budget for conversation history
    MAX_OUTPUT_TOKENS = 512         # max generation length
    MAX_CONTEXT_WINDOW = 4096       # model's total context window

    def __init__(self):
        self.enc = tiktoken.get_encoding("cl100k_base")

    def count_tokens(self, text: str) -> int:
        return len(self.enc.encode(text))

    def build_context_window(
        self,
        turns: list[ConversationTurn],
        system_prompt: str,
        rag_context: str,
        current_query: str
    ) -> list[dict]:
        """
        Always include: system prompt + current query + RAG context.
        Fill remaining budget with most recent history turns (LIFO).
        If a turn would exceed budget, summarize older turns instead.
        """
        budget = (
            self.MAX_CONTEXT_WINDOW
            - self.SYSTEM_PROMPT_TOKENS
            - self.count_tokens(rag_context)
            - self.count_tokens(current_query)
            - self.MAX_OUTPUT_TOKENS
            - 64   # overhead / formatting tokens
        )

        selected_turns = []
        tokens_used = 0

        # Walk history in reverse (most recent first)
        for turn in reversed(turns):
            if tokens_used + turn.token_count > budget:
                # Inject a truncation marker instead of dropping silently
                selected_turns.append({
                    "role": "system",
                    "content": f"[Earlier conversation truncated — {len(turns) - len(selected_turns)} turns not shown]"
                })
                break
            selected_turns.append({"role": turn.role, "content": turn.content})
            tokens_used += turn.token_count

        # Reconstruct in chronological order
        messages = [{"role": "system", "content": system_prompt}]
        messages += list(reversed(selected_turns))
        messages.append({"role": "user", "content": f"Context:\n{rag_context}\n\nQuestion: {current_query}"})
        return messages

Step 2 — AWQ INT4 Quantization to Reclaim VRAM

# quantization/awq_calibration.py
from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer
from datasets import load_dataset

def quantize_with_manga_corpus(model_path: str, output_path: str):
    """
    AWQ (Activation-aware Weight Quantization) calibrates quantization
    using real activation statistics from our manga domain corpus.
    This is critical — generic calibration data hurts Japanese-language
    quality significantly more than domain-matched calibration.
    """
    model = AutoAWQForCausalLM.from_pretrained(model_path, device_map="auto")
    tokenizer = AutoTokenizer.from_pretrained(model_path)

    # Load manga-domain calibration data (critical for Japanese quality)
    calibration_data = load_dataset(
        "json",
        data_files="s3://manga-assist-data/calibration/manga_qa_500.jsonl",
        split="train"
    )

    quant_config = {
        "zero_point": True,      # Zero-point quantization for better accuracy
        "q_group_size": 128,     # Groups of 128 weights share a scale factor
        "w_bit": 4,              # 4-bit weights (INT4)
        "version": "GEMM",       # GEMM kernel — best for batch inference
    }

    model.quantize(
        tokenizer,
        quant_config=quant_config,
        calib_data=calibration_data,
        text_column="question",
        max_calib_samples=512,   # 512 representative prompts from manga QA
        max_calib_seq_len=512,
    )

    model.save_quantized(output_path)
    tokenizer.save_pretrained(output_path)
    print(f"AWQ INT4 model saved to {output_path}")
    # Original: 16GB (BF16) → Quantized: 5.2GB (INT4 + metadata)
    # Quality on manga eval set: 4.21/5.0 → 4.17/5.0 (-0.9% regression)

Step 3 — OOM Circuit Breaker

# inference/oom_guard.py
import torch
import functools
import logging

logger = logging.getLogger(__name__)

def oom_circuit_breaker(fallback_response: str):
    """
    Decorator: catches GPU OOM errors, frees cache, and returns a
    graceful degradation response rather than crashing the worker.
    """
    def decorator(func):
        @functools.wraps(func)
        async def wrapper(*args, **kwargs):
            try:
                return await func(*args, **kwargs)
            except torch.cuda.OutOfMemoryError as e:
                logger.error(f"GPU OOM in {func.__name__}: {e}")
                torch.cuda.empty_cache()    # release unreferenced tensors
                torch.cuda.synchronize()    # wait for CUDA ops to complete

                # Emit a metric so we know this is happening
                emit_metric("gpu_oom_count", 1)

                return fallback_response
        return wrapper
    return decorator

# Usage on the LLM generation endpoint:
@oom_circuit_breaker(
    fallback_response="I'm having trouble with that right now. Let me connect you with our support team."
)
async def generate_response(messages: list, sampling_params: SamplingParams) -> str:
    return await engine.generate(messages, sampling_params)

OOM Incidents After Fixes

Before (no context windowing, no quantization, no OOM guard):
  OOM incidents per week:    ~14
  Container restarts/week:   ~14 (each = ~45s mini-outage for one instance)
  Longest conversation (P99): 12 turns before OOM risk

After (sliding window + AWQ + OOM guard):
  OOM incidents per week:    0 (in 8 weeks post-deployment)
  VRAM used by 20-turn conv: 3.8GB (INT4) vs 11.2GB (BF16)
  Longest conversation (P99): 35+ turns handled without degradation

Acceptance Criteria

Zero GPU OOM container restarts in 30-day production monitoring window
Conversations of 30+ turns served within latency SLA
AWQ quantization quality regression < 2% on manga evaluation set
Sliding window truncation marker shown to users when history is trimmed

User Story 6 — Multi-Model GPU Contention (Multi-LoRA)

User Story

AS A cost-conscious senior engineer deploying multiple model variants,
I WANT multiple fine-tuned LoRA adapters to share a single base model instance on one GPU,
SO THAT I do not need separate SageMaker endpoints and separate GPU instances
for every domain-specific or task-specific model variant.

Background & Problem

We had two fine-tuned model variants during development: 1. Manga-domain adapter — fine-tuned on 50K manga QA pairs for product Q&A 2. General assistant adapter — fine-tuned on customer service data for support flows

Initially deployed on separate endpoints:

manga-domain endpoint:     ml.g5.xlarge = $1.41/hr
general-assist endpoint:   ml.g5.xlarge = $1.41/hr
Total:                     $2.82/hr = $2,030/month

Each instance was ~40% utilized independently. Combining them would cut costs in half.

Implementation Detail

Multi-LoRA Serving with vLLM

# multi_lora_config.py
from vllm import AsyncLLMEngine, AsyncEngineArgs
from vllm.lora.request import LoRARequest

engine_args = AsyncEngineArgs(
    model="meta-llama/Llama-3-8b-instruct",  # base model (shared)
    enable_lora=True,
    max_loras=4,                 # simultaneously loaded LoRA adapters
    max_lora_rank=64,
    max_cpu_loras=8,             # keep up to 8 adapters in CPU memory for hot swapping
    gpu_memory_utilization=0.90,
    max_num_seqs=96,
)

engine = AsyncLLMEngine.from_engine_args(engine_args)

# LoRA adapter registry
LORA_ADAPTERS = {
    "manga_domain":    LoRARequest("manga",   1, "/opt/ml/lora/manga_domain_r64/"),
    "general_support": LoRARequest("support", 2, "/opt/ml/lora/general_support_r64/"),
    "jp_translation":  LoRARequest("jp",      3, "/opt/ml/lora/jp_translation_r32/"),
}

async def generate_with_adapter(
    messages: list[dict],
    adapter_name: str,
    sampling_params: SamplingParams
) -> str:
    lora_request = LORA_ADAPTERS.get(adapter_name)  # None = base model only

    request_id = f"req-{uuid.uuid4()}"
    results = engine.generate(
        messages,
        sampling_params,
        request_id=request_id,
        lora_request=lora_request   # inject the appropriate adapter
    )

    output = ""
    async for result in results:
        output = result.outputs[0].text
    return output

Routing Logic — Which Adapter to Use

# orchestrator/adapter_router.py

INTENT_TO_ADAPTER = {
    "product_question":    "manga_domain",
    "recommendation":      "manga_domain",
    "discovery":           "manga_domain",
    "order_tracking":      "general_support",
    "return_request":      "general_support",
    "escalation":          "general_support",
    "chitchat":            None,              # base model sufficient
    "faq":                 None,              # RAG context handles it
}

def select_adapter(intent: str, locale: str) -> str | None:
    base_adapter = INTENT_TO_ADAPTER.get(intent)
    # Japanese locale gets JP translation adapter stacked on top
    if locale == "ja-JP" and base_adapter == "manga_domain":
        return "jp_translation"
    return base_adapter

Cost Comparison

Before (separate endpoints per adapter):
  2 × ml.g5.xlarge:    $2.82/hr = $2,030/month

After (Multi-LoRA, single endpoint):
  1 × ml.g5.xlarge:    $1.41/hr = $1,015/month

Saving:                 $1,015/month (50% reduction)
GPU utilization:        From 40% per instance → 76% on combined instance
Quality delta:          No observable change (adapters isolated by LoRA)

Acceptance Criteria

All 3 LoRA adapters served from a single GPU instance
Latency overhead from LoRA routing < 5ms additional
No cross-contamination between adapters (verified via eval benchmarks)
50% GPU instance reduction confirmed in AWS Cost Explorer

User Story 7 — Slow GPU Provisioning During Traffic Spikes

User Story

AS AN SRE managing MangaAssist reliability during major manga release events,
I WANT new GPU instances to be available within 90 seconds of a traffic spike,
SO THAT we never experience the 5-8 minute provisioning lag that caused
429 ThrottlingExceptions during the One Piece chapter release traffic spike.

Background & Problem

During the One Piece chapter 1,100 release, traffic spiked 3x in 15 minutes. SageMaker's reactive auto-scaling triggered but couldn't provision instances fast enough:

Timeline of the One Piece Incident:
  8:00 PM JST:  Chapter releases, traffic starts climbing
  8:07 PM JST:  Auto-scaling trigger fires (InvocationsPerInstance > 300)
  8:07 PM JST:  SageMaker begins provisioning 2 new instances
  8:15 PM JST:  First new instance becomes healthy
  8:07–8:15 PM: 12,000 requests queued, 20% failed with 429
  8:20 PM JST:  All new instances healthy, 0% error rate
  8:07–8:20 PM: 13 minutes of degraded service

Implementation Detail

Step 1 — Predictive Scaling via Manga Release Calendar

# scaling/manga_release_predictor.py
import boto3
from datetime import datetime, timedelta
import pytz

JST = pytz.timezone("Asia/Tokyo")

# Known high-traffic events with historical multipliers
KNOWN_EVENTS = [
    # Weekly manga releases (Shonen Jump: Mondays, Shonen Magazine: Wednesdays)
    {"day_of_week": "MON", "hour_jst": 18, "multiplier": 2.0, "reason": "Shonen Jump release"},
    {"day_of_week": "WED", "hour_jst": 18, "multiplier": 1.8, "reason": "Shonen Magazine release"},
    # Seasonal: Spring/Summer anime announcements drive catalog searches
    {"month": 4, "day": 1, "multiplier": 2.5, "reason": "Spring anime announcement season"},
    {"month": 10, "day": 1, "multiplier": 2.5, "reason": "Fall anime announcement season"},
]

def calculate_required_instances(base_rps: float, multiplier: float, instance_capacity_rps: float = 200) -> int:
    required_rps = base_rps * multiplier
    # Add 30% headroom buffer
    return max(4, int((required_rps / instance_capacity_rps) * 1.3))

def schedule_predictive_scale_events(endpoint_name: str):
    """Pre-scale 2 hours before every known high-traffic event."""
    aas = boto3.client("application-autoscaling", region_name="ap-northeast-1")

    for event in KNOWN_EVENTS:
        required = calculate_required_instances(
            base_rps=350,
            multiplier=event["multiplier"]
        )

        # Schedule 2 hours before event
        cron = f"cron(0 {event['hour_jst'] - 2} ? * {event.get('day_of_week', '*')} *)"

        aas.put_scheduled_action(
            ServiceNamespace="sagemaker",
            ScheduledActionName=f"{endpoint_name}-pre-scale-{event['reason'].replace(' ','-')}",
            ResourceId=f"endpoint/{endpoint_name}/variant/AllTraffic",
            ScalableDimension="sagemaker:variant:DesiredInstanceCount",
            Schedule=cron,
            ScalableTargetAction={"MinCapacity": required, "MaxCapacity": required + 4},
        )

Step 2 — Aggressive Step Scaling (Scale Out Fast, Scale In Slow)

# infrastructure/step_scaling.py
import boto3

def configure_step_scaling(endpoint_name: str):
    aas = boto3.client("application-autoscaling", region_name="ap-northeast-1")
    cw = boto3.client("cloudwatch", region_name="ap-northeast-1")

    # CloudWatch alarm: high invocations
    cw.put_metric_alarm(
        AlarmName=f"{endpoint_name}-high-invocations",
        MetricName="InvocationsPerInstance",
        Namespace="AWS/SageMaker",
        Dimensions=[{"Name": "EndpointName", "Value": endpoint_name}],
        Period=60,
        EvaluationPeriods=1,   # trigger after just 1 minute (aggressive)
        Threshold=200,
        ComparisonOperator="GreaterThanOrEqualToThreshold",
        Statistic="Sum",
        AlarmActions=[f"arn:aws:autoscaling:ap-northeast-1:ACCOUNT:scalingPolicy:..."],
    )

    # Step scaling: +2 when > 200/min, +4 when > 400/min, +8 when > 800/min
    aas.put_scaling_policy(
        PolicyName=f"{endpoint_name}-step-scale-out",
        ServiceNamespace="sagemaker",
        ResourceId=f"endpoint/{endpoint_name}/variant/AllTraffic",
        ScalableDimension="sagemaker:variant:DesiredInstanceCount",
        PolicyType="StepScaling",
        StepScalingPolicyConfiguration={
            "AdjustmentType": "ChangeInCapacity",
            "Cooldown": 60,
            "StepAdjustments": [
                {"MetricIntervalLowerBound": 0,   "MetricIntervalUpperBound": 200, "ScalingAdjustment": 2},
                {"MetricIntervalLowerBound": 200, "MetricIntervalUpperBound": 600, "ScalingAdjustment": 4},
                {"MetricIntervalLowerBound": 600, "MetricIntervalUpperBound": None, "ScalingAdjustment": 8},
            ]
        }
    )

Before/After the One Piece Incident Pattern

Before (reactive target tracking only):
  Trigger to capacity available: 8-13 minutes
  Requests dropped during spike: ~20%
  Error rate peak:               20% for ~13 minutes

After (predictive + step scaling + event calendar):
  Pre-scale fires 2 hours early: instances already provisioned
  Reactive step scale (fallback): trigger to capacity in <90 seconds
  Requests dropped:              0% for predicted events
  Error rate during unpredicted spikes: <1% for <90 seconds

Acceptance Criteria

Zero 429 errors during scheduled manga release windows
Unpredicted spike absorption: additional capacity online within 90 seconds
Scaling event calendar covers 100% of known weekly release patterns
Post-spike scale-in happens no sooner than 10 minutes after traffic normalizes

User Story 8 — INT4 Quantization Quality Degradation on Japanese Text

User Story

AS A product manager measuring MangaAssist recommendation quality,
I WANT INT4-quantized models to maintain the same response quality
as the full-precision BF16 model on Japanese manga Q&A,
SO THAT we can achieve 3x VRAM savings without sacrificing the product experience
for our Japanese-language users.

Background & Problem

Naive INT8/INT4 quantization (round weights to nearest integer) caused a significant quality drop on Japanese text. Japanese characters have high token density and the language is morphologically complex — small weight perturbations caused mistranslations and hallucinated manga titles:

BF16 Response:
"鬼滅の刃（Demon Slayer）に似たマンガとして、進撃の巨人（Attack on Titan）、
呪術廻戦（Jujutsu Kaisen）、ワンピース（One Piece）をお勧めします。"
Quality score: 4.21/5.0

Naive INT8 Response:
"鬼滅の刃に似たマンガとして、東京喰種（Tokyo Ghoul）、嘘喰いをお勧めします。
また、聖おにいさん（Saint Young Men）も良いかもしれません。"
Quality score: 3.40/5.0   ← acceptable threshold is 4.0

Naive INT4 Response:
"お勧めのマンガは九鬼スピリット、魔術バックです。これらのタイトルは..."
Quality score: 2.10/5.0   ← hallucinated fictional titles!

Implementation Detail

AWQ (Activation-Aware Weight Quantization) with In-Domain Calibration

Standard quantization uses generic text for calibration. AWQ instead observes the actual activation magnitudes for weights that matter most and protects those weights with higher precision. The critical insight is that calibrating on domain-matched data (manga QA) preserves the activations specific to Japanese manga vocabulary.

# quantization/awq_manga_calibration.py
from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer
import json

def build_calibration_dataset(path: str) -> list[str]:
    """
    500 representative manga Q&A pairs covering:
    - Japanese-character title mentions
    - Mixed Japanese/English text (common in manga descriptions)
    - Numeric content (volume numbers, chapter references)
    - Long-form recommendations
    """
    with open(path) as f:
        data = [json.loads(line) for line in f]

    calibration_texts = []
    for item in data[:500]:
        # Format as the exact prompt template used in production
        text = f"<|system|>あなたはマンガショッピングアシスタントです。\n<|user|>{item['question']}\n<|assistant|>{item['answer']}"
        calibration_texts.append(text)

    return calibration_texts

def run_awq_quantization(model_path: str, output_path: str, calib_path: str):
    model = AutoAWQForCausalLM.from_pretrained(
        model_path,
        device_map="auto",
        torch_dtype="auto"
    )
    tokenizer = AutoTokenizer.from_pretrained(model_path)

    calibration_data = build_calibration_dataset(calib_path)

    # AWQ with conservative hyperparameters for Japanese quality preservation
    quant_config = {
        "zero_point": True,
        "q_group_size": 64,    # smaller group = more precision, less compression
        "w_bit": 4,
        "version": "GEMM",
    }

    model.quantize(
        tokenizer,
        quant_config=quant_config,
        calib_data=calibration_data,
        max_calib_samples=500,
        max_calib_seq_len=768,   # cover full manga recommendation responses
    )

    model.save_quantized(output_path, safetensors=True)
    print(f"AWQ INT4 model saved: {output_path}")

Quality Evaluation Pipeline

# evaluation/quantization_quality_eval.py
from typing import Callable
import statistics

EVAL_PROMPTS = [
    {"prompt": "鬼滅の刃に似たマンガを5つ教えてください", "language": "ja",
     "required_titles": ["進撃の巨人", "呪術廻戦", "ワンピース", "僕のヒーローアカデミア"]},
    {"prompt": "What are the differences between the standard and deluxe edition of One Piece?",
     "language": "en", "required_terms": ["color pages", "hardcover", "larger format"]},
    # ... 50 total eval prompts
]

def evaluate_model(generate_fn: Callable, eval_prompts: list) -> float:
    scores = []
    for item in eval_prompts:
        response = generate_fn(item["prompt"])
        score = score_response(response, item)
        scores.append(score)
    return statistics.mean(scores)

def score_response(response: str, item: dict) -> float:
    score = 0.0
    # Check required titles/terms are present
    required = item.get("required_titles", []) + item.get("required_terms", [])
    for req in required:
        if req.lower() in response.lower():
            score += 1.0 / len(required)
    # Penalize hallucinated titles (checked against known manga DB)
    hallucinated = count_hallucinated_titles(response)
    score -= hallucinated * 0.25
    return max(0.0, min(1.0, score)) * 5.0  # scale to /5.0

Quantization Quality Comparison

Method                    | VRAM   | Manga Eval | EN Eval | JP Eval | Accept?
──────────────────────────┼────────┼────────────┼─────────┼─────────┼────────
BF16 (baseline)           | 16.1GB | 4.21/5.0   | 4.30    | 4.12    | ✓
Naive INT8                | 8.2GB  | 3.89/5.0   | 4.01    | 3.77    | Marginal
Naive INT4                | 4.8GB  | 2.83/5.0   | 3.44    | 2.22    | ✗
AWQ INT4 (generic calib)  | 5.1GB  | 3.72/5.0   | 4.05    | 3.38    | ✗
AWQ INT4 (manga calib) ✓  | 5.2GB  | 4.17/5.0   | 4.24    | 4.10    | ✓

Acceptance threshold: 4.0/5.0 on all three dimensions
Only AWQ INT4 with manga-domain calibration passes all thresholds.

Acceptance Criteria

AWQ INT4 manga eval score >= 4.0/5.0
AWQ INT4 Japanese-specific eval >= 4.0/5.0 (critical for JP store)
Zero hallucinated manga titles in 500-prompt eval set
VRAM reduction >= 60% vs BF16 (achieved: 5.2GB vs 16.1GB = 68% reduction)

GPU Architecture Summary — ROI Dashboard

Challenge Solved	Monthly Saving	Latency Gain	Reliability Gain
PagedAttention (KV cache)	$9,200 (50% fewer GPU instances)	40% P99 improvement	0 memory-related failures
Continuous Batching	—	40% latency at peak	Consistent throughput
Cold Start Elimination	—	99.7% of cold starts removed	0 SLA violations on warmup
Inferentia Migration (DistilBERT)	$1,790	within 20% of GPU baseline	Zero regressions
Off-Peak Scheduled Scaling	$507	—	Zero over-provisioning
Multi-LoRA Consolidation	$1,015	< 5ms adapter overhead	No cross-contamination
Predictive Spike Scaling	—	—	0 errors on predicted events
AWQ Quantization	Enabled all above savings	—	0 OOM incidents post-fix
Total	~$12,500/month	40–65% P99 improvement	Near-zero GPU-related incidents

References

01-inference-pipeline-challenges.md — Multi-model pipeline design
03-tradeoffs-decisions.md — Decision framework and reversal triggers
../Tech-Stack/02-open-source-libraries.md — vLLM, PagedAttention deep dive
../Tech-Stack/01-detailed-tech-stack.md — Hardware choices (Inferentia, g5, g4dn)