Scenario 4 — Container Restarts From GPU OOM And How We Stopped Them

User Story

As the on-call engineer for MangaAssist, I wanted long multi-turn manga reading sessions to stop crashing GPU workers and triggering container restarts, because every restart created a mini-outage for that instance's share of active user sessions — all of whom lost their conversation state mid-chat.

The Production Problem

The docs explicitly describe a failure chain:

A user starts a long manga exploration session — 20+ turns asking about character arcs, plot comparisons, sequel recommendations
Each turn appends to the conversation context
Context growth increases KV-cache memory pressure
VRAM is exhausted
The serving worker process crashes
The health check fails
The platform restarts the container
All users routed to that container lose their session

The user-facing symptom was not slow inference. It was an availability issue caused by container churn triggered by application-level memory behavior.

What We Actually Did

Introduced a sliding-window context policy so prompts stayed inside a clear token budget.
Quantized the large model path with AWQ INT4 to reclaim VRAM headroom.
Added an OOM circuit breaker so the worker degraded gracefully (returned a controlled fallback + emitted a metric) instead of crashing the process.

Deep-Dive Questions And Answers

Q1. Why did long manga conversations lead to container restarts? Because context growth increased KV-cache memory pressure. Each turn added tokens to the input, and the KV-cache for those tokens consumed VRAM. When VRAM was exhausted, the serving worker crashed. The container platform then interpreted that as an unhealthy container and triggered a restart cycle. The manga user's conversation was lost.

Q2. Why was sliding-window context better than blunt truncation? Blunt truncation throws away context without control — it just drops the oldest tokens, which might include the user's stated genre preferences, favorite characters, or reading history context that the model needs. Sliding-window budgeting keeps the most recent and most relevant turns, preserves a predictable token ceiling, and inserts an explicit truncation marker so the model knows context was summarized — not just silently missing.

Q3. Why was quantization a container optimization, not only a model optimization? Reducing model VRAM footprint directly improved container stability and effective instance density. Smaller VRAM usage means: fewer OOMs per instance, more room for concurrent KV-cache from multiple users, less container churn under load. AWQ INT4 was the mechanism; container stability was the outcome.

Q4. Why keep an OOM circuit breaker if quantization already helped? Because prevention is not containment. Even with INT4 quantization and a context budget, a rare traffic pattern or unusually long prompt could still push toward OOM. The circuit breaker ensures that rare case returns a controlled response ("I need to start a fresh context for this conversation") and emits a metric — rather than crashing the worker and restarting the container, taking down all other concurrent sessions with it.

Q5. What result should you quote? OOM incidents went from ~14 per week to zero over the monitored post-fix window. 20-turn manga conversations became much less memory-intensive after the combined sliding-window + AWQ changes.

Q6. What is the best senior-level takeaway? Treat GPU memory as a capacity budget, not an incidental detail. In long-context chat systems, prompt construction policy is just as important as the serving engine when you care about container stability. The serving engine manages the GPU; the application layer controls how much work the GPU is asked to do.

Optimizations We Can Credibly Claim

Explicit context-window budgeting with sliding-window policy
AWQ INT4 quantization to reclaim VRAM headroom
Graceful OOM handling instead of hard worker crashes
Fewer container restarts → better session stability for long manga reading sessions
Metrics emitted on OOM circuit breaker trips for proactive monitoring

Better-Than-Naive Explanation

The naive answer is "we needed a bigger GPU." Bigger hardware only delays the problem — a user with a long enough manga session would still eventually OOM a larger GPU if context growth is unbounded. We fixed the real issue by: controlling prompt growth with a sliding-window budget, shrinking the model memory footprint with quantization, and preventing one OOM from taking down the worker and all concurrent sessions with it.

Decision Table

Dimension	Details
Root cause	Unbounded context growth → KV-cache VRAM exhaustion → worker crash → container restart → session loss
Why not "just add GPU"	Larger GPU delays the problem; doesn't fix unbounded context growth
Sliding-window vs blunt truncation	Blunt = silent loss of preference context. Sliding-window = predictable budget + explicit marker
Quantization (AWQ INT4) tradeoff	Minor quality delta on long inference vs significant VRAM reduction + container stability
OOM circuit breaker rationale	Prevention (budgeting + quantization) is necessary but not sufficient — containment (graceful fallback) handles the tail case
Who owns the fix	App layer owns context budget (not the container), serving layer owns graceful OOM handling
Scale mechanism	Context budget = VRAM budget per request; enforced in application before GPU is stressed
Key metric	OOM incidents: ~14/week → 0 post-fix

Tradeoffs Discussed

Option considered	Why rejected or scoped
Larger GPU instance	Delays OOM, doesn't prevent it; adds significant cost
Blunt oldest-turn truncation	Loses genre preferences, character context silently; bad for manga recommendation quality
No quantization (full precision)	Max quality but VRAM footprint too large for cost-efficient container density
Crash-and-restart without guard	Platform naturally handles it, but all concurrent sessions on that container are disrupted
Per-request GPU isolation	Each request in its own container — total isolation but impractical cost and startup overhead

Scale Planned

Constraint	Enforcement
Max input tokens per request	Sliding-window budget enforced in application layer before FM call
VRAM per container	AWQ INT4 reduces model footprint; more headroom for concurrent KV-cache
Concurrent sessions per instance	Increased after quantization — more users per GPU without OOM risk
OOM tail case handling	Circuit breaker returns controlled response + emits metric; does not crash worker

Failure Modes And Containment

Without fixes:
  Long conversation → VRAM exhausted → worker crashes → container restarts → ALL sessions on instance disrupted

With fixes:
  Long conversation → budget policy trims oldest turns → VRAM stays bounded
  Quantization → model needs less VRAM per inference → more headroom for all sessions
  Circuit breaker → rare edge case → graceful fallback → metric emitted → worker stays alive

Intuition From This Scenario

In a chat system, the application layer and the serving layer share responsibility for GPU memory. The serving engine manages VRAM efficiently during inference. But the application layer controls how large the input ever gets — and if the application sends unbounded context growth, no serving engine optimization can save it. The fix was split across both layers deliberately: sliding-window budget in the app layer to control what's sent, quantization in the serving layer to reduce the baseline footprint, and a circuit breaker at the boundary for the cases that still slip through. That three-layer defense is the senior-engineer answer.