LOCAL PREVIEW View on GitHub

Scenario 4 — Container Restarts From GPU OOM And How We Stopped Them

User Story

As the on-call engineer for MangaAssist, I wanted long multi-turn manga reading sessions to stop crashing GPU workers and triggering container restarts, because every restart created a mini-outage for that instance's share of active user sessions — all of whom lost their conversation state mid-chat.


The Production Problem

The docs explicitly describe a failure chain:

  1. A user starts a long manga exploration session — 20+ turns asking about character arcs, plot comparisons, sequel recommendations
  2. Each turn appends to the conversation context
  3. Context growth increases KV-cache memory pressure
  4. VRAM is exhausted
  5. The serving worker process crashes
  6. The health check fails
  7. The platform restarts the container
  8. All users routed to that container lose their session

The user-facing symptom was not slow inference. It was an availability issue caused by container churn triggered by application-level memory behavior.


What We Actually Did

  • Introduced a sliding-window context policy so prompts stayed inside a clear token budget.
  • Quantized the large model path with AWQ INT4 to reclaim VRAM headroom.
  • Added an OOM circuit breaker so the worker degraded gracefully (returned a controlled fallback + emitted a metric) instead of crashing the process.

Deep-Dive Questions And Answers

Q1. Why did long manga conversations lead to container restarts? Because context growth increased KV-cache memory pressure. Each turn added tokens to the input, and the KV-cache for those tokens consumed VRAM. When VRAM was exhausted, the serving worker crashed. The container platform then interpreted that as an unhealthy container and triggered a restart cycle. The manga user's conversation was lost.

Q2. Why was sliding-window context better than blunt truncation? Blunt truncation throws away context without control — it just drops the oldest tokens, which might include the user's stated genre preferences, favorite characters, or reading history context that the model needs. Sliding-window budgeting keeps the most recent and most relevant turns, preserves a predictable token ceiling, and inserts an explicit truncation marker so the model knows context was summarized — not just silently missing.

Q3. Why was quantization a container optimization, not only a model optimization? Reducing model VRAM footprint directly improved container stability and effective instance density. Smaller VRAM usage means: fewer OOMs per instance, more room for concurrent KV-cache from multiple users, less container churn under load. AWQ INT4 was the mechanism; container stability was the outcome.

Q4. Why keep an OOM circuit breaker if quantization already helped? Because prevention is not containment. Even with INT4 quantization and a context budget, a rare traffic pattern or unusually long prompt could still push toward OOM. The circuit breaker ensures that rare case returns a controlled response ("I need to start a fresh context for this conversation") and emits a metric — rather than crashing the worker and restarting the container, taking down all other concurrent sessions with it.

Q5. What result should you quote? OOM incidents went from ~14 per week to zero over the monitored post-fix window. 20-turn manga conversations became much less memory-intensive after the combined sliding-window + AWQ changes.

Q6. What is the best senior-level takeaway? Treat GPU memory as a capacity budget, not an incidental detail. In long-context chat systems, prompt construction policy is just as important as the serving engine when you care about container stability. The serving engine manages the GPU; the application layer controls how much work the GPU is asked to do.


Optimizations We Can Credibly Claim

  • Explicit context-window budgeting with sliding-window policy
  • AWQ INT4 quantization to reclaim VRAM headroom
  • Graceful OOM handling instead of hard worker crashes
  • Fewer container restarts → better session stability for long manga reading sessions
  • Metrics emitted on OOM circuit breaker trips for proactive monitoring

Better-Than-Naive Explanation

The naive answer is "we needed a bigger GPU." Bigger hardware only delays the problem — a user with a long enough manga session would still eventually OOM a larger GPU if context growth is unbounded. We fixed the real issue by: controlling prompt growth with a sliding-window budget, shrinking the model memory footprint with quantization, and preventing one OOM from taking down the worker and all concurrent sessions with it.


Decision Table

Dimension Details
Root cause Unbounded context growth → KV-cache VRAM exhaustion → worker crash → container restart → session loss
Why not "just add GPU" Larger GPU delays the problem; doesn't fix unbounded context growth
Sliding-window vs blunt truncation Blunt = silent loss of preference context. Sliding-window = predictable budget + explicit marker
Quantization (AWQ INT4) tradeoff Minor quality delta on long inference vs significant VRAM reduction + container stability
OOM circuit breaker rationale Prevention (budgeting + quantization) is necessary but not sufficient — containment (graceful fallback) handles the tail case
Who owns the fix App layer owns context budget (not the container), serving layer owns graceful OOM handling
Scale mechanism Context budget = VRAM budget per request; enforced in application before GPU is stressed
Key metric OOM incidents: ~14/week → 0 post-fix

Tradeoffs Discussed

Option considered Why rejected or scoped
Larger GPU instance Delays OOM, doesn't prevent it; adds significant cost
Blunt oldest-turn truncation Loses genre preferences, character context silently; bad for manga recommendation quality
No quantization (full precision) Max quality but VRAM footprint too large for cost-efficient container density
Crash-and-restart without guard Platform naturally handles it, but all concurrent sessions on that container are disrupted
Per-request GPU isolation Each request in its own container — total isolation but impractical cost and startup overhead

Scale Planned

Constraint Enforcement
Max input tokens per request Sliding-window budget enforced in application layer before FM call
VRAM per container AWQ INT4 reduces model footprint; more headroom for concurrent KV-cache
Concurrent sessions per instance Increased after quantization — more users per GPU without OOM risk
OOM tail case handling Circuit breaker returns controlled response + emits metric; does not crash worker

Failure Modes And Containment

Without fixes:
  Long conversation → VRAM exhausted → worker crashes → container restarts → ALL sessions on instance disrupted

With fixes:
  Long conversation → budget policy trims oldest turns → VRAM stays bounded
  Quantization → model needs less VRAM per inference → more headroom for all sessions
  Circuit breaker → rare edge case → graceful fallback → metric emitted → worker stays alive

Intuition From This Scenario

In a chat system, the application layer and the serving layer share responsibility for GPU memory. The serving engine manages VRAM efficiently during inference. But the application layer controls how large the input ever gets — and if the application sends unbounded context growth, no serving engine optimization can save it. The fix was split across both layers deliberately: sliding-window budget in the app layer to control what's sent, quantization in the serving layer to reduce the baseline footprint, and a circuit breaker at the boundary for the cases that still slip through. That three-layer defense is the senior-engineer answer.