Scenario 4 — Container Restarts From GPU OOM And How We Stopped Them
User Story
As the on-call engineer for MangaAssist, I wanted long multi-turn manga reading sessions to stop crashing GPU workers and triggering container restarts, because every restart created a mini-outage for that instance's share of active user sessions — all of whom lost their conversation state mid-chat.
The Production Problem
The docs explicitly describe a failure chain:
- A user starts a long manga exploration session — 20+ turns asking about character arcs, plot comparisons, sequel recommendations
- Each turn appends to the conversation context
- Context growth increases KV-cache memory pressure
- VRAM is exhausted
- The serving worker process crashes
- The health check fails
- The platform restarts the container
- All users routed to that container lose their session
The user-facing symptom was not slow inference. It was an availability issue caused by container churn triggered by application-level memory behavior.
What We Actually Did
- Introduced a sliding-window context policy so prompts stayed inside a clear token budget.
- Quantized the large model path with AWQ INT4 to reclaim VRAM headroom.
- Added an OOM circuit breaker so the worker degraded gracefully (returned a controlled fallback + emitted a metric) instead of crashing the process.
Deep-Dive Questions And Answers
Q1. Why did long manga conversations lead to container restarts? Because context growth increased KV-cache memory pressure. Each turn added tokens to the input, and the KV-cache for those tokens consumed VRAM. When VRAM was exhausted, the serving worker crashed. The container platform then interpreted that as an unhealthy container and triggered a restart cycle. The manga user's conversation was lost.
Q2. Why was sliding-window context better than blunt truncation? Blunt truncation throws away context without control — it just drops the oldest tokens, which might include the user's stated genre preferences, favorite characters, or reading history context that the model needs. Sliding-window budgeting keeps the most recent and most relevant turns, preserves a predictable token ceiling, and inserts an explicit truncation marker so the model knows context was summarized — not just silently missing.
Q3. Why was quantization a container optimization, not only a model optimization? Reducing model VRAM footprint directly improved container stability and effective instance density. Smaller VRAM usage means: fewer OOMs per instance, more room for concurrent KV-cache from multiple users, less container churn under load. AWQ INT4 was the mechanism; container stability was the outcome.
Q4. Why keep an OOM circuit breaker if quantization already helped? Because prevention is not containment. Even with INT4 quantization and a context budget, a rare traffic pattern or unusually long prompt could still push toward OOM. The circuit breaker ensures that rare case returns a controlled response ("I need to start a fresh context for this conversation") and emits a metric — rather than crashing the worker and restarting the container, taking down all other concurrent sessions with it.
Q5. What result should you quote? OOM incidents went from ~14 per week to zero over the monitored post-fix window. 20-turn manga conversations became much less memory-intensive after the combined sliding-window + AWQ changes.
Q6. What is the best senior-level takeaway? Treat GPU memory as a capacity budget, not an incidental detail. In long-context chat systems, prompt construction policy is just as important as the serving engine when you care about container stability. The serving engine manages the GPU; the application layer controls how much work the GPU is asked to do.
Optimizations We Can Credibly Claim
- Explicit context-window budgeting with sliding-window policy
- AWQ INT4 quantization to reclaim VRAM headroom
- Graceful OOM handling instead of hard worker crashes
- Fewer container restarts → better session stability for long manga reading sessions
- Metrics emitted on OOM circuit breaker trips for proactive monitoring
Better-Than-Naive Explanation
The naive answer is "we needed a bigger GPU." Bigger hardware only delays the problem — a user with a long enough manga session would still eventually OOM a larger GPU if context growth is unbounded. We fixed the real issue by: controlling prompt growth with a sliding-window budget, shrinking the model memory footprint with quantization, and preventing one OOM from taking down the worker and all concurrent sessions with it.
Decision Table
| Dimension | Details |
|---|---|
| Root cause | Unbounded context growth → KV-cache VRAM exhaustion → worker crash → container restart → session loss |
| Why not "just add GPU" | Larger GPU delays the problem; doesn't fix unbounded context growth |
| Sliding-window vs blunt truncation | Blunt = silent loss of preference context. Sliding-window = predictable budget + explicit marker |
| Quantization (AWQ INT4) tradeoff | Minor quality delta on long inference vs significant VRAM reduction + container stability |
| OOM circuit breaker rationale | Prevention (budgeting + quantization) is necessary but not sufficient — containment (graceful fallback) handles the tail case |
| Who owns the fix | App layer owns context budget (not the container), serving layer owns graceful OOM handling |
| Scale mechanism | Context budget = VRAM budget per request; enforced in application before GPU is stressed |
| Key metric | OOM incidents: ~14/week → 0 post-fix |
Tradeoffs Discussed
| Option considered | Why rejected or scoped |
|---|---|
| Larger GPU instance | Delays OOM, doesn't prevent it; adds significant cost |
| Blunt oldest-turn truncation | Loses genre preferences, character context silently; bad for manga recommendation quality |
| No quantization (full precision) | Max quality but VRAM footprint too large for cost-efficient container density |
| Crash-and-restart without guard | Platform naturally handles it, but all concurrent sessions on that container are disrupted |
| Per-request GPU isolation | Each request in its own container — total isolation but impractical cost and startup overhead |
Scale Planned
| Constraint | Enforcement |
|---|---|
| Max input tokens per request | Sliding-window budget enforced in application layer before FM call |
| VRAM per container | AWQ INT4 reduces model footprint; more headroom for concurrent KV-cache |
| Concurrent sessions per instance | Increased after quantization — more users per GPU without OOM risk |
| OOM tail case handling | Circuit breaker returns controlled response + emits metric; does not crash worker |
Failure Modes And Containment
Without fixes:
Long conversation → VRAM exhausted → worker crashes → container restarts → ALL sessions on instance disrupted
With fixes:
Long conversation → budget policy trims oldest turns → VRAM stays bounded
Quantization → model needs less VRAM per inference → more headroom for all sessions
Circuit breaker → rare edge case → graceful fallback → metric emitted → worker stays alive
Intuition From This Scenario
In a chat system, the application layer and the serving layer share responsibility for GPU memory. The serving engine manages VRAM efficiently during inference. But the application layer controls how large the input ever gets — and if the application sends unbounded context growth, no serving engine optimization can save it. The fix was split across both layers deliberately: sliding-window budget in the app layer to control what's sent, quantization in the serving layer to reduce the baseline footprint, and a circuit breaker at the boundary for the cases that still slip through. That three-layer defense is the senior-engineer answer.