LOCAL PREVIEW View on GitHub

Docker Scenarios With Deep-Dive Answers

This pack converts the MangaAssist architecture into Docker-specific interview stories. The goal is not generic Docker trivia. The goal is to explain how containers actually showed up in this chatbot system and why the chosen approach was better than the obvious alternatives.

Companion Deep Dive

Framing Note

The repo shows two compute views:

  • The HLD introduces orchestration in a serverless style with Lambda and Step Functions.
  • The deeper stack and scalability docs show the production operating model as hybrid compute: ECS Fargate for baseline traffic and Lambda for burst overflow.

For Docker interviews, anchor on the production operating model: Fargate, ECR, SageMaker containers, vLLM serving containers, and Dockerized CI dependencies.

Where Docker Appears In MangaAssist

Area Docker role in this project Why it mattered
Application services Multi-stage images deployed to ECS Fargate Predictable baseline compute without EC2 management
Registry ECR Standard AWS registry with vulnerability scanning
ML serving SageMaker inference containers and vLLM serving containers Custom models, GPU efficiency, reproducible runtime
Startup path Container warmup and health-check gating Prevent cold containers from receiving live traffic
CI and integration tests LocalStack, Redis, and OpenSearch containers More realism than mocks without depending on shared staging for everything
Security Image evidence, SBOM, signatures, promotion gates Reduced supply-chain risk and faster rollback

Scenario 1 - Multi-Stage Docker Builds For The Orchestrator

User story

As a platform engineer, I wanted the orchestrator and supporting services to run in lean, reproducible containers on ECS Fargate so that deployments stayed fast, secure, and operationally simple.

What we actually did

  • Used Docker multi-stage builds for production services.
  • Pushed images to ECR.
  • Ran steady baseline traffic on ECS Fargate.
  • Used Lambda for sudden overflow instead of making the container platform absorb every burst alone.

Why this was the right Docker story for MangaAssist

MangaAssist was not just a batch API. It had WebSocket traffic, per-container L1 caching, streaming responses, and a predictable daily baseline. That made long-lived containers a better steady-state home than Lambda-only, while still letting Lambda absorb sharp spikes.

Deep-dive questions and answers

Q1. Why did you use Docker on ECS Fargate instead of EC2 or EKS?
Because Fargate gave us the benefits of containers without cluster management. The repo explicitly positions EKS as overkill for our service count, and EC2 would have added patching, AMI management, and capacity management without giving us a product advantage.

Q2. Why did multi-stage Docker builds matter here?
They let us keep build tooling out of the runtime image. That reduced image size, reduced pull time, reduced attack surface, and made startup faster. In a service that may autoscale during traffic events, smaller runtime images directly improve deployment and recovery time.

Q3. What belongs in the runtime stage and what does not?
Runtime stage should contain only the app code, runtime dependencies, entrypoint, and minimal OS packages required to start. It should not contain compilers, test tools, lint tools, source caches, or training-only libraries. That separation is the entire point of the multi-stage pattern.

Q4. Why not run the orchestrator only on Lambda if Lambda was already in the design?
Because the deeper design evolved to hybrid compute. Fargate handled the predictable baseline more efficiently, while Lambda absorbed bursts. That let us keep steady-state cost and runtime behavior predictable, but still scale suddenly when traffic spiked.

Q5. What is the best interview answer for how Docker helped the application layer, not just infra?
Docker gave us deployment consistency across the orchestrator, observability components, and test environments. It also enabled per-container memory caches and stable runtime behavior for streaming chat workloads, which are less comfortable in a pure function-only architecture.

Optimizations we can credibly claim

  • Multi-stage images instead of single-stage builds.
  • Fargate for steady traffic, Lambda for overflow.
  • ECR as the standard registry with vulnerability scanning.
  • Per-container L1 cache on the application path, backed by Redis and DAX for shared caching.

Better-than-naive explanation

The naive answer is "we used Docker because everybody uses containers." The stronger answer is: we used Docker because it fit the hybrid compute model. Containers handled the predictable baseline well, and Lambda protected us from burst spikes. That gave us simpler operations than EKS and more stable runtime behavior than Lambda-only.


Scenario 2 - SageMaker Inference Containers And Cold-Start Elimination

User story

As a customer hitting the system right after a scale-out event, I wanted my first message to stay within the normal SLA instead of waiting 45 to 120 seconds for a cold model container to load.

What we actually did

  • Added a warmup script that sent synthetic requests during container startup.
  • Kept /ping unhealthy until warmup completed, so live traffic never hit a cold instance.
  • Set minimum instance count to 2 so the endpoint never scaled to zero.
  • Reduced load time with smaller or faster model artifacts, including quantization, ONNX or TorchScript where applicable, and safetensors plus faster storage for large-model startup paths.

The production problem

The docs show a real container startup chain:

  • Docker container start around 45 seconds
  • Model weight download and initialization after that
  • First live request potentially around 120 seconds after provisioning

That is not a small regression. That is a broken user experience if you let live traffic land on the instance too early.

Deep-dive questions and answers

Q1. Why was the problem a Docker container problem and not just a model problem?
Because the delay included container startup, dependency initialization, model loading, and first-request framework warmup. The model was part of the cost, but the user-facing symptom came from the full container lifecycle.

Q2. Why did you fail health checks until warmup completed?
Because a container can be "process alive" but still not be "traffic ready." Returning 503 until warmup finished prevented the load balancer from routing real users to an instance that had not yet completed model initialization and CUDA graph capture.

Q3. Why did warmup use multiple synthetic prompts instead of a single dummy request?
Because one request shape is not enough. We wanted to warm common inference shapes: short answers, medium responses, and longer recommendation-style prompts. That primes caches and kernel paths closer to real traffic.

Q4. Why was min=2 a better choice than scale-to-zero?
Because scale-to-zero is cheaper only if your SLA tolerates cold starts. Ours did not. Keeping two hot instances meant one instance could absorb traffic while another restarted or warmed up.

Q5. What metrics proved the fix worked?
The repo gives concrete outcomes: cold-start exposure was reduced from around 118 to 120 seconds on first-hit scenarios to roughly 22 seconds in the remaining edge case, and SLA violations dropped to a very small fraction of traffic once warmup plus minimum-capacity controls were in place.

Q6. What is the best short answer if an interviewer asks for the core lesson?
Treat readiness and liveness differently. A container should not be considered healthy just because the process is up. It should be considered healthy only after the model and runtime are actually ready for production traffic.

Optimizations we can credibly claim

  • Startup warmup requests
  • Readiness gating via health check behavior
  • Minimum hot capacity
  • Smaller or faster-loading model artifacts
  • Faster storage and serialization choices for large models

Better-than-naive explanation

The naive answer is "autoscaling fixed it." It did not. Autoscaling only provisions capacity. We fixed the user experience by making container readiness depend on successful warmup, then by keeping a minimum hot pool so the platform rarely exposed a cold path at all.


Scenario 3 - vLLM Serving Containers For Throughput And Cost

User story

As an infrastructure engineer, I wanted our custom-model serving path to handle higher concurrency at lower GPU cost, without creating a brittle platform that only one specialist could operate.

What we actually did

  • Chose vLLM as the serving container for fine-tuned models on SageMaker.
  • Used PagedAttention to reduce KV-cache waste.
  • Used continuous batching to improve throughput during spikes.
  • Used automatic prefix caching to avoid repeated work in multi-turn chat.
  • Used streaming, Multi-LoRA, and AWQ quantization where appropriate.

Why this was a Docker story, not just an ML story

The interview value here is that vLLM was not only an algorithm choice. It was an operational container choice. The repo explicitly contrasts it against TGI and TensorRT-LLM in terms of setup complexity, Docker ergonomics, hardware flexibility, and operational burden.

Deep-dive questions and answers

Q1. Why vLLM over Hugging Face TGI or TensorRT-LLM?
Because it had the best balance of throughput, latency, setup simplicity, and hardware flexibility. TensorRT-LLM was slightly faster, but it imposed more compilation complexity and stronger NVIDIA lock-in. TGI was easier than raw Transformers but still behind vLLM on our workload.

Q2. What specific container-level gains did vLLM give you?
It let us package a single high-performance serving runtime with continuous batching, prefix caching, and OpenAI-compatible APIs. That lowered operational complexity and let the app layer switch serving backends with minimal changes.

Q3. How did prefix caching help a chatbot specifically?
Chatbots repeat a lot of prefix tokens: the system prompt, conversation framing, and recurring context structure. vLLM reused that shared prefix work, which the repo calls out as a meaningful reduction in redundant compute for multi-turn chats.

Q4. How did continuous batching change the traffic profile?
Instead of fixed batching windows that hurt latency, requests were dynamically batched as they arrived. That improved throughput while still reducing latency during spikes because the engine kept the GPU busy without waiting for a rigid batch boundary.

Q5. Why was Multi-LoRA important in container terms?
Because it let multiple adapters share one base model container instead of forcing separate endpoints and separate GPU fleets for each domain variant. That is a clean story for reducing both cost and operational sprawl.

Q6. What hard metric should you quote in the interview?
The repo gives strong numbers: about 50 percent GPU reduction and about 68 percent latency improvement after moving from raw Transformers to vLLM for the fine-tuned serving path.

Optimizations we can credibly claim

  • PagedAttention for memory efficiency
  • Continuous batching for spike handling
  • Prefix caching for repeated prompt structure
  • Streaming for better first-token experience
  • Multi-LoRA to consolidate variants
  • AWQ quantization to shrink memory footprint

Better-than-naive explanation

The naive answer is "we picked the fastest engine." The better answer is: we picked the best performance-to-operability point. vLLM was fast enough to win economically, simple enough to run as a production container, and flexible enough to avoid locking us into one hardware path.

If the interviewer pushes on future alternatives

Say that TensorRT-LLM is worth re-evaluating if its build pipeline becomes materially simpler or if a very large throughput gap opens. Also say that AWS Neuron-based paths are worth revisiting as the SDK matures for larger models. Both points are already consistent with the repo's tradeoff documents.


Scenario 4 - Container Restarts From GPU OOM And How We Stopped Them

User story

As the on-call engineer, I wanted long multi-turn conversations to stop crashing GPU workers and triggering container restarts, because every restart created a mini-outage for that instance's share of traffic.

What we actually did

  • Introduced a sliding-window context policy so prompts stayed inside a clear token budget.
  • Quantized the large model path with AWQ INT4 to reclaim VRAM.
  • Added an OOM guard so the worker degraded gracefully instead of crashing the process.

The production problem

The docs explicitly describe GPU OOM killing the worker process, which then caused the health check to restart the container. That means the user-facing symptom was not just slow inference. It was an availability issue caused by container churn.

Deep-dive questions and answers

Q1. Why did long conversations lead to container restarts?
Because context growth increased KV-cache memory pressure. When VRAM was exhausted, the serving worker crashed. The platform then interpreted that as an unhealthy container and restarted it.

Q2. Why was sliding-window context better than blunt truncation?
Because blunt truncation throws away context without control. Sliding-window budgeting keeps the most recent and most relevant turns, preserves a predictable token ceiling, and can insert an explicit truncation marker instead of silently losing context.

Q3. Why was quantization a container optimization, not only a model optimization?
Because reducing model memory footprint directly improved container stability and effective density. Smaller VRAM usage means fewer OOMs, more room for concurrent requests, and less container churn under load.

Q4. Why keep an OOM circuit breaker if quantization already helped?
Because prevention is not enough. You still need containment. If a rare request shape or traffic pattern triggers OOM, the service should return a controlled fallback and emit a metric rather than crash the worker.

Q5. What result should you quote?
The repo gives a clean before-and-after story: OOM incidents went from about 14 per week to zero over the monitored post-fix window, and 20-turn conversations became much less memory-intensive after the combined sliding-window and AWQ changes.

Q6. What is the best senior-level takeaway?
Treat GPU memory as a capacity budget, not an incidental detail. In long-context chat systems, prompt construction policy is just as important as the serving engine when you care about container stability.

Optimizations we can credibly claim

  • Explicit context-window budgeting
  • INT4 quantization
  • Graceful OOM handling instead of hard crashes
  • Fewer restarts and better long-conversation stability

Better-than-naive explanation

The naive answer is "we needed a bigger GPU." The stronger answer is: bigger hardware only delays the problem. We fixed the real issue by controlling prompt growth, shrinking model memory footprint, and preventing one OOM from taking down the worker.


Scenario 5 - Dockerized Integration Testing Instead Of Mocking Everything

User story

As the API owner, I wanted integration tests that behaved like the real system without making every CI run depend on every shared staging service.

What we actually did

  • Ran LocalStack in Docker during CI for DynamoDB, S3, SQS, and Kinesis style integrations.
  • Used TestContainers to spin up Redis and OpenSearch per test run.
  • Used WireMock for downstream REST APIs when behavior control mattered more than runtime realism.
  • Still used real staging SageMaker endpoints where latency and actual model behavior mattered.

Why this is an important Docker interview story

This is a strong non-production Docker story. It shows that containers were part of engineering quality, not just deployment. The repo is explicit that mocks alone would have missed real latency and startup issues.

Deep-dive questions and answers

Q1. Why not mock everything in tests?
Because mocks are too optimistic. They do not show real serialization behavior, TTL behavior, index behavior, or container startup cost. They are useful at the unit layer, but insufficient for integration confidence.

Q2. Why combine LocalStack, TestContainers, and WireMock instead of standardizing on one tool?
Because each solves a different problem. LocalStack is good for AWS service emulation. TestContainers is good for real dependency instances like Redis and OpenSearch. WireMock is best when you want precise control over downstream API failures, delays, and retry sequences.

Q3. Where did you refuse to use Docker-based emulation and instead call a real service?
For SageMaker staging endpoints. The repo explicitly says real latency characteristics mattered there. A local fake would not have revealed cold-start behavior or real tail-latency issues.

Q4. What bug did this testing strategy catch that mocks could have hidden?
It caught real retry and circuit-breaker behavior, plus model cold-start issues that only appeared when talking to actual staged inference endpoints. That is exactly why the project used layered testing rather than a mocks-only strategy.

Q5. What is the best way to explain the value of Docker here?
Docker gave us reproducible, disposable dependency environments in CI. That raised realism while keeping tests isolated and automatable.

Optimizations we can credibly claim

  • Three-tier testing model: mocked, local-emulated, and real staging
  • Faster feedback than full shared-environment testing
  • Better behavioral coverage than mocks-only testing
  • Reproducible CI infrastructure via disposable containers

Better-than-naive explanation

The naive answer is "we used Docker in CI because it was convenient." The stronger answer is: we used Docker selectively to move more tests into the middle of the test pyramid, where they were realistic enough to catch integration failures but still cheap enough to run on every PR.


Scenario 6 - Container Supply-Chain Security And Release Gating

User story

As a platform and security engineer, I wanted only trusted container artifacts to reach production, because an LLM system can fail through its build and deployment chain even when application code looks unchanged.

What we actually did

  • Used ECR as the standard registry and scanned images for vulnerabilities.
  • Treated base images, build containers, and deployment artifacts as supply-chain inputs.
  • Required evidence such as lockfiles, dependency diffs, SBOMs, signatures, and provenance before promotion.
  • Used rollback-to-previous-signed-artifact as the containment path when a release was no longer trusted.

Why this matters specifically for Docker

A Docker image is not just a packaging format. It is a release artifact that bundles OS libraries, language runtimes, transitive dependencies, and sometimes model-serving components. If you cannot prove what is inside the image and how it was built, you do not really control production.

Deep-dive questions and answers

Q1. Why is image scanning alone not enough?
Because scanning tells you about known issues, but not provenance. You still need exact dependency versions, SBOMs, and signatures so you can answer what was built, from which inputs, and whether the artifact was tampered with.

Q2. What release evidence would you require for a production container?
Lockfile digest, dependency diff from the previous release, SBOM, build provenance attestation, artifact signature, security scan results, evaluation results, and deployment approval metadata. The supply-chain document in the repo is very clear about that evidence trail.

Q3. How would you respond if a critical base-image CVE appeared after release?
Use the SBOM to identify affected releases, rebuild from a patched base or lockfile, promote a new signed artifact, and be prepared to roll back any still-unhealthy deployment to the previous trusted version. The important part is that SBOM-driven triage starts with facts instead of guesswork.

Q4. How do you explain this without sounding like generic security theater?
Tie it back to customer impact. In MangaAssist, a bad artifact could affect guardrails, prompt handling, or runtime dependencies. We cared about rapid containment and explainability, not just compliance checkboxes.

Q5. What is the best "senior engineer" sentence here?
I treat container images as governed release artifacts, not build by-products. If an image is unsigned, missing an SBOM, or missing promotion evidence, it should fail closed and never reach production.

Optimizations we can credibly claim

  • ECR scanning
  • Promotion policy tied to evidence
  • SBOM and provenance-based incident response
  • Fast rollback to a previously trusted artifact

Better-than-naive explanation

The naive answer is "we scan Docker images." The stronger answer is: we made image trust enforceable at promotion time and auditable after release. That is how you manage supply-chain risk in a production chatbot, especially one whose runtime behavior also depends on model and prompt artifacts.


Quick-Fire Interview Questions

Use these when the interviewer wants shorter follow-ups after the stories above.

Why was Docker still useful even though the system also used Lambda?
Because the operating model was hybrid. Containers handled steady-state traffic and container-friendly services, while Lambda handled burst overflow and short-lived event-driven paths.

What was your strongest Docker optimization in the ML path?
Warmup plus readiness gating for startup, and vLLM plus quantization for steady-state efficiency.

What was your strongest Docker optimization in the app path?
Multi-stage builds and using Fargate only for the steady baseline instead of overcommitting the entire burst profile to containers.

What was the biggest Docker-related incident risk?
Cold containers receiving traffic too early and GPU OOM crashes causing unhealthy worker restarts.

What is one thing you would improve next?
I would tighten runtime enforcement so only signed, policy-approved artifacts can start, not just be stored in the registry. That direction is consistent with the supply-chain controls already documented in the repo.


Best Story Order For Interviews

If the interviewer is infra-heavy:

  1. Scenario 2 - cold starts and readiness gating
  2. Scenario 3 - vLLM container optimization
  3. Scenario 6 - supply-chain controls

If the interviewer is backend-heavy:

  1. Scenario 1 - Fargate plus Lambda hybrid
  2. Scenario 5 - Dockerized integration testing
  3. Scenario 2 - inference container warmup

If the interviewer is ML platform-heavy:

  1. Scenario 3 - vLLM choice
  2. Scenario 4 - OOM and restart containment
  3. Scenario 2 - cold-start elimination

One Strong Summary Answer

"In MangaAssist, Docker was not just a packaging decision. It shaped three critical parts of the system: steady-state compute on Fargate, custom inference on SageMaker containers, and realistic CI integration testing. The biggest wins came from multi-stage images, warmup and readiness gating for inference containers, and vLLM-based serving that cut GPU cost while improving latency. The more senior answer is that we treated containers as both a performance tool and a governed release artifact."