Docker Scenarios LLD Deep Dive
This document expands the interview stories in 01-docker-scenarios-with-answers.md into low-level design notes.
Each section is written as a decision log:
- what the first design looked like
- what broke or proved insufficient
- what decision replaced it
- how the newer decision improved the previous one
How To Use This Document
- Read
Docker-LLD-1first for the application runtime and image strategy. - Read
Docker-LLD-2throughDocker-LLD-4next for inference-container behavior. - Use
Docker-LLD-5andDocker-LLD-6for CI realism and release governance.
Docker-LLD-1: Multi-Stage Builds And Hybrid Runtime For The Orchestrator
Objective
Run the orchestrator in a container platform that preserves WebSocket streaming behavior, benefits from per-container L1 caching, and avoids paying cluster-management tax for a relatively small service count.
Architecture Slice
flowchart LR
User["Web / Mobile Chat Client"] --> Edge["CloudFront + ALB / API Gateway"]
Edge --> Fargate["ECS Fargate Orchestrator Task"]
Edge --> Lambda["Lambda Burst Worker"]
Fargate --> L1["L1 In-Memory Cache"]
Fargate --> Redis["Redis L2 Cache"]
Fargate --> DAX["DAX / DynamoDB Conversation Path"]
Fargate --> ECR["Image from ECR"]
Decision Evolution
| Step | New decision | Why the previous decision was not enough | Improvement introduced |
|---|---|---|---|
| D0 | Evaluate Lambda-only, EC2, and EKS as the primary runtime | Lambda-only is weak for long-lived streaming and per-container cache locality. EC2 and EKS add host or cluster management that the service count did not justify. | Forced the team to separate "baseline traffic" from "burst traffic" instead of trying to make one platform fit both. |
| D1 | Put baseline traffic on ECS Fargate | The earlier choices either created operational drag or did not fit sticky WebSocket-style sessions well. | Fargate kept container benefits while removing EC2 patching, AMI rotation, and Kubernetes control-plane work. |
| D2 | Use multi-stage Docker builds for application images | A single-stage image would carry compilers, test tools, and build caches into production. | Smaller runtime images, faster pulls, lower attack surface, and shorter replacement time during scale-out or rollback. |
| D3 | Keep Lambda only for overflow paths | Asking Fargate alone to absorb the full burst profile would over-provision baseline capacity. | Baseline stayed predictable on Fargate while Lambda handled sudden concurrency spikes. |
Runtime Design
- The client opens a WebSocket or HTTPS fallback connection through the chat edge.
- Sticky routing sends steady traffic to a Fargate task so streaming and connection state remain stable.
- The task uses L1 in-memory cache first, then Redis and DAX-backed paths for shared or durable state.
- Burst-only work, lightweight fan-out, or overflow traffic can spill to Lambda.
- Both runtime types consume versioned artifacts from ECR-backed release pipelines.
Image Design
| Layer | Contents | Why it exists |
|---|---|---|
| Build stage | Package manager metadata, compiler toolchain, dependency install, tests, asset compilation | Keeps heavyweight tooling out of the production image. |
| Runtime stage | App code, runtime dependencies, entrypoint, health-check endpoint, minimal OS packages | Produces a lean image optimized for Fargate task replacement and rollback. |
| Registry layer | ECR repository, vulnerability scan metadata, immutable image tag or digest | Makes the container a governed release artifact, not an opaque tarball. |
Low-Level Decisions
| Concern | Decision | Why it improved the prior option |
|---|---|---|
| Baseline compute | Fargate tasks for 10-50 normal tasks | Better than EC2 because it removed host management; better than EKS because there was no cluster overhead to justify. |
| Burst compute | Lambda for overflow up to very high concurrency | Better than over-scaling Fargate for rare spikes because cost stayed aligned to steady-state load. |
| Runtime locality | Per-container L1 cache plus Redis L2 | Better than Redis-only for ultra-hot reads because local memory stays sub-millisecond. |
| Registry | ECR with scanning | Better than ad hoc image storage because release metadata and vulnerability evidence stay attached to the artifact. |
| Build pattern | Multi-stage Dockerfile | Better than single-stage because build-only dependencies never reach production. |
Failure Modes And Controls
| Failure mode | User-facing risk | Control |
|---|---|---|
| Large runtime image | Slow deployment or slow task replacement | Multi-stage build strips toolchains and caches. |
| Scale event during traffic spike | Queue growth or slow first token | Lambda absorbs overflow rather than forcing Fargate to cover the entire spike envelope. |
| Container-local cache loss on restart | Minor latency regression on first reads | Redis and DAX remain the shared backstop layers. |
| Host or cluster drift | Operational toil, patch lag, brittle runbooks | Fargate removes EC2 and Kubernetes fleet management. |
Improvement Evidence
- Fargate became the normal path for predictable traffic instead of forcing EC2 or EKS operations into the design.
- Lambda remained available for fast elasticity instead of bloating container baseline capacity.
- The three-layer cache model gave the container runtime a concrete advantage over a function-only design.
Deep Dive: Group Discussion — Why Not Just All-Lambda Or All-EKS?
Imagine five engineers debating this decision at a whiteboard:
Engineer A (Lambda advocate): "Let's just do Lambda. Auto-scaling is free, no containers to manage, pay per invocation."
Engineer B (Systems thinker): "Lambda cold starts are 500ms-3s for our Python runtime. Our chat users expect first-token in under 200ms. Also, Lambda has no persistent memory — every invocation starts empty. Our L1 cache strategy is dead on arrival."
Engineer C (Kubernetes enthusiast): "EKS gives us everything — pods, services, autoscaling, service mesh. Full control."
Engineer D (Operations lead): "We have 3-5 services. EKS means managing a control plane ($73/month just for the API server), node groups, RBAC policies, Helm charts, ingress controllers, service mesh, certificate rotation, etcd backups. That's a full-time SRE role for a small service fleet."
Engineer E (Pragmatist): "Fargate gives us container isolation, task definitions, and auto-scaling without any of the EKS control-plane work OR the Lambda cold-start problem. We keep WebSocket stickiness through ALB target groups, and L1 cache lives in the container's memory space. Lambda catches the overflow."
The insight that wins: The right abstraction level is not "most powerful" or "least operational" — it is "best fit for the traffic shape and team size." Fargate sits exactly at that intersection for a small service count with streaming requirements.
Why Multi-Stage Builds Matter More Than People Think
Most candidates mention multi-stage builds as "it makes images smaller." That is only the surface. Here is what actually happens in production:
# What a SINGLE-STAGE image looks like internally
├── gcc 12.2 (~150 MB) ← Never used at runtime
├── python3-dev (~45 MB) ← Only needed for C extensions during pip install
├── pip cache (~200 MB) ← Leftover .whl files
├── test frameworks (~30 MB) ← pytest, coverage, moto
├── .git directory (~80 MB) ← Accidentally copied
├── node_modules/ (~180 MB) ← Build tool artifacts
├── YOUR ACTUAL APP (~25 MB) ← What you actually need
Total: ~710 MB. Only ~25 MB is useful at runtime.
With multi-stage:
# Stage 1: Build
FROM python:3.11-slim AS builder
COPY requirements.txt .
RUN pip install --user -r requirements.txt
COPY . .
RUN python setup.py build
# Stage 2: Runtime
FROM python:3.11-slim AS runtime
COPY --from=builder /root/.local /root/.local
COPY --from=builder /app/dist /app
EXPOSE 8080
CMD ["python", "-m", "app.main"]
Production impact chain: 1. Image size: 710 MB → 85 MB 2. ECR pull time during scale-out: 18s → 3s (directly affects how fast new Fargate tasks join) 3. Attack surface: gcc, pip, .git are gone — CVE scan noise drops 60% 4. Rollback speed: Smaller image = faster pull = faster rollback = shorter outage window 5. ECR storage cost: At 50 deploys/month, the savings are real
Interview Questions And Answers
Q1: "Why did you pick ECS Fargate over EKS for the orchestrator?"
Strong answer: "Three reasons. First, traffic shape — we had 3-5 services, not 50. EKS control-plane cost and operational overhead were not justified at that service count. Second, streaming requirements — our chat service used WebSockets with sticky sessions. Fargate tasks behind ALB target groups handled this cleanly. Third, cache locality — each Fargate task maintained an L1 in-memory cache for ultra-hot reads under 1ms. Lambda could not offer that because it has no persistent memory across invocations. We kept Lambda only for burst overflow where cache misses were acceptable."
What makes this strong: Links compute choice to traffic reality (service count, streaming, cache), not just feature comparison.
Q2: "What happens when Fargate can't scale fast enough?"
Strong answer: "We designed for this explicitly. Fargate scaling has a 30-60 second lag for new task provisioning. For sudden spikes, Lambda acts as an overflow valve with near-instant scaling to thousands of concurrent executions. The routing layer detects queue depth and latency thresholds — when Fargate tasks hit 80% CPU or response times exceed our SLA, new requests spill to Lambda. Lambda handles these without cache locality, so responses may be slightly slower, but the system stays available. Once Fargate catches up, traffic shifts back."
Q3: "Explain multi-stage builds. Why not just use a .dockerignore?"
Strong answer:
".dockerignore prevents files from being COPIED into the build context, but it cannot remove packages INSTALLED during the build. If I install gcc, python3-dev, and build dependencies to compile C extensions during pip install, those packages remain in the final image even with a perfect .dockerignore. Multi-stage builds solve this by using one stage for building (with all the heavy tooling) and copying ONLY the compiled output into a clean runtime stage. The build-stage layers never ship to production."
Follow-up trap: "Can you have more than two stages?" "Yes. In practice I've used three — a dependency stage that installs and caches pip packages, a test stage that runs the test suite (if tests fail, the build stops here), and a runtime stage that copies only the final application. The test stage acts as a quality gate inside the Dockerfile itself."
Q4: "How did you handle the L1/L2 cache interaction?"
Strong answer: "Read path: check L1 (in-memory dict or local LRU cache in the container) → on miss, check L2 (Redis) → on miss, query DynamoDB through DAX. Write path: write to DynamoDB, then invalidate or update Redis, then let the local L1 expire naturally via TTL. L1 TTL was short (10-30 seconds) because it is not shared across tasks — stale data risk increases with longer TTLs. Redis TTL was longer (5-10 minutes) because it is shared and we tolerated slightly stale data for non-critical reads. The key insight is that L1 absorbs 60-70% of reads at sub-millisecond latency, which dramatically reduces Redis traffic and DynamoDB read units."
Q5 (Basics): "What is a Docker container vs a Docker image?"
Answer: "An image is a read-only template — a snapshot of a filesystem plus metadata (entrypoint, env vars, exposed ports). A container is a running instance of that image with its own writable layer, process namespace, network namespace, and mount namespace. You can run multiple containers from the same image, and each gets its own isolated writable layer. Think of it like: image = class, container = object instance."
Q6 (Basics): "What is the difference between CMD and ENTRYPOINT?"
Answer:
"ENTRYPOINT defines the executable that always runs. CMD provides default arguments to that executable. If a user passes arguments to docker run, CMD is replaced but ENTRYPOINT stays. Use ENTRYPOINT when the container IS the command (like a web server), and CMD for default flags. Example: ENTRYPOINT ["python", "server.py"] with CMD ["--port", "8080"] means running docker run myapp --port 9090 overrides only the port."
Q7 (Follow-up): "What would you change if your service count grew to 30+?"
Answer: "At 30+ services, EKS becomes the right choice. The control-plane overhead is amortized across many services, and you gain service mesh (Istio/Linkerd), fine-grained RBAC, namespace isolation, Helm-based deployment patterns, and unified observability. The migration path is clean because our Fargate tasks already use containerized workloads — the Dockerfiles, health checks, and image pipelines transfer directly. The main work would be writing Kubernetes manifests and setting up the EKS cluster infrastructure."
Q8 (Follow-up): "How do you decide between Fargate spot and on-demand?"
Answer: "For our chatbot orchestrator, we used on-demand Fargate exclusively because the service is stateful (WebSocket connections, L1 cache). Spot interruptions would kill active user sessions. However, for batch processing tasks, CI runners, or async workers where interruption is tolerable, Fargate Spot saves 50-70%. The key decision factor is: can this workload handle a 2-minute interruption notice gracefully?"
Group Follow-Up Panel: Rapid-Fire Probing Questions
A panel of 5 interviewers fires follow-ups one after another. This simulates a real Amazon/FAANG loop where each person probes a different angle.
Interviewer 1 (Infra Architect): "You said L1 cache absorbs 60-70% of reads. How did you measure that? What happens when a Fargate task restarts — does L1 cold-start cause a Redis thundering herd?"
Strong answer: "We instrumented every cache layer with CloudWatch custom metrics — L1_HIT, L1_MISS, L2_HIT, L2_MISS, DB_READ. The 60-70% figure comes from a 7-day rolling average. On task restart, yes — L1 is empty so all reads hit Redis briefly. But we mitigated thundering herd two ways: (1) jittered TTLs on L1 — each key's TTL is base_ttl ± random(0, 5s), so keys don't all expire together; (2) single-flight pattern — if multiple goroutines/threads request the same key while L1 is cold, only ONE request goes to Redis and the others wait for that result."
# Single-flight pattern for cache reads
import asyncio
from functools import lru_cache
class SingleFlightCache:
def __init__(self, redis_client):
self.redis = redis_client
self.l1 = {}
self.in_flight = {} # key -> asyncio.Future
async def get(self, key):
# L1 check
if key in self.l1:
return self.l1[key]
# Single-flight: if someone else is already fetching, wait for them
if key in self.in_flight:
return await self.in_flight[key]
# I'm the first — create a flight
future = asyncio.get_event_loop().create_future()
self.in_flight[key] = future
try:
value = await self.redis.get(key)
if value is None:
value = await self.fetch_from_dynamodb(key)
await self.redis.setex(key, 300, value) # 5 min TTL
# Populate L1 with jittered TTL
import random
ttl = 15 + random.randint(0, 10) # 15-25 seconds
self.l1[key] = value
asyncio.get_event_loop().call_later(ttl, self.l1.pop, key, None)
future.set_result(value)
return value
finally:
del self.in_flight[key]
Interviewer 2 (Security): "Your multi-stage Dockerfile copies from the builder. How do you prevent secrets like API keys or internal registry tokens from leaking into the final stage?"
Strong answer: "Three rules. (1) Never put secrets in ENV or ARG — they persist in image layers. Use Docker BuildKit --mount=type=secret which mounts a secret file during a RUN step without it ever becoming a layer. (2) The COPY from builder is selective — we copy only /root/.local (installed packages) and /app/dist (compiled app), not the entire filesystem. (3) We run docker history and dive (a layer inspector tool) in CI to detect accidental secret leakage."
# WRONG — secret leaks into image layer
ARG NPM_TOKEN
RUN echo "//registry.npmjs.org/:_authToken=${NPM_TOKEN}" > .npmrc && \
npm install && \
rm .npmrc # ← STILL IN THE LAYER! 'docker history' reveals it
# RIGHT — BuildKit secret mount, never persists
# syntax=docker/dockerfile:1
RUN --mount=type=secret,id=npm_token \
NPM_TOKEN=$(cat /run/secrets/npm_token) \
echo "//registry.npmjs.org/:_authToken=${NPM_TOKEN}" > .npmrc && \
npm install && \
rm .npmrc
# Secret is mounted during the RUN step only — never part of any layer
Interviewer 3 (Cost): "You're running Fargate + Lambda + Redis + DAX. How do you know this hybrid isn't MORE expensive than just running everything on EC2?"
Strong answer: "We tracked total cost of ownership, not just compute cost. EC2 looks cheaper per-hour, but add: (1) AMI patching and rotation — 4 hours/month of engineer time at $150/hr = $600; (2) capacity planning and right-sizing — 8 hours/quarter = $1,200/quarter; (3) OS-level security patches — 2 hours/month = $300; (4) on-call alerting for host failures — priceless stress. Fargate cost was ~20% more in raw compute but saved ~40 hours/quarter of engineering time. At our scale, the Fargate premium paid for itself within the first month."
Interviewer 4 (Reliability): "What's your rollback plan when a bad Fargate deployment goes out? Walk me through second by second."
Strong answer:
Second 0: New task definition deployed via CodeDeploy Blue/Green
Second 0-60: Canary phase — 5% traffic routed to new tasks
Second 45: CloudWatch alarm triggers: P99 latency > 500ms on new tasks
Second 46: CodeDeploy automatic rollback initiated
Second 46-90: Old task set receives 100% traffic (blue/green makes this instant)
Second 90: New tasks drained and stopped
Second 91: Rollback complete. Alert sent to Slack with deployment ID and alarm details.
Total user impact: ~45 seconds of 5% traffic seeing degraded latency.
Interviewer 5 (Deep Dive): "You mentioned WebSocket stickiness via ALB target groups. What happens when a target Fargate task becomes unhealthy mid-conversation? Does the user's WebSocket drop?"
Strong answer: "Yes, the WebSocket drops. The client has reconnection logic with exponential backoff (100ms, 200ms, 400ms, max 5s). On reconnect, the client sends a resume message with the conversation ID. The new Fargate task pulls conversation state from DynamoDB (not L1 — L1 is task-local), resumes the session, and sends a reconnected event to the UI. The user sees a brief 'Reconnecting...' spinner for 1-2 seconds. The key design is that conversation state is ALWAYS persisted to DynamoDB — L1 and Redis are read caches only, never the source of truth."
Code Examples: Complete Production Dockerfile
# ============================================
# PRODUCTION MULTI-STAGE DOCKERFILE
# Orchestrator Service for Chat Platform
# ============================================
# ---------- Stage 1: Dependency Installation ----------
FROM python:3.11.7-slim-bookworm AS deps
WORKDIR /app
# Install system dependencies needed for C extensions
RUN apt-get update && \
apt-get install -y --no-install-recommends gcc libpq-dev && \
rm -rf /var/lib/apt/lists/*
# Copy ONLY dependency files first — layer caching optimization
# If requirements.txt hasn't changed, this layer is cached
COPY requirements.txt requirements-lock.txt ./
# Install Python dependencies into a user directory (easy to copy later)
RUN pip install --no-cache-dir --user \
-r requirements-lock.txt \
--require-hashes
# ---------- Stage 2: Test Runner (CI quality gate) ----------
FROM deps AS test
COPY requirements-test.txt ./
RUN pip install --no-cache-dir --user -r requirements-test.txt
COPY . .
RUN python -m pytest tests/unit \
--tb=short \
--no-header \
-q
# ---------- Stage 3: Production Runtime ----------
FROM python:3.11.7-slim-bookworm AS runtime
# Security: run as non-root
RUN groupadd -r appuser && \
useradd -r -g appuser -d /app -s /sbin/nologin appuser
WORKDIR /app
# Copy ONLY the installed packages from deps stage (not gcc, not pip cache)
COPY --from=deps /root/.local /home/appuser/.local
# Copy application code
COPY --chown=appuser:appuser src/ ./src/
COPY --chown=appuser:appuser config/ ./config/
# Set PATH for user-installed packages
ENV PATH="/home/appuser/.local/bin:${PATH}" \
PYTHONPATH="/app/src" \
PYTHONDONTWRITEBYTECODE=1 \
PYTHONUNBUFFERED=1
USER appuser
# Health check — ALB and ECS use this
HEALTHCHECK --interval=15s --timeout=5s --start-period=30s --retries=3 \
CMD ["python", "-c", "import urllib.request; urllib.request.urlopen('http://localhost:8080/health')"]
EXPOSE 8080
ENTRYPOINT ["python", "-m", "uvicorn"]
CMD ["src.main:app", "--host", "0.0.0.0", "--port", "8080", "--workers", "4"]
# ============================================
# ECS Fargate Task Definition (CloudFormation snippet)
# ============================================
OrchestratorTaskDefinition:
Type: AWS::ECS::TaskDefinition
Properties:
Family: chatbot-orchestrator
Cpu: "1024" # 1 vCPU
Memory: "2048" # 2 GB
NetworkMode: awsvpc
RequiresCompatibilities: [FARGATE]
ExecutionRoleArn: !GetAtt ExecutionRole.Arn
TaskRoleArn: !GetAtt TaskRole.Arn
ContainerDefinitions:
- Name: orchestrator
Image: !Sub "${AWS::AccountId}.dkr.ecr.${AWS::Region}.amazonaws.com/chatbot-orchestrator:${ImageTag}"
PortMappings:
- ContainerPort: 8080
Protocol: tcp
LogConfiguration:
LogDriver: awslogs
Options:
awslogs-group: /ecs/chatbot-orchestrator
awslogs-region: !Ref AWS::Region
awslogs-stream-prefix: ecs
HealthCheck:
Command: ["CMD-SHELL", "python -c \"import urllib.request; urllib.request.urlopen('http://localhost:8080/health')\" || exit 1"]
Interval: 15
Timeout: 5
Retries: 3
StartPeriod: 30
Environment:
- Name: REDIS_ENDPOINT
Value: !GetAtt RedisCluster.PrimaryEndPoint.Address
- Name: DYNAMODB_TABLE
Value: !Ref ConversationTable
- Name: LOG_LEVEL
Value: INFO
Critical Points To Remember — Docker-LLD-1
╔══════════════════════════════════════════════════════════════════════╗
║ CRITICAL POINTS — LLD-1 ║
╠══════════════════════════════════════════════════════════════════════╣
║ ║
║ 1. COMPUTE CHOICE = f(service_count, traffic_shape, team_size) ║
║ • <10 services → Fargate wins over EKS ║
║ • Streaming/WebSocket → Fargate wins over Lambda ║
║ • Cache locality needed → Fargate wins over Lambda ║
║ ║
║ 2. MULTI-STAGE IS NOT JUST ABOUT SIZE ║
║ • Size → faster pull → faster scale-out → faster rollback ║
║ • Attack surface → fewer packages → fewer CVEs → cleaner scans ║
║ • Build reproducibility → deterministic runtime image ║
║ ║
║ 3. CACHE HIERARCHY IS A LATENCY PYRAMID ║
║ • L1 (container memory): <1ms, 60-70% hit rate, per-task ║
║ • L2 (Redis): 1-5ms, shared across tasks, 5-10 min TTL ║
║ • L3 (DAX/DynamoDB): 5-10ms, persistent, source of truth ║
║ • NEVER let L1 be the source of truth — it dies with the task ║
║ ║
║ 4. HYBRID COMPUTE = BASELINE + BURST ║
║ • Fargate for baseline: predictable, warm, cached ║
║ • Lambda for burst: instant scale, no cache, acceptable delay ║
║ • NEVER run stateful workloads on Lambda ║
║ ║
║ 5. DOCKERFILE LAYER ORDER MATTERS ║
║ • Least-changing layers FIRST (OS packages, dependencies) ║
║ • Most-changing layers LAST (application code) ║
║ • One changed layer invalidates ALL layers below it ║
║ ║
║ 6. SECRETS NEVER IN IMAGES ║
║ • Not in ENV (persists in layers) ║
║ • Not in ARG (visible in docker history) ║
║ • Use BuildKit --mount=type=secret or runtime injection ║
║ ║
╚══════════════════════════════════════════════════════════════════════╝
Docker-LLD-2: SageMaker Inference Containers And Cold-Start Removal
Objective
Stop live traffic from landing on model containers that are technically running but not actually ready to serve within SLA.
Startup Flow
sequenceDiagram
participant SM as SageMaker
participant C as Inference Container
participant W as Warmup Script
participant H as /ping Health Check
participant LB as Endpoint Load Balancer
SM->>C: Start container
C->>W: Run startup warmup
W->>C: Load weights, trigger JIT, cache CUDA kernels
SM->>H: Probe /ping
H-->>SM: 503 warming_up
W->>C: Warmup complete
SM->>H: Probe /ping
H-->>SM: 200 healthy
SM->>LB: Add instance to live traffic pool
Decision Evolution
| Step | New decision | Why the previous decision was not enough | Improvement introduced |
|---|---|---|---|
| D0 | Use standard SageMaker container startup behavior | Process-alive health was too weak. The endpoint could route traffic before model load, JIT, and CUDA-path warmup finished. | Made the real problem visible: startup readiness and traffic readiness are different states. |
| D1 | Add synthetic warmup requests during startup | Merely loading the process still left the first real request 3-5x slower. | Warmed weights, kernel caches, and common prompt shapes before customer traffic arrived. |
| D2 | Keep /ping unhealthy until warmup completes |
Warmup by itself still fails if the load balancer can see the instance too early. | Turned health checks into a true readiness gate. |
| D3 | Pin minimum instance count to 2 | Even a correct readiness gate does not help if every new scale-out request still starts from cold. | Removed almost all cold-path exposure by keeping a hot pool available. |
| D4 | Optimize artifact format and load path | Warm containers solved the common case but not the residual scale-out edge case. | Safetensors, smaller artifacts, and faster storage cut the remaining startup penalty. |
Runtime Design
- Container boots and the startup hook invokes warmup.
- Warmup sends representative short, medium, and long prompts through the real inference path.
/pingreturns503until warmup completes.- SageMaker keeps the instance out of service until readiness flips to healthy.
- Autoscaling never goes below two hot instances.
- Artifact-level optimizations reduce the rare scale-out path that still requires fresh initialization.
Readiness Contract
| Concern | Decision | Why it improved the prior option |
|---|---|---|
| Warmup prompt coverage | Use multiple synthetic prompt shapes | Better than a single dummy prompt because it primes more realistic kernel and cache paths. |
| Health semantics | Return 503 while warming |
Better than process-only liveness because traffic does not see a half-ready instance. |
| Capacity floor | MinCapacity = 2 |
Better than scale-to-zero because one healthy instance can absorb traffic while another is replaced or warming. |
| Artifact loading | Quantization, ONNX or TorchScript where applicable, safetensors, faster storage | Better than raw PyTorch .bin artifacts because load and deserialization time drop materially. |
| Cooldowns | Fast scale-out, slow scale-in | Better than symmetric cooldowns because the endpoint protects itself from rapid shrink-expand churn. |
Failure Modes And Controls
| Failure mode | User-facing risk | Control |
|---|---|---|
| Container marked healthy too early | First request waits on model startup | /ping readiness gate blocks traffic until warmup is done. |
| Endpoint scales to zero or one | First customer after idle period sees cold latency | Minimum hot capacity of two instances. |
| Large model artifact | New instance misses SLA during scale-out | Artifact shrinking and faster storage reduce unavoidable load time. |
| Startup replacement event | Brief capacity dip | One already-hot instance remains available while the second instance recovers. |
Improvement Evidence
| Metric | Before | After |
|---|---|---|
| Full cold start | ~118 seconds | Residual edge case ~22 seconds |
| Cold-path exposure | Frequent on new instances | Removed for ~99.7% of requests |
| SLA violations on first-hit path | 100% on cold new instances | ~0.1% of requests |
Design Lesson
The winning change was not "autoscaling." The winning change was splitting container lifecycle into three separate states: process started, model warmed, and traffic ready.
Deep Dive: Group Discussion — What Does "Ready" Really Mean For An ML Container?
Engineer A (Backend): "The container is running. Docker says it's healthy. Why are users still seeing 3-second response times?"
Engineer B (ML Infra): "Because 'process running' and 'model ready' are completely different things. When the container starts, the Python process is alive, the HTTP server is listening, and Docker's HEALTHCHECK passes. But the model weights haven't been loaded into GPU memory yet. The CUDA context hasn't been initialized. The JIT compiler hasn't compiled any kernels. The first inference request triggers ALL of this — that's your 3-second delay."
Engineer C (SRE): "This is exactly like a Java app with a cold JVM. The process is alive, but the JIT hasn't compiled hot paths. In the JVM world we solved this with class data sharing and warmup requests. Same principle here."
Engineer D (Platform): "The real danger is the load balancer. SageMaker's endpoint sees the container responding to health checks and starts routing traffic. The container is technically healthy but functionally cold. This is the 'liveness vs readiness' distinction that Kubernetes solved years ago with separate probes."
Engineer E (Senior ML): "Here's what people miss — even loading the model weights isn't enough. The first forward pass through the model triggers CUDA kernel compilation for the specific input shapes. If your first real request has a prompt of 500 tokens but you only tested with an empty request, you'll still be slow. The warmup must exercise realistic input shapes."
The Three States Nobody Teaches You
Container Lifecycle for ML Inference:
State 1: PROCESS ALIVE (seconds 0-5)
├── Python process started
├── HTTP server listening on port 8080
├── Docker HEALTHCHECK: PASS ✓
├── Can serve inference: NO ✗
└── What's missing: model weights, CUDA context, JIT kernels
State 2: MODEL LOADED (seconds 5-60)
├── Weights loaded into GPU VRAM
├── CUDA context initialized
├── JIT kernels NOT yet compiled
├── First real request: SLOW (3-5x normal latency)
└── What's missing: warmed kernel cache, primed memory allocator
State 3: TRAFFIC READY (seconds 60-120)
├── Synthetic warmup requests completed
├── Short/medium/long prompt shapes all exercised
├── CUDA kernel cache hot
├── Memory allocator has seen realistic allocation patterns
├── /ping returns 200
└── Load balancer routes traffic: NOW SAFE
Why MinCapacity = 2 Is Not Wasteful
Common pushback: "You're paying for an idle instance."
The math tells a different story:
Scenario: MinCapacity = 1
- Instance fails or is replaced by SageMaker
- Time to new instance ready: ~120 seconds
- During those 120 seconds: 100% of traffic hits cold path
- If traffic = 50 requests/sec → 6,000 requests see degraded latency
- Each degraded request adds ~3 seconds → user-visible SLA violation
Scenario: MinCapacity = 2
- Instance fails or is replaced by SageMaker
- Surviving instance handles ALL traffic immediately
- New instance warms up in background
- Zero requests see cold-path latency
- Extra cost: ~$2-4/hour for one additional GPU instance
- SLA violations prevented: priceless (or rather, easily worth $2-4/hour)
The breakeven: if avoiding cold-start SLA violations saves even one customer escalation per month, MinCapacity = 2 pays for itself many times over.
Interview Questions And Answers
Q1: "How did you eliminate cold starts in your inference containers?"
Strong answer:
"We attacked it at four layers. First, warmup scripts — during container startup, we send synthetic inference requests covering short (10 tokens), medium (200 tokens), and long (1000 tokens) prompt shapes. This primes CUDA kernels, activates the JIT compiler, and pre-allocates GPU memory for realistic workloads. Second, readiness gating — /ping returns 503 until warmup completes, so SageMaker's load balancer never routes traffic to an unready instance. Third, minimum capacity — we pinned MinCapacity to 2, ensuring at least one hot instance is always available even during replacements. Fourth, artifact optimization — we switched from PyTorch .bin files to safetensors format, which reduced model loading from ~45 seconds to ~12 seconds."
Q2: "What is the difference between liveness and readiness probes? Why does it matter for ML?"
Strong answer:
"A liveness probe answers: 'Is the process alive and not deadlocked?' If it fails, the orchestrator restarts the container. A readiness probe answers: 'Can this instance serve real traffic right now?' If it fails, the load balancer removes it from the pool but does NOT restart it. For ML containers, this distinction is critical. The process can be alive for 90 seconds while loading a 7B parameter model into GPU memory. During that window, liveness should pass (don't restart!), but readiness must fail (don't send traffic!). SageMaker's /ping acts as both — so we had to implement the readiness logic ourselves by returning 503 during warmup and 200 only after warmup completes."
Q3: "Why safetensors over PyTorch .bin?"
Strong answer:
"Three reasons. Speed — safetensors uses memory-mapped file I/O, meaning the OS can load tensors directly into memory without deserializing through Python's pickle. This is 2-5x faster for large models. Safety — PyTorch's .bin format uses pickle, which can execute arbitrary Python code during deserialization. This is a supply-chain attack vector. Safetensors is a pure data format with no code execution. Efficiency — safetensors supports lazy loading, so you can load specific tensors without reading the entire file. For our 7B model, loading dropped from ~45 seconds to ~12 seconds."
Q4 (Basics): "What is a Docker health check?"
Answer:
"A HEALTHCHECK instruction in the Dockerfile tells Docker how to test if the container is working. Docker runs the specified command at intervals and tracks the result. If it fails consecutively, the container is marked 'unhealthy.' Example: HEALTHCHECK --interval=30s --timeout=5s --retries=3 CMD curl -f http://localhost:8080/health || exit 1. Orchestrators like ECS and Kubernetes use health status to decide whether to route traffic or restart the container."
Q5 (Basics): "What happens when a Docker container runs out of memory?"
Answer:
"The Linux kernel's OOM killer terminates the process that is using the most memory inside the container's cgroup. Docker detects this and marks the container with exit code 137 (128 + signal 9 SIGKILL). With --memory flag, you set the hard limit. With --memory-reservation, you set a soft limit. For GPU containers, this is more nuanced — GPU OOM is a CUDA error, not a system OOM, so Docker doesn't see it. The application must catch torch.cuda.OutOfMemoryError and handle it gracefully."
Q6 (Follow-up): "How would you warm up a container that serves multiple models?"
Answer:
"You need a warmup strategy per model. I'd maintain a warmup manifest — a config file listing each model with its representative input shapes. During startup, the warmup script iterates through the manifest, loads each model, and sends synthetic requests through each. The /ping endpoint tracks warmup completion per model and only returns 200 when ALL models are ready. For multi-LoRA setups where adapters share a base model, warming the base model plus the most frequently used adapter covers 80%+ of traffic patterns."
Q7 (Follow-up): "What if warmup takes too long and SageMaker times out?"
Answer:
"SageMaker has a default container start timeout of 5 minutes (configurable up to 60 minutes via ModelDataDownloadTimeoutInSeconds and ContainerStartupHealthCheckTimeoutInSeconds). If warmup exceeds this, SageMaker kills the container and marks the deployment as failed. Solutions: (1) Increase the timeout — for large models, 10-15 minutes is reasonable. (2) Optimize load time — safetensors, quantized weights, faster storage (FSx for Lustre instead of S3). (3) Parallelize warmup — warm multiple models concurrently if GPU memory allows. (4) Tiered readiness — serve traffic for already-warm models while others are still loading."
Q8 (Follow-up): "How do you test your warmup logic in CI?"
Answer:
"We can't warm a real GPU model in CI (no GPU available). Instead, we test the warmup orchestration logic with a mock model that simulates the three lifecycle states. The test verifies: (1) /ping returns 503 before warmup completes. (2) Warmup sends requests matching all shapes in the manifest. (3) /ping returns 200 only after all shapes pass. (4) A failed warmup correctly blocks readiness. For actual GPU warmup validation, we run a staging integration test against a real SageMaker endpoint after the image is built."
Group Follow-Up Panel: Rapid-Fire Probing Questions
Interviewer 1 (Principal Engineer): "You said warmup sends short, medium, and long prompts. How did you choose those specific shapes? What if production traffic has a shape you didn't warm?"
Strong answer: "We sampled 30 days of production traffic logs and bucketed prompts by token count. Three distinct clusters emerged: short (5-50 tokens, 40% of traffic), medium (100-300 tokens, 45%), and long (500-1500 tokens, 15%). Our warmup sends one prompt from each cluster. For shapes we didn't warm — CUDA kernels are compiled for specific tensor dimensions, but shapes within a similar range reuse compiled kernels. A 200-token warmup prompt primes kernels that work for 150-300 tokens. The 'gap' is only for dramatically different shapes, which we address with a catch-all 'max length' warmup pass."
Interviewer 2 (SRE): "Your /ping returns 503 during warmup. What if warmup itself gets stuck — model file corrupted, CUDA driver mismatch, infinite loop in warmup script? The container never becomes ready but never crashes either."
Strong answer: "We add a warmup timeout. If warmup doesn't complete within a configured deadline (e.g., 180 seconds), the startup script exits with code 1, which triggers SageMaker to kill and replace the container. We also emit a metric warmup_timeout so we can alarm on repeated failures — if 3 containers in a row fail warmup, that signals a systemic issue (bad model artifact, driver mismatch) rather than a transient failure."
# Warmup with timeout and failure handling
import asyncio
import signal
import sys
import logging
logger = logging.getLogger("warmup")
WARMUP_TIMEOUT_SECONDS = 180
WARMUP_SHAPES = [
{"prompt": "Hello", "max_tokens": 10, "label": "short"},
{"prompt": "Explain the plot of this manga chapter in detail, covering all character interactions and themes discussed",
"max_tokens": 200, "label": "medium"},
{"prompt": "You are a manga expert assistant..." + " context " * 200,
"max_tokens": 512, "label": "long"},
]
class WarmupState:
ready = False
started_at = None
completed_shapes = set()
warmup_state = WarmupState()
async def run_warmup(model):
"""Execute warmup with timeout protection."""
warmup_state.started_at = asyncio.get_event_loop().time()
try:
async with asyncio.timeout(WARMUP_TIMEOUT_SECONDS):
for shape in WARMUP_SHAPES:
logger.info(f"Warming shape: {shape['label']}")
try:
await model.generate(
prompt=shape["prompt"],
max_tokens=shape["max_tokens"]
)
warmup_state.completed_shapes.add(shape["label"])
logger.info(f"Shape {shape['label']} warmed successfully")
except Exception as e:
logger.error(f"Warmup failed for {shape['label']}: {e}")
# Fail the entire warmup if any shape fails
sys.exit(1)
warmup_state.ready = True
elapsed = asyncio.get_event_loop().time() - warmup_state.started_at
logger.info(f"All shapes warmed in {elapsed:.1f}s")
except asyncio.TimeoutError:
logger.critical(
f"Warmup timed out after {WARMUP_TIMEOUT_SECONDS}s. "
f"Completed: {warmup_state.completed_shapes}. Exiting."
)
sys.exit(1) # SageMaker will replace this container
# Health check endpoint
async def ping_handler(request):
"""SageMaker probes this. 503 = not ready. 200 = serve traffic."""
if warmup_state.ready:
return web.json_response({"status": "healthy"}, status=200)
else:
return web.json_response(
{"status": "warming_up",
"completed": list(warmup_state.completed_shapes)},
status=503
)
Interviewer 3 (Cost): "MinCapacity=2 on GPU instances is expensive. Have you explored SageMaker serverless inference or async inference to avoid this fixed cost?"
Strong answer: "Yes. SageMaker Serverless scales to zero but has cold starts of 30-60 seconds — unacceptable for real-time chat. Async inference is for batch-style workloads where the caller polls for results — wrong pattern for streaming chat. The real cost comparison isn't 'MinCapacity=2 vs zero' — it's 'MinCapacity=2 vs the cost of SLA violations + customer churn.' Two p3.2xlarge instances cost ~$150/day. One major customer escalation from repeated cold-start timeouts costs far more in relationship damage and engineering fire-drill time."
Interviewer 4 (ML Platform): "When you do a model update — new weights — how do you deploy without cold-starting all instances simultaneously?"
Strong answer: "Rolling deployment. We deploy to a new endpoint configuration while keeping the old one live. SageMaker provisions new instances with the new model, warms them up, and only shifts traffic after they pass the readiness gate. The old instances continue serving until the new fleet is fully healthy. This is a blue/green deployment at the SageMaker level. Zero downtime, zero cold-start exposure to users."
# SageMaker blue/green model deployment
import boto3
sm = boto3.client("sagemaker")
def deploy_new_model(endpoint_name, new_model_name, new_variant_weight=0.05):
"""
Step 1: Add new model variant at 5% weight (canary)
Step 2: Monitor for 15 minutes
Step 3: Shift to 100% or rollback
"""
# Create new model
sm.create_model(
ModelName=new_model_name,
PrimaryContainer={
"Image": "123456789.dkr.ecr.us-east-1.amazonaws.com/inference:v2.1",
"ModelDataUrl": "s3://models/v2.1/model.tar.gz",
},
ExecutionRoleArn="arn:aws:iam::role/SageMakerRole"
)
# Create new endpoint config with canary traffic split
sm.create_endpoint_config(
EndpointConfigName=f"{endpoint_name}-canary",
ProductionVariants=[
{
"VariantName": "current",
"ModelName": "model-v2.0",
"InstanceType": "ml.g5.2xlarge",
"InitialInstanceCount": 2,
"InitialVariantWeight": 1.0 - new_variant_weight,
},
{
"VariantName": "canary",
"ModelName": new_model_name,
"InstanceType": "ml.g5.2xlarge",
"InitialInstanceCount": 2, # Both instances warm before traffic
"InitialVariantWeight": new_variant_weight,
}
]
)
# Update endpoint — SageMaker handles rolling replacement
sm.update_endpoint(
EndpointName=endpoint_name,
EndpointConfigName=f"{endpoint_name}-canary",
)
# New instances boot → warmup runs → /ping returns 200 → traffic shifts
Interviewer 5 (Edge Cases): "What if your safetensors model file is corrupted in S3? The container starts, tries to load, and gets a corrupted model. What happens?"
Strong answer: "Safetensors has built-in integrity checking — it validates tensor metadata on load. A corrupted file throws a SafetensorError during model loading (before warmup even starts). Our startup script catches this, logs the error with the S3 artifact path, emits a model_load_failure metric, and exits with code 1. SageMaker replaces the container. If the SAME artifact fails 3 times, our alarm fires and the on-call engineer investigates — likely a corrupt S3 object, which we fix by re-uploading from the build artifact. We also store model checksums in DynamoDB and verify the S3 object checksum before container startup in the latest iteration."
Code Examples: SageMaker Container Entrypoint
#!/usr/bin/env python3
"""
SageMaker inference container entrypoint.
Handles model loading, warmup, health checks, and inference serving.
"""
import os
import time
import logging
from pathlib import Path
from aiohttp import web
from safetensors.torch import load_file
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger("inference-container")
# ============================================
# MODEL LOADING WITH INTEGRITY CHECK
# ============================================
class ModelServer:
def __init__(self):
self.model = None
self.tokenizer = None
self.ready = False
self.load_time = None
self.warmup_time = None
def load_model(self, model_dir="/opt/ml/model"):
"""Load model with safetensors integrity check."""
start = time.monotonic()
model_path = Path(model_dir)
safetensor_files = list(model_path.glob("*.safetensors"))
if not safetensor_files:
raise FileNotFoundError(f"No .safetensors files in {model_dir}")
logger.info(f"Loading {len(safetensor_files)} safetensor shards...")
# safetensors validates integrity on load — corrupted files throw here
try:
state_dict = {}
for f in safetensor_files:
state_dict.update(load_file(str(f), device="cuda"))
self.model = self._build_model(state_dict)
self.load_time = time.monotonic() - start
logger.info(f"Model loaded in {self.load_time:.1f}s")
except Exception as e:
logger.critical(f"Model load failed: {e}")
raise
async def warmup(self):
"""Warmup with representative prompt shapes."""
start = time.monotonic()
shapes = [
("short", "Hello", 10),
("medium", "Explain this chapter thoroughly " * 10, 200),
("long", "System: You are an expert... " * 50, 512),
]
for label, prompt, max_tokens in shapes:
logger.info(f"Warmup: {label} ({len(prompt)} chars, {max_tokens} max)")
await self.generate(prompt, max_tokens)
logger.info(f"Warmup: {label} complete")
self.warmup_time = time.monotonic() - start
self.ready = True
logger.info(f"Warmup done in {self.warmup_time:.1f}s. READY FOR TRAFFIC.")
# ============================================
# HEALTH CHECK — THE READINESS GATE
# ============================================
async def health_check(request):
server = request.app["model_server"]
if server.ready:
return web.json_response({
"status": "healthy",
"load_time_s": server.load_time,
"warmup_time_s": server.warmup_time,
}, status=200)
return web.json_response({"status": "warming_up"}, status=503)
async def invocations(request):
"""SageMaker sends inference requests to /invocations."""
server = request.app["model_server"]
if not server.ready:
return web.json_response({"error": "Model not ready"}, status=503)
body = await request.json()
result = await server.generate(body["prompt"], body.get("max_tokens", 256))
return web.json_response({"generated_text": result})
# ============================================
# STARTUP SEQUENCE
# ============================================
async def startup(app):
server = ModelServer()
app["model_server"] = server
server.load_model() # Blocks until weights are in GPU
await server.warmup() # Blocks until all shapes are warmed
app = web.Application()
app.on_startup.append(startup)
app.router.add_get("/ping", health_check)
app.router.add_post("/invocations", invocations)
if __name__ == "__main__":
web.run_app(app, host="0.0.0.0", port=8080)
Critical Points To Remember — Docker-LLD-2
╔══════════════════════════════════════════════════════════════════════╗
║ CRITICAL POINTS — LLD-2 ║
╠══════════════════════════════════════════════════════════════════════╣
║ ║
║ 1. THREE CONTAINER STATES (memorize these) ║
║ • PROCESS ALIVE ≠ MODEL LOADED ≠ TRAFFIC READY ║
║ • Docker sees alive. LB needs ready. Users need warm. ║
║ • If you conflate these, users hit cold containers. ║
║ ║
║ 2. READINESS GATE = /ping RETURNS 503 UNTIL WARM ║
║ • The health check IS the admission control ║
║ • 503 during warmup is CORRECT behavior, not an error ║
║ • Never return 200 just because the process started ║
║ ║
║ 3. MinCapacity = 2 IS INSURANCE, NOT WASTE ║
║ • One instance fails → other absorbs 100% immediately ║
║ • Cost of extra instance << cost of cold-start SLA violations ║
║ • The math always favors MinCapacity ≥ 2 for production ║
║ ║
║ 4. WARMUP MUST MATCH PRODUCTION SHAPES ║
║ • Short + medium + long prompt shapes ║
║ • CUDA kernel compilation is shape-dependent ║
║ • Warming with "hello" does NOT warm 500-token prompts ║
║ ║
║ 5. SAFETENSORS > PYTORCH .BIN (always) ║
║ • 2-5x faster loading (memory-mapped I/O) ║
║ • No arbitrary code execution (pickle is dangerous) ║
║ • Built-in integrity validation ║
║ ║
║ 6. ADD WARMUP TIMEOUT — DEADLOCKED WARMUP IS SILENT DEATH ║
║ • Set deadline (e.g., 180 seconds) ║
║ • Exit with code 1 on timeout → container replaced ║
║ • Alarm on repeated timeouts → systemic issue ║
║ ║
║ 7. ROLLING DEPLOYMENT FOR MODEL UPDATES ║
║ • Never update all instances simultaneously ║
║ • Blue/green at SageMaker level ║
║ • New fleet warms → passes readiness → receives traffic ║
║ ║
╚══════════════════════════════════════════════════════════════════════╝
Docker-LLD-3: vLLM Serving Containers For Throughput And Cost
Objective
Replace a custom serving stack that wasted GPU memory and concurrency with a production-ready container that improves throughput without introducing specialized operational fragility.
Container Architecture
flowchart LR
Req["Chat Request"] --> API["OpenAI-Compatible API Layer"]
API --> Sched["Continuous Batching Scheduler"]
Sched --> Prefix["Prefix Cache"]
Prefix --> KV["PagedAttention KV Block Manager"]
KV --> Engine["vLLM Engine"]
Engine --> GPU["GPU"]
Engine --> LoRA["Optional Multi-LoRA Adapter"]
Decision Evolution
| Step | New decision | Why the previous decision was not enough | Improvement introduced |
|---|---|---|---|
| D0 | Start with HF Transformers plus custom serving | Static KV allocation wasted VRAM and concurrency collapsed under load. | Established the baseline problem in measurable terms: high latency and poor GPU utilization. |
| D1 | Move to vLLM with PagedAttention | The custom stack could not use GPU memory efficiently enough for multi-turn chat. | VRAM waste dropped sharply and concurrency per GPU increased dramatically. |
| D2 | Rely on vLLM continuous batching instead of fixed batching windows | Fixed batching improved throughput only by hurting latency with rigid wait windows. | GPU stayed busy while requests could enter the batch at iteration boundaries. |
| D3 | Enable automatic prefix caching | Even with better batching, multi-turn chat kept recomputing system and framing tokens. | Reused repeated prompt prefixes and reduced redundant compute. |
| D4 | Use Multi-LoRA and AWQ where appropriate | One base model per adapter or full-precision weights would still create cost and memory sprawl. | Multiple adapters shared one base container and quantization reduced memory footprint further. |
| D5 | Choose vLLM over TensorRT-LLM | TensorRT-LLM offered slightly higher raw speed but at much higher build and hardware complexity. | Best performance-to-operability point without locking the team into one specialized serving path. |
Serving Configuration
| Setting | Value | Why it matters |
|---|---|---|
gpu_memory_utilization |
0.92 |
Leaves a small headroom buffer for CUDA operations rather than consuming all VRAM. |
max_num_seqs |
128 |
Caps concurrency explicitly so the scheduler stays predictable. |
max_model_len |
4096 |
Aligns container memory policy with the context window budget. |
block_size |
16 |
Matches the PagedAttention block model for efficient KV allocation. |
enable_prefix_caching |
true |
Makes repeated chat prefixes a cost-saving optimization instead of repeated waste. |
quantization |
awq |
Shrinks the runtime memory footprint while preserving acceptable quality. |
Why This Improved The Previous Design
| Area | Previous design | New design | Practical improvement |
|---|---|---|---|
| KV cache | Static allocation | PagedAttention blocks | Memory waste dropped from roughly 72% to roughly 4%. |
| Scheduling | Fixed or manual batching | Continuous batching | Higher throughput without the same latency penalty. |
| Prefix reuse | None | Automatic prefix caching | Multi-turn chat stopped recomputing stable prompt prefixes. |
| Variant management | Separate model copies or custom wiring | Multi-LoRA on one base container | Fewer endpoints and less GPU fleet sprawl. |
| Ops burden | Custom serving code | Standard vLLM container | Easier deployment and simpler day-2 operations. |
Failure Modes And Controls
| Failure mode | User-facing risk | Control |
|---|---|---|
| Static KV fragmentation | Latency spikes and low concurrency | PagedAttention block allocation. |
| Rigid batching window | Higher tail latency during spikes | Iteration-level continuous batching. |
| Adapter sprawl | Too many model endpoints and idle GPU pools | Multi-LoRA consolidation. |
| Hardware lock-in | Expensive future migration | vLLM preserved broader hardware flexibility than TensorRT-LLM. |
Improvement Evidence
| Metric | Before | After |
|---|---|---|
| GPU fleet | 8 A100 | 4 A100 |
| Monthly GPU cost | $32,000 | $16,000 |
| P95 latency | 920 ms | 290 ms |
| Concurrency per GPU | 4-6 requests | 85-90 requests |
Design Lesson
The critical decision was not "pick the fastest engine." It was "pick the best runtime envelope for both performance and operability."
Deep Dive: Group Discussion — Why PagedAttention Changed Everything
Engineer A (ML Researcher): "I keep hearing PagedAttention mentioned. Can someone explain why it matters so much?"
Engineer B (GPU Systems): "Okay, think about how a normal Transformer serves requests. For every token generated, the model needs to read the Key and Value vectors from ALL previous tokens — that's the KV cache. In a naive implementation, when a request comes in for max 2048 tokens, the system pre-allocates a contiguous memory block of 2048 × num_layers × hidden_size × 2 (K and V) × bytes_per_element. For a 7B model, that's roughly 1 GB per concurrent request."
Engineer C (Backend): "So if I have 80 GB of VRAM and each request reserves 1 GB, I can only serve 80 concurrent requests?"
Engineer B: "Worse. Most requests don't actually USE 2048 tokens. A typical chat request might use 300 tokens, but the system reserved 2048 tokens worth of memory. That's 85% waste. Now your 80 GB GPU effectively serves only 6-12 concurrent requests because most of the VRAM is holding empty reserved space."
Engineer D (OS/Systems): "This is literally the virtual memory problem from operating systems. Physical RAM was wasted because processes allocated large contiguous blocks but only used a fraction. The solution was paging — break memory into small fixed-size blocks and allocate them on demand."
Engineer B: "Exactly. PagedAttention does the same thing for KV cache. Instead of one contiguous block per request, it breaks the KV cache into small pages (blocks of 16 tokens). Pages are allocated on demand as the sequence grows. Pages can be non-contiguous in physical GPU memory. When a request finishes, its pages are immediately freed."
Traditional KV Cache Allocation:
Request A: [████████████████████░░░░░░░░░░░░░░░░░░░░] ← 50% wasted
Request B: [██████████████░░░░░░░░░░░░░░░░░░░░░░░░░░] ← 65% wasted
Request C: [████████████████████████████░░░░░░░░░░░░░] ← 30% wasted
^^^^^ USED ^^^^^ ^^^^^^^^ EMPTY/WASTED ^^^^^^^^
PagedAttention KV Cache:
Request A: [page][page][page][page][page] ← 5 pages, fits exactly
Request B: [page][page][page] ← 3 pages, fits exactly
Request C: [page][page][page][page][page][page][page] ← 7 pages, fits exactly
Free pool: [page][page][page][page][page]... ← available for new requests
^^^^^ Only allocates what's needed ^^^^^
Engineer A: "That's why memory waste went from 72% to 4%?"
Engineer B: "Yes. And the concurrency improvement is even more dramatic — from 4-6 requests per GPU to 85-90 requests per GPU. Not because the GPU is faster, but because you're not wasting 70% of its memory."
Why Continuous Batching Beats Static Batching
Static Batching (old way):
Time ──────────────────────────────────────────►
Batch 1: [Req A ████████] [Req B ██████████████] [Req C ████]
^--- All three start together, all must finish before next batch ---^
Req A finishes early but GPU slot is WASTED until Req B finishes
New requests WAIT until entire batch completes
Continuous Batching (vLLM):
Time ──────────────────────────────────────────►
Slot 1: [Req A ████████] [Req D ██████] [Req G ████████████]
Slot 2: [Req B ██████████████] [Req E ████████████]
Slot 3: [Req C ████] [Req F ██████████] [Req H ██████]
^--- Requests enter/exit at iteration boundaries ---^
When Req A finishes, Req D immediately starts in that slot
GPU never idles waiting for the longest request in a batch
Impact: Static batching forces GPU idle time proportional to the variance in request lengths. For chat workloads where some requests are 50 tokens and others are 500 tokens, static batching wastes 40-60% of GPU cycles. Continuous batching keeps utilization at 85-95%.
Why vLLM Over TensorRT-LLM?
| Factor | vLLM | TensorRT-LLM |
|---|---|---|
| Raw throughput | Very good | 10-20% higher in benchmarks |
| Build complexity | pip install vllm + config |
Engine compilation per model, per GPU arch, per batch size |
| Hardware flexibility | NVIDIA, AMD (ROCm), CPU fallback | NVIDIA only |
| Model support | 100+ architectures on day one | Requires engine build per model |
| Update cycle | New model support in days | Weeks to months for engine update |
| Team expertise needed | Python + basic GPU knowledge | CUDA, TensorRT, engine optimization |
| Recovery from failure | Restart process, reload model | Rebuild engine, redistribute |
The decision: TensorRT-LLM's 10-20% speed advantage did not justify the 3-5x increase in operational complexity for a team without deep CUDA expertise.
Interview Questions And Answers
Q1: "Explain PagedAttention in simple terms."
Strong answer: "PagedAttention solves a memory waste problem in GPU inference. Traditionally, when a model serves a request, it pre-allocates a large contiguous block of GPU memory for the KV cache — enough for the maximum possible sequence length. But most requests use much less than the maximum, so 60-80% of allocated memory sits empty. PagedAttention breaks the KV cache into small fixed-size pages, allocated on demand as the sequence grows — exactly like how OS virtual memory uses paging instead of contiguous allocation. This reduced our memory waste from 72% to 4%, increasing concurrent requests per GPU from 4-6 to 85-90."
Q2: "Why is continuous batching important for chat workloads specifically?"
Strong answer: "Chat messages have highly variable lengths — a 'hello' response might be 5 tokens while a detailed explanation is 500 tokens. With static batching, all requests in a batch must wait for the slowest one to finish before new requests can enter. A 5-token response sits in a GPU slot doing nothing while the 500-token response completes. Continuous batching allows requests to enter and exit the batch at iteration boundaries — the moment one request finishes, another takes its slot. For our chat workload with 10x variance in response lengths, this improved GPU utilization from ~40% to ~90%."
Q3: "How does prefix caching help in multi-turn chat?"
Strong answer: "In multi-turn chat, every new message re-sends the entire conversation history plus the system prompt. The system prompt and earlier turns are identical across requests. Without prefix caching, the model recomputes attention for these identical tokens every time — pure waste. vLLM's automatic prefix caching detects when the beginning of a new request matches a previously computed prefix and reuses the KV cache from that computation. For our chatbot with a 500-token system prompt, this saved ~500 × num_layers forward pass computations on every single request. At 50 requests/second, that's 25,000 tokens of redundant computation eliminated per second."
Q4 (Basics): "What is GPU VRAM and why does it limit inference?"
Answer: "VRAM (Video RAM) is the memory directly attached to the GPU chip. An A100 has 80 GB of VRAM. During inference, VRAM must hold: (1) the model weights (~14 GB for a 7B model in FP16), (2) the KV cache for all concurrent requests (varies with concurrency and sequence length), (3) activation memory for the forward pass, and (4) CUDA overhead. Because all of these compete for the same 80 GB pool, inefficient KV cache allocation directly reduces how many requests you can serve concurrently."
Q5 (Basics): "What is model quantization?"
Answer: "Quantization reduces the numerical precision of model weights — for example, from 16-bit floating point (FP16) to 4-bit integer (INT4). A 7B parameter model in FP16 uses ~14 GB of VRAM. In INT4 (AWQ quantization), the same model uses ~3.5 GB — a 4x reduction. The tradeoff is a small quality loss, typically 1-3% on benchmarks. AWQ (Activation-Aware Weight Quantization) is smarter than naive quantization — it identifies which weights are most sensitive to quality and preserves their precision while aggressively quantizing less important weights."
Q6 (Follow-up): "How do you benchmark vLLM to validate these improvements?"
Answer:
"We used three benchmarks. (1) Throughput test — send 1000 requests of varied lengths and measure total tokens/second. This validates batch efficiency. (2) Latency profile — measure P50, P95, P99 time-to-first-token and inter-token latency under varying concurrency (1, 10, 50, 100 concurrent). This catches latency regression under load. (3) Memory utilization — monitor VRAM usage via nvidia-smi during the throughput test to verify memory efficiency. We compared against our previous HuggingFace Transformers baseline. All three benchmarks had to improve for us to proceed with the migration."
Q7 (Follow-up): "What happens when all KV cache pages are exhausted?"
Answer:
"vLLM implements a preemption policy. When new requests arrive and there are no free pages, the engine can preempt (pause) lower-priority or older requests by swapping their KV cache to CPU memory, freeing GPU pages for new requests. When pages become available again, the preempted requests resume by reloading their KV cache from CPU. This is better than rejection (dropping requests) because no work is lost. We monitor the preemption rate — if it exceeds 5%, it signals we need to add GPU capacity or reduce max_num_seqs."
Q8 (Follow-up): "If you had to serve both a 7B model and a 70B model, how would you architect this?"
Answer: "Separate GPU pools. The 7B model fits on a single A100 and can serve high concurrency. The 70B model needs tensor parallelism across 4-8 GPUs. I'd deploy them as separate SageMaker endpoints with independent autoscaling policies. The orchestrator routes requests based on complexity — simple queries go to the 7B model (faster, cheaper), complex queries go to the 70B model (higher quality). This avoids the worst-case scenario of a 70B model deployment wasting expensive multi-GPU resources on simple questions."
Group Follow-Up Panel: Rapid-Fire Probing Questions
Interviewer 1 (GPU Specialist): "You set gpu_memory_utilization to 0.92. What happens if you set it to 1.0? And why not 0.80 to be safe?"
Strong answer: "At 1.0, the vLLM engine tries to use ALL VRAM for the KV cache, leaving zero headroom for CUDA's internal operations — memory allocator, kernel launch overhead, cuBLAS workspace. This causes sporadic CUDA OOM on edge-case requests. At 0.80, you're leaving 16 GB unused on an 80 GB A100 — that's enough KV pages for ~20 additional concurrent requests you're throwing away. 0.92 is our sweet spot — we profiled under peak load and found CUDA needs roughly 5-8% headroom. We validated this by running a 24-hour soak test at 0.92 with zero OOM events."
# vLLM server launch configuration
# This is the actual Docker CMD / entrypoint for our inference container
"""
vLLM Serving Container Launch Script
Run inside: FROM vllm/vllm-openai:v0.4.0
"""
import subprocess
import os
VLLM_ARGS = [
"python", "-m", "vllm.entrypoints.openai.api_server",
# Model identity
"--model", os.environ.get("MODEL_PATH", "/opt/ml/model"),
"--served-model-name", os.environ.get("MODEL_NAME", "chatbot-v2"),
# Memory management
"--gpu-memory-utilization", "0.92", # Leave 8% CUDA headroom
"--max-model-len", "4096", # Context window cap
"--block-size", "16", # PagedAttention block size
# Concurrency control
"--max-num-seqs", "128", # Max concurrent sequences
"--max-num-batched-tokens", "8192", # Max tokens per batch iteration
# Performance optimizations
"--enable-prefix-caching", # Reuse common prompt prefixes
"--quantization", "awq", # INT4 quantization
"--dtype", "half", # FP16 for non-quantized layers
# Serving config
"--host", "0.0.0.0",
"--port", "8080",
"--trust-remote-code",
# Tensor parallelism (for multi-GPU setups)
"--tensor-parallel-size", os.environ.get("TP_SIZE", "1"),
]
# Optional: Multi-LoRA support
if os.environ.get("ENABLE_LORA", "false") == "true":
VLLM_ARGS.extend([
"--enable-lora",
"--max-loras", os.environ.get("MAX_LORAS", "4"),
"--max-lora-rank", os.environ.get("MAX_LORA_RANK", "16"),
])
print(f"Starting vLLM with args: {' '.join(VLLM_ARGS)}")
subprocess.run(VLLM_ARGS, check=True)
Interviewer 2 (Architect): "You mentioned Multi-LoRA. In production, how does the routing work? How does the container know which adapter to use for which request?"
Strong answer: "The request includes a model field in the OpenAI-compatible API format. Each LoRA adapter is registered with a name — e.g., chatbot-ja for Japanese, chatbot-en for English. The client sends {\"model\": \"chatbot-ja\", \"messages\": [...]}. vLLM's scheduler routes the request to the correct adapter while sharing the same base model weights. The base model weights stay in VRAM once, and each LoRA adapter adds only 0.1-1% overhead. So 4 adapters on one GPU is barely more expensive than 1."
# Multi-LoRA request routing — client-side
import openai
# All adapters are served from the SAME container on the SAME GPU
client = openai.OpenAI(base_url="http://inference:8080/v1", api_key="dummy")
# Japanese adapter
response_ja = client.chat.completions.create(
model="chatbot-ja", # ← Routes to Japanese LoRA adapter
messages=[{"role": "user", "content": "この漫画の感想を教えて"}],
max_tokens=256,
)
# English adapter — same container, same GPU, different adapter
response_en = client.chat.completions.create(
model="chatbot-en", # ← Routes to English LoRA adapter
messages=[{"role": "user", "content": "Tell me about this manga"}],
max_tokens=256,
)
# Both use the same base model weights — only adapter layers differ
Interviewer 3 (Performance): "Your P95 latency dropped from 920ms to 290ms. Can you decompose where each millisecond was saved?"
Strong answer: "Three main contributors. (1) PagedAttention eliminated memory fragmentation that was causing the GPU to stall during allocation — saved ~200ms on memory management overhead. (2) Continuous batching eliminated inter-batch waiting — under static batching, requests waited up to 300ms for the batch window; with continuous batching, the wait is one iteration step (~5-15ms). Saved ~250ms average. (3) Prefix caching removed redundant computation on the 500-token system prompt — at ~1ms per token in the prefill phase, that's ~500ms saved on every multi-turn request that reuses the prefix. Not all savings stack linearly, but the net effect was 920ms → 290ms."
Interviewer 4 (Operations): "How do you monitor vLLM in production? What metrics are you looking at?"
Strong answer: "Five key metrics, all exposed via vLLM's Prometheus endpoint. (1) vllm:num_requests_running — current in-flight requests, alerts if sustained above max_num_seqs * 0.85. (2) vllm:gpu_cache_usage_perc — KV cache utilization, alerts above 95% (preemption risk). (3) vllm:avg_prompt_throughput_tps — prefill throughput, detects degradation. (4) vllm:avg_generation_throughput_tps — decode throughput. (5) vllm:num_preemptions_total — counts preemptions, any non-zero rate signals capacity pressure. We also track nvidia_smi_gpu_utilization and nvidia_smi_memory_used."
# Prometheus scrape config for vLLM monitoring
# Added to the inference container sidecar
scrape_configs:
- job_name: 'vllm-inference'
metrics_path: '/metrics'
static_configs:
- targets: ['localhost:8080']
scrape_interval: 15s
# CloudWatch alarm definitions
alarms:
kv_cache_pressure:
metric: vllm_gpu_cache_usage_perc
threshold: 0.95
period: 300 # 5 minutes
evaluation_periods: 2
action: scale_out_gpu_fleet
preemption_rate:
metric: vllm_num_preemptions_total
threshold: 10 # 10 preemptions in 5 minutes
period: 300
action: page_on_call
request_queue_depth:
metric: vllm_num_requests_waiting
threshold: 50
period: 60
action: scale_out_gpu_fleet
Interviewer 5 (Disaster Recovery): "Production is serving with 4 A100 GPUs. One dies. What happens?"
Strong answer: "SageMaker endpoint has 4 instances behind a load balancer. One going unhealthy triggers: (1) Health check fails → instance removed from load balancer pool in ~30 seconds. (2) Remaining 3 instances absorb 100% of traffic — we capacity-plan so that N-1 instances can handle peak load (we call this N+1 redundancy). (3) SageMaker's autoscaler detects the missing instance and provisions a replacement. (4) The replacement goes through the full warmup cycle (~2 minutes) before receiving traffic. (5) Alert fires to Slack. Total user impact: slight latency increase during the 2-minute replacement window as 3 GPUs handle the load of 4. No requests are dropped."
Code Examples: Dockerfile For vLLM Inference Container
# ============================================
# vLLM INFERENCE CONTAINER
# Production-ready GPU serving container
# ============================================
FROM nvidia/cuda:12.1.1-runtime-ubuntu22.04 AS base
# Avoid timezone prompts
ENV DEBIAN_FRONTEND=noninteractive
RUN apt-get update && \
apt-get install -y --no-install-recommends \
python3.11 python3-pip curl && \
rm -rf /var/lib/apt/lists/*
# ---------- Stage: Dependencies ----------
FROM base AS deps
# Install vLLM and serving dependencies
COPY requirements-inference.txt /tmp/
RUN pip install --no-cache-dir -r /tmp/requirements-inference.txt
# ---------- Stage: Runtime ----------
FROM base AS runtime
# Copy installed packages
COPY --from=deps /usr/local/lib/python3.11 /usr/local/lib/python3.11
COPY --from=deps /usr/local/bin /usr/local/bin
# Copy serving scripts
WORKDIR /app
COPY serve.py warmup.py health.py ./
COPY config/ ./config/
# Non-root user (as much as possible — some GPU ops need root)
RUN useradd -m -s /bin/bash vllm
USER vllm
# vLLM metrics endpoint + inference API
EXPOSE 8080
# Health check for SageMaker
HEALTHCHECK --interval=10s --timeout=5s --start-period=300s --retries=3 \
CMD curl -f http://localhost:8080/health || exit 1
ENV NVIDIA_VISIBLE_DEVICES=all \
NVIDIA_DRIVER_CAPABILITIES=compute,utility \
VLLM_USAGE_STATS=0
ENTRYPOINT ["python3", "serve.py"]
Critical Points To Remember — Docker-LLD-3
╔══════════════════════════════════════════════════════════════════════╗
║ CRITICAL POINTS — LLD-3 ║
╠══════════════════════════════════════════════════════════════════════╣
║ ║
║ 1. PagedAttention = OS VIRTUAL MEMORY FOR GPU ║
║ • Static KV = contiguous allocation → 72% waste ║
║ • Paged KV = on-demand blocks → 4% waste ║
║ • This single change increased concurrency 15x ║
║ ║
║ 2. CONTINUOUS BATCHING = NO WAITING FOR SLOWEST REQUEST ║
║ • Static batch: all wait for longest → GPU idle 40-60% ║
║ • Continuous: enter/exit at iteration boundary → 85-95% util ║
║ • Critical for chat workloads with variable response lengths ║
║ ║
║ 3. PREFIX CACHING = FREE COMPUTE SAVINGS FOR CHAT ║
║ • System prompt is identical across ALL requests ║
║ • Without caching: recompute 500 tokens every time ║
║ • With caching: reuse computed KV cache → saves 500ms+ ║
║ ║
║ 4. gpu_memory_utilization = 0.92 (NOT 1.0, NOT 0.80) ║
║ • 1.0 → sporadic CUDA OOM (no headroom for CUDA internals) ║
║ • 0.80 → wasting 16 GB of usable KV cache space ║
║ • Profile under peak load, validate with soak test ║
║ ║
║ 5. vLLM vs TensorRT-LLM = OPERABILITY vs RAW SPEED ║
║ • TensorRT-LLM: 10-20% faster, 3-5x harder to operate ║
║ • vLLM: near-best speed, pip install, broad hardware support ║
║ • Choose based on team skills, not benchmarks ║
║ ║
║ 6. MULTI-LoRA = MULTIPLE MODELS ON ONE GPU ║
║ • Base model loaded once → shared across all adapters ║
║ • Each adapter: 0.1-1% memory overhead ║
║ • Route by model name in OpenAI-compatible API format ║
║ ║
║ 7. MONITOR THESE 5 vLLM METRICS ║
║ • num_requests_running (concurrency) ║
║ • gpu_cache_usage_perc (KV pressure) ║
║ • num_preemptions_total (capacity overflow) ║
║ • avg_prompt_throughput (prefill speed) ║
║ • avg_generation_throughput (decode speed) ║
║ ║
╚══════════════════════════════════════════════════════════════════════╝
Docker-LLD-4: GPU OOM Prevention And Container Stability
Objective
Prevent long conversations from turning into GPU OOM crashes that restart the inference container and create availability loss for that instance's share of traffic.
Memory-Control Flow
flowchart TD
Msg["Conversation History"] --> Budget["Context Budget Manager"]
Budget --> Window["Sliding Window Selection"]
Window --> Prompt["Prompt Assembly"]
Prompt --> Infer["vLLM Inference"]
Infer --> OOM{"OOM?"}
OOM -->|No| Resp["Return Response"]
OOM -->|Yes| Guard["OOM Circuit Breaker"]
Guard --> Fallback["Graceful Fallback + Metric"]
Decision Evolution
| Step | New decision | Why the previous decision was not enough | Improvement introduced |
|---|---|---|---|
| D0 | Send the full conversation history to the model | Memory use grew linearly with turn count and eventually crashed the worker. | Made the true root cause explicit: prompt policy was driving container instability. |
| D1 | Add sliding-window context budgeting | Blunt truncation would have hidden memory pressure at the cost of random context loss. | Turn selection became deterministic and recent turns were preserved first. |
| D2 | Add explicit truncation markers when history is dropped | Silent truncation makes debugging and conversation continuity harder. | The model and the user path gained an explicit signal that some history was intentionally trimmed. |
| D3 | Apply AWQ INT4 quantization with manga-domain calibration | Context budgeting alone still left too little VRAM headroom under concurrency. | Reclaimed VRAM while keeping quality loss below the accepted threshold. |
| D4 | Add an OOM circuit breaker at inference time | Preventive controls cannot guarantee zero OOM events in every traffic shape. | One bad request degrades gracefully instead of crashing the worker and triggering a restart. |
Token Budget Strategy
| Budget area | Policy |
|---|---|
| System prompt | Reserve fixed headroom for the system instructions. |
| RAG context | Reserve space for retrieval payloads; do not let history consume it all. |
| Current query | Always include the live user turn. |
| Output budget | Reserve generation tokens ahead of time. |
| History window | Fill only the remaining budget with the most recent turns. |
| Overflow behavior | Insert an explicit truncation marker instead of silently dropping context. |
Why This Improved The Previous Design
| Area | Previous design | New design | Practical improvement |
|---|---|---|---|
| Prompt assembly | Full history or ad hoc truncation | Sliding window with explicit budget | Predictable memory footprint and clearer debugging. |
| Model footprint | BF16 or larger-weight path | AWQ INT4 with in-domain calibration | Significant VRAM reduction with acceptable quality retention. |
| Error handling | Worker crash on OOM | Circuit breaker + fallback | OOM no longer turns into container churn. |
| Availability | Restart after memory blow-up | Response degradation on extreme edge cases | User sees a controlled fallback instead of an instance outage. |
Failure Modes And Controls
| Failure mode | User-facing risk | Control |
|---|---|---|
| Long multi-turn session | Container restart and brief traffic loss | Sliding-window budgeting prevents runaway context growth. |
| Quantization with poor calibration | Lower answer quality, especially for Japanese content | Use manga-domain calibration data for AWQ. |
| Rare OOM despite prevention | Instance restart | OOM circuit breaker clears cache, emits a metric, and returns a safe fallback. |
| Silent history loss | Confusing conversation behavior | Truncation marker makes loss explicit. |
Improvement Evidence
| Metric | Before | After |
|---|---|---|
| OOM incidents per week | ~14 | 0 in 8 weeks post-deployment |
| Container restarts per week | ~14 | 0 from OOM path |
| VRAM for 20-turn conversation | 11.2 GB | 3.8 GB |
| P99 safe conversation length | 12 turns | 35+ turns |
Design Lesson
GPU stability is not just a model-choice problem. It is a prompt-budgeting problem, a weight-footprint problem, and an error-containment problem at the same time.
Deep Dive: Group Discussion — Why GPU OOM Is A Software Bug, Not A Hardware Limitation
Engineer A (Frontend): "Users are complaining — after chatting for 20 minutes, the bot stops responding for 30 seconds. Then it comes back but forgets everything."
Engineer B (SRE): "Container restarts spike every time this happens. Exit code 137 sometimes, but also CUDA OOM errors in the logs."
Engineer C (ML Infra): "Here's the root cause. Every new message in a conversation sends the ENTIRE history — system prompt + all previous messages + current query. Turn 1 = 500 tokens. Turn 10 = 3,000 tokens. Turn 30 = 12,000 tokens. The KV cache grows linearly with each turn. At turn 25-30, the KV cache exceeds available VRAM and the CUDA runtime throws an OOM."
Engineer D (Senior): "So the GPU didn't run out of memory. The APPLICATION let the prompt grow unbounded until the GPU ran out of memory. This is a prompt engineering and context management problem, not a hardware scaling problem."
Engineer E (ML): "Exactly. Throwing more VRAM at this is band-aid thinking. An A100 80GB just means users crash at turn 50 instead of turn 30. The real fix is making sure the prompt can NEVER exceed the memory budget, regardless of conversation length."
Visualizing The Problem
Turn 1: [SYS:500][Q1:50][A1:200] = 750 tokens ✓ OK
Turn 5: [SYS:500][Q1:50][A1:200]...[Q5:50][A5:200] = 2,250 tokens ✓ OK
Turn 10: [SYS:500][Q1:50][A1:200]...[Q10:50][A10:200] = 4,000 tokens ⚠ Getting tight
Turn 20: [SYS:500][Q1:50][A1:200]...[Q20:50][A20:200] = 7,500 tokens ⚠ VRAM pressure
Turn 30: [SYS:500][Q1:50][A1:200]...[Q30:50][A30:200] = 11,000 tokens ✗ GPU OOM!
The Token Budget Manager — How It Actually Works
# Simplified but accurate representation of the budget logic
MAX_CONTEXT = 4096 # Model's max context window
def assemble_prompt(system_prompt, rag_context, history, current_query):
budget = MAX_CONTEXT
# 1. System prompt: ALWAYS included, non-negotiable
budget -= count_tokens(system_prompt) # -500 tokens → 3596 left
# 2. Output reservation: MUST leave room for the response
budget -= MAX_OUTPUT_TOKENS # -512 tokens → 3084 left
# 3. Current query: ALWAYS included
budget -= count_tokens(current_query) # -50 tokens → 3034 left
# 4. RAG context: Include retrieval results
budget -= count_tokens(rag_context) # -300 tokens → 2734 left
# 5. History: Fill remaining budget with MOST RECENT turns
history_tokens = 0
included_turns = []
for turn in reversed(history): # Start from most recent
turn_tokens = count_tokens(turn)
if history_tokens + turn_tokens > budget:
break # Stop, budget exhausted
included_turns.insert(0, turn)
history_tokens += turn_tokens
# 6. If any history was dropped, add a truncation marker
if len(included_turns) < len(history):
included_turns.insert(0,
"[Earlier messages were summarized to fit context window]")
return build_prompt(system_prompt, included_turns,
rag_context, current_query)
Key design principle: The budget is allocated top-down by priority. System prompt and output reservation are non-negotiable. History is the flexible part — it fills whatever space remains.
Why AWQ Over GPTQ Or BitsAndBytes?
| Method | Quality | Speed | Calibration | Use case |
|---|---|---|---|---|
| AWQ | High (activation-aware) | Fast inference | Requires in-domain calibration data | Production serving where quality matters |
| GPTQ | Good | Fast inference | Requires calibration data | Good alternative to AWQ |
| BitsAndBytes (NF4) | Good | Slower (dynamic dequant) | No calibration needed | Quick experimentation, fine-tuning |
| GGUF/llama.cpp | Variable | CPU-friendly | Pre-quantized models available | Edge/CPU deployment |
Why we chose AWQ: The calibration data matters. We calibrated AWQ with manga-domain conversation data (Japanese + English mixed content), so the quantization preserved quality specifically for our use case. Generic calibration on Wikipedia text would have degraded Japanese content quality by 3-5%, but domain-specific calibration kept degradation under 1%.
Interview Questions And Answers
Q1: "Tell me about a time you debugged a container stability issue."
Strong answer (STAR format): "Situation: Our inference containers were restarting ~14 times per week during peak hours. Each restart caused 30+ seconds of downtime for users routed to that instance. Task: Identify the root cause and eliminate the restarts. Action: I traced the restarts to GPU OOM errors triggered by long multi-turn conversations. The prompt assembly was sending full conversation history — turn 30+ conversations exceeded the KV cache budget. I implemented a four-part fix: (1) a token budget manager that allocates context space by priority — system prompt, output reservation, current query, RAG context, then fills remaining space with recent history; (2) explicit truncation markers so users know when history was trimmed; (3) AWQ INT4 quantization with in-domain calibration to reclaim VRAM; (4) an OOM circuit breaker that catches CUDA OOM, clears the KV cache, and returns a graceful fallback instead of crashing. Result: Zero OOM restarts in 8 weeks post-deployment. VRAM usage for a 20-turn conversation dropped from 11.2 GB to 3.8 GB."
Q2: "What is a circuit breaker pattern?"
Strong answer: "A circuit breaker monitors calls to a downstream dependency or operation. It has three states: Closed (normal — requests flow through), Open (fault detected — requests immediately fail-fast without attempting the operation), and Half-Open (testing — a few requests are allowed through to check if the dependency recovered). For our GPU OOM case, the 'dependency' was the inference operation itself. If it throws a CUDA OOM, the circuit breaker trips: it clears the KV cache, logs a metric, returns a graceful fallback response, and waits before allowing full-size requests again. This prevents one bad request from cascading into a container restart that affects all other requests on that instance."
Q3: "How did you choose the token budget allocations?"
Strong answer: "Data-driven. I analyzed 10,000 production conversations: median system prompt = 480 tokens, median RAG context = 280 tokens, median user query = 45 tokens, median response = 350 tokens. I set reserves with headroom: 500 for system prompt, 512 for output, 300 for RAG. That left ~2,784 tokens for history on a 4096-token model. At ~250 tokens per turn (question + answer), that's ~11 most recent turns. For the 95th percentile use case, this was more than enough context. For the rare power user with 50+ turns, only the most recent 11 turns are kept — but the truncation marker and optional summarization preserve conversational coherence."
Q4 (Basics): "What is a Docker volume and when do you use it?"
Answer:
"A Docker volume is a persistent storage mechanism that outlives the container. Containers have a writable layer, but it's deleted when the container is removed. Volumes solve this by mounting a directory from the host (or a named volume managed by Docker) into the container. Use cases: (1) database storage — you don't want to lose your database when a container restarts, (2) shared data between containers, (3) model weights — mount a volume with pre-downloaded weights so every container doesn't re-download them. Types: named volumes (docker volume create), bind mounts (host path directly), and tmpfs mounts (memory only, for sensitive data)."
Q5 (Basics): "What is the difference between COPY and ADD in a Dockerfile?"
Answer:
"Both copy files from the build context into the image. COPY is straightforward — it copies files as-is. ADD has two extra features: (1) it can extract tar archives automatically, and (2) it can fetch URLs. Best practice: always use COPY unless you specifically need tar extraction. ADD's URL feature is discouraged because it creates non-reproducible builds (the URL content can change). If you need to download files, use RUN curl or RUN wget instead, because those can be cached in a separate layer."
Q6 (Follow-up): "What if the user NEEDS full conversation history for their use case?"
Answer: "Three approaches. (1) Summarization — instead of dropping old turns completely, periodically summarize older turns into a condensed paragraph. This preserves semantic meaning in fewer tokens. (2) RAG over history — index the full conversation in a vector store, and retrieve relevant past turns based on the current query instead of including everything. (3) Larger context model — move to a model with 32K or 128K context. But even 128K has a limit — a conversation with 500+ turns will still exceed it. The sliding window with summarization is the most robust long-term solution because it works regardless of context length."
Q7 (Follow-up): "How does the OOM circuit breaker know to recover?"
Answer: "After tripping, the circuit breaker enters a cooldown period (30 seconds in our case). During cooldown, all incoming requests to that instance are served with reduced context — maximum 1024 tokens instead of 4096 — which is guaranteed to fit in available VRAM. After the cooldown, the breaker enters half-open state and allows one full-context request through. If it succeeds, the breaker closes and normal operation resumes. If it OOMs again, the breaker re-opens and we emit an alert. Two consecutive trips within 5 minutes triggers an automatic scale-out event to add instance capacity."
Group Follow-Up Panel: Rapid-Fire Probing Questions
Interviewer 1 (Data Scientist): "Your sliding window keeps the most recent turns. But what if the CRITICAL information — like the user's name, their order number, or the topic — was mentioned in turn 1 and that turn gets dropped?"
Strong answer: "Great catch — this is why we don't just do blind sliding window. We have a priority hierarchy within the history window. Certain turns are 'pinned': the first turn (often contains user intent and key details) and any turn the system explicitly marked as containing entities (names, IDs, product references). Pinned turns are always included, and the sliding window operates over the remaining turns. If even pinned turns don't fit, we fall back to entity extraction — we pull key entities from dropped turns into a compact 'conversation context' block that takes ~50 tokens instead of the full turn."
# Enhanced sliding window with pinned turns and entity extraction
from dataclasses import dataclass
from typing import Optional
@dataclass
class ConversationTurn:
role: str
content: str
token_count: int
is_pinned: bool = False # First turn, entity-rich turns
entities: list = None # Extracted entities for compaction
class ContextBudgetManager:
def __init__(self, max_context=4096, output_reserve=512,
system_reserve=500, rag_reserve=300):
self.max_context = max_context
self.output_reserve = output_reserve
self.system_reserve = system_reserve
self.rag_reserve = rag_reserve
def assemble(self, system_prompt: str, rag_context: str,
current_query: str, history: list[ConversationTurn]) -> dict:
"""Build prompt within budget. Never exceeds max_context."""
budget = self.max_context
budget -= self.system_reserve
budget -= self.output_reserve
budget -= self._count_tokens(current_query)
budget -= self._count_tokens(rag_context)
# Phase 1: Always include pinned turns
pinned = [t for t in history if t.is_pinned]
unpinned = [t for t in history if not t.is_pinned]
pinned_cost = sum(t.token_count for t in pinned)
remaining_budget = budget - pinned_cost
if remaining_budget < 0:
# Even pinned turns don't fit — compact them to entities
entity_summary = self._compact_to_entities(pinned)
pinned = [ConversationTurn(
role="system",
content=f"[Context from earlier: {entity_summary}]",
token_count=self._count_tokens(entity_summary),
is_pinned=True
)]
pinned_cost = pinned[0].token_count
remaining_budget = budget - pinned_cost
# Phase 2: Fill remaining budget with most recent unpinned turns
included_unpinned = []
used = 0
for turn in reversed(unpinned):
if used + turn.token_count > remaining_budget:
break
included_unpinned.insert(0, turn)
used += turn.token_count
# Phase 3: Build final prompt with truncation marker if needed
total_history = len(pinned) + len(unpinned)
included_history = len(pinned) + len(included_unpinned)
was_truncated = included_history < total_history
result = {
"system_prompt": system_prompt,
"history": pinned + (
[ConversationTurn(
role="system",
content="[Some earlier messages were omitted to fit context window]",
token_count=15
)] if was_truncated else []
) + included_unpinned,
"rag_context": rag_context,
"current_query": current_query,
"was_truncated": was_truncated,
"turns_dropped": total_history - included_history,
"total_tokens": self.max_context - remaining_budget + used,
}
return result
def _compact_to_entities(self, turns):
"""Extract key entities from turns for compact representation."""
entities = []
for turn in turns:
if turn.entities:
entities.extend(turn.entities)
return ", ".join(entities) if entities else "No key entities extracted"
def _count_tokens(self, text):
# Simplified — use tiktoken in production
return len(text.split()) * 1.3 # rough approximation
Interviewer 2 (SRE): "You said the circuit breaker catches CUDA OOM. But CUDA OOM is raised during the forward pass — mid-computation. How do you actually catch it without corrupting the engine state?"
Strong answer: "Critical detail. CUDA OOM during a vLLM forward pass leaves the engine in an undefined state — partially allocated memory, incomplete batch state. Simply catching the Python exception isn't enough. Our circuit breaker does three things: (1) Catches torch.cuda.OutOfMemoryError at the request handler level. (2) Calls torch.cuda.empty_cache() to release all cached GPU memory back to the allocator. (3) Triggers a vLLM engine reset — this flushes the KV block manager, clears the scheduler queue, and reinitializes the memory pool. Active requests in the batch are returned with a 503 error. The engine is ready for new requests after ~2-3 seconds of cleanup."
# OOM Circuit Breaker implementation
import torch
import time
import logging
from enum import Enum
logger = logging.getLogger("oom-breaker")
class BreakerState(Enum):
CLOSED = "closed" # Normal operation
OPEN = "open" # Blocking full-context requests
HALF_OPEN = "half_open" # Testing recovery
class OOMCircuitBreaker:
def __init__(self, cooldown_seconds=30, reduced_context=1024,
max_consecutive_trips=2, trip_window_seconds=300):
self.state = BreakerState.CLOSED
self.cooldown_seconds = cooldown_seconds
self.reduced_context = reduced_context
self.max_consecutive_trips = max_consecutive_trips
self.trip_window_seconds = trip_window_seconds
self.last_trip_time = 0
self.trip_count = 0
self.metrics = {"trips": 0, "reduced_requests": 0, "recovered": 0}
def get_allowed_context_length(self, requested_length: int) -> int:
"""Returns the maximum context length allowed given breaker state."""
if self.state == BreakerState.CLOSED:
return requested_length
elif self.state == BreakerState.OPEN:
self.metrics["reduced_requests"] += 1
return min(requested_length, self.reduced_context)
elif self.state == BreakerState.HALF_OPEN:
return requested_length # Let one full request through to test
def record_success(self):
"""Called after successful inference."""
if self.state == BreakerState.HALF_OPEN:
logger.info("Half-open request succeeded. Closing breaker.")
self.state = BreakerState.CLOSED
self.trip_count = 0
self.metrics["recovered"] += 1
def record_oom(self, engine):
"""Called when CUDA OOM is caught."""
now = time.monotonic()
self.metrics["trips"] += 1
# Track consecutive trips
if now - self.last_trip_time < self.trip_window_seconds:
self.trip_count += 1
else:
self.trip_count = 1
self.last_trip_time = now
logger.error(
f"OOM Circuit Breaker TRIPPED (trip #{self.trip_count}). "
f"Entering OPEN state for {self.cooldown_seconds}s."
)
# Emergency GPU cleanup
torch.cuda.empty_cache()
engine.reset() # Flush KV blocks, clear scheduler
self.state = BreakerState.OPEN
# Schedule transition to half-open
# In production, use asyncio.call_later or similar
self._schedule_half_open()
# Escalation: too many trips → scale out
if self.trip_count >= self.max_consecutive_trips:
logger.critical(
f"{self.trip_count} OOM trips in {self.trip_window_seconds}s. "
"Triggering scale-out alarm."
)
self._emit_scale_out_alarm()
def _schedule_half_open(self):
"""After cooldown, allow one test request through."""
import threading
def _transition():
self.state = BreakerState.HALF_OPEN
logger.info("Breaker entering HALF_OPEN state. Next request is a test.")
threading.Timer(self.cooldown_seconds, _transition).start()
def _emit_scale_out_alarm(self):
"""Send CloudWatch metric that triggers autoscaling."""
import boto3
cw = boto3.client("cloudwatch")
cw.put_metric_data(
Namespace="InferenceContainer",
MetricData=[{
"MetricName": "OOMEscalation",
"Value": 1,
"Unit": "Count"
}]
)
Interviewer 3 (Product): "The truncation marker says 'earlier messages were omitted.' Doesn't that confuse users? Have you tested this?"
Strong answer: "We tested three approaches. (1) Silent truncation — model suddenly 'forgets' context with no explanation. Users rated this worst — confusing and frustrating. (2) Visible truncation marker shown to user — 'This conversation is getting long. Some earlier messages may not be referenced.' Users appreciated the transparency. (3) Invisible truncation marker in the model prompt only — the model sees '[Earlier context omitted]' but the user sees nothing; the model learns to avoid referencing information it might not have. We chose option 3 for most cases with option 2 for conversations exceeding 30 turns."
Interviewer 4 (ML Engineer): "You calibrated AWQ with manga-domain data. How much calibration data did you need? What happens if the domain shifts — say you add a new content category?"
Strong answer: "AWQ calibration needs surprisingly little data — 128-512 examples from your target domain is sufficient. We used 256 multi-turn conversations (manga discussion in Japanese and English). The calibration identifies which weight channels are 'salient' — high impact on output quality — and preserves their precision. For domain shift: moderate shifts (new manga genres) don't require recalibration because the salient channels are largely shared across similar tasks. Major shifts (adding code generation or medical Q&A) would require recalibration and A/B testing to verify quality."
Interviewer 5 (Scale): "What happens at 100 concurrent conversations, each 30 turns long? Does your budget manager become a bottleneck?"
Strong answer: "The budget manager is pure CPU computation — tokenization and list iteration. Even with 100 concurrent conversations × 30 turns, the budget calculation takes <1ms per request. It's not on the critical path of GPU inference, which takes 50-300ms per request. The real bottleneck is GPU VRAM — 100 concurrent requests × 4096 tokens × KV cache per token. With PagedAttention and AWQ quantization, we can fit ~90 concurrent sequences per A100. If we hit 100 concurrent conversations, autoscaling kicks in to add a second GPU. The budget manager scales horizontally by default because it runs inside each container."
Code Examples: Complete OOM-Safe Inference Handler
# Full request handler with budget management + OOM protection
import asyncio
from aiohttp import web
class InferenceHandler:
def __init__(self, engine, budget_manager, circuit_breaker):
self.engine = engine
self.budget = budget_manager
self.breaker = circuit_breaker
async def handle_request(self, request):
body = await request.json()
# Step 1: Assemble prompt within budget
assembled = self.budget.assemble(
system_prompt=body.get("system_prompt", "You are a helpful assistant."),
rag_context=body.get("rag_context", ""),
current_query=body["query"],
history=body.get("history", []),
)
# Step 2: Check circuit breaker — may reduce context
allowed_len = self.breaker.get_allowed_context_length(
assembled["total_tokens"]
)
# Step 3: Inference with OOM protection
try:
result = await self.engine.generate(
prompt=assembled,
max_tokens=min(body.get("max_tokens", 256), 512),
max_context=allowed_len,
)
self.breaker.record_success()
return web.json_response({
"response": result.text,
"tokens_used": result.tokens_used,
"was_truncated": assembled["was_truncated"],
"turns_dropped": assembled["turns_dropped"],
})
except torch.cuda.OutOfMemoryError:
# OOM caught — breaker handles cleanup and state transition
self.breaker.record_oom(self.engine)
return web.json_response({
"response": "I'm experiencing high load. Please try again "
"or start a new conversation for best results.",
"error": "capacity_limit",
"was_truncated": True,
}, status=503)
except Exception as e:
logger.exception(f"Inference failed: {e}")
return web.json_response(
{"error": "internal_error"}, status=500
)
Critical Points To Remember — Docker-LLD-4
╔══════════════════════════════════════════════════════════════════════╗
║ CRITICAL POINTS — LLD-4 ║
╠══════════════════════════════════════════════════════════════════════╣
║ ║
║ 1. GPU OOM IS A PROMPT-SIZE BUG, NOT A HARDWARE BUG ║
║ • Unbounded history → linear VRAM growth → guaranteed crash ║
║ • More VRAM just delays the crash, doesn't fix it ║
║ • Fix: budget manager that CAPS total tokens ║
║ ║
║ 2. TOKEN BUDGET PRIORITY (memorize this order) ║
║ 1. System prompt (always included) ║
║ 2. Output reservation (always reserved) ║
║ 3. Current query (always included) ║
║ 4. RAG context (always included) ║
║ 5. Pinned turns (first turn, entity-rich turns) ║
║ 6. Recent history (fills remaining budget) ║
║ → History is the FLEXIBLE part. Everything else is fixed. ║
║ ║
║ 3. CIRCUIT BREAKER STATES ║
║ • CLOSED → normal operation, full context allowed ║
║ • OPEN → reduced context (1024 max), cooldown active ║
║ • HALF_OPEN → one full request allowed to test recovery ║
║ • 2 trips in 5 minutes → auto-scale-out alarm ║
║ ║
║ 4. CUDA OOM ≠ SYSTEM OOM ║
║ • System OOM → kernel kills process → exit code 137 ║
║ • CUDA OOM → Python exception → catchable but GPU is dirty ║
║ • After CUDA OOM: empty_cache() + engine reset + KV flush ║
║ ║
║ 5. TRUNCATION MUST BE EXPLICIT, NEVER SILENT ║
║ • Silent drop → model hallucinates with missing context ║
║ • Explicit marker → model knows to avoid referencing dropped ║
║ • User-facing marker → transparency for long conversations ║
║ ║
║ 6. AWQ CALIBRATION IS DOMAIN-SPECIFIC ║
║ • 256 in-domain examples is sufficient ║
║ • Wrong calibration data → 3-5% quality degradation ║
║ • Right calibration data → <1% quality degradation ║
║ ║
║ 7. THE FORMULA: Container Stability = ║
║ Budget Management + Weight Quantization + Error Containment ║
║ Remove any one → instability returns ║
║ ║
╚══════════════════════════════════════════════════════════════════════╝
Docker-LLD-5: Dockerized Integration Testing Instead Of Mocking Everything
Objective
Raise integration-test realism enough to catch serialization, retry, startup, and latency issues without forcing every PR through a fully shared staging environment.
CI Test Topology
flowchart LR
Suite["CI Test Job"] --> Unit["Unit Tests + Pure Mocks"]
Suite --> Local["Docker Network"]
Local --> LS["LocalStack"]
Local --> Redis["Redis Test Container"]
Local --> OS["OpenSearch Test Container"]
Suite --> WM["WireMock"]
Suite --> SM["Staging SageMaker Endpoint"]
Decision Evolution
| Step | New decision | Why the previous decision was not enough | Improvement introduced |
|---|---|---|---|
| D0 | Start with mocks for dependency isolation | Mocks were too optimistic about latency, serialization, TTL behavior, and startup effects. | Established that unit confidence was not the same as integration confidence. |
| D1 | Use LocalStack for AWS primitives in CI | Shared staging for every AWS interaction would be slower and harder to isolate per PR. | Moved DynamoDB, S3, SQS, and Kinesis-like tests into disposable Dockerized environments. |
| D2 | Use TestContainers for Redis and OpenSearch | Emulators alone do not cover real dependency semantics for cache expiry or search index behavior. | Real service containers made cache and search behavior far closer to production. |
| D3 | Keep WireMock for deterministic downstream REST failure paths | Real services are poor at reproducing exact timeout-retry-recover sequences on demand. | Failure choreography became deterministic and repeatable in CI. |
| D4 | Keep real staging SageMaker endpoints for ML-sensitive paths | Dockerized fakes cannot reproduce real model latency and cold-start characteristics. | Preserved realism exactly where latency behavior mattered most. |
Test Tier Design
| Tier | Dependencies | Best for | Why it improved the previous tier |
|---|---|---|---|
| Unit | Mocks only | Business logic, pure branching | Fastest layer, but intentionally unrealistic. |
| Local integration | LocalStack, Redis, OpenSearch, WireMock | Serialization, cache TTL, index behavior, retry logic | Better than unit mocks because dependencies behave like running systems. |
| Staging integration | Real SageMaker and selected cloud dependencies | Latency and startup-sensitive ML paths | Better than local-only tests because model behavior and tail latency stay real. |
| Full E2E | Shared staging | Final release confidence | Broader system validation after cheaper layers already filtered most issues. |
Execution Model
- Unit tests run first and fail fast.
- CI spins up disposable Dockerized dependencies for local integration coverage.
- WireMock scripts failure sequences for retry and circuit-breaker scenarios.
- ML-sensitive integration cases call staging SageMaker instead of a fake endpoint.
- Only a smaller E2E set needs the full shared environment.
Failure Modes And Controls
| Failure mode | User-facing risk | Control |
|---|---|---|
| Mock-only confidence | Hidden runtime bugs reach staging or prod | Add Dockerized dependencies for realistic integration behavior. |
| Unstable shared staging dependency | Flaky PR builds | Keep most integrations local and disposable in CI. |
| Retry bug only visible under orchestrated failures | Circuit breaker or fallback logic misbehaves | Use WireMock scenario support to script exact failure order. |
| ML latency drift hidden by emulators | SLA surprises in production | Hit staging SageMaker for inference-sensitive paths. |
Improvement Evidence
| Metric | Value |
|---|---|
| Unit test runtime | Less than 30 seconds |
| Integration test runtime | Roughly 5 minutes |
| End-to-end runtime | Roughly 10 minutes |
| Bugs caught by this design | Cold-start issue in intent classifier, circuit-breaker bug under failure choreography |
Design Lesson
The best design was not "replace mocks." It was "place each dependency on the cheapest test tier that still preserves the behavior you actually care about."
Deep Dive: Group Discussion — Why Mocks Lie And How Docker Fixes It
Engineer A (junior dev): "Our unit tests all pass. 100% coverage. Why did this break in staging?"
Engineer B (senior): "What broke?"
Engineer A: "The DynamoDB query. Testing locally it returns items in insertion order. In production it returns them in hash-key order. Our code assumed sorted results."
Engineer C (QA): "This is exactly why I keep saying mocks are dangerous. Your mock returned [item1, item2, item3] because that's how you wrote the mock. Real DynamoDB returns them differently."
Engineer D (Platform): "Mocks test your CODE. They don't test the INTERACTION between your code and the dependency. Every mock is an assumption about how the dependency behaves — and assumptions can be wrong."
Engineer E (Staff): "The solution isn't 'no mocks.' It's 'right tool at the right layer.' Use mocks for business logic and branching. Use real dependencies (in Docker containers) for serialization, ordering, TTL, connection handling, error codes, and protocol behavior."
The Testing Pyramid With Docker Integration
/\
/ \ E2E Tests (staging env)
/ E2E\ - 10 minutes
/------\ - Full shared environment
/ Docker \ - Run selectively on release branches
/ Integr. \
/ Tests \ Docker Integration Tests
/--------------\ - 5 minutes
/ LocalStack \ - LocalStack, Redis, OpenSearch containers
/ TestContainers \ - Per-PR, disposable, isolated
/ WireMock \
/----------------------\
/ Unit Tests \ Unit Tests (mocks)
/ (Mocks Only) \ - 30 seconds
/____________________________\ - Every commit
- Pure business logic
What Each Docker Tool Actually Does
LocalStack — Emulates AWS services locally in a Docker container.
# docker-compose.test.yml
services:
localstack:
image: localstack/localstack
ports:
- "4566:4566" # All AWS services on one port
environment:
- SERVICES=dynamodb,s3,sqs,kinesis
- DEFAULT_REGION=us-east-1
http://localhost:4566 instead of real AWS. Creates tables, puts items, queries — all with real DynamoDB behavior (ordering, consistent reads, conditional writes) but zero AWS charges and zero shared state between test runs.
TestContainers — Spins up real service containers programmatically from test code.
# In your test file
from testcontainers.redis import RedisContainer
def test_cache_ttl_behavior():
with RedisContainer("redis:7.2") as redis:
client = redis.get_client()
client.set("key", "value", ex=2) # 2-second TTL
assert client.get("key") == b"value"
time.sleep(3)
assert client.get("key") is None # TTL expired — this is REAL Redis behavior
WireMock — Programmable HTTP stub server for deterministic failure testing.
{
"mappings": [
{
"scenarioName": "retry-test",
"requiredScenarioState": "Started",
"newScenarioState": "First-Failure",
"request": { "method": "POST", "url": "/api/inference" },
"response": { "status": 503, "fixedDelayMilliseconds": 5000 }
},
{
"scenarioName": "retry-test",
"requiredScenarioState": "First-Failure",
"newScenarioState": "Success",
"request": { "method": "POST", "url": "/api/inference" },
"response": { "status": 200, "body": "{\"result\": \"ok\"}" }
}
]
}
Real Bugs Docker Integration Tests Caught
| Bug | Why mocks missed it | How Docker caught it |
|---|---|---|
| DynamoDB item ordering | Mock returned items in insertion order | LocalStack returned items in hash-key order, exposing sort assumption |
| Redis TTL race condition | Mock get always returned value |
Real Redis expired the key between set and get under test timing |
| OpenSearch mapping conflict | Mock accepted any document structure | Real OpenSearch rejected a field type change, exposing a migration gap |
| Circuit breaker not tripping | Mock returned errors instantly | WireMock added realistic 5-second timeout, revealing circuit breaker timeout was set too low |
| SageMaker cold-start SLA miss | Mock returned in 50ms | Real staging endpoint returned in 3 seconds on cold path, exposing missing retry logic |
Interview Questions And Answers
Q1: "How do you decide what to mock vs what to test with real dependencies?"
Strong answer: "I use a behavior-based rule. If I'm testing my code's branching logic — what happens when the input is null, what happens when the amount is negative — mocks are perfect. But if I'm testing interaction behavior — serialization format, ordering guarantees, TTL behavior, connection pooling, error response codes — I need a real dependency. Mocks encode my ASSUMPTIONS about the dependency. Docker containers encode the dependency's ACTUAL behavior. The risk of mocks is that your assumptions are wrong and you don't find out until production."
Q2: "Isn't running Docker in CI slow and flaky?"
Strong answer: "Two separate concerns. Speed: LocalStack container starts in 3-5 seconds, Redis in 1-2 seconds, OpenSearch in 8-10 seconds. Total CI overhead is 15-20 seconds for container startup. Our integration suite runs in ~5 minutes including startup. That's fast enough for per-PR execution. Flakiness: The key is that these containers are disposable — each test run gets fresh containers with no leftover state. This is actually LESS flaky than shared staging environments where concurrent PR tests interfere with each other. The only shared dependency in our pipeline is the staging SageMaker endpoint, which we protect with test isolation via unique request IDs."
Q3: "What is docker-compose and how does it help testing?"
Strong answer:
"docker-compose lets you define and run multi-container applications with a single YAML file. For testing, it's invaluable because you can spin up your entire test dependency stack — LocalStack, Redis, OpenSearch, WireMock — with one command. Each service is on a Docker network so they can communicate. The test runner connects to these services by container name. After tests complete, docker-compose down destroys everything — clean slate for the next run. In CI, this means no test pollution between runs and no shared infrastructure to maintain."
Q4 (Basics): "What is Docker Compose?"
Answer:
"Docker Compose is a tool for defining multi-container applications. You write a docker-compose.yml file that describes all your services, their images, ports, volumes, environment variables, and network connections. Then docker-compose up starts everything together. Key features: service dependency ordering (depends_on), shared networks (services can talk by name), volume mounts, and environment variable injection. It's the bridge between running a single container and running an orchestrated application."
Q5 (Basics): "What is a Docker network?"
Answer:
"Docker networks provide isolated communication channels between containers. By default, Docker creates a bridge network where containers can communicate via IP addresses. With docker-compose or user-defined networks, containers can communicate by service NAME (DNS resolution). Network types: bridge (default, single host), host (container shares host's network stack), overlay (multi-host, used in Swarm/Kubernetes), none (no networking). For testing, the bridge network lets LocalStack, Redis, and your app container all talk to each other in isolation."
Q6 (Follow-up): "How do you handle test data setup and teardown in Docker integration tests?"
Answer:
"Three strategies. (1) Container-per-test-suite — start fresh containers before the suite, destroy after. Clean state guaranteed but slower if you have many suites. (2) Namespace isolation — use unique table names, key prefixes, or index names per test. Containers persist across tests but data doesn't collide. This is faster for large test suites. (3) Truncate between tests — clear data between tests but keep containers running. Fastest but requires careful cleanup code. We used strategy 2 for most tests — each test function generates a unique DynamoDB table name like test_users_{uuid}, so tests run in parallel without interference."
Q7 (Follow-up): "Why not just use staging for everything?"
Answer: "Three problems with staging-for-everything. (1) Speed — staging tests require network calls to remote AWS services. Our DynamoDB integration test takes 50ms locally via LocalStack but 200-400ms against real DynamoDB. Across 200 tests, that's 10 seconds vs 80 seconds just for network overhead. (2) Isolation — if 10 developers push PRs simultaneously, their staging tests can interfere. One dev's test creates data that another dev's test reads unexpectedly. (3) Cost — running real AWS services 24/7 for testing costs real money. LocalStack costs zero. We reserve staging for the things that MUST be tested against real AWS: SageMaker inference latency and production-specific service behaviors."
Q8 (Follow-up): "How do you ensure LocalStack behaves like real AWS?"
Answer: "You don't fully — and that's the key insight. LocalStack is 90-95% compatible for common operations. It covers DynamoDB CRUD, S3 operations, SQS messaging, and basic Kinesis streams. But it doesn't perfectly replicate throttling behavior, IAM policy evaluation, or eventual consistency timing. Our strategy: use LocalStack for functional correctness (does my query return the right data?), and use real AWS services in staging for operational correctness (does my retry handle throttling?). The LocalStack tests catch 80% of bugs at 1% of the cost. The staging tests catch the remaining 20%."
Group Follow-Up Panel: Rapid-Fire Probing Questions
Interviewer 1 (Lead QA): "How do you handle test ordering and parallelism with Docker containers? If two test suites run in parallel, do they stomp on each other's data?"
Strong answer: "Two strategies. For LocalStack/DynamoDB, each test suite creates tables with a UUID prefix — test_abc123_conversations vs test_def456_conversations. Suites run in parallel against the same LocalStack container but with completely isolated namespaces. For Redis, we use different database numbers (Redis supports 16 DBs: 0-15) or key prefixes. The test harness assigns a unique prefix per suite at startup. Teardown is optional since the containers are disposable — when CI finishes, docker-compose down destroys everything."
# Test isolation with unique namespaces per test suite
import pytest
import uuid
import boto3
@pytest.fixture(scope="session")
def dynamodb_table():
"""Create an isolated DynamoDB table for this test session."""
session_id = uuid.uuid4().hex[:8]
table_name = f"test_{session_id}_conversations"
# Connect to LocalStack
dynamodb = boto3.resource(
"dynamodb",
endpoint_url="http://localhost:4566",
region_name="us-east-1",
aws_access_key_id="test",
aws_secret_access_key="test",
)
# Create isolated table
table = dynamodb.create_table(
TableName=table_name,
KeySchema=[
{"AttributeName": "conversation_id", "KeyType": "HASH"},
{"AttributeName": "turn_id", "KeyType": "RANGE"},
],
AttributeDefinitions=[
{"AttributeName": "conversation_id", "AttributeType": "S"},
{"AttributeName": "turn_id", "AttributeType": "N"},
],
BillingMode="PAY_PER_REQUEST",
)
table.wait_until_exists()
yield table
# Cleanup (optional — container dies anyway)
table.delete()
@pytest.fixture(scope="session")
def redis_client():
"""Isolated Redis client with unique key prefix."""
import redis
session_id = uuid.uuid4().hex[:8]
client = redis.Redis(host="localhost", port=6379, db=0)
class PrefixedRedis:
"""Wraps Redis client to add session prefix to all keys."""
def __init__(self, client, prefix):
self._client = client
self._prefix = f"test:{prefix}:"
def get(self, key):
return self._client.get(f"{self._prefix}{key}")
def set(self, key, value, **kwargs):
return self._client.set(f"{self._prefix}{key}", value, **kwargs)
def setex(self, key, ttl, value):
return self._client.setex(f"{self._prefix}{key}", ttl, value)
def delete(self, *keys):
return self._client.delete(*[f"{self._prefix}{k}" for k in keys])
yield PrefixedRedis(client, session_id)
Interviewer 2 (DevOps): "Your CI runs docker-compose. What about GitHub Actions or CI systems that don't easily support Docker-in-Docker?"
Strong answer: "GitHub Actions runners DO support Docker natively — they run on Azure VMs with Docker pre-installed. No Docker-in-Docker needed. The CI job just runs docker-compose up -d, waits for health checks, runs tests, then docker-compose down. For environments that truly lack Docker (some corporate CI systems), we use TestContainers, which manages container lifecycle from test code itself — it automatically pulls images, starts containers, waits for readiness, and cleans up. TestContainers works anywhere a Docker daemon is reachable, even remotely."
# .github/workflows/integration-tests.yml
name: Integration Tests
on: [pull_request]
jobs:
unit-tests:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-python@v5
with:
python-version: "3.11"
- run: pip install -r requirements-test.txt
- run: pytest tests/unit --tb=short -q # 30 seconds
integration-tests:
runs-on: ubuntu-latest
needs: unit-tests # Only run if unit tests pass
steps:
- uses: actions/checkout@v4
- uses: actions/setup-python@v5
with:
python-version: "3.11"
# Start all test dependencies
- name: Start test infrastructure
run: |
docker-compose -f docker-compose.test.yml up -d
# Wait for all services to be healthy
docker-compose -f docker-compose.test.yml exec -T \
localstack awslocal dynamodb list-tables # Verify DynamoDB is ready
docker-compose -f docker-compose.test.yml exec -T \
redis redis-cli ping # Verify Redis is ready
- name: Run integration tests
run: |
pip install -r requirements-test.txt
pytest tests/integration \
--tb=short -q \
--timeout=60 \
-x # Stop on first failure
- name: Teardown
if: always()
run: docker-compose -f docker-compose.test.yml down -v
# docker-compose.test.yml — CI test infrastructure
version: "3.8"
services:
localstack:
image: localstack/localstack:3.0
ports:
- "4566:4566"
environment:
- SERVICES=dynamodb,s3,sqs,kinesis
- DEFAULT_REGION=us-east-1
- EAGER_SERVICE_LOADING=1 # Start all services immediately
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:4566/_localstack/health"]
interval: 5s
timeout: 3s
retries: 10
redis:
image: redis:7.2-alpine
ports:
- "6379:6379"
healthcheck:
test: ["CMD", "redis-cli", "ping"]
interval: 5s
timeout: 3s
retries: 5
opensearch:
image: opensearchproject/opensearch:2.11.0
ports:
- "9200:9200"
environment:
- discovery.type=single-node
- DISABLE_SECURITY_PLUGIN=true
- OPENSEARCH_JAVA_OPTS=-Xms512m -Xmx512m
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:9200/_cluster/health"]
interval: 10s
timeout: 5s
retries: 10
wiremock:
image: wiremock/wiremock:3.3.1
ports:
- "8089:8080"
volumes:
- ./tests/wiremock:/home/wiremock # Scenario JSON files
command: ["--verbose"]
Interviewer 3 (Architect): "WireMock for failure choreography is clever. Can you give me a concrete example of a bug WireMock caught that mocks would have missed?"
Strong answer: "Our circuit breaker was configured with a 5-second timeout. We had a test: 'if downstream returns 503, retry 3 times then open the breaker.' With mocks, the 503 returned instantly — no real delay. The test passed. But in production, the downstream took 4.8 seconds before returning 503. Three retries × 4.8 seconds = 14.4 seconds total before the breaker opened. Our SLA was 10 seconds. WireMock let us script: 'delay 4800ms, then return 503.' The test immediately revealed that our retry timeout was too generous — we needed per-attempt timeouts, not just a global retry count."
// tests/wiremock/mappings/circuit-breaker-test.json
// Simulates a slow-then-failing downstream service
{
"mappings": [
{
"scenarioName": "slow-503-circuit-breaker",
"requiredScenarioState": "Started",
"newScenarioState": "Attempt-2",
"request": {
"method": "POST",
"url": "/api/v1/inference"
},
"response": {
"status": 503,
"fixedDelayMilliseconds": 4800,
"headers": { "Content-Type": "application/json" },
"body": "{\"error\": \"Service Unavailable\"}"
}
},
{
"scenarioName": "slow-503-circuit-breaker",
"requiredScenarioState": "Attempt-2",
"newScenarioState": "Attempt-3",
"request": {
"method": "POST",
"url": "/api/v1/inference"
},
"response": {
"status": 503,
"fixedDelayMilliseconds": 4800,
"body": "{\"error\": \"Service Unavailable\"}"
}
},
{
"scenarioName": "slow-503-circuit-breaker",
"requiredScenarioState": "Attempt-3",
"newScenarioState": "Recovered",
"request": {
"method": "POST",
"url": "/api/v1/inference"
},
"response": {
"status": 200,
"body": "{\"result\": \"recovered\"}"
}
}
]
}
# Test that caught the circuit breaker timing bug
import pytest
import time
import httpx
async def test_circuit_breaker_respects_per_attempt_timeout():
"""
Bug this caught: Circuit breaker had 3 retries with NO per-attempt timeout.
Downstream took 4.8s per 503 → total 14.4s before breaker opened.
SLA was 10 seconds. This test enforces per-attempt timeout of 3 seconds.
"""
start = time.monotonic()
async with httpx.AsyncClient(timeout=10.0) as client:
response = await client.post(
"http://localhost:8080/api/chat",
json={"query": "test", "history": []},
)
elapsed = time.monotonic() - start
# Circuit breaker should open within 10 seconds (our SLA)
assert elapsed < 10.0, (
f"Circuit breaker took {elapsed:.1f}s — exceeds 10s SLA. "
"Check per-attempt timeout configuration."
)
# Should get a graceful degradation response, NOT a timeout error
assert response.status_code in (200, 503)
if response.status_code == 503:
assert "capacity" in response.json().get("error", "")
Interviewer 4 (Security): "Your Docker test containers use aws_access_key_id='test'. What if someone accidentally commits real AWS credentials in a test file?"
Strong answer: "Three safeguards. (1) git-secrets pre-commit hook scans for patterns matching AWS key IDs (AKIA...) and secret keys (40-char base64 strings). Blocks the commit before it reaches the repo. (2) Our test conftest.py explicitly sets AWS_DEFAULT_REGION=us-east-1 and AWS_ENDPOINT_URL=http://localhost:4566 in the process environment — even if real credentials leak, all boto3 calls go to LocalStack, not real AWS. (3) CI runner IAM role has zero permissions to production accounts. The role can only access the CI account's ECR and CloudWatch."
Interviewer 5 (Performance): "5 minutes for integration tests seems fast. What happens as you add more test cases? How do you keep it from growing to 30 minutes?"
Strong answer: "Three strategies to keep CI fast as tests grow. (1) Parallel execution — pytest-xdist runs test files in parallel across multiple processes. Since each test uses namespaced tables/keys, there's no collision. (2) Container reuse — we start containers once at the beginning of the suite, not per test. Container startup is ~15 seconds total; amortized across 200 tests, it's negligible. (3) Test categorization — we tag tests as @pytest.mark.fast (under 1s) and @pytest.mark.slow (over 5s). On PR pushes, only fast integration tests run. The full suite runs on merge to main. This keeps PR feedback under 3 minutes."
Critical Points To Remember — Docker-LLD-5
╔══════════════════════════════════════════════════════════════════════╗
║ CRITICAL POINTS — LLD-5 ║
╠══════════════════════════════════════════════════════════════════════╣
║ ║
║ 1. MOCKS TEST YOUR CODE. DOCKER TESTS THE INTERACTION. ║
║ • Mock = your assumption about dependency behavior ║
║ • Docker = dependency's ACTUAL behavior ║
║ • Mocks are blind to: ordering, TTL, serialization, latency ║
║ ║
║ 2. THE TEST TIER RULE ║
║ • Unit (mocks): business logic → 30 seconds ║
║ • Docker integration: serialization/TTL/retry → 5 minutes ║
║ • Staging: ML latency/cold-start → 10 minutes ║
║ • Place each test at the CHEAPEST tier that catches the bug ║
║ ║
║ 3. LOCALSTACK IS 90-95% COMPATIBLE, NOT 100% ║
║ • Covers: CRUD, basic queries, S3 operations ║
║ • Misses: throttling, IAM policy eval, eventual consistency ║
║ • Use for functional correctness, not operational correctness ║
║ ║
║ 4. TEST ISOLATION = NAMESPACE PER TEST, NOT CONTAINER PER TEST ║
║ • UUID-prefixed table names, key prefixes, DB numbers ║
║ • Enables parallelism without data collision ║
║ • Container startup is expensive; namespace creation is free ║
║ ║
║ 5. WIREMOCK FOR FAILURE CHOREOGRAPHY ║
║ • Script exact failure sequences: 503 → 503 → 200 ║
║ • Add realistic delays (4.8s timeout, not instant) ║
║ • Catches timing bugs that instant mocks ALWAYS miss ║
║ ║
║ 6. CI MUST BE FAST OR NOBODY RUNS IT ║
║ • Container reuse (start once, test many) ║
║ • Parallel test execution (pytest-xdist) ║
║ • Test categorization (fast on PR, full on merge) ║
║ • Target: <5 minutes for PR feedback loop ║
║ ║
║ 7. KEEP STAGING FOR WHAT DOCKER CAN'T FAKE ║
║ • Real SageMaker inference latency ║
║ • Real cold-start behavior ║
║ • Real throttling and rate limiting ║
║ • Everything else → Docker containers in CI ║
║ ║
╚══════════════════════════════════════════════════════════════════════╝
Docker-LLD-6: Container Supply-Chain Security And Promotion Gating
Objective
Treat container images as governed release artifacts so production only runs artifacts whose contents, provenance, and promotion evidence are all verifiable.
Promotion Pipeline
flowchart LR
Src["Reviewed source"] --> Lock["Lockfile + hashes"]
Lock --> Build["Deterministic build"]
Build --> Scan["SCA / CVE scan"]
Scan --> SBOM["SBOM + dependency diff"]
SBOM --> Sign["Signature + provenance attestation"]
Sign --> ECR["Publish immutable image"]
ECR --> Policy{"Promotion policy passed?"}
Policy -->|No| Block["Block release"]
Policy -->|Yes| Canary["Canary rollout"]
Canary --> Healthy{"Healthy?"}
Healthy -->|No| Rollback["Rollback to previous signed artifact"]
Healthy -->|Yes| Prod["Promote to production"]
Decision Evolution
| Step | New decision | Why the previous decision was not enough | Improvement introduced |
|---|---|---|---|
| D0 | Scan images in the registry | Scanning alone can say "this image has a CVE" but not "exactly what inputs produced it" or "what changed from the last release." | Identified that image trust required more than vulnerability detection. |
| D1 | Pin dependencies with hashes and install from a private mirror | Public-internet installs and floating dependencies make builds nondeterministic. | Build inputs became reproducible and easier to audit. |
| D2 | Generate SBOM, dependency diff, provenance, and signature for each build | A lockfile alone does not give release-level traceability after the image is published. | Every artifact gained evidence that can be checked at promotion time and during incident response. |
| D3 | Enforce fail-closed promotion policy | Evidence that exists but is not enforced still allows risky releases through. | Missing signature, missing SBOM, open policy-violating CVEs, or missing evals block release. |
| D4 | Roll back only to the previous signed artifact | Ad hoc rollback can reintroduce untrusted or ambiguous artifacts. | Containment became deterministic and fast. |
Build-Time Contract
| Control | Decision | Why it improved the prior option |
|---|---|---|
| Dependency source | Private mirror only | Better than public registry installs because approved bytes are controlled and cached. |
| Dependency lock | Exact versions plus hashes | Better than version pinning alone because the approved artifact bytes are also fixed. |
| SBOM generation | Mandatory in CI | Better than post-hoc inventory because the release artifact and evidence are created together. |
| Provenance | Signed build attestation | Better than unauthenticated build logs because artifact origin becomes machine-verifiable. |
| Promotion policy | Fail closed | Better than manual review because missing evidence blocks automatically. |
Runtime And Incident Design
- CI resolves dependencies from approved inputs only.
- The build produces a container image plus attached evidence.
- ECR stores the immutable image digest and scan metadata.
- Promotion policy checks signature, SBOM, dependency diff, CVE status, evaluation results, and approvals.
- Canary rollout validates health before production promotion.
- If trust or health is lost, deployment rolls back to the previous signed artifact.
Failure Modes And Controls
| Failure mode | User-facing risk | Control |
|---|---|---|
| Floating dependency or tampered package | Runtime compromise or silent regression | Lockfile with hashes plus private mirror. |
| Registry scan passes but provenance is unknown | No trustworthy answer to "what is running?" | Mandatory SBOM, provenance, and signature. |
| Missing evidence but manual pressure to ship | Risky production deployment | Fail-closed promotion policy. |
| Critical CVE discovered after release | Slow and guess-based response | SBOM-based triage identifies affected releases quickly, then rollback or rebuild follows. |
Improvement Evidence
| Capability | Scan-only pipeline | Governed artifact pipeline |
|---|---|---|
| Vulnerability detection | Yes | Yes |
| Exact dependency inventory | Weak | Strong via SBOM and lockfile |
| Artifact authenticity | Weak | Strong via signature and provenance |
| Promotion enforcement | Mostly manual | Automatic fail-closed policy |
| Rollback confidence | Variable | Deterministic to previous signed digest |
Design Lesson
The stronger design is not "we scan Docker images." It is "we can prove what is inside the image, how it was built, why it was promoted, and exactly what trusted artifact we will roll back to."
Deep Dive: Group Discussion — Why Image Scanning Alone Is A False Sense Of Security
Engineer A (Security): "Our pipeline scans every image with Trivy before pushing to ECR. We're secure, right?"
Engineer B (Staff): "What happens when Trivy says the image is clean, but someone injected a malicious package that isn't in any CVE database yet?"
Engineer A: "Well... that wouldn't be caught by scanning."
Engineer C (Supply Chain): "This is the fundamental problem. Vulnerability scanning answers ONE question: 'Does this image contain packages with KNOWN vulnerabilities?' It does NOT answer: (1) How was this image built? (2) What exact source code and dependencies went into it? (3) Has anyone tampered with it between build and deployment? (4) Can we reproduce it? These are provenance and integrity questions — scanning doesn't touch them."
Engineer D (Incident Response): "And during an incident, the first question is 'what is running in production RIGHT NOW?' If all you have is a scan report, you know the image was clean when scanned. But you don't know if the image in production is the same one you scanned. You don't know exactly which versions of every transitive dependency are inside. Without an SBOM, you're doing forensics in the dark."
Engineer E (Compliance): "We also need to answer auditor questions like: 'Show me the chain of evidence from source code commit to production deployment for this artifact.' Scanning gives you one link. We need the full chain."
The Kill Chain A Scanning-Only Pipeline Misses
Supply Chain Attack Vector:
1. Attacker compromises a popular PyPI package (e.g., typosquatting)
2. Your CI pulls the latest version during `pip install`
3. The malicious package has no known CVE yet → scanner says CLEAN ✓
4. Image is pushed to ECR → scanner says CLEAN ✓
5. Image is deployed to production → scanner says CLEAN ✓
6. Malicious code executes → data exfiltration begins
What WOULD have caught this with a governed pipeline:
Step 2: Private mirror doesn't have the package → build FAILS ✗
OR: Lockfile hash doesn't match → build FAILS ✗
OR: SBOM diff shows unexpected new dependency → human reviews
OR: Dependency diff flags the new package → promotion BLOCKED
Each layer catches what the previous one might miss.
Anatomy Of A Governed Container Release
Source ──► Build ──► Verify ──► Publish ──► Gate ──► Deploy
Source:
├── Code reviewed and merged to main
├── requirements.txt with exact versions AND hashes
│ Example: flask==3.0.0 --hash=sha256:abc123...
└── All deps available in private mirror
Build:
├── Deterministic: same inputs → same image bytes
├── pip install from private mirror ONLY (no pypi.org)
├── Multi-stage build: build deps never reach runtime image
└── Image tagged with git SHA (immutable)
Verify:
├── CVE scan (Trivy, Grype, or Snyk)
├── SBOM generation (Syft → CycloneDX format)
├── Dependency diff vs previous release
│ "These 3 packages changed versions: X 1.2→1.3, Y 2.0→2.1, Z 3.0 (NEW)"
├── Provenance attestation (Sigstore/cosign)
│ "Built by GitHub Actions run #4521 from commit abc123"
└── Digital signature (cosign sign)
Gate (all must pass — fail-closed):
├── ✓ No critical/high CVEs
├── ✓ SBOM present and valid
├── ✓ Signature verified
├── ✓ Provenance attestation present
├── ✓ Dependency diff reviewed (if new deps)
├── ✓ ML eval results present and passing
└── ✗ ANY missing evidence → release BLOCKED
Deploy:
├── Canary: 5% traffic for 15 minutes
├── Monitor error rate, latency, GPU metrics
├── If healthy → promote to 100%
└── If unhealthy → automatic rollback to PREVIOUS SIGNED DIGEST
Why Lockfiles With Hashes Matter
# WITHOUT hashes (dangerous):
flask==3.0.0
# pip downloads flask 3.0.0 from PyPI
# But how do you know the bytes haven't been tampered with?
# Answer: You don't.
# WITH hashes (safe):
flask==3.0.0 \
--hash=sha256:21c0527d5fce083e3dc580fa2b28db8c6f2add8a3e8... \
--hash=sha256:7e8b2cdc7e7f5f2...
# pip downloads flask 3.0.0 AND verifies the SHA256 hash
# If the bytes don't match → install FAILS
# Tampered package → caught immediately
The hash locks the exact bytes, not just the version number. If a PyPI maintainer's account is compromised and a modified flask-3.0.0 is uploaded, the hash check fails and your build breaks — which is exactly what you want. A broken build is infinitely better than a compromised production deployment.
Interview Questions And Answers
Q1: "What is an SBOM and why does it matter?"
Strong answer: "SBOM — Software Bill of Materials — is a complete inventory of every component inside a software artifact. For a Docker image, it lists every OS package, Python package, their exact versions, licenses, and source locations. It matters for three reasons: (1) Vulnerability response — when a new CVE drops (like Log4Shell), you can instantly query 'which of our production images contain this package?' without manually inspecting each image. Response time drops from hours to seconds. (2) Compliance — regulations like Executive Order 14028 require SBOMs for software sold to the US government. (3) Incident forensics — during a security incident, the SBOM tells you exactly what was running, not what you think was running."
Q2: "Explain fail-closed vs fail-open promotion."
Strong answer: "Fail-closed means: if ANY required evidence is missing or invalid, the release is BLOCKED by default. The release needs positive proof to proceed. Fail-open means: the release proceeds unless something actively blocks it. Fail-open is dangerous because it means a broken scanner, a missing attestation, or a misconfigured policy silently allows an unverified artifact into production. We chose fail-closed because the cost of a delayed release (minutes to hours while you fix the evidence) is far, far lower than the cost of a compromised production deployment."
Q3: "How do you handle a critical CVE discovered after a release is already in production?"
Strong answer: "Three-step process. Step 1: Triage — query the SBOM database for all production images containing the affected package. This tells us exactly which services are impacted and at what version. Without SBOM, this step alone could take hours. Step 2: Decision — if the CVE is exploitable in our context, we rebuild the affected images with the patched dependency. If it's not exploitable (e.g., a vulnerability in a CLI tool we don't invoke), we schedule the fix for the next release. Step 3: Deploy — the rebuild goes through the same pipeline: scan, SBOM, sign, gate, canary. If urgency demands it, we can fast-track the canary window from 15 minutes to 5 minutes, but we never skip the pipeline entirely."
Q4 (Basics): "What is Docker image layering?"
Answer: "A Docker image is built from layers. Each instruction in the Dockerfile (FROM, RUN, COPY, ADD) creates a new read-only layer. Layers are stacked — each layer contains only the diff from the layer below it. Docker caches layers and reuses them across builds. If you change line 10 of your Dockerfile, Docker reuses cached layers 1-9 and only rebuilds from line 10 onward. This is why you should put things that change infrequently (OS packages, dependencies) at the top and things that change frequently (your application code) at the bottom. Layer caching can reduce build times from 5 minutes to 30 seconds."
Q5 (Basics): "What is a Docker registry?"
Answer:
"A Docker registry is a storage and distribution service for Docker images. Docker Hub is the public default registry. AWS ECR (Elastic Container Registry), GCR, and Azure ACR are cloud-provider registries. You push images to a registry and pull them from it. Key features of ECR: private repositories, IAM-based access control, integrated vulnerability scanning, immutable image tags (prevent overwriting), and cross-region replication. In production, you never pull from Docker Hub directly — you use a private registry to control what images are available."
Q6 (Follow-up): "How do you ensure builds are deterministic?"
Answer:
"Five practices. (1) Pin base images by digest, not tag — FROM python:3.11@sha256:abc123... instead of FROM python:3.11 because tags can be overwritten. (2) Lockfile with hashes — exact dependency versions plus byte-level verification. (3) Private mirror — all packages come from a controlled source, not the live public internet. (4) No network access during build (after dependency install) — prevents any dynamic fetching. (5) Reproducible build flags — set SOURCE_DATE_EPOCH for timestamp normalization. The goal: running the same build twice or on different machines produces bit-for-bit identical images."
Q7 (Follow-up): "What is image signing and cosign?"
Answer: "Image signing cryptographically proves that an image was built by your CI system and hasn't been tampered with since. Cosign (from the Sigstore project) is the standard tool. It generates a digital signature over the image digest and stores it alongside the image in the registry. At deployment time, a policy engine (like Kyverno or OPA Gatekeeper) verifies the signature before allowing the image to run. If someone pushes a modified image directly to ECR bypassing the pipeline, it won't have a valid signature and deployment will be blocked. This prevents both insider threats and compromised registry accounts."
Q8 (Follow-up): "Walk me through what happens during a rollback."
Answer: "Our rollback is deterministic because every production image is signed and stored by digest. Step 1: The canary detects degradation (error rate spike, latency increase, or GPU metric anomaly). Step 2: The deployment controller queries the promotion database for the previous deployment's image digest — not a mutable tag, the actual SHA256 digest. Step 3: It verifies the previous image's signature is still valid (hasn't been revoked). Step 4: It redeploys the previous image. Step 5: It runs the same canary health check to confirm the rollback is healthy. Total rollback time: under 3 minutes for Fargate, under 5 minutes for SageMaker endpoints. The critical insight: we never roll back to 'latest' or 'previous tag' — we roll back to a specific verified artifact."
Group Follow-Up Panel: Rapid-Fire Probing Questions
Interviewer 1 (Security Architect): "You sign images with cosign. Where is the private key stored? If someone compromises the key, can they push malicious signed images?"
Strong answer: "We use keyless signing via Sigstore's Fulcio CA — there IS no persistent private key. During CI, the build job authenticates via OIDC (GitHub Actions identity token), and Fulcio issues an ephemeral signing certificate tied to that identity. The signature is recorded in the Rekor transparency log, creating a tamper-evident audit trail. Even if someone compromises our CI workflows, the Rekor log creates a permanent public record of every signed artifact. You can verify not just 'was this signed' but 'was this signed by GitHub Actions run #4521 on repo X at commit Y.' An attacker would have to compromise both our CI and the public transparency log — which is append-only."
# Keyless signing workflow with cosign + Sigstore
# This runs in GitHub Actions CI after image build
# Step 1: Build and push image
IMAGE="123456789.dkr.ecr.us-east-1.amazonaws.com/chatbot-orchestrator"
TAG="${GITHUB_SHA}"
docker build -t "${IMAGE}:${TAG}" .
docker push "${IMAGE}:${TAG}"
# Step 2: Get the image digest (immutable reference)
DIGEST=$(docker inspect --format='{{index .RepoDigests 0}}' "${IMAGE}:${TAG}")
echo "Image digest: ${DIGEST}"
# Step 3: Sign with keyless cosign (uses OIDC identity from GitHub Actions)
cosign sign \
--yes \
--oidc-issuer=https://token.actions.githubusercontent.com \
"${DIGEST}"
# Step 4: Attach SBOM as an attestation
syft "${DIGEST}" -o cyclonedx-json > sbom.json
cosign attest \
--yes \
--oidc-issuer=https://token.actions.githubusercontent.com \
--predicate sbom.json \
--type cyclonedx \
"${DIGEST}"
# Step 5: Verify (this is what the promotion gate runs)
cosign verify \
--certificate-oidc-issuer=https://token.actions.githubusercontent.com \
--certificate-identity-regexp="github.com/our-org/our-repo" \
"${DIGEST}"
echo "Signature verified. SBOM attached. Ready for promotion gate."
Interviewer 2 (Compliance): "An auditor asks: 'Show me evidence that image X in production was built from source commit Y, passed all quality gates, and no one tampered with it between build and deployment.' How do you answer?"
Strong answer: "I run a single query chain. (1) From the production deployment, I get the running image DIGEST (sha256:abc123...). (2) From ECR, I pull the SBOM attestation attached to that digest — it lists every dependency, version, and license. (3) From the Rekor transparency log, I retrieve the signing certificate — it shows the GitHub Actions run ID, repository, and commit SHA that produced the image. (4) From the promotion database, I pull the gate record — it shows: CVE scan passed, SBOM present, signature verified, ML eval passed, approval granted, canary healthy. Each piece is cryptographically linked: the SBOM is attested to the digest, the signature binds the digest to the build identity, and the promotion record gates the deployment. The entire chain is machine-verifiable."
# Automated audit trail query — used by compliance dashboard
import boto3
import subprocess
import json
def audit_image(image_digest: str) -> dict:
"""
Given a production image digest, reconstruct the full
evidence chain from source to deployment.
"""
audit = {"digest": image_digest, "evidence": {}}
# 1. Verify signature and get build identity
result = subprocess.run([
"cosign", "verify",
"--certificate-oidc-issuer=https://token.actions.githubusercontent.com",
"--certificate-identity-regexp=github.com/our-org/.*",
"--output-text",
image_digest
], capture_output=True, text=True)
sig_info = json.loads(result.stdout)
audit["evidence"]["signature"] = {
"verified": result.returncode == 0,
"build_identity": sig_info[0]["optional"]["Subject"],
"issuer": sig_info[0]["optional"]["Issuer"],
"build_timestamp": sig_info[0]["optional"]["BuildSignerURI"],
}
# 2. Retrieve SBOM attestation
result = subprocess.run([
"cosign", "verify-attestation",
"--type", "cyclonedx",
"--certificate-oidc-issuer=https://token.actions.githubusercontent.com",
image_digest
], capture_output=True, text=True)
sbom = json.loads(result.stdout)
audit["evidence"]["sbom"] = {
"present": result.returncode == 0,
"component_count": len(sbom.get("components", [])),
"format": "CycloneDX",
}
# 3. ECR vulnerability scan results
ecr = boto3.client("ecr")
scan = ecr.describe_image_scan_findings(
repositoryName="chatbot-orchestrator",
imageId={"imageDigest": image_digest.split("@")[1]}
)
findings = scan["imageScanFindings"]["findingSeverityCounts"]
audit["evidence"]["vulnerability_scan"] = {
"critical": findings.get("CRITICAL", 0),
"high": findings.get("HIGH", 0),
"medium": findings.get("MEDIUM", 0),
"scan_completed": scan["imageScanStatus"]["status"] == "COMPLETE",
}
# 4. Promotion gate record (from DynamoDB)
dynamodb = boto3.resource("dynamodb")
table = dynamodb.Table("deployment-promotions")
gate_record = table.get_item(Key={"image_digest": image_digest})
if "Item" in gate_record:
audit["evidence"]["promotion_gate"] = {
"all_gates_passed": gate_record["Item"]["all_passed"],
"gates": gate_record["Item"]["gate_results"],
"promoted_at": gate_record["Item"]["promoted_at"],
"promoted_by": gate_record["Item"]["approved_by"],
}
return audit
Interviewer 3 (Incident Commander): "It's 2 AM. A critical CVE is published for a package in your inference container. Walk me through your response, step by step."
Strong answer:
02:00 PagerDuty alert fires from CVE feed integration
02:05 On-call opens SBOM dashboard, queries: "Which production images
contain package X at version Y?"
→ Results in 8 seconds: image chatbot-inference:sha256:abc123
deployed on production endpoint 'inference-prod'
02:10 Assess exploitability: Is the vulnerable code path reachable?
→ Check: Is this an HTTP parsing CVE? A compression library CVE?
→ If NOT reachable in our context → schedule patch for next release
→ If reachable → proceed to emergency patch
02:15 Branch from current release commit (from Rekor: commit sha def456)
Update dependency version in requirements-lock.txt
Regenerate lockfile hashes
02:20 CI builds new image → scan → SBOM → sign → all gates pass
02:25 Deploy to canary (5% traffic)
02:30 Canary healthy for 5 minutes → promote to 100%
02:35 Verify new image SBOM no longer contains vulnerable version
02:40 Update incident ticket. Stand down.
Total response: ~40 minutes from alert to patched production.
Without SBOM: Step 02:05 alone takes 2-4 hours (manual image inspection).
Interviewer 4 (DevOps Lead): "You use a private mirror for dependencies. How do you keep it updated? What happens when a developer needs a new package that isn't in the mirror?"
Strong answer: "The mirror syncs from PyPI on a daily schedule (automated cron job). The sync is not blind — it runs through an approved-packages list maintained in a Git repository. To add a new package: the developer submits a PR to add the package name and version to the approved list. The PR triggers an automated check: (1) Is the package on PyPI? (2) Does it have any critical CVEs? (3) What is its license? (Compatible with our project?) The mirror syncs only approved packages. This process takes 10-15 minutes, which is a minor friction that prevents supply-chain attacks."
Interviewer 5 (Platform): "Immutable tags in ECR — what exactly prevents someone from pushing a different image with the same tag? And what's the difference between a tag and a digest?"
Strong answer: "ECR's immutable tag feature is a registry policy that prevents overwriting an existing tag. Once v2.1.0 is pushed, any attempt to push a different image with the same tag returns an error. But tags are still just human-readable labels. The digest (sha256:abc123...) is the actual content hash — it's computed from the image manifest and is mathematically unique. Two identical images always have the same digest. Two different images cannot share a digest. Our pipeline ALWAYS references images by digest internally, even when we also apply tags for human readability. The promotion gate checks the digest, not the tag."
Code Examples: CI/CD Pipeline With Full Supply-Chain Security
# .github/workflows/secure-release.yml
name: Secure Container Release
on:
push:
branches: [main]
permissions:
id-token: write # Needed for keyless cosign signing
contents: read
packages: write
jobs:
build-and-verify:
runs-on: ubuntu-latest
outputs:
image-digest: ${{ steps.push.outputs.digest }}
steps:
- uses: actions/checkout@v4
# Build image
- name: Build container image
run: |
docker build \
--build-arg BUILD_DATE=$(date -u +%Y-%m-%dT%H:%M:%SZ) \
--build-arg GIT_SHA=${{ github.sha }} \
-t chatbot-orchestrator:${{ github.sha }} .
# Push to ECR
- name: Push to ECR
id: push
run: |
IMAGE="123456789.dkr.ecr.us-east-1.amazonaws.com/chatbot-orchestrator"
docker tag chatbot-orchestrator:${{ github.sha }} ${IMAGE}:${{ github.sha }}
docker push ${IMAGE}:${{ github.sha }}
DIGEST=$(docker inspect --format='{{index .RepoDigests 0}}' ${IMAGE}:${{ github.sha }})
echo "digest=${DIGEST}" >> $GITHUB_OUTPUT
# Vulnerability scan
- name: Scan for CVEs
uses: aquasecurity/trivy-action@master
with:
image-ref: ${{ steps.push.outputs.digest }}
severity: CRITICAL,HIGH
exit-code: 1 # Fail build on critical/high CVEs
# Generate SBOM
- name: Generate SBOM
run: |
syft ${{ steps.push.outputs.digest }} -o cyclonedx-json > sbom.json
echo "Components found: $(jq '.components | length' sbom.json)"
# Generate dependency diff vs previous release
- name: Dependency diff
run: |
# Fetch previous release SBOM from S3
aws s3 cp s3://release-artifacts/previous-sbom.json previous-sbom.json || true
if [ -f previous-sbom.json ]; then
python scripts/diff_sbom.py previous-sbom.json sbom.json > dep-diff.json
echo "Dependency changes:"
cat dep-diff.json
fi
# Sign image (keyless via Sigstore)
- name: Sign image
uses: sigstore/cosign-installer@v3
- run: |
cosign sign --yes ${{ steps.push.outputs.digest }}
cosign attest --yes --predicate sbom.json --type cyclonedx \
${{ steps.push.outputs.digest }}
# Promotion gate — fail-closed
promotion-gate:
needs: build-and-verify
runs-on: ubuntu-latest
steps:
- name: Verify all evidence exists
run: |
DIGEST="${{ needs.build-and-verify.outputs.image-digest }}"
echo "=== Checking signature ==="
cosign verify \
--certificate-oidc-issuer=https://token.actions.githubusercontent.com \
--certificate-identity-regexp="github.com/our-org/.*" \
"${DIGEST}" || { echo "SIGNATURE MISSING — BLOCKING RELEASE"; exit 1; }
echo "=== Checking SBOM attestation ==="
cosign verify-attestation --type cyclonedx \
--certificate-oidc-issuer=https://token.actions.githubusercontent.com \
"${DIGEST}" || { echo "SBOM MISSING — BLOCKING RELEASE"; exit 1; }
echo "=== All gates passed ==="
# Canary deployment
canary-deploy:
needs: [build-and-verify, promotion-gate]
runs-on: ubuntu-latest
steps:
- name: Deploy canary (5% traffic)
run: |
aws ecs update-service \
--cluster production \
--service chatbot-orchestrator \
--task-definition "chatbot-orchestrator:${{ github.sha }}" \
--deployment-configuration "maximumPercent=200,minimumHealthyPercent=100"
# CodeDeploy handles canary traffic splitting
Critical Points To Remember — Docker-LLD-6
╔══════════════════════════════════════════════════════════════════════╗
║ CRITICAL POINTS — LLD-6 ║
╠══════════════════════════════════════════════════════════════════════╣
║ ║
║ 1. SCANNING ALONE IS NOT SECURITY ║
║ • Scanning detects KNOWN CVEs only ║
║ • It does NOT prove: provenance, integrity, or reproducibility ║
║ • You need: scan + SBOM + signature + provenance + policy ║
║ ║
║ 2. FAIL-CLOSED PROMOTION = DEFAULT DENY ║
║ • Missing signature → BLOCKED ║
║ • Missing SBOM → BLOCKED ║
║ • Open critical CVE → BLOCKED ║
║ • The absence of evidence IS the blocker ║
║ ║
║ 3. DIGEST > TAG (always) ║
║ • Tag = mutable label ("v2.1" can be overwritten) ║
║ • Digest = content hash (sha256:abc123 is immutable) ║
║ • All internal references use digest ║
║ • Tags are for human readability only ║
║ ║
║ 4. LOCKFILE + HASHES = BYTE-LEVEL REPRODUCIBILITY ║
║ • Version pin: "flask==3.0.0" (locks version) ║
║ • Hash pin: "--hash=sha256:abc..." (locks exact bytes) ║
║ • Tampered package → hash mismatch → build FAILS ║
║ • Without hashes, version pin alone is insufficient ║
║ ║
║ 5. SBOM IS YOUR INCIDENT RESPONSE SUPERPOWER ║
║ • New CVE announced → query SBOM DB → impacted images in <10s ║
║ • Without SBOM → manual inspection → hours of response time ║
║ • SBOM format: CycloneDX or SPDX (both machine-readable) ║
║ ║
║ 6. KEYLESS SIGNING (Sigstore/cosign) ║
║ • No persistent private key to steal ║
║ • OIDC identity (CI system identity) = signing identity ║
║ • Rekor transparency log = tamper-evident audit trail ║
║ • Verifiable: "This was signed by GH Actions run #X on repo Y" ║
║ ║
║ 7. ROLLBACK TO SIGNED DIGEST, NEVER TO TAG ║
║ • Promotion DB stores: digest + signature + gate evidence ║
║ • Rollback query: "previous deployment's verified digest" ║
║ • Verify signature on rollback target (hasn't been revoked) ║
║ • Total rollback: <3 min Fargate, <5 min SageMaker ║
║ ║
║ 8. PRIVATE MIRROR = SUPPLY CHAIN FIREWALL ║
║ • All packages from approved mirror, never live PyPI ║
║ • Approved packages list maintained in Git (PR-based process) ║
║ • Prevents typosquatting, dependency confusion, hijacking ║
║ ║
╚══════════════════════════════════════════════════════════════════════╝
Quick Reference: Docker Fundamentals For Interview Warm-Up
These are baseline questions interviewers use to gauge your Docker fluency before diving into scenarios.
Docker Basics Rapid Fire
| Question | Concise Answer |
|---|---|
| What is a container? | An isolated process with its own filesystem, network, and process namespace, sharing the host kernel |
| Container vs VM? | Containers share the host OS kernel (lightweight, seconds to start). VMs have their own kernel (heavier, minutes to start) |
| What is a Dockerfile? | A text file with instructions to build a Docker image, layer by layer |
| What is an image? | A read-only template (layered filesystem + metadata) used to create containers |
| What is a container registry? | A storage service for Docker images (Docker Hub, ECR, GCR) |
What does docker run do? |
Creates a new container from an image and starts it |
What does docker exec do? |
Runs a command inside an already-running container |
| EXPOSE vs PORT publishing? | EXPOSE documents which port the app uses. -p 8080:80 actually maps host port to container port |
| ENV vs ARG? | ENV is available at runtime. ARG is only available during build time |
| .dockerignore purpose? | Excludes files from the build context (like .gitignore for Docker). Reduces build context size and prevents sensitive files from being copied |
Docker Networking Quick Reference
| Network Type | Use Case | Example |
|---|---|---|
| bridge | Default. Single-host container-to-container | Development, testing |
| host | Container shares host network stack | Performance-sensitive apps (no network overhead) |
| overlay | Multi-host networking | Docker Swarm, Kubernetes services |
| none | No networking | Security-sensitive workloads |
Docker Storage Quick Reference
| Storage Type | Persistence | Use Case |
|---|---|---|
| Container layer | Deleted with container | Temporary files, logs |
| Named volume | Survives container removal | Database storage, model weights |
| Bind mount | Host filesystem directly | Development (hot reload), config files |
| tmpfs | Memory only, never on disk | Sensitive data (secrets, tokens) |
Dockerfile Best Practices Checklist
- Use specific base image tags —
python:3.11.7-slim, notpython:latest - Multi-stage builds — separate build and runtime stages
- Layer ordering — install dependencies before copying code (maximize cache hits)
- Minimize layers — combine related RUN commands with
&& - Non-root user —
RUN adduser --system app && USER app - COPY over ADD — unless you specifically need tar extraction
- Health checks — always include HEALTHCHECK instruction
- .dockerignore — exclude .git, node_modules, pycache, .env
- No secrets in images — use build args for build-time secrets, or mount secrets at runtime
- Pin dependency versions — exact versions with hashes in lockfiles