Scenario 5 — Dockerized Integration Testing Instead Of Mocking Everything

User Story

As the MangaAssist API owner, I wanted integration tests that behaved like the real recommendation and chat system without making every CI run depend on every shared staging service — especially since staging SageMaker endpoints are expensive to keep always-on and slow to provision per PR.

The CI Challenge For MangaAssist

MangaAssist had multiple real infrastructure dependencies: - DynamoDB for user preferences and reading history - S3 for manga metadata and cover assets - OpenSearch for semantic manga search and RAG retrieval - Redis for preference caching and session state - SageMaker inference endpoints for the fine-tuned recommendation model - External downstream APIs for manga metadata enrichment

Mocking all of these would have been fast but hollow — real serialization behavior, TTL behavior, index behavior, and container startup costs would have been invisible. Depending on shared staging for everything would have made CI slow, expensive, and flaky.

What We Actually Did

Ran LocalStack in Docker during CI for DynamoDB, S3, SQS, and Kinesis-style integrations.
Used TestContainers to spin up real Redis and OpenSearch instances per test run.
Used WireMock for downstream REST APIs when behavior control mattered more than runtime realism (retry sequences, timeout simulation, failure injection).
Kept real staging SageMaker endpoints for tests where actual model latency and cold-start behavior had to be real.

Why This Is An Important Docker Interview Story

This is the strongest non-production Docker story for MangaAssist. It shows that containers were part of engineering quality, not just deployment. The repo is explicit: mocks alone would have missed real latency characteristics, container startup costs, and retry behavior that only appears against real dependencies.

Deep-Dive Questions And Answers

Q1. Why not mock everything in tests? Because mocks are too optimistic. They don't show real serialization behavior, TTL expiry behavior, OpenSearch index behavior, or container startup cost. They're useful at the unit layer for fast feedback, but insufficient for integration confidence. MangaAssist's recommendation pipeline had several subtle behaviors — like Redis TTL expiry colliding with a recommendation request — that only appeared against real dependency instances.

Q2. Why combine LocalStack, TestContainers, and WireMock instead of standardizing on one tool? Because each solves a different problem: - LocalStack is good for AWS service emulation (DynamoDB table behavior, S3 events, SQS queues). - TestContainers is good for real dependency instances like Redis and OpenSearch — actual index behavior, not simulated. - WireMock is best when you want precise control over downstream API failures, timeouts, and retry sequences — programming real HTTP behavior scenarios.

Standardizing on one tool would have meant using the wrong abstraction for two of the three problem types.

Q3. Where did you refuse to use Docker-based emulation and instead call a real service? For SageMaker staging endpoints. Real latency characteristics mattered there — a local fake would not have revealed cold-start behavior, real tail-latency under concurrent requests, or actual model response quality. The emulation cost savings weren't worth the reduction in test fidelity for the most important path in the system.

Q4. What bug did this testing strategy catch that mocks would have hidden? It caught real retry and circuit-breaker behavior under simulated dependency delays, plus model cold-start issues that only appeared against actual staged inference endpoints. A mock SageMaker would have returned answers at microsecond speed — the real issue was that under concurrent load, staging endpoints showed different latency characteristics that exposed gaps in our retry policy.

Q5. What is the best way to explain the value of Docker here? Docker gave us reproducible, disposable dependency environments in CI. Each PR spun up a fresh set of dependencies, ran the tests, and discarded the containers. That raised test realism while keeping tests isolated and automatable — no shared state pollution between test runs.

Optimizations We Can Credibly Claim

Three-tier testing model: mocked (unit) → Docker-emulated (integration) → real staging (system)
Faster feedback than full shared-environment testing for most test types
Better behavioral coverage than mocks-only strategy
Reproducible CI infrastructure via disposable containers
Isolated test runs — no shared state between PRs

Better-Than-Naive Explanation

The naive answer is "we used Docker in CI because it was convenient." The stronger answer: we used Docker selectively to move more tests into the middle of the test pyramid, where they were realistic enough to catch integration failures but still cheap enough to run on every PR. The decision about which tier each test belongs in was deliberate — unit for speed, Docker for realism on cheap dependencies, real staging for the path that actually mattered.

Three-Tier Testing Architecture

Tier 1 — Unit (mocks)
  Fast, isolated, cheap
  Covers: pure business logic, prompt construction, routing decisions
  Tools: pytest mocks, in-memory fakes

Tier 2 — Integration (Docker containers)
  Realistic behavior, disposable, per-PR
  Covers: DynamoDB ops, Redis TTL, OpenSearch indexing, downstream API retry behavior
  Tools: LocalStack, TestContainers, WireMock

Tier 3 — System (real staging)
  Production-fidelity, expensive, gated
  Covers: SageMaker inference latency, model cold-start, real recommendation quality
  Tools: real AWS staging endpoints

Decision Table

Dimension	Details
LocalStack rationale	AWS service behavior (DynamoDB, S3, SQS) without shared staging dependency
TestContainers rationale	Real Redis + OpenSearch instances per test run — actual TTL, actual index behavior
WireMock rationale	Precise HTTP failure/delay programming for retry and circuit-breaker validation
Why real SageMaker for some tests	Cold-start and tail-latency behavior only visible against real inference endpoints
Tradeoff: realism vs cost	LocalStack/TestContainers: high realism, low cost. Real staging: highest realism, highest cost
Scale mechanism	Each PR spins disposable containers; no shared staging contention
Key outcome	Catches retry/circuit-breaker bugs and cold-start issues that mocks-only strategy misses

Tradeoffs Discussed

Option considered	Why rejected or scoped
Mocks-only for all dependencies	Fast, cheap — but misses serialization, TTL, index behavior, real timing
Always use shared staging	Slow to provision, expensive, creates flaky inter-PR test pollution
Real AWS for all integration tests	Maximum fidelity but prohibitive cost per PR; cold-start wait times unacceptable in CI
Standardize on LocalStack only	Can't run real Redis or OpenSearch behavior — index and cache semantics differ from emulation
Standardize on TestContainers only	Can't emulate full AWS service semantics like SQS visibility timeout or DynamoDB streams

Scale Planned

Test type	Volume	Infra
Unit tests	Per commit	No containers; pure in-process mocks
Integration tests	Per PR	Disposable Docker containers (LocalStack + TestContainers + WireMock)
System tests	Per release candidate	Real staging SageMaker endpoints; gated and scheduled

Intuition From This Scenario

The test pyramid has a middle tier that most teams skip, and that's where most integration bugs live. Mocks are fast but lie. Production is truthful but expensive to hit constantly. Docker containers in CI are the middle — they tell the truth about real dependency behavior (real Redis TTL, real OpenSearch index delays, real HTTP retry sequences) at a cost that fits every PR. The discipline is knowing which tier each test belongs in and resisting the urge to mock everything for speed or to hit staging for everything for fidelity.