Scenario 5 — Dockerized Integration Testing Instead Of Mocking Everything
User Story
As the MangaAssist API owner, I wanted integration tests that behaved like the real recommendation and chat system without making every CI run depend on every shared staging service — especially since staging SageMaker endpoints are expensive to keep always-on and slow to provision per PR.
The CI Challenge For MangaAssist
MangaAssist had multiple real infrastructure dependencies: - DynamoDB for user preferences and reading history - S3 for manga metadata and cover assets - OpenSearch for semantic manga search and RAG retrieval - Redis for preference caching and session state - SageMaker inference endpoints for the fine-tuned recommendation model - External downstream APIs for manga metadata enrichment
Mocking all of these would have been fast but hollow — real serialization behavior, TTL behavior, index behavior, and container startup costs would have been invisible. Depending on shared staging for everything would have made CI slow, expensive, and flaky.
What We Actually Did
- Ran LocalStack in Docker during CI for DynamoDB, S3, SQS, and Kinesis-style integrations.
- Used TestContainers to spin up real Redis and OpenSearch instances per test run.
- Used WireMock for downstream REST APIs when behavior control mattered more than runtime realism (retry sequences, timeout simulation, failure injection).
- Kept real staging SageMaker endpoints for tests where actual model latency and cold-start behavior had to be real.
Why This Is An Important Docker Interview Story
This is the strongest non-production Docker story for MangaAssist. It shows that containers were part of engineering quality, not just deployment. The repo is explicit: mocks alone would have missed real latency characteristics, container startup costs, and retry behavior that only appears against real dependencies.
Deep-Dive Questions And Answers
Q1. Why not mock everything in tests? Because mocks are too optimistic. They don't show real serialization behavior, TTL expiry behavior, OpenSearch index behavior, or container startup cost. They're useful at the unit layer for fast feedback, but insufficient for integration confidence. MangaAssist's recommendation pipeline had several subtle behaviors — like Redis TTL expiry colliding with a recommendation request — that only appeared against real dependency instances.
Q2. Why combine LocalStack, TestContainers, and WireMock instead of standardizing on one tool? Because each solves a different problem: - LocalStack is good for AWS service emulation (DynamoDB table behavior, S3 events, SQS queues). - TestContainers is good for real dependency instances like Redis and OpenSearch — actual index behavior, not simulated. - WireMock is best when you want precise control over downstream API failures, timeouts, and retry sequences — programming real HTTP behavior scenarios.
Standardizing on one tool would have meant using the wrong abstraction for two of the three problem types.
Q3. Where did you refuse to use Docker-based emulation and instead call a real service? For SageMaker staging endpoints. Real latency characteristics mattered there — a local fake would not have revealed cold-start behavior, real tail-latency under concurrent requests, or actual model response quality. The emulation cost savings weren't worth the reduction in test fidelity for the most important path in the system.
Q4. What bug did this testing strategy catch that mocks would have hidden? It caught real retry and circuit-breaker behavior under simulated dependency delays, plus model cold-start issues that only appeared against actual staged inference endpoints. A mock SageMaker would have returned answers at microsecond speed — the real issue was that under concurrent load, staging endpoints showed different latency characteristics that exposed gaps in our retry policy.
Q5. What is the best way to explain the value of Docker here? Docker gave us reproducible, disposable dependency environments in CI. Each PR spun up a fresh set of dependencies, ran the tests, and discarded the containers. That raised test realism while keeping tests isolated and automatable — no shared state pollution between test runs.
Optimizations We Can Credibly Claim
- Three-tier testing model: mocked (unit) → Docker-emulated (integration) → real staging (system)
- Faster feedback than full shared-environment testing for most test types
- Better behavioral coverage than mocks-only strategy
- Reproducible CI infrastructure via disposable containers
- Isolated test runs — no shared state between PRs
Better-Than-Naive Explanation
The naive answer is "we used Docker in CI because it was convenient." The stronger answer: we used Docker selectively to move more tests into the middle of the test pyramid, where they were realistic enough to catch integration failures but still cheap enough to run on every PR. The decision about which tier each test belongs in was deliberate — unit for speed, Docker for realism on cheap dependencies, real staging for the path that actually mattered.
Three-Tier Testing Architecture
Tier 1 — Unit (mocks)
Fast, isolated, cheap
Covers: pure business logic, prompt construction, routing decisions
Tools: pytest mocks, in-memory fakes
Tier 2 — Integration (Docker containers)
Realistic behavior, disposable, per-PR
Covers: DynamoDB ops, Redis TTL, OpenSearch indexing, downstream API retry behavior
Tools: LocalStack, TestContainers, WireMock
Tier 3 — System (real staging)
Production-fidelity, expensive, gated
Covers: SageMaker inference latency, model cold-start, real recommendation quality
Tools: real AWS staging endpoints
Decision Table
| Dimension | Details |
|---|---|
| LocalStack rationale | AWS service behavior (DynamoDB, S3, SQS) without shared staging dependency |
| TestContainers rationale | Real Redis + OpenSearch instances per test run — actual TTL, actual index behavior |
| WireMock rationale | Precise HTTP failure/delay programming for retry and circuit-breaker validation |
| Why real SageMaker for some tests | Cold-start and tail-latency behavior only visible against real inference endpoints |
| Tradeoff: realism vs cost | LocalStack/TestContainers: high realism, low cost. Real staging: highest realism, highest cost |
| Scale mechanism | Each PR spins disposable containers; no shared staging contention |
| Key outcome | Catches retry/circuit-breaker bugs and cold-start issues that mocks-only strategy misses |
Tradeoffs Discussed
| Option considered | Why rejected or scoped |
|---|---|
| Mocks-only for all dependencies | Fast, cheap — but misses serialization, TTL, index behavior, real timing |
| Always use shared staging | Slow to provision, expensive, creates flaky inter-PR test pollution |
| Real AWS for all integration tests | Maximum fidelity but prohibitive cost per PR; cold-start wait times unacceptable in CI |
| Standardize on LocalStack only | Can't run real Redis or OpenSearch behavior — index and cache semantics differ from emulation |
| Standardize on TestContainers only | Can't emulate full AWS service semantics like SQS visibility timeout or DynamoDB streams |
Scale Planned
| Test type | Volume | Infra |
|---|---|---|
| Unit tests | Per commit | No containers; pure in-process mocks |
| Integration tests | Per PR | Disposable Docker containers (LocalStack + TestContainers + WireMock) |
| System tests | Per release candidate | Real staging SageMaker endpoints; gated and scheduled |
Intuition From This Scenario
The test pyramid has a middle tier that most teams skip, and that's where most integration bugs live. Mocks are fast but lie. Production is truthful but expensive to hit constantly. Docker containers in CI are the middle — they tell the truth about real dependency behavior (real Redis TTL, real OpenSearch index delays, real HTTP retry sequences) at a cost that fits every PR. The discipline is knowing which tier each test belongs in and resisting the urge to mock everything for speed or to hit staging for everything for fidelity.