US-04: Real-Time vs Pre-Computed Inference
User Story
As a ML engineering lead, I want to decide what to compute in real-time versus pre-compute in batch, So that we minimize real-time latency and cost without serving stale or irrelevant results.
The Debate
graph TD
subgraph "Performance Team"
P["Pre-compute everything!<br/>Real-time inference is slow.<br/>If we pre-generate top 100<br/>recommendations per user,<br/>response time drops to<br/>a cache lookup."]
end
subgraph "Inference Team"
I["Pre-computed results go stale.<br/>The user's current page context<br/>and conversation turn matter.<br/>You can't pre-compute a response to<br/>a question nobody asked yet."]
end
subgraph "Cost Team"
C["Batch inference is 3-5x cheaper<br/>per unit than real-time.<br/>But storing and refreshing<br/>pre-computed results costs money too.<br/>What's the break-even point?"]
end
P ---|"Staleness<br/>concern"| I
I ---|"Batch cost vs<br/>real-time cost"| C
C ---|"Latency<br/>SLA"| P
style P fill:#4ecdc4,stroke:#333,color:#000
style I fill:#ff6b6b,stroke:#333,color:#000
style C fill:#f9d71c,stroke:#333,color:#000
Acceptance Criteria
- A clear classification exists for every data/inference type: real-time, near-real-time, or batch.
- Pre-computed data has defined freshness SLAs and staleness alerts.
- Real-time inference is reserved for tasks that genuinely need conversation context.
- Batch pipelines run during off-peak hours to minimize compute cost.
- Hybrid paths exist: pre-computed base + real-time refinement.
The Spectrum: Not Binary
graph LR
subgraph "Fully Pre-Computed"
A["Generated hours/days ahead<br/>Stored in cache/DB<br/>Nearest-neighbor lookup<br/>at request time"]
end
subgraph "Near-Real-Time"
B["Pre-computed base<br/>+ lightweight real-time<br/>refinement at request time"]
end
subgraph "Fully Real-Time"
C["Computed from scratch<br/>at request time<br/>using full conversation<br/>context"]
end
A -->|"Freshness<br/>decreases"| B
B -->|"Latency<br/>increases"| C
style A fill:#2d8659,stroke:#333,color:#fff
style B fill:#fd9644,stroke:#333,color:#000
style C fill:#eb3b5a,stroke:#333,color:#fff
Classification of Every Inference Task
| Task | Mode | Refresh Cycle | Rationale |
|---|---|---|---|
| User preference embeddings | Batch | Every 6 hours | Change slowly; based on purchase/browse history |
| Top-N recommendations per user | Batch | Every 6 hours | Pre-compute top 50 candidates per user |
| Recommendation reranking | Near-RT | At request time | Rerank pre-computed candidates using current context |
| Product embeddings for RAG | Batch | Every 6 hours or on catalog event | Product data changes infrequently |
| Query embedding | Real-time | Every request | Each query is unique |
| RAG retrieval (KNN search) | Real-time | Every request | Depends on query embedding |
| Intent classification | Real-time | Every request | Must classify the actual user message |
| LLM response generation | Real-time | Every request | Needs conversation context + retrieved data |
| FAQ answer compilation | Near-RT | Pre-fetch top 50 FAQ answers; refine at request time | Many FAQ questions are repeated |
| Promotion matching | Batch | Every 15 min or on promo event | Promotions change infrequently |
| Guardrail checks | Real-time | Every request | Must validate the actual response |
| Review summaries | Batch | Daily | Reviews change slowly |
| Trending/popular rankings | Batch | Hourly | Based on aggregate signals |
Deep Dive: Recommendation Pipeline (Hybrid Approach)
The recommendation flow is the best example of the hybrid pattern: batch pre-computation + real-time refinement.
graph TD
subgraph "Batch Pipeline (runs every 6 hours)"
B1["User purchase history"] --> B2["Collaborative filtering<br/>model inference"]
B3["Browsing signals"] --> B2
B2 --> B4["Top 50 candidate ASINs<br/>per user"]
B4 --> B5["Store in DynamoDB<br/>key: user_id"]
end
subgraph "Real-Time Pipeline (at request time)"
R1["Load pre-computed<br/>50 candidates"] --> R2["Filter by<br/>current context"]
R3["Current page ASIN<br/>+ conversation history"] --> R2
R2 --> R4["Rerank with<br/>lightweight model"]
R4 --> R5["Top 5 final<br/>recommendations"]
R5 --> R6["Feed to LLM<br/>for natural language"]
end
B5 -.->|"cache lookup<br/>5ms"| R1
style B2 fill:#2d8659,stroke:#333,color:#fff
style R4 fill:#fd9644,stroke:#333,color:#000
style R6 fill:#eb3b5a,stroke:#333,color:#fff
Cost Comparison: Fully Real-Time vs Hybrid
| Approach | Compute Per Request | Monthly Cost (1M req/day) | Latency Added | Quality |
|---|---|---|---|---|
| Full real-time recommendation | Run collaborative filtering + reranking + LLM | $180,000 | +800ms | 0.92 |
| Hybrid (batch + rerank + LLM) | Cache lookup + lightweight rerank + LLM | $52,000 | +150ms for rerank | 0.89 |
| Fully pre-computed (incl. LLM response) | Cache lookup only | $8,000 | +5ms | 0.71 |
Decision: Hybrid. The $128K/month savings vs fully real-time is worth a 0.03 quality drop. The fully pre-computed approach saves more money but quality drops sharply because the LLM response can't incorporate conversation context.
Deep Dive: FAQ Pre-Compilation
graph TD
subgraph "Batch: Pre-compile top 50 FAQs"
F1["Analyze query logs<br/>→ top 50 FAQ questions"] --> F2["Generate Haiku answers<br/>for each variation"]
F2 --> F3["Store in ElastiCache<br/>key: faq_question_hash"]
end
subgraph "Real-Time: Serve or generate"
R1["User FAQ question"] --> R2{"Semantic similarity<br/>to pre-compiled FAQ<br/>> 0.92?"}
R2 -->|"Yes"| R3["Return pre-compiled<br/>answer (5ms)"]
R2 -->|"No"| R4["Full RAG + LLM<br/>pipeline (1,200ms)"]
end
F3 -.-> R2
style R3 fill:#2d8659,stroke:#333,color:#fff
style R4 fill:#eb3b5a,stroke:#333,color:#fff
The Staleness Problem
graph TD
A["Pre-computed answer:<br/>'Return window is 30 days'"] --> B{"Policy changed<br/>to 14 days"}
B --> C["Stale answer served<br/>for up to 6 hours"]
C --> D["Customer returns after<br/>day 15, gets denied"]
D --> E["Support escalation<br/>+ trust damage"]
style C fill:#eb3b5a,stroke:#333,color:#fff
style E fill:#ff6b6b,stroke:#333,color:#000
Staleness Mitigation Strategy
| Mitigation | How | Tradeoff |
|---|---|---|
| Event-driven invalidation | Policy change event → purge FAQ cache | Requires event infrastructure; adds complexity |
| Version tagging | Each pre-compiled answer carries a source version; compare at serve time | Adds 10ms per request for version check |
| Confidence degradation | Reduce confidence score as answer ages; fall through to real-time below threshold | Increases real-time load as cache ages |
| Content hash comparison | Hash source document; if hash changed, invalidate | Catches all changes but requires source access |
Decision: Event-driven invalidation for policy changes (high-stakes), plus confidence degradation for editorial content (lower-stakes). This catches the dangerous staleness (wrong policy) cheaply while tolerating mild staleness in recommendation-style content.
The Break-Even Analysis
When is pre-computation cheaper than real-time?
graph LR
subgraph "Pre-Computation Cost"
PC1["Batch compute: run model<br/>on N items"] --> PC2["Storage: store results<br/>in cache/DB"]
PC2 --> PC3["Refresh: re-run periodically"]
PC3 --> PC4["Total = batch_cost +<br/>storage + refresh_cost"]
end
subgraph "Real-Time Cost"
RT1["Per-request compute:<br/>model inference"] --> RT2["Scales linearly<br/>with traffic"]
RT2 --> RT3["Total = cost_per_request<br/>× request_volume"]
end
PC4 --> BEP{"Break Even<br/>Point"}
RT3 --> BEP
style BEP fill:#54a0ff,stroke:#333,color:#000
Break-Even Formula
$$\text{Break-even volume} = \frac{\text{Batch Cost} + \text{Storage Cost}}{\text{Real-time Cost Per Request} - \text{Serve Cost Per Request}}$$
Concrete Example: Recommendation Candidates
| Factor | Value |
|---|---|
| Batch compute cost (50 candidates × 5M users, every 6h) | $2,400/day |
| Storage cost (DynamoDB) | $200/day |
| Refresh cost (4 runs/day) | Included in batch |
| Real-time cost per recommendation request | $0.012 |
| Cache-serve cost per recommendation request | $0.0001 |
| Daily recommendation requests | 400,000 |
$$\text{Real-time daily cost} = 400{,}000 \times \$0.012 = \$4{,}800$$
$$\text{Pre-compute daily cost} = \$2{,}400 + \$200 + (400{,}000 \times \$0.0001) = \$2{,}640$$
Pre-computation saves $2,160/day ($64,800/month) for recommendations alone.
What Should NEVER Be Pre-Computed
| Task | Why Not |
|---|---|
| LLM response generation | Requires conversation context that doesn't exist until the user types |
| Intent classification | Must classify the actual message |
| Guardrail validation | Must validate the actual response |
| Price lookups | Prices change too frequently; must always be live |
| Cart-dependent operations | Cart state changes between queries |
2026 Update: Prefer Event-Driven Hybrid Inference over Giant Batch Jobs
Treat everything above this section as the baseline hybrid architecture. This update preserves that prior model and shows how the current architecture shifts toward more event-driven refinement and invalidation.
The strongest current pattern is a three-lane architecture: offline batch for slow-changing artifacts, event-driven refresh for freshness-critical derived artifacts, and on-demand refinement for live session context.
- Pre-compute candidate sets, embeddings, summaries, and reusable evidence blocks, but keep final wording, personalization, price/inventory validation, and session reasoning real-time.
- Use event-driven invalidation and refresh for purchases, catalog changes, policy updates, and promo changes. Batch-only refresh windows are usually too coarse for customer-facing experiences.
- Treat prompt or prefix caching as part of the hybrid design. A stable base prefix plus live delta is often cheaper and fresher than regenerating everything or serving a fully stale artifact.
- Add an explicit staleness contract to every precomputed artifact:
freshness_sla,invalidated_by,fallback_if_stale, andowner. - Revisit break-even calculations using the current serving stack. Prefix caching, prompt caching, and cheaper rerank/refine passes have improved the economics of near-real-time refinement.
Recent references: AWS Bedrock prompt caching, Anthropic prompt caching, vLLM automatic prefix caching, Anthropic contextual retrieval.
Reversal Triggers
| Trigger | Action |
|---|---|
| Pre-computed recommendation hit rate drops below 60% | Users' real-time queries diverge from pre-computed candidates; increase candidate pool or refresh more often |
| Stale answer causes a customer complaint | Tighten invalidation; potentially move that content type to real-time |
| Batch compute cost exceeds real-time cost for any task | Switch that task to real-time |
| New model makes real-time inference 5x cheaper | Re-evaluate all pre-computed tasks |
| Off-peak batch jobs start impacting peak traffic (resource contention) | Move batch to dedicated compute or different region |
Impact on Trilemma
| Dimension | Fully Real-Time | Fully Pre-Computed | Hybrid (Decision) |
|---|---|---|---|
| Cost | $$$$ | $$ | $$ (batch is cheap) |
| Performance | Slow (800ms+ for reco) | Fast (5ms cache hit) | Fast-ish (150ms rerank) |
| Quality | Best (full context) | Poor (stale, no context) | Good (context-aware rerank) |
| QACPI | Low (high cost, high latency) | Medium (low quality) | Highest (balanced) |