US-04: Real-Time vs Pre-Computed Inference

User Story

As a ML engineering lead, I want to decide what to compute in real-time versus pre-compute in batch, So that we minimize real-time latency and cost without serving stale or irrelevant results.

The Debate

graph TD
    subgraph "Performance Team"
        P["Pre-compute everything!<br/>Real-time inference is slow.<br/>If we pre-generate top 100<br/>recommendations per user,<br/>response time drops to<br/>a cache lookup."]
    end

    subgraph "Inference Team"
        I["Pre-computed results go stale.<br/>The user's current page context<br/>and conversation turn matter.<br/>You can't pre-compute a response to<br/>a question nobody asked yet."]
    end

    subgraph "Cost Team"
        C["Batch inference is 3-5x cheaper<br/>per unit than real-time.<br/>But storing and refreshing<br/>pre-computed results costs money too.<br/>What's the break-even point?"]
    end

    P ---|"Staleness<br/>concern"| I
    I ---|"Batch cost vs<br/>real-time cost"| C
    C ---|"Latency<br/>SLA"| P

    style P fill:#4ecdc4,stroke:#333,color:#000
    style I fill:#ff6b6b,stroke:#333,color:#000
    style C fill:#f9d71c,stroke:#333,color:#000

Acceptance Criteria

A clear classification exists for every data/inference type: real-time, near-real-time, or batch.
Pre-computed data has defined freshness SLAs and staleness alerts.
Real-time inference is reserved for tasks that genuinely need conversation context.
Batch pipelines run during off-peak hours to minimize compute cost.
Hybrid paths exist: pre-computed base + real-time refinement.

The Spectrum: Not Binary

graph LR
    subgraph "Fully Pre-Computed"
        A["Generated hours/days ahead<br/>Stored in cache/DB<br/>Nearest-neighbor lookup<br/>at request time"]
    end

    subgraph "Near-Real-Time"
        B["Pre-computed base<br/>+ lightweight real-time<br/>refinement at request time"]
    end

    subgraph "Fully Real-Time"
        C["Computed from scratch<br/>at request time<br/>using full conversation<br/>context"]
    end

    A -->|"Freshness<br/>decreases"| B
    B -->|"Latency<br/>increases"| C

    style A fill:#2d8659,stroke:#333,color:#fff
    style B fill:#fd9644,stroke:#333,color:#000
    style C fill:#eb3b5a,stroke:#333,color:#fff

Classification of Every Inference Task

Task	Mode	Refresh Cycle	Rationale
User preference embeddings	Batch	Every 6 hours	Change slowly; based on purchase/browse history
Top-N recommendations per user	Batch	Every 6 hours	Pre-compute top 50 candidates per user
Recommendation reranking	Near-RT	At request time	Rerank pre-computed candidates using current context
Product embeddings for RAG	Batch	Every 6 hours or on catalog event	Product data changes infrequently
Query embedding	Real-time	Every request	Each query is unique
RAG retrieval (KNN search)	Real-time	Every request	Depends on query embedding
Intent classification	Real-time	Every request	Must classify the actual user message
LLM response generation	Real-time	Every request	Needs conversation context + retrieved data
FAQ answer compilation	Near-RT	Pre-fetch top 50 FAQ answers; refine at request time	Many FAQ questions are repeated
Promotion matching	Batch	Every 15 min or on promo event	Promotions change infrequently
Guardrail checks	Real-time	Every request	Must validate the actual response
Review summaries	Batch	Daily	Reviews change slowly
Trending/popular rankings	Batch	Hourly	Based on aggregate signals

Deep Dive: Recommendation Pipeline (Hybrid Approach)

The recommendation flow is the best example of the hybrid pattern: batch pre-computation + real-time refinement.

graph TD
    subgraph "Batch Pipeline (runs every 6 hours)"
        B1["User purchase history"] --> B2["Collaborative filtering<br/>model inference"]
        B3["Browsing signals"] --> B2
        B2 --> B4["Top 50 candidate ASINs<br/>per user"]
        B4 --> B5["Store in DynamoDB<br/>key: user_id"]
    end

    subgraph "Real-Time Pipeline (at request time)"
        R1["Load pre-computed<br/>50 candidates"] --> R2["Filter by<br/>current context"]
        R3["Current page ASIN<br/>+ conversation history"] --> R2
        R2 --> R4["Rerank with<br/>lightweight model"]
        R4 --> R5["Top 5 final<br/>recommendations"]
        R5 --> R6["Feed to LLM<br/>for natural language"]
    end

    B5 -.->|"cache lookup<br/>5ms"| R1

    style B2 fill:#2d8659,stroke:#333,color:#fff
    style R4 fill:#fd9644,stroke:#333,color:#000
    style R6 fill:#eb3b5a,stroke:#333,color:#fff

Cost Comparison: Fully Real-Time vs Hybrid

Approach	Compute Per Request	Monthly Cost (1M req/day)	Latency Added	Quality
Full real-time recommendation	Run collaborative filtering + reranking + LLM	$180,000	+800ms	0.92
Hybrid (batch + rerank + LLM)	Cache lookup + lightweight rerank + LLM	$52,000	+150ms for rerank	0.89
Fully pre-computed (incl. LLM response)	Cache lookup only	$8,000	+5ms	0.71

Decision: Hybrid. The $128K/month savings vs fully real-time is worth a 0.03 quality drop. The fully pre-computed approach saves more money but quality drops sharply because the LLM response can't incorporate conversation context.

Deep Dive: FAQ Pre-Compilation

graph TD
    subgraph "Batch: Pre-compile top 50 FAQs"
        F1["Analyze query logs<br/>→ top 50 FAQ questions"] --> F2["Generate Haiku answers<br/>for each variation"]
        F2 --> F3["Store in ElastiCache<br/>key: faq_question_hash"]
    end

    subgraph "Real-Time: Serve or generate"
        R1["User FAQ question"] --> R2{"Semantic similarity<br/>to pre-compiled FAQ<br/>> 0.92?"}
        R2 -->|"Yes"| R3["Return pre-compiled<br/>answer (5ms)"]
        R2 -->|"No"| R4["Full RAG + LLM<br/>pipeline (1,200ms)"]
    end

    F3 -.-> R2

    style R3 fill:#2d8659,stroke:#333,color:#fff
    style R4 fill:#eb3b5a,stroke:#333,color:#fff

The Staleness Problem

graph TD
    A["Pre-computed answer:<br/>'Return window is 30 days'"] --> B{"Policy changed<br/>to 14 days"}
    B --> C["Stale answer served<br/>for up to 6 hours"]
    C --> D["Customer returns after<br/>day 15, gets denied"]
    D --> E["Support escalation<br/>+ trust damage"]

    style C fill:#eb3b5a,stroke:#333,color:#fff
    style E fill:#ff6b6b,stroke:#333,color:#000

Staleness Mitigation Strategy

Mitigation	How	Tradeoff
Event-driven invalidation	Policy change event → purge FAQ cache	Requires event infrastructure; adds complexity
Version tagging	Each pre-compiled answer carries a source version; compare at serve time	Adds 10ms per request for version check
Confidence degradation	Reduce confidence score as answer ages; fall through to real-time below threshold	Increases real-time load as cache ages
Content hash comparison	Hash source document; if hash changed, invalidate	Catches all changes but requires source access

Decision: Event-driven invalidation for policy changes (high-stakes), plus confidence degradation for editorial content (lower-stakes). This catches the dangerous staleness (wrong policy) cheaply while tolerating mild staleness in recommendation-style content.

The Break-Even Analysis

When is pre-computation cheaper than real-time?

graph LR
    subgraph "Pre-Computation Cost"
        PC1["Batch compute: run model<br/>on N items"] --> PC2["Storage: store results<br/>in cache/DB"]
        PC2 --> PC3["Refresh: re-run periodically"]
        PC3 --> PC4["Total = batch_cost +<br/>storage + refresh_cost"]
    end

    subgraph "Real-Time Cost"
        RT1["Per-request compute:<br/>model inference"] --> RT2["Scales linearly<br/>with traffic"]
        RT2 --> RT3["Total = cost_per_request<br/>× request_volume"]
    end

    PC4 --> BEP{"Break Even<br/>Point"}
    RT3 --> BEP

    style BEP fill:#54a0ff,stroke:#333,color:#000

Break-Even Formula

$$\text{Break-even volume} = \frac{\text{Batch Cost} + \text{Storage Cost}}{\text{Real-time Cost Per Request} - \text{Serve Cost Per Request}}$$

Concrete Example: Recommendation Candidates

Factor	Value
Batch compute cost (50 candidates × 5M users, every 6h)	$2,400/day
Storage cost (DynamoDB)	$200/day
Refresh cost (4 runs/day)	Included in batch
Real-time cost per recommendation request	$0.012
Cache-serve cost per recommendation request	$0.0001
Daily recommendation requests	400,000

$$\text{Real-time daily cost} = 400{,}000 \times \$0.012 = \$4{,}800$$

$$\text{Pre-compute daily cost} = \$2{,}400 + \$200 + (400{,}000 \times \$0.0001) = \$2{,}640$$

Pre-computation saves $2,160/day ($64,800/month) for recommendations alone.

What Should NEVER Be Pre-Computed

Task	Why Not
LLM response generation	Requires conversation context that doesn't exist until the user types
Intent classification	Must classify the actual message
Guardrail validation	Must validate the actual response
Price lookups	Prices change too frequently; must always be live
Cart-dependent operations	Cart state changes between queries

2026 Update: Prefer Event-Driven Hybrid Inference over Giant Batch Jobs

Treat everything above this section as the baseline hybrid architecture. This update preserves that prior model and shows how the current architecture shifts toward more event-driven refinement and invalidation.

The strongest current pattern is a three-lane architecture: offline batch for slow-changing artifacts, event-driven refresh for freshness-critical derived artifacts, and on-demand refinement for live session context.

Pre-compute candidate sets, embeddings, summaries, and reusable evidence blocks, but keep final wording, personalization, price/inventory validation, and session reasoning real-time.
Use event-driven invalidation and refresh for purchases, catalog changes, policy updates, and promo changes. Batch-only refresh windows are usually too coarse for customer-facing experiences.
Treat prompt or prefix caching as part of the hybrid design. A stable base prefix plus live delta is often cheaper and fresher than regenerating everything or serving a fully stale artifact.
Add an explicit staleness contract to every precomputed artifact: freshness_sla, invalidated_by, fallback_if_stale, and owner.
Revisit break-even calculations using the current serving stack. Prefix caching, prompt caching, and cheaper rerank/refine passes have improved the economics of near-real-time refinement.

Recent references: AWS Bedrock prompt caching, Anthropic prompt caching, vLLM automatic prefix caching, Anthropic contextual retrieval.

Reversal Triggers

Trigger	Action
Pre-computed recommendation hit rate drops below 60%	Users' real-time queries diverge from pre-computed candidates; increase candidate pool or refresh more often
Stale answer causes a customer complaint	Tighten invalidation; potentially move that content type to real-time
Batch compute cost exceeds real-time cost for any task	Switch that task to real-time
New model makes real-time inference 5x cheaper	Re-evaluate all pre-computed tasks
Off-peak batch jobs start impacting peak traffic (resource contention)	Move batch to dedicated compute or different region

Impact on Trilemma

Dimension	Fully Real-Time	Fully Pre-Computed	Hybrid (Decision)
Cost	$$$$	$$	$$ (batch is cheap)
Performance	Slow (800ms+ for reco)	Fast (5ms cache hit)	Fast-ish (150ms rerank)
Quality	Best (full context)	Poor (stale, no context)	Good (context-aware rerank)
QACPI	Low (high cost, high latency)	Medium (low quality)	Highest (balanced)