LOCAL PREVIEW View on GitHub

US-04: Real-Time vs Pre-Computed Inference

User Story

As a ML engineering lead, I want to decide what to compute in real-time versus pre-compute in batch, So that we minimize real-time latency and cost without serving stale or irrelevant results.

The Debate

graph TD
    subgraph "Performance Team"
        P["Pre-compute everything!<br/>Real-time inference is slow.<br/>If we pre-generate top 100<br/>recommendations per user,<br/>response time drops to<br/>a cache lookup."]
    end

    subgraph "Inference Team"
        I["Pre-computed results go stale.<br/>The user's current page context<br/>and conversation turn matter.<br/>You can't pre-compute a response to<br/>a question nobody asked yet."]
    end

    subgraph "Cost Team"
        C["Batch inference is 3-5x cheaper<br/>per unit than real-time.<br/>But storing and refreshing<br/>pre-computed results costs money too.<br/>What's the break-even point?"]
    end

    P ---|"Staleness<br/>concern"| I
    I ---|"Batch cost vs<br/>real-time cost"| C
    C ---|"Latency<br/>SLA"| P

    style P fill:#4ecdc4,stroke:#333,color:#000
    style I fill:#ff6b6b,stroke:#333,color:#000
    style C fill:#f9d71c,stroke:#333,color:#000

Acceptance Criteria

  • A clear classification exists for every data/inference type: real-time, near-real-time, or batch.
  • Pre-computed data has defined freshness SLAs and staleness alerts.
  • Real-time inference is reserved for tasks that genuinely need conversation context.
  • Batch pipelines run during off-peak hours to minimize compute cost.
  • Hybrid paths exist: pre-computed base + real-time refinement.

The Spectrum: Not Binary

graph LR
    subgraph "Fully Pre-Computed"
        A["Generated hours/days ahead<br/>Stored in cache/DB<br/>Nearest-neighbor lookup<br/>at request time"]
    end

    subgraph "Near-Real-Time"
        B["Pre-computed base<br/>+ lightweight real-time<br/>refinement at request time"]
    end

    subgraph "Fully Real-Time"
        C["Computed from scratch<br/>at request time<br/>using full conversation<br/>context"]
    end

    A -->|"Freshness<br/>decreases"| B
    B -->|"Latency<br/>increases"| C

    style A fill:#2d8659,stroke:#333,color:#fff
    style B fill:#fd9644,stroke:#333,color:#000
    style C fill:#eb3b5a,stroke:#333,color:#fff

Classification of Every Inference Task

Task Mode Refresh Cycle Rationale
User preference embeddings Batch Every 6 hours Change slowly; based on purchase/browse history
Top-N recommendations per user Batch Every 6 hours Pre-compute top 50 candidates per user
Recommendation reranking Near-RT At request time Rerank pre-computed candidates using current context
Product embeddings for RAG Batch Every 6 hours or on catalog event Product data changes infrequently
Query embedding Real-time Every request Each query is unique
RAG retrieval (KNN search) Real-time Every request Depends on query embedding
Intent classification Real-time Every request Must classify the actual user message
LLM response generation Real-time Every request Needs conversation context + retrieved data
FAQ answer compilation Near-RT Pre-fetch top 50 FAQ answers; refine at request time Many FAQ questions are repeated
Promotion matching Batch Every 15 min or on promo event Promotions change infrequently
Guardrail checks Real-time Every request Must validate the actual response
Review summaries Batch Daily Reviews change slowly
Trending/popular rankings Batch Hourly Based on aggregate signals

Deep Dive: Recommendation Pipeline (Hybrid Approach)

The recommendation flow is the best example of the hybrid pattern: batch pre-computation + real-time refinement.

graph TD
    subgraph "Batch Pipeline (runs every 6 hours)"
        B1["User purchase history"] --> B2["Collaborative filtering<br/>model inference"]
        B3["Browsing signals"] --> B2
        B2 --> B4["Top 50 candidate ASINs<br/>per user"]
        B4 --> B5["Store in DynamoDB<br/>key: user_id"]
    end

    subgraph "Real-Time Pipeline (at request time)"
        R1["Load pre-computed<br/>50 candidates"] --> R2["Filter by<br/>current context"]
        R3["Current page ASIN<br/>+ conversation history"] --> R2
        R2 --> R4["Rerank with<br/>lightweight model"]
        R4 --> R5["Top 5 final<br/>recommendations"]
        R5 --> R6["Feed to LLM<br/>for natural language"]
    end

    B5 -.->|"cache lookup<br/>5ms"| R1

    style B2 fill:#2d8659,stroke:#333,color:#fff
    style R4 fill:#fd9644,stroke:#333,color:#000
    style R6 fill:#eb3b5a,stroke:#333,color:#fff

Cost Comparison: Fully Real-Time vs Hybrid

Approach Compute Per Request Monthly Cost (1M req/day) Latency Added Quality
Full real-time recommendation Run collaborative filtering + reranking + LLM $180,000 +800ms 0.92
Hybrid (batch + rerank + LLM) Cache lookup + lightweight rerank + LLM $52,000 +150ms for rerank 0.89
Fully pre-computed (incl. LLM response) Cache lookup only $8,000 +5ms 0.71

Decision: Hybrid. The $128K/month savings vs fully real-time is worth a 0.03 quality drop. The fully pre-computed approach saves more money but quality drops sharply because the LLM response can't incorporate conversation context.


Deep Dive: FAQ Pre-Compilation

graph TD
    subgraph "Batch: Pre-compile top 50 FAQs"
        F1["Analyze query logs<br/>→ top 50 FAQ questions"] --> F2["Generate Haiku answers<br/>for each variation"]
        F2 --> F3["Store in ElastiCache<br/>key: faq_question_hash"]
    end

    subgraph "Real-Time: Serve or generate"
        R1["User FAQ question"] --> R2{"Semantic similarity<br/>to pre-compiled FAQ<br/>> 0.92?"}
        R2 -->|"Yes"| R3["Return pre-compiled<br/>answer (5ms)"]
        R2 -->|"No"| R4["Full RAG + LLM<br/>pipeline (1,200ms)"]
    end

    F3 -.-> R2

    style R3 fill:#2d8659,stroke:#333,color:#fff
    style R4 fill:#eb3b5a,stroke:#333,color:#fff

The Staleness Problem

graph TD
    A["Pre-computed answer:<br/>'Return window is 30 days'"] --> B{"Policy changed<br/>to 14 days"}
    B --> C["Stale answer served<br/>for up to 6 hours"]
    C --> D["Customer returns after<br/>day 15, gets denied"]
    D --> E["Support escalation<br/>+ trust damage"]

    style C fill:#eb3b5a,stroke:#333,color:#fff
    style E fill:#ff6b6b,stroke:#333,color:#000

Staleness Mitigation Strategy

Mitigation How Tradeoff
Event-driven invalidation Policy change event → purge FAQ cache Requires event infrastructure; adds complexity
Version tagging Each pre-compiled answer carries a source version; compare at serve time Adds 10ms per request for version check
Confidence degradation Reduce confidence score as answer ages; fall through to real-time below threshold Increases real-time load as cache ages
Content hash comparison Hash source document; if hash changed, invalidate Catches all changes but requires source access

Decision: Event-driven invalidation for policy changes (high-stakes), plus confidence degradation for editorial content (lower-stakes). This catches the dangerous staleness (wrong policy) cheaply while tolerating mild staleness in recommendation-style content.


The Break-Even Analysis

When is pre-computation cheaper than real-time?

graph LR
    subgraph "Pre-Computation Cost"
        PC1["Batch compute: run model<br/>on N items"] --> PC2["Storage: store results<br/>in cache/DB"]
        PC2 --> PC3["Refresh: re-run periodically"]
        PC3 --> PC4["Total = batch_cost +<br/>storage + refresh_cost"]
    end

    subgraph "Real-Time Cost"
        RT1["Per-request compute:<br/>model inference"] --> RT2["Scales linearly<br/>with traffic"]
        RT2 --> RT3["Total = cost_per_request<br/>× request_volume"]
    end

    PC4 --> BEP{"Break Even<br/>Point"}
    RT3 --> BEP

    style BEP fill:#54a0ff,stroke:#333,color:#000

Break-Even Formula

$$\text{Break-even volume} = \frac{\text{Batch Cost} + \text{Storage Cost}}{\text{Real-time Cost Per Request} - \text{Serve Cost Per Request}}$$

Concrete Example: Recommendation Candidates

Factor Value
Batch compute cost (50 candidates × 5M users, every 6h) $2,400/day
Storage cost (DynamoDB) $200/day
Refresh cost (4 runs/day) Included in batch
Real-time cost per recommendation request $0.012
Cache-serve cost per recommendation request $0.0001
Daily recommendation requests 400,000

$$\text{Real-time daily cost} = 400{,}000 \times \$0.012 = \$4{,}800$$

$$\text{Pre-compute daily cost} = \$2{,}400 + \$200 + (400{,}000 \times \$0.0001) = \$2{,}640$$

Pre-computation saves $2,160/day ($64,800/month) for recommendations alone.


What Should NEVER Be Pre-Computed

Task Why Not
LLM response generation Requires conversation context that doesn't exist until the user types
Intent classification Must classify the actual message
Guardrail validation Must validate the actual response
Price lookups Prices change too frequently; must always be live
Cart-dependent operations Cart state changes between queries

2026 Update: Prefer Event-Driven Hybrid Inference over Giant Batch Jobs

Treat everything above this section as the baseline hybrid architecture. This update preserves that prior model and shows how the current architecture shifts toward more event-driven refinement and invalidation.

The strongest current pattern is a three-lane architecture: offline batch for slow-changing artifacts, event-driven refresh for freshness-critical derived artifacts, and on-demand refinement for live session context.

  • Pre-compute candidate sets, embeddings, summaries, and reusable evidence blocks, but keep final wording, personalization, price/inventory validation, and session reasoning real-time.
  • Use event-driven invalidation and refresh for purchases, catalog changes, policy updates, and promo changes. Batch-only refresh windows are usually too coarse for customer-facing experiences.
  • Treat prompt or prefix caching as part of the hybrid design. A stable base prefix plus live delta is often cheaper and fresher than regenerating everything or serving a fully stale artifact.
  • Add an explicit staleness contract to every precomputed artifact: freshness_sla, invalidated_by, fallback_if_stale, and owner.
  • Revisit break-even calculations using the current serving stack. Prefix caching, prompt caching, and cheaper rerank/refine passes have improved the economics of near-real-time refinement.

Recent references: AWS Bedrock prompt caching, Anthropic prompt caching, vLLM automatic prefix caching, Anthropic contextual retrieval.

Reversal Triggers

Trigger Action
Pre-computed recommendation hit rate drops below 60% Users' real-time queries diverge from pre-computed candidates; increase candidate pool or refresh more often
Stale answer causes a customer complaint Tighten invalidation; potentially move that content type to real-time
Batch compute cost exceeds real-time cost for any task Switch that task to real-time
New model makes real-time inference 5x cheaper Re-evaluate all pre-computed tasks
Off-peak batch jobs start impacting peak traffic (resource contention) Move batch to dedicated compute or different region

Impact on Trilemma

Dimension Fully Real-Time Fully Pre-Computed Hybrid (Decision)
Cost $$$$ $$ $$ (batch is cheap)
Performance Slow (800ms+ for reco) Fast (5ms cache hit) Fast-ish (150ms rerank)
Quality Best (full context) Poor (stale, no context) Good (context-aware rerank)
QACPI Low (high cost, high latency) Medium (low quality) Highest (balanced)