LOCAL PREVIEW View on GitHub

Scenario 5: Context Window Silent Truncation

Scenario Summary

The application keeps appending chat history and retrieved context until the request nears the model's context limit. Older turns are silently dropped, the model loses key user preferences, and the answers become confidently wrong without obvious platform errors.

Why It Matters

Long-context support does not remove the need for memory design. This scenario tests whether the architecture treats context as a budgeted resource instead of an unlimited transcript dump.

Failure Pattern

Design area Weak choice Better choice
Memory strategy Fixed number of turns regardless of size Token-aware history assembly
Compression Drop oldest turns silently Summarize and log compression decisions
Observability Only HTTP status and latency Metrics for dropped turns and compression rate

Deep Dive

Conversation systems fail here because the architecture confuses "store everything" with "send everything." Durable storage can keep the full transcript, but the model invocation should receive only the most relevant and affordable context. Good designs reserve budget for:

  • system instructions,
  • retrieved evidence,
  • current user request,
  • recent conversation turns,
  • summarized history when needed.

Detection Signals

  • Quality drops mainly in long sessions
  • The assistant contradicts user preferences stated earlier in the conversation
  • Token counts rise steadily while answer relevance falls

Runbook

  1. Add token estimation before every FM call.
  2. Reserve explicit budget for system prompt, RAG context, and output.
  3. Summarize older context instead of silently trimming it.
  4. Emit metrics when turns are dropped or compressed.
  5. Review memory retention rules for privacy, cost, and relevance together.

Questions To Ask

  • What is our safe input budget after reserving space for output and grounding?
  • Which parts of the session must be preserved verbatim versus summarized?
  • How will we know when compression starts affecting quality?
  • Should durable state and conversational history be stored differently?

Interview Drill

How would you explain the difference between conversation storage and model context assembly to a product stakeholder?

Good Outcome

The system treats context as a governed budget, logs compression behavior, and preserves the most valuable user state instead of relying on silent truncation.