Scenario 5: Context Window Silent Truncation
Scenario Summary
The application keeps appending chat history and retrieved context until the request nears the model's context limit. Older turns are silently dropped, the model loses key user preferences, and the answers become confidently wrong without obvious platform errors.
Why It Matters
Long-context support does not remove the need for memory design. This scenario tests whether the architecture treats context as a budgeted resource instead of an unlimited transcript dump.
Failure Pattern
| Design area | Weak choice | Better choice |
|---|---|---|
| Memory strategy | Fixed number of turns regardless of size | Token-aware history assembly |
| Compression | Drop oldest turns silently | Summarize and log compression decisions |
| Observability | Only HTTP status and latency | Metrics for dropped turns and compression rate |
Deep Dive
Conversation systems fail here because the architecture confuses "store everything" with "send everything." Durable storage can keep the full transcript, but the model invocation should receive only the most relevant and affordable context. Good designs reserve budget for:
- system instructions,
- retrieved evidence,
- current user request,
- recent conversation turns,
- summarized history when needed.
Detection Signals
- Quality drops mainly in long sessions
- The assistant contradicts user preferences stated earlier in the conversation
- Token counts rise steadily while answer relevance falls
Runbook
- Add token estimation before every FM call.
- Reserve explicit budget for system prompt, RAG context, and output.
- Summarize older context instead of silently trimming it.
- Emit metrics when turns are dropped or compressed.
- Review memory retention rules for privacy, cost, and relevance together.
Questions To Ask
- What is our safe input budget after reserving space for output and grounding?
- Which parts of the session must be preserved verbatim versus summarized?
- How will we know when compression starts affecting quality?
- Should durable state and conversational history be stored differently?
Interview Drill
How would you explain the difference between conversation storage and model context assembly to a product stakeholder?
Good Outcome
The system treats context as a governed budget, logs compression behavior, and preserves the most valuable user state instead of relying on silent truncation.