03. Foundational Model Optimization Techniques
Optimization Goals
Prompt optimization in MangaAssist is not only about answer quality.
It must balance four objectives at the same time:
- accuracy
- latency
- cost
- output stability
Optimization Matrix
| Technique | Helps Most With | Main Tradeoff |
|---|---|---|
| Zero-shot | low cost and low latency | weaker style control |
| Few-shot | better behavioral consistency | token bloat |
| Role prompting | tone and boundary control | can become verbose |
| Structured-output prompting | schema compliance | brittle if model drifts |
| Context compression | token control | can remove nuance |
| Prompt caching | repeated-call cost | less flexible if context changes often |
| Model tiering | cost and latency | routing complexity |
| Output-length constraints | latency and cost | may cut off nuance |
| Temperature tuning | hallucination control or stylistic range | can flatten recommendations |
Zero-Shot vs Few-Shot
Zero-Shot Use Cases
- order tracking formatting
- promotion display
- simple FAQ explanation
- escalation summary generation
Few-Shot Use Cases
- recommendation explanation tone
- contrastive product comparison
- JSON-shaped output with nuanced fields
- refusal style that still feels helpful
Rule
Only add examples when the examples solve a repeatable failure mode.
If the examples only make the answer sound nicer in demos, they are usually not worth the token cost.
Role Prompting
Use role prompting to lock the model to a narrow operating mode:
- shopping assistant
- policy explainer
- support summarizer
- product comparison assistant
Avoid vague roles such as "expert manga advisor" in high-risk flows because they encourage speculation.
Structured-Output Prompting
When the UI expects structured data, prompt for a strict contract.
Return valid JSON with exactly these top-level keys:
- response_text
- products
- actions
- follow_up_suggestions
Do not wrap the JSON in markdown.
If a field has no values, return an empty array.
Best Practice
Treat schema prompting as one layer only. Add parser validation and repair outside the model.
Token Budget Control
Budget Example
| Prompt Part | Budget |
|---|---|
| system rules | 250 to 400 tokens |
| workflow context | 100 to 200 tokens |
| retrieved chunks | 800 to 1500 tokens |
| conversation summary and recent turns | 300 to 800 tokens |
| output reservation | 300 to 700 tokens |
Budget Rules
- reserve output tokens first
- compress history before removing grounding
- shrink retrieved context before shrinking hard constraints
- keep per-intent variants smaller than one global prompt
Context Compression Techniques
| Technique | Good For | Risk |
|---|---|---|
| summary turn | long conversations | may lose specific preferences |
| slot memory | repeated preferences | loses narrative detail |
| ranked context inclusion | crowded inputs | ranking error hides important facts |
| intent-specific context packing | narrow workflows | extra orchestration logic |
Temperature and Sampling
Lower Temperature
Use for:
- FAQ
- order support
- policy
- structured output
Higher Temperature
Use carefully for:
- recommendation wording
- follow-up suggestions
Practical Rule
Do not use a high temperature to compensate for weak retrieval or weak recommendation ranking.
Model Tiering
Example Routing
| Task | Preferred Model Behavior |
|---|---|
| policy paraphrase | fast, literal, low-variance |
| recommendation explanation | richer language, better comparative reasoning |
| chunk embedding | strong retrieval quality |
| summarization | concise and stable |
Why Tiering Helps
One FM is rarely optimal for every part of the pipeline.
Use cheaper or faster paths for deterministic formatting and reserve stronger models for open-ended comparison or multi-step responses.
Prompt Prefix Reuse and Caching
Reusable pieces:
- store persona
- hard rules
- output schema
- common locale instructions
Caching these prompt prefixes reduces repeated token cost in high-volume scenarios.
FM-Specific Behavior Tendencies
This project should assume model families differ in predictable ways.
| Behavior Dimension | Practical Impact | Compensation Pattern |
|---|---|---|
| long-context obedience | model may ignore late instructions | keep hard rules early and short |
| JSON stability | some models drift on nested structures | use tighter contracts and parser repair |
| tone richness | some models write better recommendation copy | separate candidate selection from explanation |
| latency under large context | long prompts delay first token | compress history and filter retrieval earlier |
| refusal behavior | some models over-refuse | separate safety constraints from business uncertainty logic |
Optimization Decision Table
| Symptom | Try First | Do Not Try First |
|---|---|---|
| hallucinated prices | remove price generation responsibility | higher few-shot count |
| long latency | reduce prompt size and model path | just lower max tokens blindly |
| schema drift | tighten contract and add parser | add more narrative examples |
| weak recommendation explanation | better editorial grounding | higher temperature alone |
| generic FAQ answer | improve retrieval quality | more persona language |