03. Foundational Model Optimization Techniques

Optimization Goals

Prompt optimization in MangaAssist is not only about answer quality.

It must balance four objectives at the same time:

accuracy
latency
cost
output stability

Optimization Matrix

Technique	Helps Most With	Main Tradeoff
Zero-shot	low cost and low latency	weaker style control
Few-shot	better behavioral consistency	token bloat
Role prompting	tone and boundary control	can become verbose
Structured-output prompting	schema compliance	brittle if model drifts
Context compression	token control	can remove nuance
Prompt caching	repeated-call cost	less flexible if context changes often
Model tiering	cost and latency	routing complexity
Output-length constraints	latency and cost	may cut off nuance
Temperature tuning	hallucination control or stylistic range	can flatten recommendations

Zero-Shot vs Few-Shot

Zero-Shot Use Cases

order tracking formatting
promotion display
simple FAQ explanation
escalation summary generation

Few-Shot Use Cases

recommendation explanation tone
contrastive product comparison
JSON-shaped output with nuanced fields
refusal style that still feels helpful

Rule

Only add examples when the examples solve a repeatable failure mode.

If the examples only make the answer sound nicer in demos, they are usually not worth the token cost.

Role Prompting

Use role prompting to lock the model to a narrow operating mode:

shopping assistant
policy explainer
support summarizer
product comparison assistant

Avoid vague roles such as "expert manga advisor" in high-risk flows because they encourage speculation.

Structured-Output Prompting

When the UI expects structured data, prompt for a strict contract.

Return valid JSON with exactly these top-level keys:
- response_text
- products
- actions
- follow_up_suggestions
Do not wrap the JSON in markdown.
If a field has no values, return an empty array.

Best Practice

Treat schema prompting as one layer only. Add parser validation and repair outside the model.

Token Budget Control

Budget Example

Prompt Part	Budget
system rules	250 to 400 tokens
workflow context	100 to 200 tokens
retrieved chunks	800 to 1500 tokens
conversation summary and recent turns	300 to 800 tokens
output reservation	300 to 700 tokens

Budget Rules

reserve output tokens first
compress history before removing grounding
shrink retrieved context before shrinking hard constraints
keep per-intent variants smaller than one global prompt

Context Compression Techniques

Technique	Good For	Risk
summary turn	long conversations	may lose specific preferences
slot memory	repeated preferences	loses narrative detail
ranked context inclusion	crowded inputs	ranking error hides important facts
intent-specific context packing	narrow workflows	extra orchestration logic

Temperature and Sampling

Lower Temperature

Use for:

FAQ
order support
policy
structured output

Higher Temperature

Use carefully for:

recommendation wording
follow-up suggestions

Practical Rule

Do not use a high temperature to compensate for weak retrieval or weak recommendation ranking.

Model Tiering

Example Routing

Task	Preferred Model Behavior
policy paraphrase	fast, literal, low-variance
recommendation explanation	richer language, better comparative reasoning
chunk embedding	strong retrieval quality
summarization	concise and stable

Why Tiering Helps

One FM is rarely optimal for every part of the pipeline.

Use cheaper or faster paths for deterministic formatting and reserve stronger models for open-ended comparison or multi-step responses.

Prompt Prefix Reuse and Caching

Reusable pieces:

store persona
hard rules
output schema
common locale instructions

Caching these prompt prefixes reduces repeated token cost in high-volume scenarios.

FM-Specific Behavior Tendencies

This project should assume model families differ in predictable ways.

Behavior Dimension	Practical Impact	Compensation Pattern
long-context obedience	model may ignore late instructions	keep hard rules early and short
JSON stability	some models drift on nested structures	use tighter contracts and parser repair
tone richness	some models write better recommendation copy	separate candidate selection from explanation
latency under large context	long prompts delay first token	compress history and filter retrieval earlier
refusal behavior	some models over-refuse	separate safety constraints from business uncertainty logic

Optimization Decision Table

Symptom	Try First	Do Not Try First
hallucinated prices	remove price generation responsibility	higher few-shot count
long latency	reduce prompt size and model path	just lower max tokens blindly
schema drift	tighten contract and add parser	add more narrative examples
weak recommendation explanation	better editorial grounding	higher temperature alone
generic FAQ answer	improve retrieval quality	more persona language