LOCAL PREVIEW View on GitHub

03. Foundational Model Optimization Techniques

Optimization Goals

Prompt optimization in MangaAssist is not only about answer quality.

It must balance four objectives at the same time:

  1. accuracy
  2. latency
  3. cost
  4. output stability

Optimization Matrix

Technique Helps Most With Main Tradeoff
Zero-shot low cost and low latency weaker style control
Few-shot better behavioral consistency token bloat
Role prompting tone and boundary control can become verbose
Structured-output prompting schema compliance brittle if model drifts
Context compression token control can remove nuance
Prompt caching repeated-call cost less flexible if context changes often
Model tiering cost and latency routing complexity
Output-length constraints latency and cost may cut off nuance
Temperature tuning hallucination control or stylistic range can flatten recommendations

Zero-Shot vs Few-Shot

Zero-Shot Use Cases

  • order tracking formatting
  • promotion display
  • simple FAQ explanation
  • escalation summary generation

Few-Shot Use Cases

  • recommendation explanation tone
  • contrastive product comparison
  • JSON-shaped output with nuanced fields
  • refusal style that still feels helpful

Rule

Only add examples when the examples solve a repeatable failure mode.

If the examples only make the answer sound nicer in demos, they are usually not worth the token cost.

Role Prompting

Use role prompting to lock the model to a narrow operating mode:

  • shopping assistant
  • policy explainer
  • support summarizer
  • product comparison assistant

Avoid vague roles such as "expert manga advisor" in high-risk flows because they encourage speculation.

Structured-Output Prompting

When the UI expects structured data, prompt for a strict contract.

Return valid JSON with exactly these top-level keys:
- response_text
- products
- actions
- follow_up_suggestions
Do not wrap the JSON in markdown.
If a field has no values, return an empty array.

Best Practice

Treat schema prompting as one layer only. Add parser validation and repair outside the model.

Token Budget Control

Budget Example

Prompt Part Budget
system rules 250 to 400 tokens
workflow context 100 to 200 tokens
retrieved chunks 800 to 1500 tokens
conversation summary and recent turns 300 to 800 tokens
output reservation 300 to 700 tokens

Budget Rules

  1. reserve output tokens first
  2. compress history before removing grounding
  3. shrink retrieved context before shrinking hard constraints
  4. keep per-intent variants smaller than one global prompt

Context Compression Techniques

Technique Good For Risk
summary turn long conversations may lose specific preferences
slot memory repeated preferences loses narrative detail
ranked context inclusion crowded inputs ranking error hides important facts
intent-specific context packing narrow workflows extra orchestration logic

Temperature and Sampling

Lower Temperature

Use for:

  • FAQ
  • order support
  • policy
  • structured output

Higher Temperature

Use carefully for:

  • recommendation wording
  • follow-up suggestions

Practical Rule

Do not use a high temperature to compensate for weak retrieval or weak recommendation ranking.

Model Tiering

Example Routing

Task Preferred Model Behavior
policy paraphrase fast, literal, low-variance
recommendation explanation richer language, better comparative reasoning
chunk embedding strong retrieval quality
summarization concise and stable

Why Tiering Helps

One FM is rarely optimal for every part of the pipeline.

Use cheaper or faster paths for deterministic formatting and reserve stronger models for open-ended comparison or multi-step responses.

Prompt Prefix Reuse and Caching

Reusable pieces:

  • store persona
  • hard rules
  • output schema
  • common locale instructions

Caching these prompt prefixes reduces repeated token cost in high-volume scenarios.

FM-Specific Behavior Tendencies

This project should assume model families differ in predictable ways.

Behavior Dimension Practical Impact Compensation Pattern
long-context obedience model may ignore late instructions keep hard rules early and short
JSON stability some models drift on nested structures use tighter contracts and parser repair
tone richness some models write better recommendation copy separate candidate selection from explanation
latency under large context long prompts delay first token compress history and filter retrieval earlier
refusal behavior some models over-refuse separate safety constraints from business uncertainty logic

Optimization Decision Table

Symptom Try First Do Not Try First
hallucinated prices remove price generation responsibility higher few-shot count
long latency reduce prompt size and model path just lower max tokens blindly
schema drift tighten contract and add parser add more narrative examples
weak recommendation explanation better editorial grounding higher temperature alone
generic FAQ answer improve retrieval quality more persona language