06: Technical Decisions and Tradeoffs

AIP-C01 Mapping

Task 5.2 → All Skills (5.2.1–5.2.5): Cross-cutting technical decisions required to build a production-ready GenAI troubleshooting capability. Each decision references the specific skill it supports.

Decision Format

Every decision follows this structure:

Attribute	Description
Decision	What choice needs to be made
Options	The viable alternatives
Choice	What MangaAssist chose
Rationale	Why — grounded in our specific context
Risk	What could go wrong with this choice
Mitigation	How we manage that risk
Skill	Which 5.2.x skill this supports

Skill 5.2.1: Content Handling Decisions

Decision 1: Context Window Overflow Strategy

Aspect	Option A: Hard Truncation	Option B: Summarization	Option C: Dynamic Chunking (Chosen)
Approach	Cut content at token limit	LLM-summarize long content	Allocate budget per section, compress adaptively
Latency impact	None	+500-1500ms per summary call	+10-50ms for budget calculation
Quality impact	Can lose critical info mid-sentence	Loses specific details (ASINs, prices)	Preserves high-priority content, compresses low-priority
Cost	None	Additional LLM call per overflow	Negligible compute
Complexity	Trivial	Medium (need summary prompt + quality check)	High (budget manager, priority system, compressor)

Choice: Dynamic Chunking (Option C)

Rationale: In MangaAssist, losing an ASIN or price mid-truncation is worse than the engineering cost of dynamic budgeting. Summarization adds latency and cost per request, and summaries lose the specific details our users need. Dynamic chunking is complex to build but cheap to run and preserves the highest-value content.

Risk: Budget allocation logic becomes brittle as new content sections are added.

Mitigation: Budget priorities are externalized in configuration (not hardcoded). New content sections default to DROPPABLE priority until promoted. CloudWatch alarms fire if budget utilization exceeds 95% consistently.

Decision 2: Token Counting Strategy

Aspect	Option A: Character Heuristic (len/4)	Option B: Model-Specific Tokenizer	Option C: Hybrid (Chosen)
Accuracy	±15% for English, ±40% for Japanese	Exact	Exact for critical paths, heuristic for estimates
Latency	< 1ms	5-20ms (load tokenizer + encode)	Mixed
Dependency	None	tiktoken or model-specific library	Optional dependency
Cost	None	CPU time per request	Negligible

Choice: Hybrid (Option C)

Rationale: Use character heuristic for initial budget estimation and pre-flight checks. Use exact tokenizer for final validation before FM submission. For MangaAssist's Japanese content, the character heuristic underestimates token count by ~40%, which means we would send over-budget requests if we relied on it alone.

Risk: Two counting methods can disagree, causing confusing debugging logs.

Mitigation: Log both estimates in structured logs. If heuristic and exact differ by > 20%, log a warning.

Skill 5.2.2: FM Integration Decisions

Decision 3: Retry vs. Circuit Breaker Strategy

Aspect	Option A: Retry Only	Option B: Circuit Breaker Only	Option C: Retry + Circuit Breaker (Chosen)
Transient failure handling	Good (retries help)	Bad (rejects on first failure)	Good
Cascade prevention	Bad (retries add load to failing service)	Good (stops traffic)	Good
Recovery speed	Slow (each request retries independently)	Fast (single probe)	Fast
Complexity	Low	Medium	High

Choice: Retry + Circuit Breaker (Option C)

Rationale: Bedrock has two distinct failure modes that need different treatments: - Transient (occasional 429, network blip): retries with backoff solve it - Systemic (throttling cascade, service degradation): circuit breaker prevents making it worse

Using retries alone during a Bedrock degradation would amplify the problem (N users × 3 retries = 3N Bedrock calls). The circuit breaker stops the bleeding after 5 consecutive failures.

Risk: Circuit breaker opens too aggressively during minor throttling episodes.

Mitigation: Failure threshold set to 5 (not 1 or 2). HALF_OPEN state sends a single probe every 30 seconds. Recovery timeout is tunable per-environment.

Decision 4: Single FM vs. Model Tiering

Aspect	Option A: Single Model (Sonnet Only)	Option B: Model Tiering (Chosen)
Prompt maintenance	One system prompt to maintain	Multiple prompts per model
Cost	Higher (Sonnet for everything)	Lower (Haiku for simple intents)
Latency	Consistent	Lower for simple intents
Reliability	Single point of failure	Graceful degradation possible
Quality	Consistent (highest for all)	Variable (simpler model for simple tasks)

Choice: Model Tiering (Option B)

Rationale: MangaAssist handles intents of wildly different complexity. Order status queries ("Where is my order?") are template-fillable — Haiku handles them at 5x lower cost and 3x lower latency. Product recommendations need Sonnet's reasoning ability. Tiering also provides a reliability benefit: if Sonnet is throttled, simple intents still work on Haiku.

Risk: Prompt maintenance doubles (one prompt per model per intent).

Mitigation: Use a shared prompt template with model-specific adaptation layers. The core instructions are identical; only the output formatting and few-shot examples differ per model.

Decision 5: Synchronous vs. Streaming FM Invocation

Aspect	Option A: Synchronous	Option B: Streaming (Chosen)
Time to first token	1-3s (full generation)	200-500ms
UX	Loading spinner → full response	Progressive rendering
Error handling	Simple (success/fail)	Complex (stream interruption)
Client complexity	Simple HTTP response	WebSocket management
Mobile reliability	Robust	Fragile on poor connections

Choice: Streaming (Option B) with synchronous fallback

Rationale: Users expect chatbot responses to appear progressively. A 3-second blank screen feels broken even if the response is excellent. Streaming reduces perceived latency by 80%. However, mobile users on poor connections need a fallback to fetch the full response after stream interruption.

Risk: Partial stream delivery creates visible half-responses in the UI.

Mitigation: Buffer complete sentences server-side before streaming. Add GET /chat/message/{id} endpoint for post-reconnection recovery.

Skill 5.2.3: Prompt Engineering Decisions

Decision 6: Prompt Testing — Manual vs. Automated

Aspect	Option A: Manual Review	Option B: Automated Golden Set (Chosen)
Coverage	5-10 cases per review	20+ cases per intent, every time
Consistency	Reviewer-dependent	Deterministic scoring
Speed	Hours per prompt change	Minutes
Regression detection	Depends on reviewer memory	Automatic per-dimension comparison
Subtlety detection	Good (humans catch tone)	Moderate (heuristic scores)

Choice: Automated Golden Set (Option B) with manual review for borderline cases

Rationale: The golden test suite catches 90% of regressions automatically. The remaining 10% (tone shifts, subtle instruction following changes) are caught by manual review of flagged cases where scores are within ±5% of thresholds. This gives speed without sacrificing nuance.

Risk: Golden set becomes stale as user behavior evolves.

Mitigation: Refresh golden set quarterly. Add every production incident as a new test case. Track test case age and flag cases older than 6 months.

Decision 7: Prompt Scoring — Heuristic vs. LLM-as-Judge

Aspect	Option A: Heuristic Scoring (Chosen for MVP)	Option B: LLM-as-Judge
Speed	< 10ms per case	1-3s per case (FM call)
Cost	Free	$5-20 per evaluation suite run
Determinism	Fully deterministic	Stochastic (varies per run)
Subtlety	Low (keyword matching, format checks)	High (nuanced quality assessment)
Scalability	Unlimited	Limited by FM throughput and cost

Choice: Heuristic for MVP, with LLM-as-Judge planned for Phase 2

Rationale: Heuristic scoring catches the failures that matter most in production: missing required content, wrong format, forbidden content, hallucination signals. These are binary checks. For nuanced quality assessment (tone, helpfulness, completeness), LLM-as-Judge will be added as a secondary scoring layer — but only after the golden test suite proves its value with deterministic checks.

Risk: Heuristic scoring misses quality degradation that users notice but heuristics don't.

Mitigation: Sample 5% of production responses for human review weekly. Use feedback signals (thumbs up/down) to identify cases where heuristic scores were misleading.

Skill 5.2.4: Retrieval System Decisions

Decision 8: Exact vs. Approximate Nearest Neighbor Search

Aspect	Option A: Exact k-NN	Option B: Approximate (HNSW) (Chosen)
Accuracy	100% recall	95-99% recall (tunable)
Latency	O(n) — scales linearly	O(log n) — sub-linear
Memory	Lower	Higher (HNSW graph)
Scale ceiling	~100K vectors practical	Millions
Tuning	None needed	m, ef_construction, ef_search

Choice: Approximate (HNSW) via OpenSearch Serverless

Rationale: MangaAssist's product catalog has 500K+ items and grows daily. Exact search at this scale would have P95 latency > 1s, busting our 100ms retrieval budget. HNSW achieves 97% recall at P95 < 50ms with our index configuration (m=16, ef_construction=256, ef_search=256).

Risk: Edge cases where the 3% recall gap causes a relevant product to be missed.

Mitigation: Use hybrid search — k-NN first, then keyword fallback for queries with low vector similarity scores. The keyword fallback catches exact-match cases (specific ASIN queries, exact title matches) that ANN sometimes misses.

Decision 9: Embedding Refresh Frequency

Aspect	Option A: Real-time (on write)	Option B: Batch Daily (Chosen)	Option C: Batch Weekly
Freshness	Immediate	Up to 24h stale	Up to 7 days stale
Cost	High (Bedrock embed per write)	Medium (batch embed)	Low
Complexity	High (streaming pipeline)	Medium (scheduled Lambda)	Low
Error blast radius	Per-item	Per-batch (100-2000 items)	Per-batch (large)

Choice: Batch Daily (Option B) with event-driven ingestion for critical updates

Rationale: Most product catalog changes (new listings, price updates) are not time-critical — a 24h delay is acceptable. However, major launches (new manga series, trending items) need same-day visibility. We use daily batch as the baseline plus an SQS-triggered event pipeline for items flagged as "priority" by the product team.

Risk: Priority items missed if the event pipeline fails silently.

Mitigation: CloudWatch alarm on event pipeline DLQ depth > 0. Daily reconciliation job compares OpenSearch document count against DynamoDB catalog count.

Decision 10: Chunk Size Strategy

Aspect	Option A: Fixed 500 chars	Option B: Fixed 1000 chars	Option C: Content-Aware (Chosen)
FAQ items	Good (short, focused)	Too large (diluted)	Matched to FAQ boundary
Product descriptions	Too small (split mid-paragraph)	Good	Matched to description structure
Series narratives	Too small (100+ chunks)	Still fragmented	Paragraph-level with overlap
Metadata preservation	May split from content	May split from content	Attached per chunk

Choice: Content-Aware Chunking (Option C)

Rationale: MangaAssist has three very different content types in a single index. A fixed chunk size is wrong for at least one of them. Content-aware chunking splits FAQs at the question boundary, product descriptions as whole units (with metadata appended), and narratives at paragraph boundaries with 100-char overlap.

Risk: Chunking logic becomes complex and content-type dependent.

Mitigation: Each content type has a separate chunking function. New content types default to the 800-char fixed chunker until a custom chunker is written and tested.

Skill 5.2.5: Prompt Maintenance Decisions

Decision 11: Observability Tool — CloudWatch vs. X-Ray vs. Custom

Aspect	CloudWatch Metrics + Logs	X-Ray Tracing	Custom (Chosen: Both)
Aggregation	Excellent (dashboards, alarms)	Poor (trace-level only)	Both
Request-level detail	Possible (Logs Insights)	Excellent (waterfall view)	Both
Cost	Low-medium	Low	Combined
Setup	Easy	Moderate (SDK instrumentation)	High
Correlation	Log-based (correlation_id)	Native (trace_id)	Both available

Choice: CloudWatch Metrics + Logs for aggregation, X-Ray for request-level diagnosis

Rationale: You need both zoom levels. CloudWatch answers "Is the system healthy?" (aggregate metrics, alarms, dashboards). X-Ray answers "Why did this specific request fail?" (request waterfall, span timings). Neither alone is sufficient for GenAI troubleshooting.

Risk: Dual instrumentation doubles the observability code.

Mitigation: Single PromptObservabilityPipeline class wraps both CloudWatch and X-Ray. Spans are annotated once and emitted to both systems. Correlation IDs bridge the two.

Decision 12: Schema Validation — Strict vs. Lenient

Aspect	Option A: Strict (Block)	Option B: Lenient + Log (Chosen)
User impact	Bad response blocked → fallback	Schema-drifted response delivered + logged
Detection speed	Immediate (user sees error)	After log analysis (minutes to hours)
False positive impact	High (blocks good responses with unexpected fields)	None (logs only)
Data quality	Higher (only valid responses reach user)	Lower (some schema-variant responses reach user)

Choice: Lenient + Log (Option B) for production, Strict for testing

Rationale: In production, a response with an extra field is better than no response. The user does not care about an extra confidence field the FM added — but they care about seeing an error message. In testing, strict validation ensures prompts produce the exact expected schema.

Risk: Schema drift accumulates undetected if logs are not monitored.

Mitigation: Nightly schema compliance report with CloudWatch Logs Insights. Alert if compliance drops below 95%.

Cross-Cutting Decision Summary

graph TD
    A[GenAI Troubleshooting<br>Decision Space] --> B[Content Handling<br>5.2.1]
    A --> C[FM Integration<br>5.2.2]
    A --> D[Prompt Engineering<br>5.2.3]
    A --> E[Retrieval System<br>5.2.4]
    A --> F[Prompt Maintenance<br>5.2.5]

    B --> B1[Dynamic chunking<br>over truncation]
    B --> B2[Hybrid token counting<br>heuristic + exact]

    C --> C1[Retry + circuit breaker<br>layered resilience]
    C --> C2[Model tiering<br>Sonnet + Haiku]
    C --> C3[Streaming with<br>sync fallback]

    D --> D1[Automated golden set<br>+ manual review]
    D --> D2[Heuristic scoring MVP<br>LLM-judge later]

    E --> E1[HNSW approximate<br>+ keyword fallback]
    E --> E2[Daily batch +<br>event-driven priority]
    E --> E3[Content-aware chunking<br>per content type]

    F --> F1[CloudWatch + X-Ray<br>dual observability]
    F --> F2[Lenient production<br>strict testing]

Cost Impact Summary

Decision	Monthly Cost Impact	Justification
Dynamic chunking over summarization	-$800	Avoids extra LLM call per overflow event (~50K events/month)
Model tiering (Haiku for simple)	-$2,500	40% of requests use Haiku at 1/10^th cost
Streaming with sync fallback	+$200	WebSocket infrastructure + reconnection endpoint
Automated golden set testing	-$150	Prevents costly production rollbacks (1 incident = $500-2000)
HNSW over exact search	+$300	Higher memory for HNSW graph, but saves compute
Daily batch + event embed	-$400	Vs real-time: 90% fewer Bedrock embed API calls
CloudWatch + X-Ray dual	+$250	Combined telemetry costs
Net monthly impact	-$3,100	Cost-positive due to model tiering and LLM call avoidance