LOCAL PREVIEW View on GitHub

06: Technical Decisions and Tradeoffs

AIP-C01 Mapping

Task 5.2 → All Skills (5.2.1–5.2.5): Cross-cutting technical decisions required to build a production-ready GenAI troubleshooting capability. Each decision references the specific skill it supports.


Decision Format

Every decision follows this structure:

Attribute Description
Decision What choice needs to be made
Options The viable alternatives
Choice What MangaAssist chose
Rationale Why — grounded in our specific context
Risk What could go wrong with this choice
Mitigation How we manage that risk
Skill Which 5.2.x skill this supports

Skill 5.2.1: Content Handling Decisions

Decision 1: Context Window Overflow Strategy

Aspect Option A: Hard Truncation Option B: Summarization Option C: Dynamic Chunking (Chosen)
Approach Cut content at token limit LLM-summarize long content Allocate budget per section, compress adaptively
Latency impact None +500-1500ms per summary call +10-50ms for budget calculation
Quality impact Can lose critical info mid-sentence Loses specific details (ASINs, prices) Preserves high-priority content, compresses low-priority
Cost None Additional LLM call per overflow Negligible compute
Complexity Trivial Medium (need summary prompt + quality check) High (budget manager, priority system, compressor)

Choice: Dynamic Chunking (Option C)

Rationale: In MangaAssist, losing an ASIN or price mid-truncation is worse than the engineering cost of dynamic budgeting. Summarization adds latency and cost per request, and summaries lose the specific details our users need. Dynamic chunking is complex to build but cheap to run and preserves the highest-value content.

Risk: Budget allocation logic becomes brittle as new content sections are added.

Mitigation: Budget priorities are externalized in configuration (not hardcoded). New content sections default to DROPPABLE priority until promoted. CloudWatch alarms fire if budget utilization exceeds 95% consistently.


Decision 2: Token Counting Strategy

Aspect Option A: Character Heuristic (len/4) Option B: Model-Specific Tokenizer Option C: Hybrid (Chosen)
Accuracy ±15% for English, ±40% for Japanese Exact Exact for critical paths, heuristic for estimates
Latency < 1ms 5-20ms (load tokenizer + encode) Mixed
Dependency None tiktoken or model-specific library Optional dependency
Cost None CPU time per request Negligible

Choice: Hybrid (Option C)

Rationale: Use character heuristic for initial budget estimation and pre-flight checks. Use exact tokenizer for final validation before FM submission. For MangaAssist's Japanese content, the character heuristic underestimates token count by ~40%, which means we would send over-budget requests if we relied on it alone.

Risk: Two counting methods can disagree, causing confusing debugging logs.

Mitigation: Log both estimates in structured logs. If heuristic and exact differ by > 20%, log a warning.


Skill 5.2.2: FM Integration Decisions

Decision 3: Retry vs. Circuit Breaker Strategy

Aspect Option A: Retry Only Option B: Circuit Breaker Only Option C: Retry + Circuit Breaker (Chosen)
Transient failure handling Good (retries help) Bad (rejects on first failure) Good
Cascade prevention Bad (retries add load to failing service) Good (stops traffic) Good
Recovery speed Slow (each request retries independently) Fast (single probe) Fast
Complexity Low Medium High

Choice: Retry + Circuit Breaker (Option C)

Rationale: Bedrock has two distinct failure modes that need different treatments: - Transient (occasional 429, network blip): retries with backoff solve it - Systemic (throttling cascade, service degradation): circuit breaker prevents making it worse

Using retries alone during a Bedrock degradation would amplify the problem (N users × 3 retries = 3N Bedrock calls). The circuit breaker stops the bleeding after 5 consecutive failures.

Risk: Circuit breaker opens too aggressively during minor throttling episodes.

Mitigation: Failure threshold set to 5 (not 1 or 2). HALF_OPEN state sends a single probe every 30 seconds. Recovery timeout is tunable per-environment.


Decision 4: Single FM vs. Model Tiering

Aspect Option A: Single Model (Sonnet Only) Option B: Model Tiering (Chosen)
Prompt maintenance One system prompt to maintain Multiple prompts per model
Cost Higher (Sonnet for everything) Lower (Haiku for simple intents)
Latency Consistent Lower for simple intents
Reliability Single point of failure Graceful degradation possible
Quality Consistent (highest for all) Variable (simpler model for simple tasks)

Choice: Model Tiering (Option B)

Rationale: MangaAssist handles intents of wildly different complexity. Order status queries ("Where is my order?") are template-fillable — Haiku handles them at 5x lower cost and 3x lower latency. Product recommendations need Sonnet's reasoning ability. Tiering also provides a reliability benefit: if Sonnet is throttled, simple intents still work on Haiku.

Risk: Prompt maintenance doubles (one prompt per model per intent).

Mitigation: Use a shared prompt template with model-specific adaptation layers. The core instructions are identical; only the output formatting and few-shot examples differ per model.


Decision 5: Synchronous vs. Streaming FM Invocation

Aspect Option A: Synchronous Option B: Streaming (Chosen)
Time to first token 1-3s (full generation) 200-500ms
UX Loading spinner → full response Progressive rendering
Error handling Simple (success/fail) Complex (stream interruption)
Client complexity Simple HTTP response WebSocket management
Mobile reliability Robust Fragile on poor connections

Choice: Streaming (Option B) with synchronous fallback

Rationale: Users expect chatbot responses to appear progressively. A 3-second blank screen feels broken even if the response is excellent. Streaming reduces perceived latency by 80%. However, mobile users on poor connections need a fallback to fetch the full response after stream interruption.

Risk: Partial stream delivery creates visible half-responses in the UI.

Mitigation: Buffer complete sentences server-side before streaming. Add GET /chat/message/{id} endpoint for post-reconnection recovery.


Skill 5.2.3: Prompt Engineering Decisions

Decision 6: Prompt Testing — Manual vs. Automated

Aspect Option A: Manual Review Option B: Automated Golden Set (Chosen)
Coverage 5-10 cases per review 20+ cases per intent, every time
Consistency Reviewer-dependent Deterministic scoring
Speed Hours per prompt change Minutes
Regression detection Depends on reviewer memory Automatic per-dimension comparison
Subtlety detection Good (humans catch tone) Moderate (heuristic scores)

Choice: Automated Golden Set (Option B) with manual review for borderline cases

Rationale: The golden test suite catches 90% of regressions automatically. The remaining 10% (tone shifts, subtle instruction following changes) are caught by manual review of flagged cases where scores are within ±5% of thresholds. This gives speed without sacrificing nuance.

Risk: Golden set becomes stale as user behavior evolves.

Mitigation: Refresh golden set quarterly. Add every production incident as a new test case. Track test case age and flag cases older than 6 months.


Decision 7: Prompt Scoring — Heuristic vs. LLM-as-Judge

Aspect Option A: Heuristic Scoring (Chosen for MVP) Option B: LLM-as-Judge
Speed < 10ms per case 1-3s per case (FM call)
Cost Free $5-20 per evaluation suite run
Determinism Fully deterministic Stochastic (varies per run)
Subtlety Low (keyword matching, format checks) High (nuanced quality assessment)
Scalability Unlimited Limited by FM throughput and cost

Choice: Heuristic for MVP, with LLM-as-Judge planned for Phase 2

Rationale: Heuristic scoring catches the failures that matter most in production: missing required content, wrong format, forbidden content, hallucination signals. These are binary checks. For nuanced quality assessment (tone, helpfulness, completeness), LLM-as-Judge will be added as a secondary scoring layer — but only after the golden test suite proves its value with deterministic checks.

Risk: Heuristic scoring misses quality degradation that users notice but heuristics don't.

Mitigation: Sample 5% of production responses for human review weekly. Use feedback signals (thumbs up/down) to identify cases where heuristic scores were misleading.


Skill 5.2.4: Retrieval System Decisions

Aspect Option A: Exact k-NN Option B: Approximate (HNSW) (Chosen)
Accuracy 100% recall 95-99% recall (tunable)
Latency O(n) — scales linearly O(log n) — sub-linear
Memory Lower Higher (HNSW graph)
Scale ceiling ~100K vectors practical Millions
Tuning None needed m, ef_construction, ef_search

Choice: Approximate (HNSW) via OpenSearch Serverless

Rationale: MangaAssist's product catalog has 500K+ items and grows daily. Exact search at this scale would have P95 latency > 1s, busting our 100ms retrieval budget. HNSW achieves 97% recall at P95 < 50ms with our index configuration (m=16, ef_construction=256, ef_search=256).

Risk: Edge cases where the 3% recall gap causes a relevant product to be missed.

Mitigation: Use hybrid search — k-NN first, then keyword fallback for queries with low vector similarity scores. The keyword fallback catches exact-match cases (specific ASIN queries, exact title matches) that ANN sometimes misses.


Decision 9: Embedding Refresh Frequency

Aspect Option A: Real-time (on write) Option B: Batch Daily (Chosen) Option C: Batch Weekly
Freshness Immediate Up to 24h stale Up to 7 days stale
Cost High (Bedrock embed per write) Medium (batch embed) Low
Complexity High (streaming pipeline) Medium (scheduled Lambda) Low
Error blast radius Per-item Per-batch (100-2000 items) Per-batch (large)

Choice: Batch Daily (Option B) with event-driven ingestion for critical updates

Rationale: Most product catalog changes (new listings, price updates) are not time-critical — a 24h delay is acceptable. However, major launches (new manga series, trending items) need same-day visibility. We use daily batch as the baseline plus an SQS-triggered event pipeline for items flagged as "priority" by the product team.

Risk: Priority items missed if the event pipeline fails silently.

Mitigation: CloudWatch alarm on event pipeline DLQ depth > 0. Daily reconciliation job compares OpenSearch document count against DynamoDB catalog count.


Decision 10: Chunk Size Strategy

Aspect Option A: Fixed 500 chars Option B: Fixed 1000 chars Option C: Content-Aware (Chosen)
FAQ items Good (short, focused) Too large (diluted) Matched to FAQ boundary
Product descriptions Too small (split mid-paragraph) Good Matched to description structure
Series narratives Too small (100+ chunks) Still fragmented Paragraph-level with overlap
Metadata preservation May split from content May split from content Attached per chunk

Choice: Content-Aware Chunking (Option C)

Rationale: MangaAssist has three very different content types in a single index. A fixed chunk size is wrong for at least one of them. Content-aware chunking splits FAQs at the question boundary, product descriptions as whole units (with metadata appended), and narratives at paragraph boundaries with 100-char overlap.

Risk: Chunking logic becomes complex and content-type dependent.

Mitigation: Each content type has a separate chunking function. New content types default to the 800-char fixed chunker until a custom chunker is written and tested.


Skill 5.2.5: Prompt Maintenance Decisions

Decision 11: Observability Tool — CloudWatch vs. X-Ray vs. Custom

Aspect CloudWatch Metrics + Logs X-Ray Tracing Custom (Chosen: Both)
Aggregation Excellent (dashboards, alarms) Poor (trace-level only) Both
Request-level detail Possible (Logs Insights) Excellent (waterfall view) Both
Cost Low-medium Low Combined
Setup Easy Moderate (SDK instrumentation) High
Correlation Log-based (correlation_id) Native (trace_id) Both available

Choice: CloudWatch Metrics + Logs for aggregation, X-Ray for request-level diagnosis

Rationale: You need both zoom levels. CloudWatch answers "Is the system healthy?" (aggregate metrics, alarms, dashboards). X-Ray answers "Why did this specific request fail?" (request waterfall, span timings). Neither alone is sufficient for GenAI troubleshooting.

Risk: Dual instrumentation doubles the observability code.

Mitigation: Single PromptObservabilityPipeline class wraps both CloudWatch and X-Ray. Spans are annotated once and emitted to both systems. Correlation IDs bridge the two.


Decision 12: Schema Validation — Strict vs. Lenient

Aspect Option A: Strict (Block) Option B: Lenient + Log (Chosen)
User impact Bad response blocked → fallback Schema-drifted response delivered + logged
Detection speed Immediate (user sees error) After log analysis (minutes to hours)
False positive impact High (blocks good responses with unexpected fields) None (logs only)
Data quality Higher (only valid responses reach user) Lower (some schema-variant responses reach user)

Choice: Lenient + Log (Option B) for production, Strict for testing

Rationale: In production, a response with an extra field is better than no response. The user does not care about an extra confidence field the FM added — but they care about seeing an error message. In testing, strict validation ensures prompts produce the exact expected schema.

Risk: Schema drift accumulates undetected if logs are not monitored.

Mitigation: Nightly schema compliance report with CloudWatch Logs Insights. Alert if compliance drops below 95%.


Cross-Cutting Decision Summary

graph TD
    A[GenAI Troubleshooting<br>Decision Space] --> B[Content Handling<br>5.2.1]
    A --> C[FM Integration<br>5.2.2]
    A --> D[Prompt Engineering<br>5.2.3]
    A --> E[Retrieval System<br>5.2.4]
    A --> F[Prompt Maintenance<br>5.2.5]

    B --> B1[Dynamic chunking<br>over truncation]
    B --> B2[Hybrid token counting<br>heuristic + exact]

    C --> C1[Retry + circuit breaker<br>layered resilience]
    C --> C2[Model tiering<br>Sonnet + Haiku]
    C --> C3[Streaming with<br>sync fallback]

    D --> D1[Automated golden set<br>+ manual review]
    D --> D2[Heuristic scoring MVP<br>LLM-judge later]

    E --> E1[HNSW approximate<br>+ keyword fallback]
    E --> E2[Daily batch +<br>event-driven priority]
    E --> E3[Content-aware chunking<br>per content type]

    F --> F1[CloudWatch + X-Ray<br>dual observability]
    F --> F2[Lenient production<br>strict testing]

Cost Impact Summary

Decision Monthly Cost Impact Justification
Dynamic chunking over summarization -$800 Avoids extra LLM call per overflow event (~50K events/month)
Model tiering (Haiku for simple) -$2,500 40% of requests use Haiku at 1/10th cost
Streaming with sync fallback +$200 WebSocket infrastructure + reconnection endpoint
Automated golden set testing -$150 Prevents costly production rollbacks (1 incident = $500-2000)
HNSW over exact search +$300 Higher memory for HNSW graph, but saves compute
Daily batch + event embed -$400 Vs real-time: 90% fewer Bedrock embed API calls
CloudWatch + X-Ray dual +$250 Combined telemetry costs
Net monthly impact -$3,100 Cost-positive due to model tiering and LLM call avoidance