06: Technical Decisions and Tradeoffs
AIP-C01 Mapping
Task 5.2 → All Skills (5.2.1–5.2.5): Cross-cutting technical decisions required to build a production-ready GenAI troubleshooting capability. Each decision references the specific skill it supports.
Decision Format
Every decision follows this structure:
| Attribute | Description |
|---|---|
| Decision | What choice needs to be made |
| Options | The viable alternatives |
| Choice | What MangaAssist chose |
| Rationale | Why — grounded in our specific context |
| Risk | What could go wrong with this choice |
| Mitigation | How we manage that risk |
| Skill | Which 5.2.x skill this supports |
Skill 5.2.1: Content Handling Decisions
Decision 1: Context Window Overflow Strategy
| Aspect | Option A: Hard Truncation | Option B: Summarization | Option C: Dynamic Chunking (Chosen) |
|---|---|---|---|
| Approach | Cut content at token limit | LLM-summarize long content | Allocate budget per section, compress adaptively |
| Latency impact | None | +500-1500ms per summary call | +10-50ms for budget calculation |
| Quality impact | Can lose critical info mid-sentence | Loses specific details (ASINs, prices) | Preserves high-priority content, compresses low-priority |
| Cost | None | Additional LLM call per overflow | Negligible compute |
| Complexity | Trivial | Medium (need summary prompt + quality check) | High (budget manager, priority system, compressor) |
Choice: Dynamic Chunking (Option C)
Rationale: In MangaAssist, losing an ASIN or price mid-truncation is worse than the engineering cost of dynamic budgeting. Summarization adds latency and cost per request, and summaries lose the specific details our users need. Dynamic chunking is complex to build but cheap to run and preserves the highest-value content.
Risk: Budget allocation logic becomes brittle as new content sections are added.
Mitigation: Budget priorities are externalized in configuration (not hardcoded). New content sections default to DROPPABLE priority until promoted. CloudWatch alarms fire if budget utilization exceeds 95% consistently.
Decision 2: Token Counting Strategy
| Aspect | Option A: Character Heuristic (len/4) | Option B: Model-Specific Tokenizer | Option C: Hybrid (Chosen) |
|---|---|---|---|
| Accuracy | ±15% for English, ±40% for Japanese | Exact | Exact for critical paths, heuristic for estimates |
| Latency | < 1ms | 5-20ms (load tokenizer + encode) | Mixed |
| Dependency | None | tiktoken or model-specific library | Optional dependency |
| Cost | None | CPU time per request | Negligible |
Choice: Hybrid (Option C)
Rationale: Use character heuristic for initial budget estimation and pre-flight checks. Use exact tokenizer for final validation before FM submission. For MangaAssist's Japanese content, the character heuristic underestimates token count by ~40%, which means we would send over-budget requests if we relied on it alone.
Risk: Two counting methods can disagree, causing confusing debugging logs.
Mitigation: Log both estimates in structured logs. If heuristic and exact differ by > 20%, log a warning.
Skill 5.2.2: FM Integration Decisions
Decision 3: Retry vs. Circuit Breaker Strategy
| Aspect | Option A: Retry Only | Option B: Circuit Breaker Only | Option C: Retry + Circuit Breaker (Chosen) |
|---|---|---|---|
| Transient failure handling | Good (retries help) | Bad (rejects on first failure) | Good |
| Cascade prevention | Bad (retries add load to failing service) | Good (stops traffic) | Good |
| Recovery speed | Slow (each request retries independently) | Fast (single probe) | Fast |
| Complexity | Low | Medium | High |
Choice: Retry + Circuit Breaker (Option C)
Rationale: Bedrock has two distinct failure modes that need different treatments: - Transient (occasional 429, network blip): retries with backoff solve it - Systemic (throttling cascade, service degradation): circuit breaker prevents making it worse
Using retries alone during a Bedrock degradation would amplify the problem (N users × 3 retries = 3N Bedrock calls). The circuit breaker stops the bleeding after 5 consecutive failures.
Risk: Circuit breaker opens too aggressively during minor throttling episodes.
Mitigation: Failure threshold set to 5 (not 1 or 2). HALF_OPEN state sends a single probe every 30 seconds. Recovery timeout is tunable per-environment.
Decision 4: Single FM vs. Model Tiering
| Aspect | Option A: Single Model (Sonnet Only) | Option B: Model Tiering (Chosen) |
|---|---|---|
| Prompt maintenance | One system prompt to maintain | Multiple prompts per model |
| Cost | Higher (Sonnet for everything) | Lower (Haiku for simple intents) |
| Latency | Consistent | Lower for simple intents |
| Reliability | Single point of failure | Graceful degradation possible |
| Quality | Consistent (highest for all) | Variable (simpler model for simple tasks) |
Choice: Model Tiering (Option B)
Rationale: MangaAssist handles intents of wildly different complexity. Order status queries ("Where is my order?") are template-fillable — Haiku handles them at 5x lower cost and 3x lower latency. Product recommendations need Sonnet's reasoning ability. Tiering also provides a reliability benefit: if Sonnet is throttled, simple intents still work on Haiku.
Risk: Prompt maintenance doubles (one prompt per model per intent).
Mitigation: Use a shared prompt template with model-specific adaptation layers. The core instructions are identical; only the output formatting and few-shot examples differ per model.
Decision 5: Synchronous vs. Streaming FM Invocation
| Aspect | Option A: Synchronous | Option B: Streaming (Chosen) |
|---|---|---|
| Time to first token | 1-3s (full generation) | 200-500ms |
| UX | Loading spinner → full response | Progressive rendering |
| Error handling | Simple (success/fail) | Complex (stream interruption) |
| Client complexity | Simple HTTP response | WebSocket management |
| Mobile reliability | Robust | Fragile on poor connections |
Choice: Streaming (Option B) with synchronous fallback
Rationale: Users expect chatbot responses to appear progressively. A 3-second blank screen feels broken even if the response is excellent. Streaming reduces perceived latency by 80%. However, mobile users on poor connections need a fallback to fetch the full response after stream interruption.
Risk: Partial stream delivery creates visible half-responses in the UI.
Mitigation: Buffer complete sentences server-side before streaming. Add GET /chat/message/{id} endpoint for post-reconnection recovery.
Skill 5.2.3: Prompt Engineering Decisions
Decision 6: Prompt Testing — Manual vs. Automated
| Aspect | Option A: Manual Review | Option B: Automated Golden Set (Chosen) |
|---|---|---|
| Coverage | 5-10 cases per review | 20+ cases per intent, every time |
| Consistency | Reviewer-dependent | Deterministic scoring |
| Speed | Hours per prompt change | Minutes |
| Regression detection | Depends on reviewer memory | Automatic per-dimension comparison |
| Subtlety detection | Good (humans catch tone) | Moderate (heuristic scores) |
Choice: Automated Golden Set (Option B) with manual review for borderline cases
Rationale: The golden test suite catches 90% of regressions automatically. The remaining 10% (tone shifts, subtle instruction following changes) are caught by manual review of flagged cases where scores are within ±5% of thresholds. This gives speed without sacrificing nuance.
Risk: Golden set becomes stale as user behavior evolves.
Mitigation: Refresh golden set quarterly. Add every production incident as a new test case. Track test case age and flag cases older than 6 months.
Decision 7: Prompt Scoring — Heuristic vs. LLM-as-Judge
| Aspect | Option A: Heuristic Scoring (Chosen for MVP) | Option B: LLM-as-Judge |
|---|---|---|
| Speed | < 10ms per case | 1-3s per case (FM call) |
| Cost | Free | $5-20 per evaluation suite run |
| Determinism | Fully deterministic | Stochastic (varies per run) |
| Subtlety | Low (keyword matching, format checks) | High (nuanced quality assessment) |
| Scalability | Unlimited | Limited by FM throughput and cost |
Choice: Heuristic for MVP, with LLM-as-Judge planned for Phase 2
Rationale: Heuristic scoring catches the failures that matter most in production: missing required content, wrong format, forbidden content, hallucination signals. These are binary checks. For nuanced quality assessment (tone, helpfulness, completeness), LLM-as-Judge will be added as a secondary scoring layer — but only after the golden test suite proves its value with deterministic checks.
Risk: Heuristic scoring misses quality degradation that users notice but heuristics don't.
Mitigation: Sample 5% of production responses for human review weekly. Use feedback signals (thumbs up/down) to identify cases where heuristic scores were misleading.
Skill 5.2.4: Retrieval System Decisions
Decision 8: Exact vs. Approximate Nearest Neighbor Search
| Aspect | Option A: Exact k-NN | Option B: Approximate (HNSW) (Chosen) |
|---|---|---|
| Accuracy | 100% recall | 95-99% recall (tunable) |
| Latency | O(n) — scales linearly | O(log n) — sub-linear |
| Memory | Lower | Higher (HNSW graph) |
| Scale ceiling | ~100K vectors practical | Millions |
| Tuning | None needed | m, ef_construction, ef_search |
Choice: Approximate (HNSW) via OpenSearch Serverless
Rationale: MangaAssist's product catalog has 500K+ items and grows daily. Exact search at this scale would have P95 latency > 1s, busting our 100ms retrieval budget. HNSW achieves 97% recall at P95 < 50ms with our index configuration (m=16, ef_construction=256, ef_search=256).
Risk: Edge cases where the 3% recall gap causes a relevant product to be missed.
Mitigation: Use hybrid search — k-NN first, then keyword fallback for queries with low vector similarity scores. The keyword fallback catches exact-match cases (specific ASIN queries, exact title matches) that ANN sometimes misses.
Decision 9: Embedding Refresh Frequency
| Aspect | Option A: Real-time (on write) | Option B: Batch Daily (Chosen) | Option C: Batch Weekly |
|---|---|---|---|
| Freshness | Immediate | Up to 24h stale | Up to 7 days stale |
| Cost | High (Bedrock embed per write) | Medium (batch embed) | Low |
| Complexity | High (streaming pipeline) | Medium (scheduled Lambda) | Low |
| Error blast radius | Per-item | Per-batch (100-2000 items) | Per-batch (large) |
Choice: Batch Daily (Option B) with event-driven ingestion for critical updates
Rationale: Most product catalog changes (new listings, price updates) are not time-critical — a 24h delay is acceptable. However, major launches (new manga series, trending items) need same-day visibility. We use daily batch as the baseline plus an SQS-triggered event pipeline for items flagged as "priority" by the product team.
Risk: Priority items missed if the event pipeline fails silently.
Mitigation: CloudWatch alarm on event pipeline DLQ depth > 0. Daily reconciliation job compares OpenSearch document count against DynamoDB catalog count.
Decision 10: Chunk Size Strategy
| Aspect | Option A: Fixed 500 chars | Option B: Fixed 1000 chars | Option C: Content-Aware (Chosen) |
|---|---|---|---|
| FAQ items | Good (short, focused) | Too large (diluted) | Matched to FAQ boundary |
| Product descriptions | Too small (split mid-paragraph) | Good | Matched to description structure |
| Series narratives | Too small (100+ chunks) | Still fragmented | Paragraph-level with overlap |
| Metadata preservation | May split from content | May split from content | Attached per chunk |
Choice: Content-Aware Chunking (Option C)
Rationale: MangaAssist has three very different content types in a single index. A fixed chunk size is wrong for at least one of them. Content-aware chunking splits FAQs at the question boundary, product descriptions as whole units (with metadata appended), and narratives at paragraph boundaries with 100-char overlap.
Risk: Chunking logic becomes complex and content-type dependent.
Mitigation: Each content type has a separate chunking function. New content types default to the 800-char fixed chunker until a custom chunker is written and tested.
Skill 5.2.5: Prompt Maintenance Decisions
Decision 11: Observability Tool — CloudWatch vs. X-Ray vs. Custom
| Aspect | CloudWatch Metrics + Logs | X-Ray Tracing | Custom (Chosen: Both) |
|---|---|---|---|
| Aggregation | Excellent (dashboards, alarms) | Poor (trace-level only) | Both |
| Request-level detail | Possible (Logs Insights) | Excellent (waterfall view) | Both |
| Cost | Low-medium | Low | Combined |
| Setup | Easy | Moderate (SDK instrumentation) | High |
| Correlation | Log-based (correlation_id) | Native (trace_id) | Both available |
Choice: CloudWatch Metrics + Logs for aggregation, X-Ray for request-level diagnosis
Rationale: You need both zoom levels. CloudWatch answers "Is the system healthy?" (aggregate metrics, alarms, dashboards). X-Ray answers "Why did this specific request fail?" (request waterfall, span timings). Neither alone is sufficient for GenAI troubleshooting.
Risk: Dual instrumentation doubles the observability code.
Mitigation: Single PromptObservabilityPipeline class wraps both CloudWatch and X-Ray. Spans are annotated once and emitted to both systems. Correlation IDs bridge the two.
Decision 12: Schema Validation — Strict vs. Lenient
| Aspect | Option A: Strict (Block) | Option B: Lenient + Log (Chosen) |
|---|---|---|
| User impact | Bad response blocked → fallback | Schema-drifted response delivered + logged |
| Detection speed | Immediate (user sees error) | After log analysis (minutes to hours) |
| False positive impact | High (blocks good responses with unexpected fields) | None (logs only) |
| Data quality | Higher (only valid responses reach user) | Lower (some schema-variant responses reach user) |
Choice: Lenient + Log (Option B) for production, Strict for testing
Rationale: In production, a response with an extra field is better than no response. The user does not care about an extra confidence field the FM added — but they care about seeing an error message. In testing, strict validation ensures prompts produce the exact expected schema.
Risk: Schema drift accumulates undetected if logs are not monitored.
Mitigation: Nightly schema compliance report with CloudWatch Logs Insights. Alert if compliance drops below 95%.
Cross-Cutting Decision Summary
graph TD
A[GenAI Troubleshooting<br>Decision Space] --> B[Content Handling<br>5.2.1]
A --> C[FM Integration<br>5.2.2]
A --> D[Prompt Engineering<br>5.2.3]
A --> E[Retrieval System<br>5.2.4]
A --> F[Prompt Maintenance<br>5.2.5]
B --> B1[Dynamic chunking<br>over truncation]
B --> B2[Hybrid token counting<br>heuristic + exact]
C --> C1[Retry + circuit breaker<br>layered resilience]
C --> C2[Model tiering<br>Sonnet + Haiku]
C --> C3[Streaming with<br>sync fallback]
D --> D1[Automated golden set<br>+ manual review]
D --> D2[Heuristic scoring MVP<br>LLM-judge later]
E --> E1[HNSW approximate<br>+ keyword fallback]
E --> E2[Daily batch +<br>event-driven priority]
E --> E3[Content-aware chunking<br>per content type]
F --> F1[CloudWatch + X-Ray<br>dual observability]
F --> F2[Lenient production<br>strict testing]
Cost Impact Summary
| Decision | Monthly Cost Impact | Justification |
|---|---|---|
| Dynamic chunking over summarization | -$800 | Avoids extra LLM call per overflow event (~50K events/month) |
| Model tiering (Haiku for simple) | -$2,500 | 40% of requests use Haiku at 1/10th cost |
| Streaming with sync fallback | +$200 | WebSocket infrastructure + reconnection endpoint |
| Automated golden set testing | -$150 | Prevents costly production rollbacks (1 incident = $500-2000) |
| HNSW over exact search | +$300 | Higher memory for HNSW graph, but saves compute |
| Daily batch + event embed | -$400 | Vs real-time: 90% fewer Bedrock embed API calls |
| CloudWatch + X-Ray dual | +$250 | Combined telemetry costs |
| Net monthly impact | -$3,100 | Cost-positive due to model tiering and LLM call avoidance |