LOCAL PREVIEW View on GitHub

06. Decision Records — Multi-Stakeholder Perspectives

"Every architecture decision in MangaAssist had to clear a room of people who don't share the same priorities. The Backend engineer wants clean code; SRE wants a runbook; Security wants a threat model; ML wants A/B test capability; Product wants user delight; Frontend wants edge-case clarity; Cost wants the unit economics; Legal wants the audit trail. This document is a record of how the hardest decisions cleared that room — what the alternatives were, who pushed back, and what we wrote in the margins."


How These Are Different from the Lens Decisions in 00-hld-lld-architecture.md

The lens decisions in 00-hld-lld-architecture.md (D-1 through D-10) cover the architectural skeleton — the structural choices baked in at design time.

This document covers the operationally-driven decisions — the ones forced by an incident, an outage, a quality regression, or a stakeholder challenge that landed after initial design. Each ADR here has:

  1. Context — what was happening when the decision was made (often: an incident)
  2. Forces — competing pressures from each lens
  3. Decision — what we picked
  4. Alternatives — what we considered and rejected
  5. Consequences — what got worse, what got better, what we still owe
  6. Revisit triggers — what would force us to revisit this

ADR-001: Price Hallucination — Override at Output, Then Move to Placeholders

Context

Two weeks after MVP launch, customer support flagged a complaint: "Chatbot told me a manga was $9.99 — I added it to cart and it was $14.99. Why is Amazon advertising wrong prices?"

Investigation showed the LLM was generating prices from training memory ~6% of the time, even with PRICE_DATA explicitly in the prompt. Pure prompt engineering reduced it to ~3% but never lower.

Forces

Lens Force
⚖️ Legal "Wrong prices are a regulated representation. We could be required to honor any price stated. Zero tolerance."
📊 PM "The chatbot's value is product confidence. If users don't trust prices, they don't use the chatbot."
🤖 ML "Prompt engineering has diminishing returns past 3%. We need a runtime defense, not just a prompt fix."
🔧 Backend "A post-LLM regex extractor + override is straightforward. ~30ms latency add per response."
⚙️ SRE "What happens if the Pricing API is slow? We can't stream the response while waiting for price validation."
💰 Cost "Pricing API call per response = +~$120/month. Negligible."
🛡️ Security "Wrong prices are a customer-trust attack vector — adversarial input could try to elicit them."
🎨 Frontend "If the user sees a wrong price even briefly during streaming, they screenshot it. The UX problem is real."

Decision

Two-phase fix, deployed in sequence:

Phase 1 (week 1): Runtime price validation in guardrails. - Regex-extract every dollar amount in the response. - Map to nearest ASIN by character proximity. - Cross-check against live Pricing API. - Replace mismatches in-place, emit correction frame if streaming had already occurred. - If Pricing API doesn't respond in 200ms, strip all price mentions and add "Check the product page for current pricing."

Phase 2 (month 2): Prompt-level redesign — prices never come from the LLM at all. - LLM is prompted to emit price placeholders: {{PRICE:B08X1YRSTR}}. - A post-generation resolver service substitutes real prices. - LLM never generates a numeric price token. - Inline-text price mentions ("around $10") are explicitly forbidden in the system prompt.

Alternatives rejected

Alternative Reason rejected
Stronger system prompt only Tried, plateaued at 3%. Diminishing returns.
Block any response containing a price Too aggressive — many valid responses mention prices correctly.
Buffer the entire response and validate before streaming Kills streaming UX (~1500ms blocking).
Train a custom model that doesn't hallucinate prices Months of work; over-engineered for the problem.

Consequences

Positive: - Hallucinated prices in production: 6.2% → 0.05% (Phase 1) → 0.0% in LLM tokens (Phase 2; placeholder resolver guarantees). - Legal sign-off achieved. - The placeholder pattern was reused for other high-stakes fields: ASIN cards, in-stock status.

Negative: - Inline price mentions in prose are now stripped by the guardrail when the LLM forgets the placeholder rule. Slightly awkward responses occasionally ("This is great manga and is currently priced at — for current pricing, please check the product page."). - Pricing API became a required dependency. If it's down, prices don't appear at all — better than wrong, worse than approximate.

Revisit triggers

  • LLM family change (e.g., Claude 4) — re-measure fabrication rate; placeholder rule may be enforced more reliably and inline stripping logic could be relaxed.
  • Pricing API SLA degrades — consider local price cache with explicit staleness indicator.

ADR-002: DAX as a Fix for Hot Sessions vs. Write-Sharding the Table

Context

Scenario 4 in 03-scale-testing-scenarios.md: a viral manga YouTuber linked to a chat session, 50K users hit the same partition key in minutes, DynamoDB throttled, orchestrator retries amplified the herd.

Forces

Lens Force
⚙️ SRE "I want to stop being paged. Whatever fix gets there fastest is the right fix."
🔧 Backend "DAX is a 1-week change. Write-sharding the partition key is a 6-week migration with backwards compatibility."
🤖 ML "Indifferent — doesn't affect model behavior."
📊 PM "The next viral moment is in days, not weeks. Take the fast fix."
💰 Cost "DAX is +$200/month. Write-sharding is engineering time but no infra cost. DAX is cheaper."
🛡️ Security "DAX is in our VPC. No new exposure."
🎨 Frontend "Doesn't affect me."
⚖️ Legal "PII vault items must NOT be in DAX (they require explicit auditable access)."

Decision

Take the DAX fix immediately, log a write-sharding follow-up that has not been done.

Specific configuration: - DAX cluster in front of the sessions table. - PII vault items use ConsistentRead=True and bypass DAX. - Application-layer request coalescing: when 50 requests arrive for the same session_id within 100ms, only one DynamoDB read is issued; others wait on a future. - Exponential backoff with jitter on retry — independent of DAX.

Alternatives rejected

Alternative Reason rejected
Write-sharding only Too slow to ship; viral risk in the meantime.
Application-layer cache (Redis) for sessions Duplicates state; harder to invalidate; DAX is purpose-built.
Move sessions to DAX-style store entirely (e.g., MemoryDB) Over-engineered; DynamoDB durability is needed for audit/PII.

Consequences

Positive: - Throttle events: tens per week → zero, sustained for 6+ months. - P99 read latency on hot sessions: 450ms → <1ms (DAX cache hit).

Negative: - DAX is now a load-bearing component. If it fails, the original partition-key issue resurfaces. - We accumulated technical debt — the "write-sharding for sessions" follow-up is still open. SRE flagged this as a "band-aid" decision in the original review.

Revisit triggers

  • DAX availability incident (would force the fall-through path to fire and likely cause the original symptom to recur briefly).
  • Session traffic doubles (~100K msg/sec) — DAX cluster sizing may not scale linearly without a re-architecture.
  • Move to a multi-tenant deployment (e.g., other domains beyond manga share the orchestrator) — partition key design needs to be revisited anyway.

ADR-003: Shadow Mode for Prompt Changes — Required, Even at $31.5K/week

Context

Three months in, the team wanted to ship a small prompt wording change: replace "Recommend" with "I recommend" to make responses feel more conversational. Engineering manager pushed: "this is a one-word change, do we really need a week of shadow mode at $31.5K?"

Two months earlier, a different "small" change (adding a single sentence: "You may use friendly emojis when appropriate") had caused Claude to add emojis to every response, including order tracking and return flows. It was caught in shadow mode. The brand impact would have been significant if it had shipped to 100%.

Forces

Lens Force
💰 Cost "$31.5K/week × 52 = $1.6M/year if every change goes through shadow mode. We need to be selective."
🤖 ML "Prompt changes are non-linear in their effects. A one-word change can shift the response distribution."
📊 PM "We can't ship surprises to customers. The emoji incident was a near-miss."
⚖️ Legal "Brand consistency is part of compliance — voice guidelines exist for a reason."
⚙️ SRE "Without shadow mode, my only signal is post-deploy metrics. By then it's already affecting users."
🔧 Backend "Shadow mode infra is built. The marginal cost is just the LLM calls — $31.5K is the LLM bill, not engineering effort."
🛡️ Security "Shadow mode also catches prompt injection regressions in adversarial test."
🎨 Frontend "If response length distribution shifts, our chat panel layout breaks. Shadow mode catches that early."

Decision

Shadow mode is required for any change that touches the LLM inference path. A risk taxonomy gates duration:

Change Type Shadow Required? Duration
Pure infra change (no prompt/model touch) No
Typo fix in prompt (no semantic change) No, golden eval only
Adding a sentence to system prompt Yes 2 days
Changing instruction verbs ("use" → "always use" → "only use") Yes 1 week
Adding a new RAG injection field Yes 3 days
Model version change Yes 1 week + per-class regression review
New guardrail stage Yes 1 week (high false-positive risk)

Cost is justified by direct comparison: one production incident (emoji-style) costs more in customer-trust + engineering hotfix time than ~3 weeks of shadow mode. We do this < 20 times per quarter.

Alternatives rejected

Alternative Reason rejected
Skip shadow for "minor" prompt changes (engineer's judgment) "Minor" is the exact category that bit us with emojis. Engineer judgment isn't reliable on prompt non-linearity.
Shadow on 1% canary instead of full shadow Smaller signal — many regressions only show at scale.
Replace shadow with offline-only golden eval Golden eval missed the emoji issue (the dataset didn't include "emoji presence" as a metric until after the incident).

Consequences

Positive: - Prevented at least 3 user-facing regressions in the next 9 months (emoji-style, response inflation, intent routing drift). - Made the shadow mode cost a budget line item rather than a per-change argument — easier to justify because everyone sees it as fixed cost.

Negative: - Slows down prompt iteration — "ship Friday → analyze Monday" cadence. - Engineering team grumbled in early months. Mitigated by showing the $31.5K vs. incident-cost comparison repeatedly.

Revisit triggers

  • Shadow mode cost exceeds 5% of total chatbot budget (it's currently ~3%).
  • Two consecutive quarters with zero shadow-mode-caught regressions (would suggest the system has stabilized — could relax for low-risk changes).

ADR-004: Inline Guardrails Are Regex-Only, Not LLM-Based

Context

During design of the two-tier guardrails (see D-2 in 00-hld-lld-architecture.md), the question came up: should the inline (per-token) guardrail be a small LLM (e.g., a fine-tuned classifier), or stay as regex?

Forces

Lens Force
🛡️ Security "Regex misses things. An LLM-based guardrail catches semantic violations a regex doesn't."
🤖 ML "An inference-per-token guardrail = +50ms per token. Streaming becomes useless."
🔧 Backend "Regex is microseconds. LLM is milliseconds. At 50K msg/sec, the difference is unbridgeable."
⚙️ SRE "Fewer moving parts. Regex doesn't have cold starts or GPU saturation."
📊 PM "User-perceived latency is everything. Whatever preserves streaming wins."
💰 Cost "An ML guardrail at every token = millions of inference calls/day. Cost-prohibitive."
🎨 Frontend "Doesn't affect me directly."
⚖️ Legal "I'd accept regex for inline IF the post-stream pipeline is comprehensive (it is)."

Decision

Inline guardrails are regex + simple keyword lookup only. Anything semantic moves to the post-stream pipeline.

Specifically, the inline checks are: - PII patterns: SSN, credit card, phone, email — well-defined regex. - Competitor names: closed list (Barnes & Noble, Books-A-Million, etc.) — exact-match keyword. - Profanity: blocked-word list — exact-match.

What is not inline (and goes to post-stream): - Price accuracy (requires API call). - ASIN validation (requires API call). - Toxicity (requires LLM-based classifier). - Scope check (requires LLM-based classifier). - Cross-title attribute mixing (requires domain reasoning).

Alternatives rejected

Alternative Reason rejected
Lightweight ML classifier (DistilBERT-style) inline +15ms per token = streaming dies.
Bedrock Guardrails inline Bedrock Guardrails has its own latency profile; we measured 30-100ms per call.
Buffer 5 tokens then check 5-token buffer is too short for semantic checks, too long to feel instant. Worst of both.

Consequences

Positive: - Inline checks add < 1ms per chunk. - Streaming feels instant. - Regex rules are easily testable, auditable, and language-portable.

Negative: - A hallucinated price like "$12.99" doesn't get caught inline — it goes through unchecked, then the post-stream pipeline catches it and emits a correction frame. User saw the wrong number for 1-3 seconds. - Mitigated by the placeholder rewrite (ADR-001) — prices are now never generated by the LLM in the first place.

Revisit triggers

  • A faster ML guardrail becomes viable (sub-1ms per token at our throughput).
  • A different class of inline harm emerges that regex genuinely cannot detect (e.g., subtle tone violations).

ADR-005: Active-Active Multi-Region — Approved Despite 1.7x Cost

Context

After designing single-region multi-AZ, the SRE lead pushed for active-active multi-region. The first cost estimate came back at +70% infrastructure cost. Initial leadership reaction: "single-region with multi-AZ is fine, 99.9% is good enough."

Then us-east-1 had a 4-hour partial outage in month 5. MangaAssist degraded heavily. The post-mortem reopened the multi-region question.

Forces

Lens Force
⚙️ SRE "Single-region was fine until it wasn't. 4 hours of degradation during the next Prime Day = dead."
💰 Cost "+70% infra cost is hard to swallow. We're paying for resilience, not features."
📊 PM "Customers don't accept 'AWS was down.' If they can't get their order tracked, they call the call center. That's the actual cost."
🤖 ML "Bedrock model availability differs per region. We have to verify same models in both regions and accept the constraint."
🛡️ Security "Two attack surfaces. Identical IAM/WAF needed. Adds an audit obligation."
🔧 Backend "Stateless orchestrator works in both regions. Eventual consistency on Global Tables is the main concern."
🎨 Frontend "Single hostname via Route 53. No client work."
⚖️ Legal "Multi-region is preferred for DR posture. Some customers ask about it explicitly."

Decision

Active-active across us-east-1 and us-west-2. Justified not on unit cost but on incident-cost-avoidance:

  • A 4-hour regional event during Prime Day = ~$X million in deflection-loss + customer-trust impact.
  • The +70% infra cost amortizes to ~$Y/month.
  • Break-even: one regional event every 2 years pays for the entire multi-region investment.
  • Historical AWS regional events: ~1-2 per year of meaningful duration.

Conclusion: the resilience investment is justified by the actuarial expected cost of regional events, not the per-request unit economics.

Alternatives rejected

Alternative Reason rejected
Single-region multi-AZ Insufficient for region-level events.
Active-passive (warm standby) Failover is brittle at the moment it's needed most (during an outage).
Multi-region but only for the orchestrator (not data) Data is the bottleneck — Global Tables required.
Multi-region but only for compute (Bedrock single-region) Bedrock outage = chatbot down even with multi-region orchestrator. Required Bedrock parity.

Consequences

Positive: - Survived the next regional event with no user-visible impact (Route 53 shifted ~80% of new connections to the healthy region within 2 minutes). - Cleaner DR posture for compliance reviews. - Latency improvements for west-coast users (~30ms reduction on average).

Negative: - 1.7x infrastructure cost. - Two-region testing complexity for new features (e.g., a deploy must roll through both regions; coordination overhead). - DynamoDB Global Tables eventual consistency caused two minor incidents (cross-region reads of fresh writes during a region-pinned session) before we tightened the routing.

Revisit triggers

  • Cost pressure (CFO challenge).
  • AWS regional reliability dramatically improves and stays improved for 12+ months.
  • A third region becomes desirable (e.g., EU expansion).

ADR-006: Memory Summarization Triggers at 20 Turns, Not Earlier or Later

Context

After the memory summarizer bug (ASINs being stripped, see Q for memory in interview docs), we re-examined the threshold. Why 20 turns?

Forces

Lens Force
💰 Cost "Earlier summarization = smaller average prompt = lower LLM cost. We modeled 10 vs. 15 vs. 20 vs. 25."
🤖 ML "Quality degrades when summary is invoked. Summaries lose nuance. Want as late as possible."
🔧 Backend "Summarization is itself an LLM call. More frequent = more calls = more code paths to handle errors."
📊 PM "Most sessions are < 10 turns. Summarization at 20 means it rarely fires. Acceptable."
⚙️ SRE "Summarization in the request path is a risk. We made it async fire-and-forget — but if it doesn't complete by turn 25, it's synchronous fallback."
🎨 Frontend "Doesn't affect me."
🛡️ Security "PII redacted before summarization, so summarization can't leak PII."
⚖️ Legal "Summary retention follows session retention. Same TTL."

Decision

Summarize at turn 20 (not 10, not 25). Reasoning:

  • At turn 10: ML evaluation showed quality degraded measurably — users who naturally referenced "the third one you mentioned" were getting wrong references because summary lost specificity. Cost saving wasn't worth quality loss.
  • At turn 20: Quality drop in evaluation was within noise. Cost saved per long-session was meaningful (~30% prompt size reduction at turn 30).
  • At turn 25: Prompt size was already large; cost amortization on summarization was lower; some sessions hit 30+ turns and prompt got expensive.

Specific implementation: - Summarizer is a separate Bedrock call (Haiku — cheap). - Async after turn 20: turn 21 doesn't wait, summary completes in background. - Synchronous fallback at turn 25 if async hasn't finished — guarantees the prompt doesn't grow unbounded. - Entities (ASINs, series names, user-stated preferences) preserved as structured metadata in the summary item, NOT just in summary prose. This is the bug-fix from the original incident.

Alternatives rejected

Alternative Reason rejected
Sliding window (last N turns only) Loses early-conversation context users frequently reference.
No summarization (full history forever) Prompt cost grows linearly; some sessions hit 100+ turns.
ML-driven dynamic threshold Over-engineered; static threshold + entity preservation is sufficient.
Summarize at turn 10 Quality regression measured.

Consequences

Positive: - Average prompt size at turn 30: ~40% smaller than no-summarization. - Cost reduction at the long-session tail: ~$8K/month. - Entity preservation eliminates the "tell me about the third one" failure mode.

Negative: - Summarizer is a load-bearing component with its own failure modes (Haiku outage, summary content drift). - Quality on long sessions is slightly lower than infinite context — accepted because users rarely hit it.

Revisit triggers

  • LLM context windows expand 10x (Claude 4 et al.) — summarization may not be needed.
  • Cost economics of long-context LLM calls drop dramatically.
  • New entity types emerge (e.g., user accessibility preferences) and need preservation.

ADR-007: Listing Trust Score Added to Guardrails After Counterfeit Incident

Context

Documented in 05-grilling-sessions.md, Round 5: a fraudulent counterfeit listing passed our ASIN validation (the ASIN existed in the catalog, even though the seller was counterfeit). The chatbot recommended it. Legal called.

Forces

Lens Force
⚖️ Legal "We can't recommend known-bad listings. Add a trust signal to guardrails."
🛡️ Security "Trust & Safety has signals we should consume. New integration needed."
🤖 ML "Trust score is a numeric field — straightforward to filter on. Doesn't change the model."
🔧 Backend "New Pact contract with Catalog API for the trust score field. New guardrail stage or augmented ASIN stage."
📊 PM "Threshold tuning is the risk — too aggressive and we miss good listings; too loose and we recommend bad ones."
⚙️ SRE "New dependency in the critical path. Need timeout + fallback. If trust score isn't available, default to safe (filter out)."
💰 Cost "Catalog API call is already happening; trust score is just an additional field. No extra call."
🎨 Frontend "Trust signals can be surfaced in product cards as a badge — opportunity, not just risk."

Decision

Augment the existing ASIN validation guardrail with a trust score check.

  • Trust & Safety publishes a listing_trust_score (0.0-1.0) per ASIN.
  • Guardrail filters out products below threshold (initial: 0.7, tunable).
  • If trust score is missing (Catalog API doesn't return it), default to 0.5 (treat as suspicious).
  • Trust signals exposed in the ProductCard.trust_signals field for frontend rendering (badges for verified sellers).
  • New seller listings (< 30 days) are flagged regardless of score for human review on a sample.

Alternatives rejected

Alternative Reason rejected
Block all listings from new sellers Too aggressive — penalizes legitimate new sellers.
ML model trained to detect counterfeit Overlaps with Trust & Safety's existing models; redundant.
Manual moderation queue for chatbot recommendations Doesn't scale.

Consequences

Positive: - Counterfeit recommendations: dropped to near-zero in production within a week. - Audit trail: every filtered listing is logged for Trust & Safety analysis. - The integration enabled future signals: price-anomaly detection, seller-history filters.

Negative: - Some legitimate listings are filtered (false positives). Threshold tuning is ongoing. - Catalog API calls became more critical — a Catalog outage now degrades trust filtering, not just product details.

Revisit triggers

  • Trust score model changes (Trust & Safety re-trains) — re-tune threshold.
  • New counterfeit pattern emerges that current trust signals don't capture.
  • False-positive complaints from sellers exceed a threshold.

ADR-008: Hallucination Rate Promoted to Primary Canary Metric

Context

After 6 months of production data, the offline-online correlation analysis (in 04-offline-testing-quality-strategies.md) showed:

  • Hallucination rate (offline) ↔ escalation rate (online): r = +0.73 (strongest correlation of any metric).
  • BLEU-4 ↔ thumbs-up rate: r = +0.15 (weak).

We had been treating BLEU as a primary quality gate and hallucination rate as a secondary check. The data said we had it backwards.

Forces

Lens Force
🤖 ML "BLEU is a familiar metric but it doesn't predict our outcomes. Hallucination rate does. Switch the primary."
📊 PM "Escalation rate IS our north-star quality metric. Whatever predicts it should be the gate."
⚙️ SRE "Canary metrics drive rollback decisions. Picking the right one is operationally critical."
🔧 Backend "Hallucination rate evaluation is heavier (catalog lookups + LLM judging) — slower CI."
💰 Cost "Hallucination eval is more expensive per run. ~$5 vs. ~$1 for BLEU. Trivial at deploy frequency."
🛡️ Security "Hallucinations include security-relevant ones (fabricated prices, fake URLs). Aligns with my priorities."
🎨 Frontend "Doesn't affect me directly."
⚖️ Legal "Reduced hallucination = reduced legal risk. Strong endorsement."

Decision

Hallucination rate becomes the primary deploy gate. BLEU is removed entirely. BERTScore is a secondary gate.

Specific gates: - Pre-deploy: hallucination rate < 2% (strict). - Model promotion: hallucination rate < 1%, no regression > 5% vs. current production. - Production canary: hallucination rate sampled hourly, alarm at 1.5%. - BLEU: removed from the gate; tracked for trend purposes only. - BERTScore: secondary gate at ≥ 0.80.

Alternatives rejected

Alternative Reason rejected
Keep BLEU as primary Data shows it doesn't predict outcomes.
Replace BLEU with BERTScore BERTScore correlates better but still weaker than hallucination rate. Made it secondary, not primary.
Use multiple metrics with weighted score Composite metrics obscure which dimension is regressing. Single primary + secondary is clearer for operations.

Consequences

Positive: - Two model promotions were blocked by the hallucination gate that BLEU would have passed. Both turned out to have hidden quality regressions on launch — gate worked. - Team mental model shifted: "did hallucination rate move?" is now the first question on any prompt change.

Negative: - Slower CI — hallucination eval takes longer than BLEU. - Initial transition was painful — engineers used to tuning for BLEU had to learn what hallucination patterns looked like.

Revisit triggers

  • Correlation data shows another metric becomes more predictive (re-run the analysis annually).
  • Hallucination rate plateaus — if we hit a floor, the marginal value of further investment decreases.

Cross-ADR Themes

Reading across these 8 ADRs, three themes emerge:

Theme 1: "Cheap fixes accumulate as debt; document the trade-off, don't pretend it's a real fix."

ADR-002 (DAX) and ADR-006 (summarizer) both took the fast fix and explicitly logged the deeper architectural change as future work. The discipline is in the documentation — making the band-aid visible so it's not forgotten.

Theme 2: "Multi-stakeholder review changes the answer."

ADR-001 (price hallucination), ADR-005 (multi-region), and ADR-007 (trust score) all started with a narrow technical view and changed once Legal, SRE, or Trust & Safety added their lens. The lesson: invite the full room early.

Theme 3: "Data should override conventional metrics."

ADR-008 (hallucination as primary canary) is the clearest example — we had defaulted to BLEU because it's industry-standard for text generation. The correlation data forced us to re-base the entire eval framework. Most teams don't do this analysis; the ones that do are the ones who learn what their system actually depends on.


How to Add a New ADR

When a new operational decision arises:

  1. Write the Context first. What is happening that requires this decision? Be specific (cite the incident, the metric, the stakeholder ask).
  2. Walk all 8 lenses. Even if a lens is "neutral," write it down — the absence is informative.
  3. Document at least 3 alternatives rejected. If you can't think of 3, you haven't explored the space.
  4. Capture revisit triggers. Decisions are not forever. Name the conditions under which you'd re-open the ADR.
  5. Tag the ADR for cross-reference. Link from related sections in 00-hld-lld-architecture.md, 03-scale-testing-scenarios.md, and other documents that depend on this decision.