US-05: DynamoDB Cost Optimization
User Story
As a backend engineer, I want to optimize DynamoDB capacity mode, TTL policies, and read/write patterns for conversation memory, So that storage and throughput costs decrease by 40-60% without impacting session performance.
Acceptance Criteria
- On-demand capacity is used instead of provisioned capacity (eliminates over-provisioning).
- TTL aggressively expires stale sessions — 24 hours for turns, 72 hours for summaries.
- Batch writes reduce WCU consumption by grouping multiple turn writes.
- GSI usage is minimized; unnecessary GSI reads are eliminated.
- Write amplification is reduced by coalescing session metadata updates.
- Total DynamoDB costs decrease by 40-60%.
High-Level Design
Cost Problem
Conversation memory (LLD-4) stores: - META items: 1 per session - TURN items: up to 20+ per session - SUMMARY items: 1 per 10-turn window
At 1M sessions/day with an average of 8 turns each: - Writes: 1M META + 8M TURN + ~400K SUMMARY = ~9.4M writes/day - Reads: ~16M reads/day (2 reads per message: META + recent turns) - Storage: accumulated sessions grow storage if TTL is not aggressive
DynamoDB on-demand pricing: - Write: $1.25/million WRU - Read: $0.25/million RRU - Storage: $0.25/GB/month
Baseline: ~$15/day writes + ~$4/day reads = ~$570/month
Optimization Architecture
graph TD
A[New Message] --> B{Write Strategy}
B --> C[Coalesce META Update<br>with TURN Write<br>Single TransactWrite]
B --> D[Batch Writes<br>for Summaries]
E[Session Lifecycle] --> F{TTL Policy}
F --> G[TURN items: 24h TTL]
F --> H[SUMMARY items: 72h TTL]
F --> I[META items: 24h TTL]
J[Read Strategy] --> K[Eventually Consistent Reads<br>50% cheaper]
J --> L[ProjectionExpression<br>Read only needed fields]
J --> M[Query with Limit<br>Fetch last N turns only]
style C fill:#2d8,stroke:#333
style K fill:#2d8,stroke:#333
Savings Breakdown
| Technique | Reduction | Monthly Savings |
|---|---|---|
| Eventually consistent reads | 50% read cost | ~$60 |
| Coalesced writes (META + TURN) | 30% fewer write operations | ~$52 |
| Aggressive TTL (smaller table) | 60% storage reduction | ~$30 |
| ProjectionExpression (smaller reads) | 20% fewer RCU | ~$24 |
| Batch summary writes | 40% fewer write ops for summaries | ~$15 |
| Total | ~$181/month |
Low-Level Design
1. Optimized Read Pattern
sequenceDiagram
participant Orchestrator
participant DynamoDB
Note over Orchestrator: Use eventually consistent reads<br>(0.5 RCU per 4KB vs 1 RCU)
Orchestrator->>DynamoDB: Query(pk=SESSION#abc,<br>sk BETWEEN 'META' AND 'META',<br>ConsistentRead=false,<br>ProjectionExpression='session_id,turn_count,last_intent,page_context')
DynamoDB-->>Orchestrator: META item (partial)
Orchestrator->>DynamoDB: Query(pk=SESSION#abc,<br>sk BEGINS_WITH 'TURN#',<br>ScanIndexForward=false,<br>Limit=6,<br>ConsistentRead=false,<br>ProjectionExpression='role,content,intent')
DynamoDB-->>Orchestrator: Last 6 turns (3 user + 3 assistant)
Code Example: Optimized DynamoDB Client
import time
from decimal import Decimal
from typing import Optional
import boto3
from boto3.dynamodb.conditions import Key
class ConversationMemoryClient:
"""Optimized DynamoDB client for conversation memory."""
TABLE_NAME = "manga_chatbot_memory"
SESSION_TTL_SECONDS = 86400 # 24 hours
SUMMARY_TTL_SECONDS = 259200 # 72 hours
def __init__(self):
self._table = boto3.resource("dynamodb").Table(self.TABLE_NAME)
def load_context(
self, session_id: str, max_recent_turns: int = 6
) -> dict:
"""Load session metadata + recent turns in 2 queries (eventually consistent)."""
pk = f"SESSION#{session_id}"
# Query META — eventually consistent, projection
meta_resp = self._table.query(
KeyConditionExpression=Key("pk").eq(pk) & Key("sk").eq("META"),
ConsistentRead=False,
ProjectionExpression="session_id, customer_id, turn_count, "
"last_intent, page_context, updated_at",
)
meta = meta_resp["Items"][0] if meta_resp["Items"] else None
# Query recent turns — reverse order, limited, eventually consistent
turns_resp = self._table.query(
KeyConditionExpression=Key("pk").eq(pk)
& Key("sk").begins_with("TURN#"),
ScanIndexForward=False,
Limit=max_recent_turns,
ConsistentRead=False,
ProjectionExpression="#r, content, intent",
ExpressionAttributeNames={"#r": "role"},
)
turns = list(reversed(turns_resp["Items"]))
# Check if we need summaries (long conversation)
summaries = []
if meta and meta.get("turn_count", 0) > max_recent_turns:
sum_resp = self._table.query(
KeyConditionExpression=Key("pk").eq(pk)
& Key("sk").begins_with("SUMMARY#"),
ScanIndexForward=False,
Limit=2,
ConsistentRead=False,
ProjectionExpression="content",
)
summaries = sum_resp["Items"]
return {
"meta": meta,
"turns": turns,
"summaries": summaries,
}
def save_turn_with_meta_update(
self,
session_id: str,
role: str,
content: str,
intent: str,
response_id: Optional[str] = None,
) -> None:
"""Coalesce TURN write + META update into a single TransactWrite."""
pk = f"SESSION#{session_id}"
now = int(time.time())
turn_sk = f"TURN#{now}"
self._table.meta.client.transact_write_items(
TransactItems=[
# Write new TURN item
{
"Put": {
"TableName": self.TABLE_NAME,
"Item": {
"pk": {"S": pk},
"sk": {"S": turn_sk},
"role": {"S": role},
"content": {"S": content},
"intent": {"S": intent},
"response_id": {"S": response_id or ""},
"created_at": {"N": str(now)},
"ttl": {"N": str(now + self.SESSION_TTL_SECONDS)},
},
}
},
# Update META item atomically
{
"Update": {
"TableName": self.TABLE_NAME,
"Key": {
"pk": {"S": pk},
"sk": {"S": "META"},
},
"UpdateExpression": "SET updated_at = :now, "
"last_intent = :intent, "
"turn_count = turn_count + :one, "
"ttl = :ttl",
"ExpressionAttributeValues": {
":now": {"N": str(now)},
":intent": {"S": intent},
":one": {"N": "1"},
":ttl": {"N": str(now + self.SESSION_TTL_SECONDS)},
},
}
},
]
)
def batch_write_summaries(
self, session_id: str, summaries: list[dict]
) -> None:
"""Batch write summary items to reduce WCU consumption."""
pk = f"SESSION#{session_id}"
now = int(time.time())
with self._table.batch_writer() as batch:
for summary in summaries:
batch.put_item(
Item={
"pk": pk,
"sk": f"SUMMARY#{summary['window_id']}",
"content": summary["content"],
"covered_turns": summary["covered_turns"],
"created_at": Decimal(str(now)),
"ttl": Decimal(str(now + self.SUMMARY_TTL_SECONDS)),
}
)
def create_session(
self,
session_id: str,
customer_id: Optional[str],
page_context: dict,
) -> None:
"""Create initial META item for a new session."""
now = int(time.time())
self._table.put_item(
Item={
"pk": f"SESSION#{session_id}",
"sk": "META",
"session_id": session_id,
"customer_id": customer_id,
"page_context": page_context,
"created_at": Decimal(str(now)),
"updated_at": Decimal(str(now)),
"ttl": Decimal(str(now + self.SESSION_TTL_SECONDS)),
"turn_count": Decimal("0"),
"last_intent": "none",
}
)
2. TTL Strategy
graph TD
subgraph "Item Lifecycle"
A[Session Created] --> B[TURN items written<br>TTL: created_at + 24h]
B --> C[After 20 turns:<br>Summarize window]
C --> D[SUMMARY items<br>TTL: created_at + 72h]
B --> E[Session Idle > 5 min]
E --> F[WebSocket closed]
F --> G[TTL expiry<br>DynamoDB auto-deletes]
end
subgraph "Cost Impact"
H[Without TTL:<br>Storage grows unbounded<br>$0.25/GB/month] --> I[With aggressive TTL:<br>Table stays < 10 GB<br>~$2.50/month storage]
end
style I fill:#2d8,stroke:#333
3. GSI Optimization
The current schema has a GSI on customer_id / updated_at. Minimize GSI writes:
graph LR
A[Current: GSI on every item] --> B[Optimized: GSI only on META items]
B --> C[TURN and SUMMARY items<br>excluded from GSI projection]
C --> D[GSI WCU reduced by 80%]
style D fill:#2d8,stroke:#333
Code Example: Optimized Table with Sparse GSI
import boto3
def create_optimized_table() -> None:
"""Create DynamoDB table with sparse GSI (only META items projected)."""
dynamodb = boto3.client("dynamodb")
dynamodb.create_table(
TableName="manga_chatbot_memory",
KeySchema=[
{"AttributeName": "pk", "KeyType": "HASH"},
{"AttributeName": "sk", "KeyType": "RANGE"},
],
AttributeDefinitions=[
{"AttributeName": "pk", "AttributeType": "S"},
{"AttributeName": "sk", "AttributeType": "S"},
{"AttributeName": "customer_id", "AttributeType": "S"},
{"AttributeName": "updated_at", "AttributeType": "N"},
],
GlobalSecondaryIndexes=[
{
"IndexName": "gsi1-customer-sessions",
"KeySchema": [
{"AttributeName": "customer_id", "KeyType": "HASH"},
{"AttributeName": "updated_at", "KeyType": "RANGE"},
],
"Projection": {
"ProjectionType": "INCLUDE",
"NonKeyAttributes": [
"session_id", "last_intent", "turn_count"
],
},
# Only META items have customer_id — GSI is naturally sparse
# TURN and SUMMARY items lack customer_id, so they're excluded
}
],
BillingMode="PAY_PER_REQUEST", # On-demand — no over-provisioning
TimeToLiveSpecification={
"Enabled": True,
"AttributeName": "ttl",
},
)
4. Write Coalescing for High-Throughput Scenarios
During traffic spikes, buffer multiple writes and flush in batches:
sequenceDiagram
participant Orchestrator1 as Orchestrator Task 1
participant Buffer as Write Buffer<br>(In-Process Queue)
participant Flusher as Flush Timer<br>(Every 100ms)
participant DynamoDB
Orchestrator1->>Buffer: Enqueue TURN write
Note over Buffer: Buffer accumulates writes
Orchestrator1->>Buffer: Enqueue another TURN write
Flusher->>Buffer: Flush (100ms elapsed)
Buffer->>DynamoDB: BatchWriteItem (2 items)
DynamoDB-->>Buffer: Success
Code Example: Write Buffer
import threading
import time
from collections import deque
from typing import Any
import boto3
class DynamoDBWriteBuffer:
"""Buffers DynamoDB writes and flushes in batches to reduce WCU."""
MAX_BATCH_SIZE = 25 # DynamoDB BatchWriteItem limit
FLUSH_INTERVAL_MS = 100 # Flush every 100ms
def __init__(self, table_name: str):
self._table_name = table_name
self._buffer: deque[dict] = deque()
self._lock = threading.Lock()
self._dynamodb = boto3.resource("dynamodb")
self._table = self._dynamodb.Table(table_name)
self._running = True
self._flush_thread = threading.Thread(target=self._flush_loop, daemon=True)
self._flush_thread.start()
def enqueue(self, item: dict) -> None:
with self._lock:
self._buffer.append(item)
if len(self._buffer) >= self.MAX_BATCH_SIZE:
self._flush()
def _flush_loop(self) -> None:
while self._running:
time.sleep(self.FLUSH_INTERVAL_MS / 1000)
with self._lock:
if self._buffer:
self._flush()
def _flush(self) -> None:
batch = []
while self._buffer and len(batch) < self.MAX_BATCH_SIZE:
batch.append(self._buffer.popleft())
if batch:
with self._table.batch_writer() as writer:
for item in batch:
writer.put_item(Item=item)
def shutdown(self) -> None:
self._running = False
with self._lock:
if self._buffer:
self._flush()
Capacity Mode Decision
graph TD
A{Traffic Pattern?} --> B{Predictable<br>steady load?}
B -->|Yes| C[Provisioned +<br>Auto-scaling<br>Cheapest for steady workloads]
B -->|No| D{Spiky or<br>unpredictable?}
D -->|Yes| E[On-Demand<br>Pay per request<br>No over-provisioning]
D -->|New service,<br>unknown pattern| E
F[MangaAssist Pattern] --> G[Spiky: peak 9am-11pm,<br>low 11pm-9am]
G --> E
style E fill:#2d8,stroke:#333
Monitoring and Metrics
| Metric | Target | Alert |
|---|---|---|
| Read latency (P99) | < 10ms | > 25ms |
| Write latency (P99) | < 20ms | > 50ms |
| Throttled requests | 0 | Any throttle event |
| Table size | < 10 GB | > 15 GB |
| TTL deletion rate | ~9M items/day | < 5M (TTL not working) |
| Monthly DynamoDB cost | ≤ $250 | > $400 |
Risks and Mitigations
| Risk | Impact | Mitigation |
|---|---|---|
| Eventually consistent reads return stale turn | User sees outdated conversation | Acceptable for chat — turn was just written by same session; worst case 1s stale |
| TransactWrite failures | Turn saved but META not updated (or vice versa) | Transaction ensures atomicity; retry with backoff on TransactionConflict |
| Write buffer data loss on task crash | Turns lost for in-flight writes | Buffer is < 100ms of writes; acceptable loss; DLQ catches missed writes |
| Aggressive TTL deletes ongoing session | User loses context mid-conversation | TTL is 24h; sessions rarely last > 1h; META update refreshes TTL on every message |
Deep Dive: Why This Works on a Manga Chatbot Workload
DynamoDB cost optimization is fundamentally a schema and access-pattern problem, not a knob-tuning problem. The 40–60% savings target in this story does not come from running DDB more cheaply; it comes from removing accidental cost entirely — writes that didn't need to happen, GSI projections that nothing reads, item TTLs longer than the data is useful. A chatbot's session-state schema is one of the worst offenders for this pattern in general production because it is conventionally treated as a long-lived OLTP record when it is actually short-lived ephemeral state.
Property 1: Chat session traffic is fundamentally spiky and bursty. Each user generates a tight burst of writes during an active conversation (8–15 turns over 5–15 minutes), then nothing for hours or days. Provisioned capacity must be sized for peak burst (or auto-scale up, paying for the lag); on-demand mode pays only for the actual write request units consumed. AWS's published guidance on the on-demand-vs-provisioned crossover is that on-demand wins below ~18% sustained provisioned utilization. Manga-chatbot chat traffic sits well below this threshold (concentrated daily peaks separated by long quiescence). The architectural assumption is that sustained utilization stays below the crossover; the failure signal is monthly utilization climbing above ~50% on the equivalent-provisioned curve, at which point provisioned + auto-scaling becomes cheaper.
Property 2: TURN items are write-heavy, read-rare. Every conversation turn is written to DDB, but only the last 6 turns of the current session are ever read. The 24-hour TTL is not an "eventual cleanup" — it is the primary cost lever because it caps the active dataset size at ~9.4M writes/day × 24h × 0.7 turn ratio = ~6 GB hot data, after which Spectrum offload kicks in. The TTL also bounds storage growth at zero net cost (TTL deletes are free in DDB). Without TTL, storage grows linearly forever and so does the GSI projection cost. The story's 24h-for-TURN / 72h-for-SUMMARY split exploits the fact that summary items are read for cross-session memory ("did we talk about this manga before?") and need a longer window than raw turns.
Property 3: Sparse GSI is the largest single saving and the most overlooked. A standard DDB GSI projects every base-table write into the index, doubling write cost. The sparse-GSI pattern (story line ~430s) projects only items that have a specific marker attribute (META items only, in this story). Since META items are 1 per session and TURN/SUMMARY items are many per session, projecting only META reduces GSI write volume by 80% — a direct WCU saving. The architectural assumption is that the GSI's query patterns (find recent sessions for user X, find sessions in state Y) only need META items, never TURN or SUMMARY. Adding a query that needs to scan TURN items would invalidate this design — flag it during design review.
Property 4: TransactWriteItems is 2× WCU but eliminates entire failure modes. Saving a turn (TURN item) and updating session state (META item) atomically requires either (a) 2 separate writes with risk of partial failure, or (b) TransactWriteItems at 2× WCU per write. The story chooses (b). The cost is a 2× write multiplier on the META+TURN pair; the benefit is that "session state advanced but turn history is missing" becomes architecturally impossible. This is the right trade for a chatbot where partial-write inconsistency surfaces as user-visible weirdness ("you said this earlier" referring to a turn that doesn't exist).
Bottom line: the wins are unevenly distributed — sparse GSI alone is roughly 30–40% of the saving, TTL is another 20–25%, on-demand mode is another 10–15%, and TransactWriteItems is negative cost (it adds 2× WCU on the transaction pairs) but is non-negotiable for correctness. The story should never trade away TransactWrite for cost; the other three are the cost levers.
Real-World Validation
Industry Benchmarks & Case Studies
- AWS DynamoDB official documentation (capacity mode comparison) — On-demand wins below ~18% sustained provisioned utilization for spiky traffic patterns. Provisioned + auto-scaling wins above ~30% sustained utilization with predictable patterns. Chatbot session traffic falls firmly in the on-demand region.
- AWS re:Invent DAT303 (DynamoDB best practices) — Documents the sparse-GSI pattern as "best practice for state-machine-shaped data" with worked cost examples showing 70–85% GSI write cost reduction.
- AWS DynamoDB pricing (current) — On-demand WCU: $1.25 per million write request units (1 KB). On-demand RCU: $0.25 per million read request units (4 KB strongly consistent / 8 KB eventually consistent). TransactWriteItems: 2× WCU per item in the transaction. ✅ matches story implicit pricing.
- AWS DynamoDB official documentation (TTL) — TTL deletes are free; only the storage cost differential matters. Items typically deleted within 48 hours of TTL timestamp (eventually-consistent, not exact). The story's 24h TTL with reset-on-write is the correct pattern for active sessions.
- Discord engineering blog: "How Discord stores billions of messages" — Documents archival patterns (hot/warm/cold tiers) for chat-style data; informs the Redshift Spectrum offload pattern in this story.
- Internal cross-reference:
POC-to-Production-War-Story/02-seven-production-catastrophes.md— The "context window overflow" catastrophe was caused in part by unbounded session-state growth; TTL is the structural fix. - Internal cross-reference:
Database-Tradeoffs/— Covers DDB-vs-RDS-vs-OpenSearch trade-offs; this story is the DDB operating point for ephemeral state.
Math Validation
- 9.4M writes/day × 30 = 282M writes/month × $1.25 / 1M = $352.50/month for base writes (on-demand, 1 KB items).
- Without sparse GSI: same 282M writes × 2 (base + GSI) = 564M WRU × $1.25 = $705/month.
- With sparse GSI (only 1M META writes/day project to GSI): 282M base + 30M GSI = 312M WRU = $390/month — saves $315/month, ~45% of write cost.
- Eventually-consistent reads: $0.25/million; reading 6 turns × ~1 KB at 1M sessions/day × 30 days = 180M RCU × $0.25/M = $45/month.
- Storage: ~6 GB hot × $0.25/GB/month = $1.50/month + Spectrum cold ($0.023/GB/month) — negligible at this scale.
- Sum: ~$390 (writes) + $45 (reads) + ~$2 (storage) = ~$437/month optimized vs ~$750+ without sparse GSI. Story baseline of $570/month sits between unoptimized-and-optimized; story claim of "40–60% savings" maps to going from baseline to ~$437 which is ~24% — flag: the $570 baseline appears to already include some optimization. Recommend the story explicitly state pre-optimization (~$750-900) and post-optimization (~$400-450) numbers.
Conservative vs Aggressive Savings Bounds
| Bound | Source | Total monthly savings |
|---|---|---|
| Conservative | TTL only, no sparse GSI, on-demand baseline | ~20% (~$120/month) |
| Aggressive | Full sparse GSI + 12h TTL + write coalescing + Spectrum cold tier | ~55% (~$330/month) |
| Story's projected savings | 40–60% (~$230–340/month) | Realistic; sparse GSI is the dominant lever. |
Cross-Story Interactions & Conflicts
This story is the authoritative side for the session-state lifecycle contract.
- US-07 (Analytics Pipeline) — Authoritative side: this story owns the lifecycle. Conflict mode: TURN items have 24h DDB TTL; if Redshift archive ingest runs less frequently than every 24h, items can be TTL-deleted before being archived. Resolution: the Spectrum offload (story line ~470s) runs as a DDB Stream → Lambda → Firehose → S3 pipeline that captures every TURN item before TTL deletes it. US-07's hot-tier retention (30 days in Redshift) reads the S3 archive via Spectrum; the contract is "every TURN item appears in S3 within 60 minutes of write."
- US-03 (Caching) — See US-03. Read coalescing in this story (eventually-consistent reads, ProjectionExpression, Limit 6) makes per-session reads cheap, but the cache from US-03 is the primary defense against re-reading the same session state. Cache miss → DDB read; ElastiCache failover → all reads miss → RCU spike. Resolution: keep on-demand mode (auto-absorbs spikes) rather than provisioned; the cost spike during failover is bounded by the cache failover window (~30–90 seconds × peak RPS).
- US-08 (Traffic-Based) — Indirect interaction. Conflict mode: if US-08's cost circuit breaker tries to trim DDB cost during budget pressure, the only safe lever is to drop the write buffer flush interval up (more batching, slightly more loss tolerance) — the data itself cannot be cheapened without losing session continuity. Resolution: US-08's degradation policy explicitly does not touch DDB writes; conversation memory is treated as protected.
- US-04 (Compute) — Indirect interaction. The DDB write buffer (100 ms flush interval) lives in-process on Fargate workers. Spot interruption can drop up to 100 ms of buffered writes; combined with Spot interruption rate (~5%), measure the loss-tolerance budget end-to-end.
Rollback & Experimentation
Shadow-Mode Plan
- Sparse GSI: deploy in shadow by creating the optimized GSI alongside the existing full-projection GSI; compare query results from both for 1 week. Promote read traffic to the sparse GSI only after 100% query equivalence.
- TTL changes: deploy with longer TTL (48h instead of 24h) for the first 2 weeks; observe deletion rate and any "session resurrected after TTL" complaints.
- Eventually-consistent reads: shadow with strongly-consistent reads in parallel; measure how often the EC read returns stale data on the same session within 1 second.
Canary Thresholds
- Sparse GSI: 10% of traffic queries the new GSI for 1 week; abort if any query returns fewer results than the full-projection GSI.
- TTL ramp: start at 48h, drop to 36h, drop to 24h over 4 weeks; abort if "lost session context" reports rise > 5%.
- Abort criteria (any one trips): write throttling > 0, query result divergence between old and new GSI > 0%, session-context-loss user reports > expected baseline + 10%.
Kill Switch
- Two flags:
ddb_sparse_gsi_enabled(reverts queries to full-projection GSI; requires keeping both during transition),ddb_aggressive_ttl_enabled(reverts TURN TTL to 7 days). Provisioned capacity flagddb_capacity_modetoggles between on-demand and provisioned with 2× capacity buffer.
Quality Regression Criteria (story-specific)
- Throttled requests: 0 (on-demand should never throttle below table-level limits).
- TransactWriteItems failure rate: ≤ 0.1% (above this, investigate hotspots in META items).
- Lost-session-context rate (user reports referencing missing turns): ≤ existing baseline.
- DDB Stream lag (DDB write → S3 archive landed): ≤ 60 minutes.
Multi-Reviewer Validation Findings & Resolutions
The cross-reviewer pass identified the following story-specific findings. README's "Multi-Reviewer Validation & Cross-Cutting Hardening" section covers concerns that span all stories.
S1 (must-fix before production)
PII in TURN message_text without redaction. TURN items store the raw user message; users routinely include phone numbers, addresses, email, partial card digits, order numbers. Stored unredacted in DDB + replicated via DDB Streams to S3 archive (US-07) → archived for years in Parquet → readable without re-encryption boundary. Resolution: PII-redaction pass before TURN write — strip phone (regex), email, full street addresses, card-pattern digits. Replace with [PHONE], [EMAIL], etc. tokens. Audit-sample 0.1% post-write to validate redaction accuracy; alert if any unredacted PII is detected.
TTL-based deletion does not satisfy GDPR/CCPA right-to-deletion SLA. DDB TTL deletes are eventually consistent and lag up to 48 hours per AWS docs. Combined with the 30-day Redshift hot retention and S3 archive, a single user's data can persist 30+ days after their deletion request. Resolution: explicit delete_user_data(customer_id) flow that (a) hard-deletes all DDB items for that customer (Query GSI on customer_id, batch DeleteItem); (b) flags the customer_id in a deletion-audit DDB table; © the S3 archive is rebuilt monthly excluding deletion-audited customers (Glue job filtering on customer_id allowlist); (d) Redshift hot table is filtered on every query by an exclusion list materialized from the deletion-audit table. Document the 30-day deletion-completion SLA in the Data Processing Addendum.
No idempotency key on TransactWrite. TransactWriteItems can fail at the network layer; client retries send the same write again. Without idempotency keys, the META update happens twice but TURN write happens once → session-state inconsistency. Resolution: every write carries request_id (UUID, threaded from the orchestrator per README cross-cutting concerns); TransactWrite includes a ConditionExpression on attribute_not_exists(request_id) for the TURN item, making retries naturally idempotent.
S2 (fix before scale-up)
Kill switch absent. Other stories have a single *_optimization_enabled flag; this story has none. Reverting requires code redeploy. Resolution: add ddb_cost_optimization_enabled flag in the central feature-flag evaluator. When false: revert to no write coalescing, no aggressive TTL (30-day default), strongly-consistent reads, full-projection GSI in parallel during transition.
Eventually-consistent read race window during failover. Code reads META with ConsistentRead=false then queries TURN. After write under same partition, eventually-consistent reads can lag ~100ms; mostly fine for chat but documented assumption. Resolution: explicit assertion in code comment: "EC reads acceptable because META and TURN share pk; lag bounded ~100ms in practice; chat session tolerates this." Add metric ec_read_staleness_ms (sampled comparison strong vs eventual reads); alert if P99 > 500ms.
Sparse GSI rollback path requires keeping both indexes during transition. Story notes "deploy old + new in parallel" but doesn't name the rollback duration. Resolution: explicit 4-week parallel-projection window during which both full-projection and sparse-projection GSI exist; promote sparse-only after 100% query equivalence sampled daily for 4 weeks; 8-week budget total before deletion of full-projection GSI.
Baseline / optimized cost not separated cleanly. Story baseline ($570/mo) appears to already include partial optimization. Resolution: explicit pre-optimization baseline ($750–900/mo with full-projection GSI, 7-day TTL, strong reads) and post-optimization target ($400–450/mo). The "40–60% savings" claim measured against pre-optimization, not the published $570 baseline.
S3 (acknowledged / future work)
- DDB Reserved Capacity not evaluated; only relevant if migrating from on-demand (out of scope).
- Cross-region replication via Global Tables for DR — out of scope.
- Per-customer cost attribution via DDB tags — deferred until cost shapes stabilize post-launch.