Scenario File 4 — Analytics Warehouse
Context in the Architecture
The analytics pipeline captures every event in the chatbot's lifecycle:
- What is captured: message_sent, intent_classified, response_generated, product_shown, feedback_given, escalation_triggered, guardrail_fired
- Pipeline: Orchestrator → Kinesis Data Stream → Kinesis Firehose → S3 / Redshift
- Volume: At peak, 500 concurrent sessions × 3 events/message × 5 messages/session/min = 7,500 events/min (~125 events/sec)
- Purpose: Business intelligence (which intents resolve successfully?), ML feedback (which responses got thumbs-down?), guardrail review (which outputs were blocked?), and cost attribution (which flows consume the most Bedrock tokens?)
The Analytics Latency Spectrum
| Use Case | Required Freshness | Example Query |
|---|---|---|
| Real-time alerting (guardrail spike) | < 60 seconds | "Are guardrail blocks > 5% in the last 5 min?" |
| Operational monitoring | < 10 minutes | "What is P95 response latency right now?" |
| Daily product reporting | Same day | "Top 10 recommended ASINs yesterday" |
| ML feedback loops | Next day | "Which intent had the highest thumbs-down rate this week?" |
| Business quarterly review | T+1 day | "Monthly active chatbot users by store section" |
Current Choice: Kinesis Data Stream → Redshift
Why it was chosen: Redshift handles columnar analytics at petabyte scale, integrates with Kinesis Firehose for near-real-time loading, supports standard SQL for the BI and data science teams, and reuses existing Amazon data infrastructure.
Event schema recap:
CREATE TABLE chatbot_events (
event_id VARCHAR(64) PRIMARY KEY,
session_id VARCHAR(64),
customer_id VARCHAR(64),
event_type VARCHAR(32),
intent VARCHAR(32),
message_text VARCHAR(2000), -- PII-scrubbed
response_text VARCHAR(4000),
products_shown VARCHAR(500), -- comma-separated ASINs
latency_ms INTEGER,
model_id VARCHAR(64),
feedback VARCHAR(16),
created_at TIMESTAMP
);
Alternative 1: Amazon Athena + S3 (Serverless Query on Raw Data)
What Changes
Events are written to S3 as Parquet (via Kinesis Firehose). No Redshift cluster. Queries run via Athena against raw S3 files.
-- Athena query on partitioned S3 data
SELECT intent, COUNT(*) as count, AVG(latency_ms) as avg_latency
FROM chatbot_events
WHERE year='2025' AND month='12' AND day='01'
GROUP BY intent
ORDER BY count DESC;
Best Case
- Zero infrastructure cost at rest: S3 stores events with no compute overhead. You pay only per query ($5/TB scanned).
- Perfect for exploratory or infrequent queries run by data scientists.
- At low event volume (<100K events/day), Athena is dramatically cheaper than a Redshift cluster (~$300/hr for ra3.4xlarge).
- Parquet + partition pruning (
year/month/day/hour) makes daily reporting fast.
Failure Scenario — The Unpartitioned Flash Sale Query
What happens: On launch day, the product manager asks: "How many users asked about the new Chainsaw Man volume in the last 2 hours?"
The Athena query is:
SELECT COUNT(*) FROM chatbot_events
WHERE message_text LIKE '%Chainsaw Man%'
AND created_at > NOW() - INTERVAL '2 hours';
The table has no partition on created_at (only year/month/day). Athena scans the entire current-day Parquet file (3M events, 4GB). Query cost: $0.02. Query duration: 47 seconds.
The PM gets the answer but it's stale by 47 seconds. They run it again every 5 minutes during the launch, generating $2.40 in Athena costs and 6 minutes of waiting. By contrast, Redshift materialized views would answer this in <3 seconds.
The deeper failure: The data engineer never set up fine-grained hourly partitions because "Athena is serverless, it'll handle it." At 500 events/sec peak, a single hour of data is 1.8M events. Without hourly partitioning, the two-hour scan always reads the full day.
Failure Scenario 2 — Concurrency Limits Block the Real-Time Alert
What happens: An on-call engineer sets up a CloudWatch alarm that triggers a Lambda every 60 seconds to run an Athena query: "Is guardrail block rate > 5% in the last minute?"
Athena's default concurrent query limit is 20 simultaneous queries per account (can be increased via service quota). During the launch, 15 data scientists, PMs, and engineers are all running ad-hoc Athena queries simultaneously. The CloudWatch Lambda's query is queued.
The guardrail alert fires 3 minutes late because the query was queued behind ad-hoc exploratory queries. The guardrail issue (a prompt injection campaign) ran unchecked for 3 extra minutes.
Grilling Questions
- Kinesis Firehose buffers events and writes to S3 every 5 minutes (minimum) or 128MB. What does this mean for the freshness of Athena queries? Is 5-minute-stale data acceptable for guardrail incident response?
- Your S3 partitions are
year=/month=/day=/. You want to query the last 30 minutes of data. How do you write the partition filter? What happens if you forget it? - Athena does not support updates or deletes. If a GDPR data deletion request comes in ("delete all events associated with customer_id X"), how do you comply? What is the operational cost of S3 Glacier rewrite for compliance?
Decision Heuristic
Use Athena + S3 when: (a) queries are infrequent (<50/day), (b) freshness of 5+ minutes is acceptable, and © you want to minimize infrastructure cost at the expense of query latency. Redshift is better when query SLAs are under 10 seconds or real-time dashboards are required.
Alternative 2: ClickHouse (Self-Hosted or AWS-Managed)
What Changes
Events are written to ClickHouse, a columnar OLAP database optimized for high-ingestion, high-cardinality analytics. Kinesis events are consumed by a ClickHouse Kafka consumer (or via a Lambda sink).
-- ClickHouse query (MergeTree engine)
SELECT intent, count() as count, avg(latency_ms) as avg_latency
FROM chatbot_events
WHERE created_at >= now() - INTERVAL 1 MINUTE
GROUP BY intent
ORDER BY count DESC;
Best Case
- Sub-second query latency even on billions of rows for time-series aggregations. ClickHouse is purpose-built for this pattern.
- Very high ingestion throughput (millions of rows/sec on modest hardware).
- Ideal for real-time dashboards: a Grafana + ClickHouse integration shows latency and guardrail metrics updating every 5 seconds.
- Cost: 3× cheaper than Redshift for equivalent query performance on time-series analytics.
Failure Scenario — Self-Hosted Cluster Operational Burden
What happens: The team deploys a 3-node ClickHouse cluster on EC2. Six months in:
- ZooKeeper (required for ClickHouse distributed coordination in older versions) has a disk failure. The ClickHouse cluster is degraded. The team has never operated ZooKeeper before. Recovery takes 6 hours.
- ClickHouse schema changes (adding a new column
guardrail_rule_id) require aALTER TABLE ... ADD COLUMNthat runs on all 3 shards with eventual consistency. During the migration, new events fill the new column, but old events haveNULL. A BI report that groups byguardrail_rule_idproduces misleading results for historical data. - ClickHouse replication is ZooKeeper-based. A cluster upgrade to ClickHouse 24.x (which replaces ZooKeeper with ClickHouse Keeper) requires a full cluster migration. The team has no documented runbook for this.
What makes this different from Redshift: Redshift is Amazon-managed. Schema migration, replication, upgrades, and backups are handled by AWS. ClickHouse self-hosted means you manage all of this.
Failure Scenario 2 — Hot Shard on session_id
What happens: ClickHouse distributes data across shards by the partition key. The team chooses toYYYYMMDD(created_at) as the partition key. On a flash sale day, all events land on today's shard, which is also the hot shard being actively written.
Read and write I/O contend on the same shard. Query latency for real-time dashboards spikes from 50ms to 3.4 seconds during peak write load.
The fix (partition by sipHash64(session_id) % 3 to distribute across shards) requires a full table rewrite — ClickHouse does not support changing the partition key in place.
Grilling Questions
- ClickHouse's
MergeTreeengine merges data parts in the background. During a merge, read queries may scan pre-merge parts, returning slightly inconsistent results (eventual read consistency). For a "guardrail blocks in the last 60 seconds" alert, is this acceptable? - You want to run ClickHouse on AWS. AWS does not offer a managed ClickHouse service (unlike Redshift). Altinity Cloud and ClickHouse Cloud are third-party. Does this have the same data residency risk as Pinecone?
- ClickHouse does not support transactions. If the Kinesis consumer crashes mid-batch, some events may be written twice to ClickHouse. Duplicates inflate
count()metrics. How do you design for deduplication?
Decision Heuristic
Use ClickHouse when: (a) query SLA is <1 second on billions of rows, (b) your team has operational capacity to manage the cluster, and © cost efficiency at large data volumes outweighs managed service simplicity. For teams without dedicated data infrastructure engineers, Redshift's managed nature is more valuable than ClickHouse's performance advantage.
Alternative 3: Real-Time Streaming with OpenSearch (Kibana Dashboards)
What Changes
Events are streamed directly into OpenSearch (not the vector search index — a separate analytics index). Kibana is used for real-time dashboards. No Kinesis Firehose buffering.
PUT /chatbot-events-2025-12-01/_doc/event_abc123
{
"session_id": "sess_abc123",
"intent": "product_discovery",
"latency_ms": 1842,
"created_at": "2025-12-01T10:05:00Z"
}
Best Case
- Real-time: events are visible in Kibana within 1-2 seconds of occurring.
- No additional analytics infrastructure if you already run OpenSearch for the vector store.
- Kibana is a powerful dashboard tool that the operations team already knows.
- Excellent for anomaly detection: "intent distribution shifted 30% in the last 5 minutes" via Kibana Alerting.
Failure Scenario — Index Explosion from High Cardinality Fields
What happens: The team decides to store message_text (PII-scrubbed) in the analytics OpenSearch index for debugging. OpenSearch inverted index creates one index entry per unique word in every message.
After 30 days of operation: 500 events/sec × 86400 sec/day × 30 days = 1.3B events. The message_text field has 200,000 unique tokens. The inverted index for message_text grows to 450GB across the OpenSearch cluster.
Kibana queries on message_text (e.g., "find all sessions where user said 'damaged'") take 45 seconds because the inverted index is too large for memory and OpenSearch is paging from disk.
The mitigation (and new failure): The team disables the message_text field in the mapping ("index": false). This means you can store the field but not search or aggregate on it. The debug use case (finding sessions where a user asked about a specific product) is gone. The data science team relied on this for labeling the golden dataset for intent classifier retraining.
Failure Scenario 2 — Write-Heavy Index Competes with Read-Heavy RAG Queries
What happens: The analytics index and the RAG knowledge base index share the same OpenSearch Serverless collection (to save cost). During peak flash sale traffic:
- RAG queries (vector KNN search): 8,000 requests/min
- Analytics writes: 500 events/sec = 30,000 writes/min
OpenSearch Serverless OCU (OpenSearch Compute Units) are shared. Write-heavy analytics workloads consume OCUs, leaving fewer for RAG queries. RAG query latency spikes from 80ms to 900ms. The chatbot's first-token latency breaches the 2-second SLA.
The fix: separate OpenSearch Serverless collections for analytics and RAG. This doubles the minimum OCU cost.
Grilling Questions
- OpenSearch indices are immutable schemas at creation time (field type cannot be changed after indexing). You realize
latency_msshould belatency_ms_float(some values are decimal for sub-ms precision). How do you migrate without losing historical data? - Your analytics OpenSearch index grows to 50B events over a year. Storage cost on OpenSearch Serverless is ~$0.024/GB-month. Assuming 500 bytes per event: 50B × 500 bytes = 25TB × $0.024 = $600/month just for storage. Compare this to S3 Parquet at $0.023/GB ($575/month) with Athena queries on demand. Is the real-time capability of OpenSearch worth double the cost?
- Kibana alerting runs queries on OpenSearch every 60 seconds. If you have 20 active Kibana alerts (latency, guardrail rate, intent drift, escalation rate), that's 20 queries/min against the analytics index. How do you size the OCU allocation to handle both analytics writes AND Kibana alert queries without impacting RAG queries?
Decision Heuristic
Use OpenSearch for analytics when real-time alerting (<5 seconds) is genuinely required (security incident response, live launch monitoring) and you already run OpenSearch. Use S3 + Athena or Redshift for all other analytics use cases — OpenSearch's storage cost and indexing overhead make it expensive for bulk historical analysis.
Alternative 4: DynamoDB Streams + Lambda for Real-Time Micro-Analytics
What Changes
Instead of Kinesis, each event write to DynamoDB (conversation memory) triggers a DynamoDB Stream consumed by a Lambda function, which performs micro-analytics (intent counting, latency tracking) and writes aggregates to a separate DynamoDB table.
def process_event(record):
if record["eventName"] == "INSERT":
intent = record["dynamodb"]["NewImage"]["intent"]["S"]
latency = int(record["dynamodb"]["NewImage"]["latency_ms"]["N"])
dynamodb.update_item(
TableName="chatbot_aggregates",
Key={"pk": {"S": f"INTENT#{intent}#HOUR#{current_hour}"}},
UpdateExpression="ADD count :one, total_latency :lat",
ExpressionAttributeValues={":one": {"N": "1"}, ":lat": {"N": str(latency)}}
)
Best Case
- True real-time aggregation: intent counts updated within 200ms of each event.
- No Kinesis, no Firehose, no Redshift cluster — just DynamoDB Streams (included in DynamoDB cost) and Lambda.
- Simple for read-time alerting: "check
chatbot_aggregatesfor INTENT#guardrail_blocked#HOUR#now" every 30 seconds.
Failure Scenario — The Lambda Fan-Out Amplification
What happens: DynamoDB Streams delivers events in shards. At 500 events/sec, DynamoDB automatically partitions the stream into shards. Lambda scales with one concurrent invocation per shard.
But each Lambda invocation processes a batch of records (default batch size: 100). At 500 events/sec: - 5 shards × 1 Lambda/shard = 5 concurrent Lambda invocations - Each Lambda processes 100 events before the next batch becomes available - Processing time per batch: 800ms (DynamoDB update per event = 8ms × 100 events) - Buffer builds up: events in the stream are 800ms old before being processed
An unexpected traffic spike to 5,000 events/sec: - 50 shards suddenly required by DynamoDB - Lambda scales from 5 → 50 concurrent invocations (cold starts add 1.5s) - During cold start phase, stream buffer grows to 4,000 unprocessed events - Aggregate tables are 8 minutes stale during the spike - The on-call dashboard shows flat intent counts for 8 minutes during the busiest period of the year
Grilling Questions
- DynamoDB Streams retains records for 24 hours. If the aggregation Lambda is down for 6 hours (due to a bug), the stream replays 6 hours of events. What happens to your hourly aggregates? Are they double-counted?
- You want to run a SQL query across all events in the last 7 days (for a weekly business review). DynamoDB Streams don't support retrospective queries. Your aggregate table only has per-hour counts, not individual events. How do you answer "show me all sessions where the user was frustrated (negative sentiment) and escalated"?
- The aggregate DynamoDB table uses
pk = INTENT#{intent}#HOUR#{hour}. With 10 intent types × 24 hours = 240 active keys at any time. Multiple Lambda invocations updating the same key simultaneously can cause write conflicts. How does DynamoDB's atomicUpdateExpressionhandle this, and what is the DynamoDB cost of 500 update/sec to 240 keys?
Decision Heuristic
DynamoDB Streams + Lambda is appropriate for simple, pre-defined micro-aggregations (counts, sums, averages by known dimensions) that must be real-time. It fails when you need ad-hoc queries across raw event data or when aggregation logic is complex. Always pair it with a separate analytical store (Redshift, Athena/S3) for historical and exploratory queries.
Master Summary Table
| Choice | Query Latency | Freshness | Cost (500 events/sec) | Key Failure Risk |
|---|---|---|---|---|
| Kinesis → Redshift (current) | 1-10s (standard), <1s (materialized views) | 5-10 min (Firehose buffer) | ~$500/month cluster | Redshift cluster scaling takes minutes; ad-hoc queries compete with dashboards |
| Athena + S3 | 10-60s (no cache) | 5 min (Firehose buffer) | ~$50/month (Athena per-query) | Unpartitioned scans blow query cost; concurrent query limits block alerts |
| ClickHouse | <1s | <1s (streaming ingestion) | ~$150/month (self-hosted) | Self-hosted operational burden; schema changes require full table rewrites |
| OpenSearch (Kibana) | <2s | <2s (real-time) | ~$800/month (storage + OCU) | High-cardinality field index explosion; competes with RAG OCU budget |
| DynamoDB Streams + Lambda | <200ms (aggregates only) | Real-time | ~$30/month (Lambda + DDB) | No ad-hoc queries; stream lag during spikes; replay causes double-counting |