Scenarios and Runbooks — Collaborative AI Systems
MangaAssist context: JP Manga store chatbot on AWS — Bedrock Claude 3 (Sonnet at $3/$15 per 1M tokens input/output, Haiku at $0.25/$1.25), OpenSearch Serverless (vector store), DynamoDB (sessions/products), ECS Fargate (orchestrator), API Gateway WebSocket, ElastiCache Redis. Target: useful answer in under 3 seconds, 1M messages/day scale.
Skill Mapping
| Field | Value |
|---|---|
| Certification | AWS AIP-C01 (AI Practitioner) |
| Domain | 2 — Fundamentals of Developing AI Solutions |
| Task | 2.1 — Develop AI solutions using AWS services |
| Skill | 2.1.5 — Develop collaborative AI systems to enhance FM capabilities with human expertise |
| Focus | Failure scenarios, operational runbooks, incident response for human-in-the-loop systems |
Scenario Index
| # | Scenario | Severity | Blast Radius |
|---|---|---|---|
| 1 | Human review queue backlog causing SLA breaches | P1 — Critical | User-facing latency, trust erosion |
| 2 | Feedback loop creating adversarial prompt patterns | P2 — High | Model behavior degradation |
| 3 | Escalation threshold too low overwhelming human reviewers | P2 — High | Reviewer burnout, cost spike |
| 4 | Approval workflow timeout leaving user hanging | P1 — Critical | User abandonment, revenue loss |
| 5 | Feedback data poisoning skewing model behavior | P1 — Critical | Systemic response quality decline |
Scenario 1: Human Review Queue Backlog Causing SLA Breaches
Problem Statement
The SQS-backed human review queue has grown to 200+ pending tasks, far exceeding the 50-task operational threshold. Priority 1 tasks (VIP customers, financial decisions) that should be reviewed within 60 seconds are sitting for 8+ minutes. Users are receiving "Your question is being reviewed" acknowledgments but no follow-up responses. The chatbot's perceived reliability has dropped, and customer support tickets about "the bot not answering" are spiking.
This scenario is particularly damaging for MangaAssist because Japanese customers have high service expectations (omotenashi culture), and unanswered messages during peak evening hours (19:00-23:00 JST) coincide with the highest purchase intent window.
Detection
Primary CloudWatch Alarms:
| Alarm | Metric | Threshold | Current Value |
|---|---|---|---|
ReviewQueueDepth-Critical |
ApproximateNumberOfMessagesVisible (SQS) |
> 50 | 217 |
ReviewSLABreachRate-P1 |
Custom metric: P1 tasks exceeding 60s | > 5% | 43% |
ReviewSLABreachRate-P2 |
Custom metric: P2 tasks exceeding 120s | > 10% | 67% |
OldestMessageAge |
ApproximateAgeOfOldestMessage (SQS) |
> 300s | 512s |
Secondary Signals:
- Step Functions executions in EnqueueForReview state growing linearly
- DynamoDB manga-review-tasks table shows 200+ items with status = pending
- WebSocket connection count dropping (users disconnecting before review completes)
- Customer support ticket volume up 300% in the last hour
Detection Query (CloudWatch Logs Insights):
fields @timestamp, @message
| filter @message like /review_acknowledgment/
| stats count() as ack_count by bin(5m) as time_window
| sort time_window desc
| limit 20
Root Cause Analysis
Immediate Cause: Three of five on-duty reviewers went offline simultaneously (shift handoff gap between 20:00-20:30 JST), reducing review capacity by 60% during peak traffic.
Contributing Factors:
-
Shift scheduling gap — The reviewer shift schedule did not account for the 30-minute overlap needed during handoff. The outgoing team logged off at 20:00, but the incoming team was not fully online until 20:30.
-
No auto-scaling for reviewer capacity — The system had no mechanism to detect declining reviewer availability and compensate (e.g., by relaxing thresholds or activating backup reviewers).
-
Escalation threshold too aggressive for peak hours — The confidence threshold for auto-respond was set at 0.85 for all hours. During peak hours, this sent 15% of traffic to review. At 1M messages/day, peak hour traffic of ~80K messages means ~12K review tasks per hour, requiring sustained reviewer throughput of ~3.3 reviews per second.
-
No backpressure mechanism — When the queue grew, the system continued enqueuing at the same rate instead of temporarily raising the auto-respond threshold.
Timeline:
19:45 JST — Queue depth at normal level (12 tasks)
20:00 JST — Outgoing reviewer shift logs off (3 of 5 reviewers)
20:05 JST — Queue depth reaches 30, P1 SLA alarm fires
20:10 JST — Queue depth reaches 75, P2 SLA alarm fires
20:15 JST — Queue depth reaches 120, oldest message at 300s
20:20 JST — Step Functions timeout handler starts delivering responses with disclaimers
20:25 JST — Incoming reviewer shift partially online (2 of 5)
20:30 JST — All incoming reviewers online, queue processing resumes
20:45 JST — Queue depth back to 40 (still elevated)
21:15 JST — Queue depth normalized to 15
Resolution
Immediate Actions (first 15 minutes):
-
Activate emergency threshold relaxation:
# Lambda function triggered by CloudWatch alarm def emergency_threshold_relaxation(event, context): """Temporarily raise auto-respond threshold to reduce review volume.""" dynamodb = boto3.resource("dynamodb") table = dynamodb.Table("manga-escalation-thresholds") # Lower the auto-respond minimum across all categories categories = ["general", "product_info", "availability", "rare_manga"] for category in categories: table.update_item( Key={"topic_category": category}, UpdateExpression="SET auto_respond_min = :val, emergency_override = :true_val, override_expires = :exp", ExpressionAttributeValues={ ":val": Decimal("0.70"), # Lowered from 0.85 ":true_val": True, ":exp": (datetime.now(timezone.utc) + timedelta(hours=2)).isoformat(), }, ) return {"status": "thresholds_relaxed", "categories": categories} -
Page backup reviewers via PagerDuty/SNS integration. The on-call rotation includes 3 backup reviewers who can be activated within 10 minutes.
-
Prioritize the queue — Re-sort pending tasks so P1 items are processed first. For P3/P4 tasks older than 5 minutes, auto-deliver with a disclaimer:
"Here's what I found — please note this response was generated automatically and may not be fully verified. Let us know if you need further help." 「こちらが見つかった情報です。自動生成された回答のため、完全に検証されていない 場合があります。さらにサポートが必要な場合はお知らせください。」 -
Drain timed-out Step Functions executions — For executions that hit the 300s timeout, the
HandleReviewTimeoutstate delivers the AI response with a disclaimer rather than leaving the user hanging.
Short-term Fixes (within 48 hours):
-
Implement shift overlap — Modify the reviewer schedule to require a 30-minute overlap between shifts. The outgoing team does not log off until the incoming team confirms they are active.
-
Add queue-depth-triggered threshold adjustment — When
ApproximateNumberOfMessagesVisibleexceeds 50, automatically lower the review threshold for low-risk categories:def auto_adjust_on_queue_depth(queue_depth: int, current_thresholds: dict) -> dict: if queue_depth > 100: # Aggressive relaxation: only review high-risk categories return {cat: {**t, "auto_respond_min": max(0.60, t["auto_respond_min"] - 0.15)} for cat, t in current_thresholds.items() if cat not in ("refund", "complaint", "legal")} elif queue_depth > 50: # Moderate relaxation return {cat: {**t, "auto_respond_min": max(0.70, t["auto_respond_min"] - 0.10)} for cat, t in current_thresholds.items() if cat not in ("refund", "complaint", "legal")} return current_thresholds -
Deploy reviewer availability dashboard — Real-time view of active reviewers, their current load, and queue depth per priority level.
Long-term Fixes (within 2 weeks):
-
Implement time-of-day threshold profiles — Different thresholds for peak vs. off-peak hours. During peak hours (19:00-23:00 JST), auto-respond threshold drops to 0.75 for general queries to reduce review volume.
-
Build predictive queue depth model — Use historical data to predict queue depth 30 minutes ahead and proactively adjust thresholds or page additional reviewers.
-
Add dead-letter queue for chronically unprocessed reviews — After 3 reassignment attempts, route to a senior reviewer pool or auto-deliver with enhanced disclaimer.
Prevention
- Automated reviewer availability monitoring — Alert if active reviewer count drops below minimum staffing level (3 for peak, 2 for off-peak)
- Queue depth circuit breaker — Automatic threshold relaxation when queue exceeds capacity
- Shift handoff verification — Incoming reviewers must check in via a dashboard; shift change does not complete until minimum coverage is confirmed
- Weekly capacity planning review — Compare review volume trends against staffing levels
- Load testing — Monthly simulation of peak traffic with reduced reviewer capacity to validate backpressure mechanisms
Scenario 2: Feedback Loop Creating Adversarial Prompt Patterns
Problem Statement
Over a 3-week period, the RLHF-lite prompt improvement cycle has incorporated feedback that subtly degrades the system prompt. A pattern of negative feedback was submitted by a small group of users (later identified as competitors running a coordinated campaign) who consistently thumbs-downed accurate product recommendations and thumbs-upped responses that steered users toward competitor products. The weekly prompt adjustment cycle picked up these signals and modified the system prompt to reduce product recommendation confidence, effectively making the chatbot less helpful for its core use case.
The attack is subtle — it does not trigger obvious anomaly detection because the feedback volume (200-300 signals/day) is within normal bounds, and the changes to the prompt are incremental.
Detection
Primary Signals:
| Indicator | Normal Baseline | Current Value | Delta |
|---|---|---|---|
| Product recommendation positive rate | 82% | 61% | -21% |
| "Add to cart" conversion from recommendations | 12% | 5.8% | -6.2% |
| Average session duration after recommendation | 4.2 min | 2.1 min | -2.1 min |
| System prompt version changes (3 weeks) | 2-3 | 8 | +5-6 |
Secondary Signals: - Negative feedback concentrated from 15 user accounts (0.001% of users generating 8% of negative feedback) - Feedback timestamps clustered in patterns suggesting automation (uniform 30-45 second intervals) - Corrected responses in feedback submissions contain competitor product URLs - Prompt changelog shows progressive weakening of product recommendation instructions
Detection Query (Athena on feedback archive):
SELECT
session_id,
COUNT(*) as total_feedback,
SUM(CASE WHEN signal = 'down' THEN 1 ELSE 0 END) as negative_count,
ROUND(SUM(CASE WHEN signal = 'down' THEN 1 ELSE 0 END) * 100.0 / COUNT(*), 1) as negative_pct,
COUNT(DISTINCT DATE(timestamp)) as active_days,
MIN(timestamp) as first_feedback,
MAX(timestamp) as last_feedback
FROM manga_feedback
WHERE feedback_type = 'binary'
AND timestamp >= DATE_ADD('day', -21, NOW())
GROUP BY session_id
HAVING negative_count > 20
AND negative_pct > 70
ORDER BY negative_count DESC
LIMIT 50;
Root Cause Analysis
Immediate Cause: The RLHF-lite prompt improvement pipeline had no adversarial detection or feedback provenance validation. It treated all feedback signals equally regardless of source.
Contributing Factors:
-
No per-user feedback rate limiting — A single session could submit unlimited feedback signals. The system counted volume, not quality.
-
No feedback provenance scoring — Feedback from new accounts, accounts with no purchase history, or accounts exhibiting abnormal patterns was weighted the same as feedback from verified purchasers.
-
Automated prompt updates without human approval — The weekly prompt improvement job automatically deployed revised prompts to production without a human reviewing the changes.
-
No A/B testing gate — Prompt changes were applied to 100% of traffic immediately rather than being validated on a small cohort first.
-
Correction submissions not sanitized — User-submitted "corrections" containing competitor URLs were fed directly into the prompt improvement pipeline.
Resolution
Immediate Actions (first 2 hours):
-
Roll back the system prompt to the version from 3 weeks ago (pre-attack):
def rollback_system_prompt(target_version: str): """Restore a previous system prompt version from S3 versioning.""" s3 = boto3.client("s3") # Fetch the known-good version response = s3.get_object( Bucket="manga-system-prompts", Key="production/system-prompt.txt", VersionId=target_version, ) prompt_content = response["Body"].read().decode("utf-8") # Deploy to ElastiCache for immediate effect redis_client = boto3.client("elasticache") # ... update the cached system prompt # Also update DynamoDB config table dynamodb = boto3.resource("dynamodb") table = dynamodb.Table("manga-config") table.update_item( Key={"config_key": "system_prompt"}, UpdateExpression="SET prompt_text = :text, version_id = :ver, rolled_back_at = :now", ExpressionAttributeValues={ ":text": prompt_content, ":ver": target_version, ":now": datetime.now(timezone.utc).isoformat(), }, ) return {"status": "rolled_back", "version": target_version} -
Freeze the RLHF-lite pipeline — Disable the automated prompt improvement job until safeguards are in place.
-
Quarantine suspicious feedback — Flag all feedback from the 15 identified accounts and exclude it from aggregate metrics:
def quarantine_feedback(suspicious_session_ids: list): dynamodb = boto3.resource("dynamodb") table = dynamodb.Table("manga-feedback") for session_id in suspicious_session_ids: # Query all feedback from this session response = table.query( IndexName="session-feedback-index", KeyConditionExpression="session_id = :sid", ExpressionAttributeValues={":sid": session_id}, ) for item in response.get("Items", []): table.update_item( Key={ "feedback_id": item["feedback_id"], }, UpdateExpression="SET quarantined = :true_val, quarantine_reason = :reason", ExpressionAttributeValues={ ":true_val": True, ":reason": "adversarial_pattern_detected", }, )
Short-term Fixes (within 1 week):
- Implement feedback provenance scoring:
| Signal | Score Modifier |
|---|---|
| Account has purchase history | +0.3 |
| Account age > 30 days | +0.2 |
| Feedback rate < 5/hour | +0.1 |
| Feedback pattern is organic (variable timing) | +0.2 |
| Account has no purchase history | -0.3 |
| Account created in last 7 days | -0.2 |
| Feedback rate > 20/hour | -0.5 |
| Uniform timing between feedback signals | -0.4 |
Only feedback with a provenance score above 0.3 is included in prompt improvement analysis.
-
Add rate limiting to feedback API: - Max 10 binary feedback signals per session per hour - Max 3 correction submissions per session per day - Max 5 preference pair signals per session per day
-
Require human approval for prompt changes — The weekly prompt improvement job generates a diff and sends it to the ML team for review via a Step Functions approval workflow before deployment.
Long-term Fixes (within 1 month):
-
Deploy anomaly detection on feedback patterns — Use CloudWatch Anomaly Detection on feedback signal distributions. Alert when the negative feedback rate from any user cohort deviates by more than 2 standard deviations from the cohort mean.
-
Implement A/B testing gate for prompt changes — New prompts are deployed to 5% of traffic for 48 hours. Only promote to 100% if the A/B cohort shows equal or better feedback scores.
-
Sanitize correction submissions — Strip URLs, remove competitor brand names, and validate that corrections are semantically related to the original query using embedding similarity.
-
Build feedback reputation system — Users who consistently provide feedback aligned with majority signals earn higher provenance scores over time.
Prevention
- Feedback provenance scoring on all signals — Never treat anonymous or unverified feedback equally with verified customer feedback
- Rate limiting at the API Gateway level — WAF rules to throttle feedback submission rates
- Prompt change approval workflow — No automated prompt deployment to production
- Weekly adversarial feedback audit — Scheduled Athena query to detect feedback clustering patterns
- A/B testing mandate — All prompt changes must pass an A/B test before full deployment
- Competitor URL blocklist — Maintain a blocklist of competitor domains to filter from correction submissions
Scenario 3: Escalation Threshold Too Low Overwhelming Human Reviewers
Problem Statement
After a well-intentioned but poorly calibrated threshold change, the auto-respond confidence threshold was raised from 0.85 to 0.95 across all topic categories. The intent was to improve response quality after a customer complaint about an incorrect product recommendation. However, this change caused the escalation rate to jump from 12% to 45% of all messages, tripling the review workload overnight.
Human reviewers went from handling ~120K reviews/day to a projected ~450K reviews/day. The reviewer pool of 20 agents can handle approximately 150K reviews/day at maximum capacity (each reviewer handles ~500 reviews in an 8-hour shift). The queue is growing by ~300K messages/day with no capacity to drain it.
Detection
Primary Alarms:
| Alarm | Metric | Threshold | Current Value |
|---|---|---|---|
EscalationRateAnomaly |
MangaAssist/Escalation/Rate |
> 25% | 45% |
ReviewerUtilization |
MangaAssist/Review/ReviewerLoad |
> 80% | 100% (capped) |
ReviewQueueGrowthRate |
Queue messages added - messages processed per 5min | > 100 | 1,042 |
CostAnomaly |
Step Functions execution cost (daily projected) | > $100 | $312 |
Secondary Signals: - Reviewer dashboard shows all 20 agents at maximum concurrent review capacity (10 each) - Average review time increasing (reviewer fatigue): from 18 seconds to 34 seconds per review - 92% of reviewed messages are being approved without changes (indicating unnecessary escalation) - Step Functions execution count up 275% day-over-day
Root Cause Analysis
Immediate Cause: A senior engineer changed the auto_respond_min threshold in the manga-escalation-thresholds DynamoDB table directly, setting all categories to 0.95 without impact analysis.
Contributing Factors:
-
No change control for threshold updates — DynamoDB table writes are not gated by an approval process. Any team member with table write access can change thresholds.
-
No impact simulation — There was no tool to predict the effect of a threshold change on escalation volume before deploying it.
-
Single threshold applied to all categories — A blanket change to all categories (including low-risk ones like "general product info") dramatically increased volume from categories that rarely need review.
-
No gradual rollout — The change was applied to 100% of traffic immediately.
-
Confidence score distribution not analyzed — The engineer did not check the distribution of confidence scores. If 45% of messages score between 0.85 and 0.95, raising the threshold to 0.95 captures all of them.
Resolution
Immediate Actions (first 30 minutes):
-
Revert the threshold change — Restore all category thresholds to their previous values:
def revert_thresholds(previous_values: dict): """Restore thresholds from the last known-good configuration.""" dynamodb = boto3.resource("dynamodb") table = dynamodb.Table("manga-escalation-thresholds") for category, thresholds in previous_values.items(): table.put_item(Item={ "topic_category": category, "auto_respond_min": Decimal(str(thresholds["auto_respond_min"])), "review_min": Decimal(str(thresholds["review_min"])), "handoff_below": Decimal(str(thresholds["handoff_below"])), "last_adjusted": datetime.now(timezone.utc).isoformat(), "adjusted_by": "incident_revert", }) # Clear the EscalationManager cache to force reload # This happens automatically within 5 minutes (cache TTL) return {"status": "reverted", "categories": list(previous_values.keys())} -
Bulk-approve pending low-risk reviews — For messages in the queue with confidence scores above 0.85 and topic categories that are not high-risk, auto-approve and deliver:
def bulk_approve_low_risk(confidence_floor: float = 0.85): """Auto-approve queued reviews that should not have been escalated.""" dynamodb = boto3.resource("dynamodb") table = dynamodb.Table("manga-review-tasks") sfn = boto3.client("stepfunctions") safe_categories = {"general", "product_info", "availability", "rare_manga"} response = table.scan( FilterExpression="#s = :pending AND confidence_score >= :floor AND topic_category IN :cats", ExpressionAttributeNames={"#s": "status"}, ExpressionAttributeValues={ ":pending": "pending", ":floor": Decimal(str(confidence_floor)), ":cats": safe_categories, }, ) approved_count = 0 for task in response.get("Items", []): token = task.get("step_function_token") if token: try: sfn.send_task_success( taskToken=token, output=json.dumps({ "decision": "approve", "reviewerId": "system_bulk_approve", "reviewDurationMs": 0, "editedResponse": None, "reviewNotes": "Bulk approved during threshold revert incident", }), ) approved_count += 1 except ClientError: pass # Task may have already timed out return {"approved": approved_count} -
Notify reviewers — Send a message to the reviewer dashboard indicating the threshold has been reverted and queue should normalize within 30 minutes.
Short-term Fixes (within 1 week):
-
Implement threshold change control — Threshold updates must go through a Step Functions approval workflow: - Engineer proposes a change - Impact simulation runs (predicts escalation volume based on historical confidence distributions) - ML team lead approves or rejects - Change deployed with a 5% canary first
-
Build an impact simulator:
def simulate_threshold_impact( category: str, proposed_auto_respond_min: float, lookback_hours: int = 168 ) -> dict: """Predict escalation volume for a proposed threshold change.""" # Query historical confidence scores for this category # Calculate what percentage would be escalated under new threshold # Compare to current escalation rate # Return projected review volume and cost impact pass -
Add threshold change audit trail — DynamoDB Streams on the threshold table, processed by a Lambda that logs every change to CloudTrail and sends an SNS notification.
Long-term Fixes (within 2 weeks):
-
Implement per-category threshold governance — Each category has an owner who must approve changes to that category's thresholds.
-
Deploy canary threshold testing — When a threshold change is proposed, it is first applied to 5% of traffic for 4 hours. If escalation rate stays within acceptable bounds, it rolls out to 100%.
-
Confidence score distribution dashboards — Histogram of confidence scores per category, updated hourly, visible to all team members. Makes it obvious what percentage of traffic falls in each confidence band.
Prevention
- Change control on threshold DynamoDB table — IAM policy restricts direct writes; changes must go through the approval workflow Lambda
- Impact simulation required — No threshold change can be deployed without an impact simulation showing projected escalation volume
- Canary rollout for all threshold changes — 5% traffic for 4 hours before full deployment
- Escalation rate circuit breaker — If escalation rate exceeds 30% for more than 10 minutes, auto-revert to previous thresholds
- Weekly threshold review meeting — ML and ops teams review threshold settings, escalation rates, and reviewer capacity
Scenario 4: Approval Workflow Timeout Leaving User Hanging
Problem Statement
Users are sending messages that trigger the human review workflow, receiving the "Your question is being reviewed" acknowledgment, and then hearing nothing. The Step Functions execution hits its 300-second timeout, but the HandleReviewTimeout Lambda function is failing silently due to a permissions error. The user's WebSocket connection remains open, but no response is ever delivered. From the user's perspective, the chatbot simply stopped responding.
This is particularly severe because it affects the most important messages — those the system deemed uncertain enough to require human review. These are the messages where user trust is most at stake.
Detection
Primary Alarms:
| Alarm | Metric | Threshold | Current Value |
|---|---|---|---|
StepFunctionsTimeoutRate |
Failed executions with States.Timeout |
> 5% | 23% |
TimeoutHandlerErrors |
Lambda manga-handle-timeout error count |
> 0 | 147 in last hour |
UndeliveredResponses |
Messages with review_ack but no final response | > 1% | 18% |
WebSocketIdleConnections |
Connections open > 5min with no server message | > 100 | 342 |
Secondary Signals:
- CloudWatch Logs for manga-handle-timeout Lambda show AccessDeniedException on execute-api:ManageConnections
- Step Functions execution history shows executions entering HandleReviewTimeout state and then failing
- User session data in DynamoDB shows last_bot_message_type = review_acknowledgment with no subsequent messages
- WebSocket disconnect rate spiking (users closing the app after waiting)
Detection Query (CloudWatch Logs Insights for the Lambda):
fields @timestamp, @message
| filter @message like /AccessDeniedException/
| stats count() as error_count by bin(5m)
| sort @timestamp desc
| limit 20
Root Cause Analysis
Immediate Cause: A recent IAM policy update removed the execute-api:ManageConnections permission from the manga-handle-timeout Lambda's execution role. This permission is required to post messages back to WebSocket connections via the API Gateway Management API.
Contributing Factors:
-
IAM policy change without impact analysis — A security team member tightened Lambda permissions as part of a least-privilege audit but did not identify the WebSocket callback dependency.
-
No integration test for timeout path — The timeout handler was never tested end-to-end in staging because triggering a timeout requires waiting 300 seconds. The deployment pipeline skipped this path.
-
Silent Lambda failure — The Lambda caught the
AccessDeniedExceptionin a broadexceptblock and logged a warning instead of raising an alarm. The CloudWatch alarm was configured on Lambda invocation errors (which counts unhandled exceptions), not on logged error patterns. -
No fallback delivery mechanism — If the WebSocket callback fails, there is no secondary delivery mechanism (email, push notification) to reach the user.
-
Acknowledgment without guaranteed delivery — The system promised the user a response ("Your question is being reviewed") without a mechanism to ensure that promise is fulfilled.
Resolution
Immediate Actions (first 15 minutes):
-
Restore the IAM permission — Add
execute-api:ManageConnectionsback to the Lambda execution role:{ "Effect": "Allow", "Action": "execute-api:ManageConnections", "Resource": "arn:aws:execute-api:ap-northeast-1:ACCOUNT:WS_API_ID/prod/POST/@connections/*" } -
Retry failed timeout deliveries — Query DynamoDB for messages stuck in the
review_acknowledgmentstate and attempt to deliver responses:def retry_stuck_messages(): """Find and deliver responses for messages stuck after timeout handler failure.""" dynamodb = boto3.resource("dynamodb") table = dynamodb.Table("manga-review-tasks") apigw = boto3.client( "apigatewaymanagementapi", endpoint_url="https://ws-api.manga-assist.example.com/prod", ) # Find tasks that timed out in the last 2 hours two_hours_ago = (datetime.now(timezone.utc) - timedelta(hours=2)).isoformat() response = table.scan( FilterExpression="#s = :status AND created_at >= :cutoff", ExpressionAttributeNames={"#s": "status"}, ExpressionAttributeValues={ ":status": "timed_out", ":cutoff": two_hours_ago, }, ) delivered = 0 failed = 0 for task in response.get("Items", []): connection_id = task.get("connection_id") ai_response = task.get("ai_response") message_id = task.get("message_id") try: apigw.post_to_connection( ConnectionId=connection_id, Data=json.dumps({ "type": "delayed_response", "messageId": message_id, "response": ai_response, "disclaimer": "We apologize for the delay. Here is our response.", "disclaimerJa": "お待たせして申し訳ございません。以下が回答です。", }).encode("utf-8"), ) delivered += 1 except ClientError as e: if e.response["Error"]["Code"] == "GoneException": # Connection closed, need alternative delivery _send_fallback_notification(task) failed += 1 else: failed += 1 return {"delivered": delivered, "failed_connection_gone": failed} def _send_fallback_notification(task: dict): """Send response via email/push if WebSocket is closed.""" ses = boto3.client("ses", region_name="ap-northeast-1") # Look up user email from session data # Send email with the response pass -
Send apology messages to users whose connections are still open but who never received a response.
Short-term Fixes (within 48 hours):
-
Fix the Lambda error handling — Replace the broad
exceptwith specific error handling that raises alarms:# BEFORE (bad): try: apigw.post_to_connection(ConnectionId=conn_id, Data=data) except Exception as e: logger.warning("Failed to deliver: %s", e) # AFTER (correct): try: apigw.post_to_connection(ConnectionId=conn_id, Data=data) except apigw.exceptions.GoneException: logger.info("Connection %s is gone, using fallback delivery", conn_id) _send_fallback_notification(context) except ClientError as e: logger.error("CRITICAL: Failed to deliver response: %s", e) raise # Let Lambda report as invocation error -> triggers alarm -
Add integration test for timeout path — Create a test that forces a timeout by using a short timeout value (10 seconds) and verifying the timeout handler delivers a response.
-
Implement fallback delivery — When WebSocket delivery fails (connection gone), fall back to: - Push notification (if mobile app) - Email (if email address on file) - In-app notification (visible on next login)
Long-term Fixes (within 2 weeks):
-
IAM change impact analysis tool — Before applying IAM policy changes, run a tool that identifies all resources and services that depend on the permissions being modified.
-
Delivery guarantee system — Every acknowledgment creates a "delivery promise" record in DynamoDB. A watchdog Lambda runs every 60 seconds, finds promises older than 5 minutes without a corresponding delivery confirmation, and triggers the fallback delivery mechanism.
-
WebSocket heartbeat for pending reviews — While a message is under review, send periodic heartbeat messages to the user every 60 seconds: "Still working on your question... (1 minute elapsed)". This keeps the user informed and prevents them from assuming the bot crashed.
Prevention
- IAM policy change reviews — All IAM changes require a peer review that includes checking dependent service permissions
- End-to-end timeout path testing — Integration tests that exercise the full timeout flow in staging before every deployment
- Delivery guarantee watchdog — Automated monitoring of unfulfilled delivery promises
- Specific exception handling — No broad
exceptblocks in critical delivery paths - WebSocket heartbeats during long operations — Keep users informed during any operation that takes longer than 10 seconds
- Fallback delivery channels — Never rely solely on WebSocket for critical message delivery
Scenario 5: Feedback Data Poisoning Skewing Model Behavior
Problem Statement
Over a 6-week period, the feedback data pipeline has been ingesting corrupted data that is systematically biasing the chatbot's behavior. Unlike Scenario 2 (which involved external adversaries), this is an internal data quality issue. A bug in the feedback collection Lambda is misattributing feedback signals: thumbs-up signals on messages about product availability are being recorded with the topic category "refund" due to an incorrect DynamoDB GSI query. This means the prompt improvement pipeline sees the "refund" category as having unusually high satisfaction, while "availability" appears to have unusually low satisfaction.
The result: the system prompt has been progressively weakened for availability queries (because the pipeline thinks users are unhappy) and strengthened for refund queries (because it thinks users are delighted). In reality, availability response quality has degraded by 30% while refund handling has not actually improved.
Detection
Primary Signals:
| Indicator | Expected | Actual | Issue |
|---|---|---|---|
| Availability query positive rate | 78% (historical) | 48% (reported) | Artificially low — misattributed |
| Refund query positive rate | 65% (historical) | 89% (reported) | Artificially high — misattributed |
| Availability prompt version changes | 1-2 per month | 6 in 6 weeks | Excessive iteration on false signal |
| Customer complaints about availability answers | ~20/week | 85/week | Real degradation from bad prompt changes |
| Refund resolution time | 4.2 min | 4.1 min | No real improvement despite "positive" signal |
Secondary Signals:
- A/B test on availability prompt changes shows worse performance, contradicting the feedback data
- Manual spot-check of feedback records reveals category mismatches
- DynamoDB GSI scan shows topic_category values that do not match the original message's topic
Detection Query (cross-reference feedback with original messages):
SELECT
f.feedback_id,
f.message_id,
f.topic_category AS feedback_topic,
m.topic_category AS original_topic,
CASE WHEN f.topic_category != m.topic_category THEN 'MISMATCH' ELSE 'OK' END AS status
FROM manga_feedback f
JOIN manga_messages m ON f.message_id = m.message_id
WHERE f.timestamp >= DATE_ADD('week', -6, NOW())
AND f.feedback_type = 'binary'
ORDER BY f.timestamp DESC
LIMIT 1000;
Root Cause Analysis
Immediate Cause: A bug in the feedback collection Lambda's DynamoDB query. When looking up the original message to enrich the feedback record with topic category, the Lambda was using a GSI that returned results sorted by timestamp descending. Under high concurrency, the query occasionally returned the wrong message (a recent message from a different session) instead of the message being rated.
The Bug:
# BUGGY CODE — uses a GSI that can return wrong results under race conditions
def _enrich_feedback_with_topic(self, message_id: str) -> str:
"""Look up the topic category for the original message."""
response = self.messages_table.query(
IndexName="message-timestamp-index", # WRONG INDEX
KeyConditionExpression="message_id = :mid",
ExpressionAttributeValues={":mid": message_id},
ScanIndexForward=False,
Limit=1,
)
items = response.get("Items", [])
if items:
return items[0].get("topic_category", "general")
return "general"
The message-timestamp-index GSI has message_id as its partition key but timestamp as its sort key. Under high write concurrency, eventually consistent reads sometimes return a stale or incorrect item. The correct approach is to use a GetItem on the base table with the exact primary key.
Contributing Factors:
-
Eventually consistent read on a GSI — GSIs in DynamoDB are eventually consistent by default. Under the 1M messages/day load, the window of inconsistency is small but frequent enough to corrupt ~5% of feedback records.
-
No data validation in the feedback pipeline — The pipeline did not verify that the topic category in the feedback record matched the original message's topic.
-
No automated data quality checks — No scheduled job compared feedback topic distributions against message topic distributions to detect drift.
-
Prompt changes applied without cross-validation — When the prompt improvement pipeline detected "low satisfaction" in a category, it changed the prompt without verifying the underlying data quality.
-
Six-week accumulation — The bug was introduced in a Lambda deployment 6 weeks ago but was not detected because it manifested as gradual drift rather than a sudden break.
Resolution
Immediate Actions (first 2 hours):
-
Fix the Lambda bug — Replace the GSI query with a direct
GetItem:# FIXED CODE — uses direct GetItem for strong consistency def _enrich_feedback_with_topic(self, message_id: str) -> str: """Look up the topic category for the original message.""" response = self.messages_table.get_item( Key={"message_id": message_id}, ConsistentRead=True, ProjectionExpression="topic_category", ) item = response.get("Item") if item: return item.get("topic_category", "general") return "general" -
Roll back prompt changes — Restore the system prompts for "availability" and "refund" categories to the versions from 6 weeks ago.
-
Freeze the prompt improvement pipeline — Pause automated prompt updates until data quality is verified.
Short-term Fixes (within 1 week):
-
Backfill corrected topic categories — Run a batch job to re-enrich all feedback records from the last 6 weeks:
def backfill_feedback_topics(): """Re-enrich all feedback records with correct topic categories.""" dynamodb = boto3.resource("dynamodb") feedback_table = dynamodb.Table("manga-feedback") messages_table = dynamodb.Table("manga-messages") six_weeks_ago = (datetime.now(timezone.utc) - timedelta(weeks=6)).isoformat() # Scan feedback records from the affected period response = feedback_table.scan( FilterExpression="#ts >= :cutoff", ExpressionAttributeNames={"#ts": "timestamp"}, ExpressionAttributeValues={":cutoff": six_weeks_ago}, ) corrected = 0 for item in response.get("Items", []): message_id = item.get("message_id") msg_response = messages_table.get_item( Key={"message_id": message_id}, ConsistentRead=True, ProjectionExpression="topic_category", ) msg_item = msg_response.get("Item") if msg_item: correct_topic = msg_item.get("topic_category", "general") if correct_topic != item.get("topic_category"): feedback_table.update_item( Key={"feedback_id": item["feedback_id"]}, UpdateExpression=( "SET topic_category = :correct, " "original_incorrect_topic = :incorrect, " "backfill_corrected = :true_val" ), ExpressionAttributeValues={ ":correct": correct_topic, ":incorrect": item.get("topic_category"), ":true_val": True, }, ) corrected += 1 return {"records_scanned": len(response.get("Items", [])), "corrected": corrected} -
Regenerate feedback analytics — Re-run Athena queries on the corrected data to produce accurate satisfaction metrics per category.
-
Re-run prompt improvement with corrected data — With clean data, generate new prompt improvement recommendations and apply them through the approval workflow.
Long-term Fixes (within 1 month):
-
Add data quality validation to the feedback pipeline:
def validate_feedback_enrichment(feedback_record: dict, original_message: dict) -> bool: """Validate that feedback enrichment matches the original message.""" checks = [ feedback_record.get("topic_category") == original_message.get("topic_category"), feedback_record.get("message_id") == original_message.get("message_id"), feedback_record.get("session_id") == original_message.get("session_id"), ] return all(checks) -
Implement automated data quality monitoring:
| Check | Frequency | Alert Condition |
|---|---|---|
| Topic distribution drift | Hourly | Feedback topic distribution diverges from message topic distribution by > 10% |
| Feedback-message join rate | Hourly | < 95% of feedback records successfully join to their original message |
| Category satisfaction anomaly | Daily | Any category's satisfaction rate changes by > 15% in 7 days |
| Cross-table consistency | Daily | Random sample of 1000 feedback records checked for topic match |
-
Add schema validation at write time — Before writing a feedback record to DynamoDB, validate that the topic_category exists in the allowed list and matches the original message.
-
Deploy data lineage tracking — Track the provenance of every data point that feeds into the prompt improvement pipeline, making it possible to trace a prompt change back to the specific feedback records that motivated it.
Prevention
- Strong consistency reads for enrichment lookups — Always use
ConsistentRead=Trueor directGetItemwhen enriching records with data from other tables - Data quality validation at write time — Validate enriched fields before persisting
- Automated data quality monitoring — Scheduled checks for distribution drift and cross-table consistency
- Prompt improvement pipeline data quality gate — Pipeline refuses to process data that fails quality checks
- Unit tests for DynamoDB query correctness — Test that the correct index and query pattern is used for each lookup
- Canary feedback records — Inject known-correct feedback records periodically and verify they are processed correctly end-to-end
Cross-Scenario Summary
Common Themes
| Theme | Scenarios | Key Lesson |
|---|---|---|
| Change control | 1, 3, 4 | All configuration changes (thresholds, IAM, prompts) need approval workflows |
| Data quality | 2, 5 | Feedback data pipelines need validation, provenance scoring, and anomaly detection |
| Fallback mechanisms | 1, 4 | Every human-dependent path needs an automated fallback when humans are unavailable |
| Gradual rollout | 2, 3 | Threshold and prompt changes must be canary-tested before full deployment |
| Monitoring depth | All | Surface-level metrics (error counts) are insufficient; need semantic validation of data flows |
Escalation Checklist for On-Call Engineers
When a collaborative AI system alarm fires:
1. [ ] Identify which component is affected (review queue, feedback pipeline,
threshold config, approval workflow, data quality)
2. [ ] Check queue depth and SLA compliance
3. [ ] Verify reviewer availability and capacity
4. [ ] Check for recent configuration changes (thresholds, IAM, deployments)
5. [ ] Verify feedback pipeline data quality (topic distribution, join rates)
6. [ ] Check Step Functions execution status for stuck or timed-out executions
7. [ ] Validate WebSocket delivery (are responses reaching users?)
8. [ ] If queue backlog: activate emergency threshold relaxation
9. [ ] If data quality issue: freeze prompt improvement pipeline
10. [ ] If delivery failure: check Lambda permissions and run retry job
11. [ ] Document timeline and root cause for post-incident review
Exam Relevance — AIP-C01 Skill 2.1.5
These scenarios test understanding of:
-
Step Functions task token lifecycle — What happens when tokens expire, how heartbeats prevent premature timeout, and how
SendTaskSuccess/SendTaskFailureresume execution. -
Human-in-the-loop failure modes — The exam expects you to know that human availability is not guaranteed and systems must have fallback mechanisms.
-
Feedback loop integrity — Understanding that feedback data can be corrupted, poisoned, or misattributed, and that validation is required.
-
Threshold calibration — The tradeoff between too much escalation (reviewer overload, cost) and too little (poor quality responses reaching users).
-
Data pipeline quality — DynamoDB consistency models, GSI limitations, and the importance of strong consistency for enrichment lookups.