03 — OrderStatusAgent
Real-time order tracking, inventory checks, and return initiation. Backed by RDS as source of truth, ElastiCache for hot reads.
The OrderStatusAgent handles every message that touches the customer's transactions: "Where's my package?", "Is volume 5 in stock?", "I want to return this." It's the most operationally sensitive sub-agent — wrong answers here cost real money or burn customer trust.
What it is
A logical sub-agent that exposes order- and inventory-shaped tools to the Orchestrator. It owns:
- Order status retrieval — order ID → current state, tracking, ETA
- Inventory checks — ASIN → in stock / low stock / out of stock + restock ETA
- Return eligibility — order ID + ASIN → can be returned? policy applied?
- Return initiation — generates RMA, creates a return shipping label
Backed by the Order & Inventory MCP (../RAG-MCP-Integration/03-order-inventory-mcp.md), which fronts a PostgreSQL RDS cluster (orders + inventory transactional data) with an ElastiCache Redis read-through cache.
Tools exposed to the Orchestrator
| Tool | Purpose | Mutation? |
|---|---|---|
get_order_status(order_id) |
Current status, tracking, ETA | Read-only |
check_stock(asin, region) |
Current inventory level | Read-only |
get_recent_orders(user_id, limit) |
Last N orders for context | Read-only |
check_refund_eligibility(order_id, asin) |
Can this be returned? | Read-only |
initiate_return(order_id, asin, reason) |
Create RMA + shipping label | Write |
The single write operation (initiate_return) is the only place in the entire chatbot that mutates customer data. It is therefore protected by additional confirmation flow — see Failure handling below.
Data freshness model
Order data has different freshness requirements by field:
| Field | Source of truth | Cache TTL | Why |
|---|---|---|---|
| Order placed | RDS | 1 hour | Immutable once placed |
| Order status (e.g. "shipped") | RDS | 5 minutes | Updated by fulfillment events |
| Tracking events | Carrier API → RDS | 15 minutes | Carriers update slowly; over-polling is rate-limited |
| Inventory level | RDS | 60 seconds | Sales drop stock fast |
| Return eligibility | Computed on read | 0 (no cache) | Depends on current date vs. order date |
Each field is cached independently. A request for order status returns partially fresh data, with each field tagged by its as-of timestamp.
Read path
Orchestrator: get_order_status("ORD-12345")
│
▼
ElastiCache lookup: order:ORD-12345
│
┌────┴────┐
▼ ▼
HIT (5ms) MISS
│ │
│ ▼
│ RDS read replica query (~50ms)
│ │
│ ▼
│ Write to cache, TTL by field
│ │
▼ ▼
Return structured order with as-of timestamps per field
Cache hits dominate by ~10:1 in steady state. The numbers above assume a warm cache.
Write path: return initiation
The only write operation. Multi-step:
- Validate — refund eligibility, ownership, return window
- Generate — RMA number, shipping label via UPS API
- Persist — write to RDS in a transaction (return record + order status update)
- Invalidate — drop affected cache entries
- Emit event — publish to SNS for downstream (refund processor, inventory restock)
This is wrapped in a saga pattern: each step has a compensating action. If UPS label generation fails after RDS write, we publish a return_initiation_failed event that triggers cleanup.
The Orchestrator confirms with the user before calling initiate_return:
User: "I want to return Berserk volume 42."
Orchestrator: [calls check_refund_eligibility]
Orchestrator → User: "I can process a return. You'll get a $19.99 refund
to the original card. Shall I proceed?"
User: "Yes."
Orchestrator: [now calls initiate_return]
The two-step gate is a deliberate friction point. Past incidents: Claude sometimes interprets "I'm returning this" as a status update vs. a request, which would trigger an unwanted RMA without confirmation.
State management
The agent itself is stateless. State lives in:
- RDS — source of truth (orders, returns, inventory)
- ElastiCache — hot read cache, TTL by field
- DynamoDB — session-scoped order context (recent order IDs the user mentioned, so re-mentions resolve)
The session-scoped piece matters: if the user says "Yes, return that one" two turns later, the Orchestrator needs to know which order ID was active. That entity-pinning is in DynamoDB session state (08-memory-architecture.md).
Failure handling
| Failure | Detection | Recovery |
|---|---|---|
| Cache miss + RDS slow (>500ms) | Latency probe | Return cached if exists with stale: true flag, else fail |
| RDS read replica down | Connection error | Failover to primary; if primary down, return last-known with disclaimer |
| UPS label API timeout | 5s timeout on write path | Saga compensation: rollback RDS, return error to user, retry async |
| Invalid order ID | Schema validation | 400 with reason, no retry |
| Eligibility check fails | Business rule violation | Return structured not_eligible with reason (past window, damaged, etc.) |
| Concurrent return on same order | DB unique constraint | Detect 23505 error code, return idempotent success if RMA already exists |
| Stock check during flash sale | Cache invalidation lag | Return cached + warning "stock changes rapidly" |
The flash sale case is the worst failure mode — telling a customer something is in stock when it's already sold out. Mitigation: during declared flash-sale windows, drop stock cache TTL to 5s and add a note in the response.
Latency budget
Target: P99 < 300ms for read tools, P99 < 2s for initiate_return.
Read path (cache hit):
Cache GET 5ms
Format 10ms
─────
Total 15ms (P50)
50ms (P99 with network jitter)
Read path (cache miss):
Cache GET 5ms
RDS query 50ms
Cache SET 5ms
Format 10ms
─────
Total 70ms (P50)
200ms (P99)
Write path (initiate_return):
Eligibility (RDS) 50ms
RMA generation 20ms
UPS label API 800ms (external; this dominates)
RDS transaction 100ms
SNS publish 30ms
Cache invalidate 10ms
─────
Total ~1010ms (P50)
~2500ms (P99)
The write-path P99 of 2.5s exceeds the chatbot's overall 3s budget. This is why returns are usually a multi-turn flow — the initial "shall I proceed?" question buys back latency budget on the actual write turn.
Why this shape
| Alternative | Why we rejected it |
|---|---|
| DynamoDB instead of RDS | Order data is relational (orders → line items → returns); ACID requirements for financial data |
| No cache layer (always RDS) | RDS read replica P99 ~80ms; under chatbot QPS this stacks badly. Cache hit at 5ms is necessary. |
| LLM-generated SQL queries | Catastrophic risk (PII leak, data corruption). All queries are templated. |
| Auto-confirm returns from natural language | Past incidents: ambiguous user intent → unintended RMA. Two-step confirmation is non-negotiable. |
| Cache stock indefinitely | Stock changes too fast during sales; 60s TTL is the loosest acceptable. |
Validation: Constraint Sanity Check
| Claimed metric | Verdict | Why |
|---|---|---|
| "Real-time freshness" stated requirement | Marketing language, not a metric | Real-time means different things in different contexts. With 60s inventory TTL and 5-min order TTL, this is near-real-time at best. The doc should drop "real-time" and use "≤ 60s freshness for inventory, ≤ 5min for order status." |
| Inventory cache TTL 60s | Acceptable except during flash sales | During a flash sale, 60s of stale "in stock" can mean hundreds of orders for unavailable items. Dynamic TTL adjustment (5s during flash sale windows) is mentioned but the trigger mechanism isn't defined. Who declares "flash sale on"? Until that's specified, the protection is theoretical. |
| Read P99 < 300ms | Realistic at warm cache | Holds for cache hits (P99 ~50ms). Cache-miss P99 (~200ms) requires RDS read replica that doesn't degrade under chatbot QPS — needs sustained-load testing data, not just steady-state numbers. |
Write P99 < 2s for initiate_return |
Dependent on UPS API | UPS label generation is external and not under our SLA control. UPS API P99 in practice is 600ms–3s. We can hit 2s most of the time but not contractually. Restate as best-effort. |
| Cache hit ratio "10:1" | Unsourced claim | Where does this come from? At Amazon scale with cold long-tail orders, the actual ratio depends on access pattern. Needs measurement, not assertion. |
| RDS for orders at scale | Architectural risk at growth | Amazon's actual order volume globally is hundreds of millions per day. A single RDS Postgres won't scale. Plausible read for a manga-store scope; would need partitioning (by user_id or order_id range) at higher volume. |
| Two-step confirmation for writes | Sound, but adds latency | The pattern adds a full conversation turn (~3s) before any return can be initiated. That's correct for safety, but eats into "fast support" goals. Acceptable tradeoff; should be acknowledged. |
| Saga compensation on write failures | Correct pattern, hard to verify | Saga correctness depends on every compensating action being idempotent and tested. No mention of how this is verified — chaos testing? Integration tests? Without a test plan this is paper protection. |
The flash sale problem
The architecture says "during declared flash-sale windows, drop stock cache TTL to 5s." This is a configuration-driven mitigation that depends on:
- Someone (or something) declaring the flash sale window
- The OrderStatusAgent reading that flag on every request
- The dynamic TTL actually being applied
None of (1)–(3) is specified. In practice, flash sales often start when an item unexpectedly goes viral — there's no human in the loop to flip a flag. The robust solution is event-driven: subscribe to inventory-change events from the fulfillment service and invalidate cache on each change. That's significant engineering and isn't in the architecture as documented.
The current architecture has a soft target ("near real-time") and a hard data freshness limit (60s). During flash sales the 60s window is too long; during normal operation it's fine. Either accept the flash-sale risk explicitly (and tell users "stock may be stale during sales") or build the event-driven invalidation. Don't claim "real-time" while doing neither.
The ACID claim and the cache
RDS is chosen for ACID guarantees. But the moment we read through a cache, ACID is broken from the chatbot's perspective — we may serve stale data that contradicts what RDS would return. That's fine for read-only paths (the user sees a slightly stale order status), but it must not feed into write decisions. Specifically: check_refund_eligibility should always hit RDS directly, never the cache. The doc doesn't make this explicit.
The "no LLM-generated SQL" rule is right and load-bearing
Worth highlighting because it's easy to lose: every SQL query in this agent is templated. The LLM never gets to write SQL. This is one of the few non-negotiable safety rails in the system; if it ever drifts (e.g., someone adds a "natural language reporting" feature later), the security posture changes materially.
Related documents
- 01-orchestrator-agent.md — Calling pattern from the supervisor
- 07-failure-handling.md — Saga compensation pattern in detail
- ../RAG-MCP-Integration/03-order-inventory-mcp.md — MCP internals
- ../12-security-privacy.md — PII handling for order data