03 — OrderStatusAgent

Real-time order tracking, inventory checks, and return initiation. Backed by RDS as source of truth, ElastiCache for hot reads.

The OrderStatusAgent handles every message that touches the customer's transactions: "Where's my package?", "Is volume 5 in stock?", "I want to return this." It's the most operationally sensitive sub-agent — wrong answers here cost real money or burn customer trust.

What it is

A logical sub-agent that exposes order- and inventory-shaped tools to the Orchestrator. It owns:

Order status retrieval — order ID → current state, tracking, ETA
Inventory checks — ASIN → in stock / low stock / out of stock + restock ETA
Return eligibility — order ID + ASIN → can be returned? policy applied?
Return initiation — generates RMA, creates a return shipping label

Backed by the Order & Inventory MCP (../RAG-MCP-Integration/03-order-inventory-mcp.md), which fronts a PostgreSQL RDS cluster (orders + inventory transactional data) with an ElastiCache Redis read-through cache.

Tools exposed to the Orchestrator

Tool	Purpose	Mutation?
`get_order_status(order_id)`	Current status, tracking, ETA	Read-only
`check_stock(asin, region)`	Current inventory level	Read-only
`get_recent_orders(user_id, limit)`	Last N orders for context	Read-only
`check_refund_eligibility(order_id, asin)`	Can this be returned?	Read-only
`initiate_return(order_id, asin, reason)`	Create RMA + shipping label	Write

The single write operation (initiate_return) is the only place in the entire chatbot that mutates customer data. It is therefore protected by additional confirmation flow — see Failure handling below.

Data freshness model

Order data has different freshness requirements by field:

Field	Source of truth	Cache TTL	Why
Order placed	RDS	1 hour	Immutable once placed
Order status (e.g. "shipped")	RDS	5 minutes	Updated by fulfillment events
Tracking events	Carrier API → RDS	15 minutes	Carriers update slowly; over-polling is rate-limited
Inventory level	RDS	60 seconds	Sales drop stock fast
Return eligibility	Computed on read	0 (no cache)	Depends on current date vs. order date

Each field is cached independently. A request for order status returns partially fresh data, with each field tagged by its as-of timestamp.

Read path

Orchestrator: get_order_status("ORD-12345")
        │
        ▼
ElastiCache lookup: order:ORD-12345
        │
   ┌────┴────┐
   ▼         ▼
HIT (5ms)   MISS
   │         │
   │         ▼
   │     RDS read replica query (~50ms)
   │         │
   │         ▼
   │     Write to cache, TTL by field
   │         │
   ▼         ▼
Return structured order with as-of timestamps per field

Cache hits dominate by ~10:1 in steady state. The numbers above assume a warm cache.

Write path: return initiation

The only write operation. Multi-step:

Validate — refund eligibility, ownership, return window
Generate — RMA number, shipping label via UPS API
Persist — write to RDS in a transaction (return record + order status update)
Invalidate — drop affected cache entries
Emit event — publish to SNS for downstream (refund processor, inventory restock)

This is wrapped in a saga pattern: each step has a compensating action. If UPS label generation fails after RDS write, we publish a return_initiation_failed event that triggers cleanup.

The Orchestrator confirms with the user before calling initiate_return:

User: "I want to return Berserk volume 42."
Orchestrator: [calls check_refund_eligibility]
Orchestrator → User: "I can process a return. You'll get a $19.99 refund
                      to the original card. Shall I proceed?"
User: "Yes."
Orchestrator: [now calls initiate_return]

The two-step gate is a deliberate friction point. Past incidents: Claude sometimes interprets "I'm returning this" as a status update vs. a request, which would trigger an unwanted RMA without confirmation.

State management

The agent itself is stateless. State lives in:

RDS — source of truth (orders, returns, inventory)
ElastiCache — hot read cache, TTL by field
DynamoDB — session-scoped order context (recent order IDs the user mentioned, so re-mentions resolve)

The session-scoped piece matters: if the user says "Yes, return that one" two turns later, the Orchestrator needs to know which order ID was active. That entity-pinning is in DynamoDB session state (08-memory-architecture.md).

Failure handling

Failure	Detection	Recovery
Cache miss + RDS slow (>500ms)	Latency probe	Return cached if exists with `stale: true` flag, else fail
RDS read replica down	Connection error	Failover to primary; if primary down, return last-known with disclaimer
UPS label API timeout	5s timeout on write path	Saga compensation: rollback RDS, return error to user, retry async
Invalid order ID	Schema validation	400 with reason, no retry
Eligibility check fails	Business rule violation	Return structured `not_eligible` with reason (past window, damaged, etc.)
Concurrent return on same order	DB unique constraint	Detect 23505 error code, return idempotent success if RMA already exists
Stock check during flash sale	Cache invalidation lag	Return cached + warning "stock changes rapidly"

The flash sale case is the worst failure mode — telling a customer something is in stock when it's already sold out. Mitigation: during declared flash-sale windows, drop stock cache TTL to 5s and add a note in the response.

Latency budget

Target: P99 < 300ms for read tools, P99 < 2s for initiate_return.

Read path (cache hit):
  Cache GET     5ms
  Format       10ms
  ─────
  Total       15ms (P50)
              50ms (P99 with network jitter)

Read path (cache miss):
  Cache GET      5ms
  RDS query     50ms
  Cache SET      5ms
  Format        10ms
  ─────
  Total        70ms (P50)
              200ms (P99)

Write path (initiate_return):
  Eligibility (RDS)    50ms
  RMA generation       20ms
  UPS label API       800ms (external; this dominates)
  RDS transaction     100ms
  SNS publish          30ms
  Cache invalidate     10ms
  ─────
  Total             ~1010ms (P50)
                    ~2500ms (P99)

The write-path P99 of 2.5s exceeds the chatbot's overall 3s budget. This is why returns are usually a multi-turn flow — the initial "shall I proceed?" question buys back latency budget on the actual write turn.

Why this shape

Alternative	Why we rejected it
DynamoDB instead of RDS	Order data is relational (orders → line items → returns); ACID requirements for financial data
No cache layer (always RDS)	RDS read replica P99 ~80ms; under chatbot QPS this stacks badly. Cache hit at 5ms is necessary.
LLM-generated SQL queries	Catastrophic risk (PII leak, data corruption). All queries are templated.
Auto-confirm returns from natural language	Past incidents: ambiguous user intent → unintended RMA. Two-step confirmation is non-negotiable.
Cache stock indefinitely	Stock changes too fast during sales; 60s TTL is the loosest acceptable.

Validation: Constraint Sanity Check

Claimed metric	Verdict	Why
"Real-time freshness" stated requirement	Marketing language, not a metric	Real-time means different things in different contexts. With 60s inventory TTL and 5-min order TTL, this is near-real-time at best. The doc should drop "real-time" and use "≤ 60s freshness for inventory, ≤ 5min for order status."
Inventory cache TTL 60s	Acceptable except during flash sales	During a flash sale, 60s of stale "in stock" can mean hundreds of orders for unavailable items. Dynamic TTL adjustment (5s during flash sale windows) is mentioned but the trigger mechanism isn't defined. Who declares "flash sale on"? Until that's specified, the protection is theoretical.
Read P99 < 300ms	Realistic at warm cache	Holds for cache hits (P99 ~50ms). Cache-miss P99 (~200ms) requires RDS read replica that doesn't degrade under chatbot QPS — needs sustained-load testing data, not just steady-state numbers.
Write P99 < 2s for `initiate_return`	Dependent on UPS API	UPS label generation is external and not under our SLA control. UPS API P99 in practice is 600ms–3s. We can hit 2s most of the time but not contractually. Restate as best-effort.
Cache hit ratio "10:1"	Unsourced claim	Where does this come from? At Amazon scale with cold long-tail orders, the actual ratio depends on access pattern. Needs measurement, not assertion.
RDS for orders at scale	Architectural risk at growth	Amazon's actual order volume globally is hundreds of millions per day. A single RDS Postgres won't scale. Plausible read for a manga-store scope; would need partitioning (by user_id or order_id range) at higher volume.
Two-step confirmation for writes	Sound, but adds latency	The pattern adds a full conversation turn (~3s) before any return can be initiated. That's correct for safety, but eats into "fast support" goals. Acceptable tradeoff; should be acknowledged.
Saga compensation on write failures	Correct pattern, hard to verify	Saga correctness depends on every compensating action being idempotent and tested. No mention of how this is verified — chaos testing? Integration tests? Without a test plan this is paper protection.

The flash sale problem

The architecture says "during declared flash-sale windows, drop stock cache TTL to 5s." This is a configuration-driven mitigation that depends on:

Someone (or something) declaring the flash sale window
The OrderStatusAgent reading that flag on every request
The dynamic TTL actually being applied

None of (1)–(3) is specified. In practice, flash sales often start when an item unexpectedly goes viral — there's no human in the loop to flip a flag. The robust solution is event-driven: subscribe to inventory-change events from the fulfillment service and invalidate cache on each change. That's significant engineering and isn't in the architecture as documented.

The current architecture has a soft target ("near real-time") and a hard data freshness limit (60s). During flash sales the 60s window is too long; during normal operation it's fine. Either accept the flash-sale risk explicitly (and tell users "stock may be stale during sales") or build the event-driven invalidation. Don't claim "real-time" while doing neither.

The ACID claim and the cache

RDS is chosen for ACID guarantees. But the moment we read through a cache, ACID is broken from the chatbot's perspective — we may serve stale data that contradicts what RDS would return. That's fine for read-only paths (the user sees a slightly stale order status), but it must not feed into write decisions. Specifically: check_refund_eligibility should always hit RDS directly, never the cache. The doc doesn't make this explicit.

The "no LLM-generated SQL" rule is right and load-bearing

Worth highlighting because it's easy to lose: every SQL query in this agent is templated. The LLM never gets to write SQL. This is one of the few non-negotiable safety rails in the system; if it ever drifts (e.g., someone adds a "natural language reporting" feature later), the security posture changes materially.

01-orchestrator-agent.md — Calling pattern from the supervisor
07-failure-handling.md — Saga compensation pattern in detail
../RAG-MCP-Integration/03-order-inventory-mcp.md — MCP internals
../12-security-privacy.md — PII handling for order data