LOCAL PREVIEW View on GitHub

03 — OrderStatusAgent

Real-time order tracking, inventory checks, and return initiation. Backed by RDS as source of truth, ElastiCache for hot reads.

The OrderStatusAgent handles every message that touches the customer's transactions: "Where's my package?", "Is volume 5 in stock?", "I want to return this." It's the most operationally sensitive sub-agent — wrong answers here cost real money or burn customer trust.


What it is

A logical sub-agent that exposes order- and inventory-shaped tools to the Orchestrator. It owns:

  1. Order status retrieval — order ID → current state, tracking, ETA
  2. Inventory checks — ASIN → in stock / low stock / out of stock + restock ETA
  3. Return eligibility — order ID + ASIN → can be returned? policy applied?
  4. Return initiation — generates RMA, creates a return shipping label

Backed by the Order & Inventory MCP (../RAG-MCP-Integration/03-order-inventory-mcp.md), which fronts a PostgreSQL RDS cluster (orders + inventory transactional data) with an ElastiCache Redis read-through cache.


Tools exposed to the Orchestrator

Tool Purpose Mutation?
get_order_status(order_id) Current status, tracking, ETA Read-only
check_stock(asin, region) Current inventory level Read-only
get_recent_orders(user_id, limit) Last N orders for context Read-only
check_refund_eligibility(order_id, asin) Can this be returned? Read-only
initiate_return(order_id, asin, reason) Create RMA + shipping label Write

The single write operation (initiate_return) is the only place in the entire chatbot that mutates customer data. It is therefore protected by additional confirmation flow — see Failure handling below.


Data freshness model

Order data has different freshness requirements by field:

Field Source of truth Cache TTL Why
Order placed RDS 1 hour Immutable once placed
Order status (e.g. "shipped") RDS 5 minutes Updated by fulfillment events
Tracking events Carrier API → RDS 15 minutes Carriers update slowly; over-polling is rate-limited
Inventory level RDS 60 seconds Sales drop stock fast
Return eligibility Computed on read 0 (no cache) Depends on current date vs. order date

Each field is cached independently. A request for order status returns partially fresh data, with each field tagged by its as-of timestamp.


Read path

Orchestrator: get_order_status("ORD-12345")
        │
        ▼
ElastiCache lookup: order:ORD-12345
        │
   ┌────┴────┐
   ▼         ▼
HIT (5ms)   MISS
   │         │
   │         ▼
   │     RDS read replica query (~50ms)
   │         │
   │         ▼
   │     Write to cache, TTL by field
   │         │
   ▼         ▼
Return structured order with as-of timestamps per field

Cache hits dominate by ~10:1 in steady state. The numbers above assume a warm cache.


Write path: return initiation

The only write operation. Multi-step:

  1. Validate — refund eligibility, ownership, return window
  2. Generate — RMA number, shipping label via UPS API
  3. Persist — write to RDS in a transaction (return record + order status update)
  4. Invalidate — drop affected cache entries
  5. Emit event — publish to SNS for downstream (refund processor, inventory restock)

This is wrapped in a saga pattern: each step has a compensating action. If UPS label generation fails after RDS write, we publish a return_initiation_failed event that triggers cleanup.

The Orchestrator confirms with the user before calling initiate_return:

User: "I want to return Berserk volume 42."
Orchestrator: [calls check_refund_eligibility]
Orchestrator → User: "I can process a return. You'll get a $19.99 refund
                      to the original card. Shall I proceed?"
User: "Yes."
Orchestrator: [now calls initiate_return]

The two-step gate is a deliberate friction point. Past incidents: Claude sometimes interprets "I'm returning this" as a status update vs. a request, which would trigger an unwanted RMA without confirmation.


State management

The agent itself is stateless. State lives in:

  • RDS — source of truth (orders, returns, inventory)
  • ElastiCache — hot read cache, TTL by field
  • DynamoDB — session-scoped order context (recent order IDs the user mentioned, so re-mentions resolve)

The session-scoped piece matters: if the user says "Yes, return that one" two turns later, the Orchestrator needs to know which order ID was active. That entity-pinning is in DynamoDB session state (08-memory-architecture.md).


Failure handling

Failure Detection Recovery
Cache miss + RDS slow (>500ms) Latency probe Return cached if exists with stale: true flag, else fail
RDS read replica down Connection error Failover to primary; if primary down, return last-known with disclaimer
UPS label API timeout 5s timeout on write path Saga compensation: rollback RDS, return error to user, retry async
Invalid order ID Schema validation 400 with reason, no retry
Eligibility check fails Business rule violation Return structured not_eligible with reason (past window, damaged, etc.)
Concurrent return on same order DB unique constraint Detect 23505 error code, return idempotent success if RMA already exists
Stock check during flash sale Cache invalidation lag Return cached + warning "stock changes rapidly"

The flash sale case is the worst failure mode — telling a customer something is in stock when it's already sold out. Mitigation: during declared flash-sale windows, drop stock cache TTL to 5s and add a note in the response.


Latency budget

Target: P99 < 300ms for read tools, P99 < 2s for initiate_return.

Read path (cache hit):
  Cache GET     5ms
  Format       10ms
  ─────
  Total       15ms (P50)
              50ms (P99 with network jitter)

Read path (cache miss):
  Cache GET      5ms
  RDS query     50ms
  Cache SET      5ms
  Format        10ms
  ─────
  Total        70ms (P50)
              200ms (P99)

Write path (initiate_return):
  Eligibility (RDS)    50ms
  RMA generation       20ms
  UPS label API       800ms (external; this dominates)
  RDS transaction     100ms
  SNS publish          30ms
  Cache invalidate     10ms
  ─────
  Total             ~1010ms (P50)
                    ~2500ms (P99)

The write-path P99 of 2.5s exceeds the chatbot's overall 3s budget. This is why returns are usually a multi-turn flow — the initial "shall I proceed?" question buys back latency budget on the actual write turn.


Why this shape

Alternative Why we rejected it
DynamoDB instead of RDS Order data is relational (orders → line items → returns); ACID requirements for financial data
No cache layer (always RDS) RDS read replica P99 ~80ms; under chatbot QPS this stacks badly. Cache hit at 5ms is necessary.
LLM-generated SQL queries Catastrophic risk (PII leak, data corruption). All queries are templated.
Auto-confirm returns from natural language Past incidents: ambiguous user intent → unintended RMA. Two-step confirmation is non-negotiable.
Cache stock indefinitely Stock changes too fast during sales; 60s TTL is the loosest acceptable.

Validation: Constraint Sanity Check

Claimed metric Verdict Why
"Real-time freshness" stated requirement Marketing language, not a metric Real-time means different things in different contexts. With 60s inventory TTL and 5-min order TTL, this is near-real-time at best. The doc should drop "real-time" and use "≤ 60s freshness for inventory, ≤ 5min for order status."
Inventory cache TTL 60s Acceptable except during flash sales During a flash sale, 60s of stale "in stock" can mean hundreds of orders for unavailable items. Dynamic TTL adjustment (5s during flash sale windows) is mentioned but the trigger mechanism isn't defined. Who declares "flash sale on"? Until that's specified, the protection is theoretical.
Read P99 < 300ms Realistic at warm cache Holds for cache hits (P99 ~50ms). Cache-miss P99 (~200ms) requires RDS read replica that doesn't degrade under chatbot QPS — needs sustained-load testing data, not just steady-state numbers.
Write P99 < 2s for initiate_return Dependent on UPS API UPS label generation is external and not under our SLA control. UPS API P99 in practice is 600ms–3s. We can hit 2s most of the time but not contractually. Restate as best-effort.
Cache hit ratio "10:1" Unsourced claim Where does this come from? At Amazon scale with cold long-tail orders, the actual ratio depends on access pattern. Needs measurement, not assertion.
RDS for orders at scale Architectural risk at growth Amazon's actual order volume globally is hundreds of millions per day. A single RDS Postgres won't scale. Plausible read for a manga-store scope; would need partitioning (by user_id or order_id range) at higher volume.
Two-step confirmation for writes Sound, but adds latency The pattern adds a full conversation turn (~3s) before any return can be initiated. That's correct for safety, but eats into "fast support" goals. Acceptable tradeoff; should be acknowledged.
Saga compensation on write failures Correct pattern, hard to verify Saga correctness depends on every compensating action being idempotent and tested. No mention of how this is verified — chaos testing? Integration tests? Without a test plan this is paper protection.

The flash sale problem

The architecture says "during declared flash-sale windows, drop stock cache TTL to 5s." This is a configuration-driven mitigation that depends on:

  1. Someone (or something) declaring the flash sale window
  2. The OrderStatusAgent reading that flag on every request
  3. The dynamic TTL actually being applied

None of (1)–(3) is specified. In practice, flash sales often start when an item unexpectedly goes viral — there's no human in the loop to flip a flag. The robust solution is event-driven: subscribe to inventory-change events from the fulfillment service and invalidate cache on each change. That's significant engineering and isn't in the architecture as documented.

The current architecture has a soft target ("near real-time") and a hard data freshness limit (60s). During flash sales the 60s window is too long; during normal operation it's fine. Either accept the flash-sale risk explicitly (and tell users "stock may be stale during sales") or build the event-driven invalidation. Don't claim "real-time" while doing neither.

The ACID claim and the cache

RDS is chosen for ACID guarantees. But the moment we read through a cache, ACID is broken from the chatbot's perspective — we may serve stale data that contradicts what RDS would return. That's fine for read-only paths (the user sees a slightly stale order status), but it must not feed into write decisions. Specifically: check_refund_eligibility should always hit RDS directly, never the cache. The doc doesn't make this explicit.

The "no LLM-generated SQL" rule is right and load-bearing

Worth highlighting because it's easy to lose: every SQL query in this agent is templated. The LLM never gets to write SQL. This is one of the few non-negotiable safety rails in the system; if it ever drifts (e.g., someone adds a "natural language reporting" feature later), the security posture changes materially.