User Story 08 — Observability, Cost Tracking, Versioning, Rate Limiting, Quota Management
Pillar: P2 (Harness) + P3 (LLD) · Stage unlocked: 3 → 4 · Reading time: ~14 min
TL;DR
This is the cross-cutting harness story. Five concerns — observability, cost, versioning, rate limits, quotas — that must be unified, not built five times. They share the same primitives: a structured event stream, a versioned identifier, a quota ledger, a dashboard. Build them as one system or pay 5× the operational tax.
The User Story
As the platform engineer running MangaAssist's harness, I want unified observability + cost + versioning + rate limit + quota infrastructure that emits structured events from every node, skill, and sub-agent, so that any question — "what did this turn cost?", "is this canary safe to ramp?", "are we approaching Bedrock TPM?", "which prompt version was this?" — has one source of truth.
Acceptance criteria
- Every LLM/skill invocation emits a structured event with: trace_id, span_id, parent_span_id, model_id, prompt_version, graph_version, tool_version, tokens_in/out, cost_usd, latency_ms, eval_score (when available), capability_flags.
- Cost dashboards drill from total → per-feature → per-user → per-turn → per-node within 30 seconds.
- Rate limits and quotas are enforced at a single chokepoint with predictable backpressure (not random 429s in 5 places).
- Every artifact (prompt, tool, sub-agent, graph, model) has a version that participates in deploy/rollback.
- The same telemetry feeds eval (story 04), debugging, capacity planning, and finance reporting.
The unified event schema
# Every span emits this
event:
schema_version: 3
trace_id: 7f3...a2
span_id: b81...44
parent_span_id: c20...19
workflow_id: a4f...ee
user_id_hash: redacted-sha256
session_id: s_abc123
ts_start: 2026-04-29T14:02:13.418Z
ts_end: 2026-04-29T14:02:13.802Z
duration_ms: 384
node:
graph_version: v17.4
node_id: Plan
invocation_mode: sync_streaming # or async_queue, batch, scheduled
artifact:
artifact_kind: llm_call # llm_call, tool_call, sub_agent_call, eval, internal
model_id: claude-sonnet-4-6
prompt_version: planner-prompt-v23
tool_id: null
tool_version: null
sub_agent_id: null
sub_agent_version: null
resource:
tokens_in: 4218
tokens_out: 312
cost_usd: 0.00451
cache_hit: prefix_partial # full / prefix_partial / miss
capability_flags_active: [adult_ok, recommend_v2, ko_locale]
outcome:
status: ok # ok, fallback, retry_succeeded, failed
error_code: null
fallback_used: null
eval_score: null # filled in async by judge
attribution:
feature: discovery_recommend
surface: chat
locale: ko-KR
user_tier: prime
This schema is the load-bearing artifact for everything below.
Observability — the dashboards that matter
flowchart TB
Events[Structured event stream] --> Kinesis[Kinesis Data Stream]
Kinesis --> RT[Real-time aggregator]
Kinesis --> S3L[S3 long-term lake]
RT --> CW[CloudWatch metrics]
RT --> Dash1[Dashboard - latency / errors / saturation]
RT --> Alarms[Alarms]
S3L --> Athena[Athena ad-hoc queries]
S3L --> Glue[Glue ETL]
Glue --> Dash2[Dashboard - cost per feature / locale]
Glue --> Eval[Eval framework joins]
Glue --> Finance[Finance reporting]
The four dashboards
- Live ops — p50/p95/p99 latency, error rate, eval score trend, queue depth, fleet saturation. Refresh: 1 min.
- Cost — $/turn, $/feature, $/user-segment, $/locale. Refresh: 15 min.
- Quality — eval scores partitioned by graph version, model version, prompt version, locale. Refresh: 5 min.
- Capacity — TPM consumed vs limit, queue trends, projected exhaustion time. Refresh: 1 min.
Same events power all four. Build once.
Cost tracking — the discipline
Cost is everywhere: tokens × $/token, plus DynamoDB writes, S3 reads, Bedrock calls, judge calls, eval costs. The trap is to track only LLM tokens — that's <60% of the real bill.
Cost decomposition for a single sync turn
| Component | Typical cost | % of turn |
|---|---|---|
| Planner LLM call | $0.0040 | 22% |
| Sub-agent LLM calls (1.4 avg) | $0.0058 | 32% |
| Tool calls (3.2 avg) | $0.0024 | 13% |
| Embedding calls | $0.0006 | 3% |
| Safety pre + post check | $0.0008 | 4% |
| Judge (sampled at 1%, amortized) | $0.0002 | 1% |
| DDB writes (state, idempotency) | $0.0009 | 5% |
| S3 reads/writes (context blobs) | $0.0007 | 4% |
| Compute (Fargate amortized) | $0.0028 | 15% |
| Total | $0.0182 | 100% |
That's $0.018/turn, ~10× the LLM-only naive estimate. At 2.4B turns/day = $43.7M/day if every turn was full sync. Most aren't, so actual is closer to $12-15M/day. Still: get this wrong by 2× and you have a $5B/year accounting hole.
Where cost dashboards earn their keep
- Anomalous user — top-100 cost users per day; flag those exceeding the $0.18/month/user ceiling. Often abuse or bug.
- Anomalous turn — turns >5σ from median cost. Often runaway tool loops.
- Anomalous feature — feature whose $/use is climbing without a usage increase. Often a prompt regression that increased tokens.
Versioning — the artifact ledger
Five things have versions:
| Artifact | Version source | Bump cadence |
|---|---|---|
| Graph topology | Git tag | Per architecture change (~weekly) |
| Prompt | Per-prompt SHA | Per prompt edit (daily) |
| Skill / Tool | Semver in registry | Per skill change |
| Sub-agent | Semver in registry | Per sub-agent change |
| Foundation model | Provider model_id | Per provider rotation |
Every event carries the version of every artifact it touched. The dashboard must be able to filter on every version. "Show me eval score for graph v17.4 + planner-prompt-v23 + Sonnet 4.7 + cross-title-link v1.0" must be one query, not five.
Promotion gates
- Prompt change: offline eval green + 1-hour canary green.
- Skill version bump: contract-test green + 24-hour shadow + canary green.
- Sub-agent version: shadow eval + canary at parent-level eval.
- Graph version: replay-harness green + canary.
- Model rotation: see story 06.
The gates compose. A graph version bump that also bumps a prompt fires both gates.
Rate limiting
Two kinds of rate limit. Don't conflate them.
Per-user rate limits (abuse prevention)
- 30 turns/min per user (generous).
- 200 turns/hour per user.
- Token bucket; refilled continuously; soft fail with retry-after when exceeded.
- Computed at the API layer, before reaching the agent.
Per-system rate limits (quota protection)
- Aggregate Bedrock TPM, Anthropic TPM, internal model TPM.
- Per-locale fairness — a Korean traffic spike can't starve French users.
- These are quotas, not user-facing limits. See next section.
The chokepoint principle
There is one rate-limiter component. Every entry point (chat, voice, batch, alerts) goes through it. Not 4 implementations, not "we'll add it later to batch."
Quota management — the most under-built piece
Quotas are a shared resource across modes (story 07): Bedrock has a per-account TPM. Sync chat, async personalization, batch enrichment, and judging all draw from the same pool. Without a quota manager, batch can drain the budget and live chat fails.
The quota ledger
flowchart LR
subgraph Callers
Sync[Sync chat]
Async[Async queue]
Batch[Batch driver]
Judge[Eval judge]
end
subgraph QM[Quota Manager]
L[(Quota ledger)]
P[Priority policy]
A[Allocator]
end
subgraph Providers
BR[Bedrock TPM]
AN[Anthropic TPM]
IH[In-house FM TPM]
end
Sync -->|reserve| QM
Async -->|reserve| QM
Batch -->|reserve| QM
Judge -->|reserve| QM
QM -->|approve / wait| Sync
QM -->|approve / wait| Async
QM -->|approve / wait| Batch
QM -->|approve / wait| Judge
QM -.consume.-> BR
QM -.consume.-> AN
QM -.consume.-> IH
- Each caller requests N tokens before invoking; manager grants or queues.
- Manager tracks rolling consumption against provider TPM ceilings minus a safety margin (we use 85%).
- Priority classes: P0 sync chat, P1 async user-facing, P2 internal eval/judge, P3 batch enrichment.
- When TPM saturates, P3 pauses first; P2 next; P1 next; P0 never (it'd rather degrade to a faster cheaper model than fail).
The flywheel between quota and graph
The Plan node (story 01) has a fallback edge for "primary model quota saturated" → use Haiku instead of Sonnet. The quota manager exposes "current saturation" as state the graph reads. This is the pillar interaction — system design (quota) feeds workflow design (fallback).
Pitfalls
Pitfall 1 — Logs but no spans
Each component logs in its own format. Reconstructing a turn means joining 6 log streams. On-call gives up.
Fix: OpenTelemetry-style spans with the schema above. One trace, one timeline, every component speaks the same shape.
Pitfall 2 — Token cost only
"We track LLM cost." DDB and S3 don't show up. Compute fixed-cost is amortized into "platform overhead" and ignored. Real cost-per-turn is double the reported number.
Fix: the cost dashboard's denominator includes ALL components in the schema's resource block. Finance reconciles weekly.
Pitfall 3 — Two rate limiters
The chat API has its rate limit. The batch driver has its own. They don't share state. A user spamming via API gets blocked there but the batch still processes their backfill.
Fix: one chokepoint with a unified token-bucket implementation. All entry points consume from the same bucket per (user_id, scope).
Pitfall 4 — Quota state by counting after the fact
Quota tracker reads provider TPM headers and reacts after exceeding. By then 5K requests are in flight that will fail.
Fix: reserve before invoke model. Quota manager grants reservations; if it can't, the caller queues or downshifts. Only the granted reservations get to issue calls. Provider's view never sees the failed ones.
Q&A drill — opening question
*Q: This is a lot of harness for a chatbot. Aren't most of these problems solved by an APM (Datadog, New Relic) plus AWS budgets?
APMs solve general observability for HTTP-shaped traffic. They don't natively understand: - Per-prompt-version score deltas. - Token-cost decomposition by sub-agent. - Cross-mode quota arbitration. - LLM cache hit/miss as a first-class metric.
You'd build all of those on top of the APM. The schema above is the contract that makes all of them queryable in one place. We use Datadog for hosting; the schema is ours.
Grilling — Round 1
Q1. What's the volume of events at scale?
2.4B turns × ~14 spans/turn = ~34B events/day = ~390K events/sec. Kinesis handles this with sharding (~256 shards). S3 lake grows ~80GB/day compressed. Athena queries scan typically 1-50GB depending on time range and filters; sub-minute response on most queries.
Q2. Cost of the cost-tracking infra itself?
Roughly 0.4% of total spend. Largest pieces: Kinesis ingestion, Glue ETL, dashboard query infra. Inexpensive relative to the optimization signal it provides — first month of cost analysis usually finds 8-12% savings, paying for the infra many times over.
Q3. How is PII kept out of the event stream?
Three layers:
- At emit: the SDK normalizes user_id → hashed user_id; redacts email/phone/payment patterns from prompt_excerpt fields.
- At ingest: a Kinesis processor enforces field-level allowlist; rejects events with non-allowlisted fields.
- At query: access controls — ad-hoc Athena access requires per-team approval; production dashboards use pre-aggregated views without raw text.
We periodically audit by injecting probe PII; if it shows up downstream, we have a finding.
Grilling — Round 2 (architect-level)
Q4. How does the quota manager handle a sudden Bedrock TPM cut by Amazon (constraint scenario 7)?
Real scenario. The manager: 1. Receives the new TPM ceiling (manual config push). 2. Recomputes per-priority allocations. 3. Updates "current saturation" reading published to the graph. 4. The graph's fallback edges activate: P0 sync chat downshifts to Haiku where it'd hit the ceiling; P3 batch pauses; P2 judge sampling drops from 1% to 0.2%. 5. On-call sees a saturated-fallback alarm; ops decides whether to buy more TPM (provider negotiation) or accept the degradation.
The customer impact during the cut: slightly lower-quality answers (Haiku vs Sonnet) for ~5-15% of turns, no failures. Graceful degradation, not outage.
Q5. Versioning is great until you need to query by user. "Was this user on graph v17 or v18 last Tuesday?"
The event log is the source. Every event carries graph_version. A query by user_id in a time window joins the user's events to their graph_version distribution. We maintain a derived "user-version exposure" table updated nightly for fast lookups.
This becomes important for postmortems and for measuring per-user impact of canary cohorts. Without versioned events, this question has no answer.
Q6. What's the reliability story when the event pipeline itself is down?
Three rings of resilience:
- Local buffering — each container has a 200MB ring buffer; loses events only if down >2 minutes.
- Backup direct-to-S3 — if Kinesis is throttled, events go to S3 raw and are reprocessed when Kinesis recovers.
- Sampled critical events — eval and cost events have a must_persist flag; the SDK acks the local write before the critical action proceeds.
Worst-case (Kinesis fully down): the harness keeps running; dashboards lag; eval and quota partial-blind for the duration. We've measured tolerance at ~30 minutes before customer impact (eval can't catch regressions; quota can over-reserve).
Intuition gained
- One event schema, five use cases. Don't build five telemetry systems.
- Cost is bigger than tokens. DDB, S3, compute, and infrastructure dominate at scale; the dashboard must include them.
- Versioning is per-artifact and joins at the event. Five things have versions; every event carries them all.
- Quota is a reservation system, not an after-the-fact counter. Reserve before invoke.
- The chokepoint principle: rate limiting and quota live in one place; entry points all flow through.
- Per-priority degradation (P0/P1/P2/P3) is what gives the system graceful behavior under quota cuts.
See also
01-execution-flow-design.md— events emitted at every node02-skill-composition-and-invocation.md— skill versioning + idempotency04-active-passive-evals.md— eval consumes the same eventsChanging-Constraints-Scenarios/03-cost-budget-halved.md— cost controls under squeezeChanging-Constraints-Scenarios/07-provider-quota-revoked.md— quota manager under pressure