User Story 08 — Observability, Cost Tracking, Versioning, Rate Limiting, Quota Management

Pillar: P2 (Harness) + P3 (LLD) · Stage unlocked: 3 → 4 · Reading time: ~14 min

TL;DR

This is the cross-cutting harness story. Five concerns — observability, cost, versioning, rate limits, quotas — that must be unified, not built five times. They share the same primitives: a structured event stream, a versioned identifier, a quota ledger, a dashboard. Build them as one system or pay 5× the operational tax.

The User Story

As the platform engineer running MangaAssist's harness, I want unified observability + cost + versioning + rate limit + quota infrastructure that emits structured events from every node, skill, and sub-agent, so that any question — "what did this turn cost?", "is this canary safe to ramp?", "are we approaching Bedrock TPM?", "which prompt version was this?" — has one source of truth.

Acceptance criteria

Every LLM/skill invocation emits a structured event with: trace_id, span_id, parent_span_id, model_id, prompt_version, graph_version, tool_version, tokens_in/out, cost_usd, latency_ms, eval_score (when available), capability_flags.
Cost dashboards drill from total → per-feature → per-user → per-turn → per-node within 30 seconds.
Rate limits and quotas are enforced at a single chokepoint with predictable backpressure (not random 429s in 5 places).
Every artifact (prompt, tool, sub-agent, graph, model) has a version that participates in deploy/rollback.
The same telemetry feeds eval (story 04), debugging, capacity planning, and finance reporting.

The unified event schema

# Every span emits this
event:
  schema_version: 3
  trace_id: 7f3...a2
  span_id: b81...44
  parent_span_id: c20...19
  workflow_id: a4f...ee
  user_id_hash: redacted-sha256
  session_id: s_abc123
  ts_start: 2026-04-29T14:02:13.418Z
  ts_end:   2026-04-29T14:02:13.802Z
  duration_ms: 384

  node:
    graph_version: v17.4
    node_id: Plan
    invocation_mode: sync_streaming  # or async_queue, batch, scheduled

  artifact:
    artifact_kind: llm_call           # llm_call, tool_call, sub_agent_call, eval, internal
    model_id: claude-sonnet-4-6
    prompt_version: planner-prompt-v23
    tool_id: null
    tool_version: null
    sub_agent_id: null
    sub_agent_version: null

  resource:
    tokens_in: 4218
    tokens_out: 312
    cost_usd: 0.00451
    cache_hit: prefix_partial         # full / prefix_partial / miss
    capability_flags_active: [adult_ok, recommend_v2, ko_locale]

  outcome:
    status: ok                        # ok, fallback, retry_succeeded, failed
    error_code: null
    fallback_used: null
    eval_score: null                  # filled in async by judge

  attribution:
    feature: discovery_recommend
    surface: chat
    locale: ko-KR
    user_tier: prime

This schema is the load-bearing artifact for everything below.

Observability — the dashboards that matter

flowchart TB
  Events[Structured event stream] --> Kinesis[Kinesis Data Stream]
  Kinesis --> RT[Real-time aggregator]
  Kinesis --> S3L[S3 long-term lake]

  RT --> CW[CloudWatch metrics]
  RT --> Dash1[Dashboard - latency / errors / saturation]
  RT --> Alarms[Alarms]

  S3L --> Athena[Athena ad-hoc queries]
  S3L --> Glue[Glue ETL]
  Glue --> Dash2[Dashboard - cost per feature / locale]
  Glue --> Eval[Eval framework joins]
  Glue --> Finance[Finance reporting]

The four dashboards

Live ops — p50/p95/p99 latency, error rate, eval score trend, queue depth, fleet saturation. Refresh: 1 min.
Cost — $/turn, $/feature, $/user-segment, $/locale. Refresh: 15 min.
Quality — eval scores partitioned by graph version, model version, prompt version, locale. Refresh: 5 min.
Capacity — TPM consumed vs limit, queue trends, projected exhaustion time. Refresh: 1 min.

Same events power all four. Build once.

Cost tracking — the discipline

Cost is everywhere: tokens × $/token, plus DynamoDB writes, S3 reads, Bedrock calls, judge calls, eval costs. The trap is to track only LLM tokens — that's <60% of the real bill.

Cost decomposition for a single sync turn

Component	Typical cost	% of turn
Planner LLM call	$0.0040	22%
Sub-agent LLM calls (1.4 avg)	$0.0058	32%
Tool calls (3.2 avg)	$0.0024	13%
Embedding calls	$0.0006	3%
Safety pre + post check	$0.0008	4%
Judge (sampled at 1%, amortized)	$0.0002	1%
DDB writes (state, idempotency)	$0.0009	5%
S3 reads/writes (context blobs)	$0.0007	4%
Compute (Fargate amortized)	$0.0028	15%
Total	$0.0182	100%

That's $0.018/turn, ~10× the LLM-only naive estimate. At 2.4B turns/day = $43.7M/day if every turn was full sync. Most aren't, so actual is closer to $12-15M/day. Still: get this wrong by 2× and you have a $5B/year accounting hole.

Where cost dashboards earn their keep

Anomalous user — top-100 cost users per day; flag those exceeding the $0.18/month/user ceiling. Often abuse or bug.
Anomalous turn — turns >5σ from median cost. Often runaway tool loops.
Anomalous feature — feature whose $/use is climbing without a usage increase. Often a prompt regression that increased tokens.

Versioning — the artifact ledger

Five things have versions:

Artifact	Version source	Bump cadence
Graph topology	Git tag	Per architecture change (~weekly)
Prompt	Per-prompt SHA	Per prompt edit (daily)
Skill / Tool	Semver in registry	Per skill change
Sub-agent	Semver in registry	Per sub-agent change
Foundation model	Provider model_id	Per provider rotation

Every event carries the version of every artifact it touched. The dashboard must be able to filter on every version. "Show me eval score for graph v17.4 + planner-prompt-v23 + Sonnet 4.7 + cross-title-link v1.0" must be one query, not five.

Promotion gates

Prompt change: offline eval green + 1-hour canary green.
Skill version bump: contract-test green + 24-hour shadow + canary green.
Sub-agent version: shadow eval + canary at parent-level eval.
Graph version: replay-harness green + canary.
Model rotation: see story 06.

The gates compose. A graph version bump that also bumps a prompt fires both gates.

Rate limiting

Two kinds of rate limit. Don't conflate them.

Per-user rate limits (abuse prevention)

30 turns/min per user (generous).
200 turns/hour per user.
Token bucket; refilled continuously; soft fail with retry-after when exceeded.
Computed at the API layer, before reaching the agent.

Per-system rate limits (quota protection)

Aggregate Bedrock TPM, Anthropic TPM, internal model TPM.
Per-locale fairness — a Korean traffic spike can't starve French users.
These are quotas, not user-facing limits. See next section.

The chokepoint principle

There is one rate-limiter component. Every entry point (chat, voice, batch, alerts) goes through it. Not 4 implementations, not "we'll add it later to batch."

Quota management — the most under-built piece

Quotas are a shared resource across modes (story 07): Bedrock has a per-account TPM. Sync chat, async personalization, batch enrichment, and judging all draw from the same pool. Without a quota manager, batch can drain the budget and live chat fails.

The quota ledger

flowchart LR
  subgraph Callers
    Sync[Sync chat]
    Async[Async queue]
    Batch[Batch driver]
    Judge[Eval judge]
  end

  subgraph QM[Quota Manager]
    L[(Quota ledger)]
    P[Priority policy]
    A[Allocator]
  end

  subgraph Providers
    BR[Bedrock TPM]
    AN[Anthropic TPM]
    IH[In-house FM TPM]
  end

  Sync -->|reserve| QM
  Async -->|reserve| QM
  Batch -->|reserve| QM
  Judge -->|reserve| QM
  QM -->|approve / wait| Sync
  QM -->|approve / wait| Async
  QM -->|approve / wait| Batch
  QM -->|approve / wait| Judge
  QM -.consume.-> BR
  QM -.consume.-> AN
  QM -.consume.-> IH

Each caller requests N tokens before invoking; manager grants or queues.
Manager tracks rolling consumption against provider TPM ceilings minus a safety margin (we use 85%).
Priority classes: P0 sync chat, P1 async user-facing, P2 internal eval/judge, P3 batch enrichment.
When TPM saturates, P3 pauses first; P2 next; P1 next; P0 never (it'd rather degrade to a faster cheaper model than fail).

The flywheel between quota and graph

The Plan node (story 01) has a fallback edge for "primary model quota saturated" → use Haiku instead of Sonnet. The quota manager exposes "current saturation" as state the graph reads. This is the pillar interaction — system design (quota) feeds workflow design (fallback).

Pitfalls

Pitfall 1 — Logs but no spans

Each component logs in its own format. Reconstructing a turn means joining 6 log streams. On-call gives up.

Fix: OpenTelemetry-style spans with the schema above. One trace, one timeline, every component speaks the same shape.

Pitfall 2 — Token cost only

"We track LLM cost." DDB and S3 don't show up. Compute fixed-cost is amortized into "platform overhead" and ignored. Real cost-per-turn is double the reported number.

Fix: the cost dashboard's denominator includes ALL components in the schema's resource block. Finance reconciles weekly.

Pitfall 3 — Two rate limiters

The chat API has its rate limit. The batch driver has its own. They don't share state. A user spamming via API gets blocked there but the batch still processes their backfill.

Fix: one chokepoint with a unified token-bucket implementation. All entry points consume from the same bucket per (user_id, scope).

Pitfall 4 — Quota state by counting after the fact

Quota tracker reads provider TPM headers and reacts after exceeding. By then 5K requests are in flight that will fail.

Fix: reserve before invoke model. Quota manager grants reservations; if it can't, the caller queues or downshifts. Only the granted reservations get to issue calls. Provider's view never sees the failed ones.

Q&A drill — opening question

*Q: This is a lot of harness for a chatbot. Aren't most of these problems solved by an APM (Datadog, New Relic) plus AWS budgets?

APMs solve general observability for HTTP-shaped traffic. They don't natively understand: - Per-prompt-version score deltas. - Token-cost decomposition by sub-agent. - Cross-mode quota arbitration. - LLM cache hit/miss as a first-class metric.

You'd build all of those on top of the APM. The schema above is the contract that makes all of them queryable in one place. We use Datadog for hosting; the schema is ours.

Grilling — Round 1

Q1. What's the volume of events at scale?

2.4B turns × ~14 spans/turn = ~34B events/day = ~390K events/sec. Kinesis handles this with sharding (~256 shards). S3 lake grows ~80GB/day compressed. Athena queries scan typically 1-50GB depending on time range and filters; sub-minute response on most queries.

Q2. Cost of the cost-tracking infra itself?

Roughly 0.4% of total spend. Largest pieces: Kinesis ingestion, Glue ETL, dashboard query infra. Inexpensive relative to the optimization signal it provides — first month of cost analysis usually finds 8-12% savings, paying for the infra many times over.

Q3. How is PII kept out of the event stream?

Three layers: - At emit: the SDK normalizes user_id → hashed user_id; redacts email/phone/payment patterns from prompt_excerpt fields. - At ingest: a Kinesis processor enforces field-level allowlist; rejects events with non-allowlisted fields. - At query: access controls — ad-hoc Athena access requires per-team approval; production dashboards use pre-aggregated views without raw text.

We periodically audit by injecting probe PII; if it shows up downstream, we have a finding.

Grilling — Round 2 (architect-level)

Q4. How does the quota manager handle a sudden Bedrock TPM cut by Amazon (constraint scenario 7)?

Real scenario. The manager: 1. Receives the new TPM ceiling (manual config push). 2. Recomputes per-priority allocations. 3. Updates "current saturation" reading published to the graph. 4. The graph's fallback edges activate: P0 sync chat downshifts to Haiku where it'd hit the ceiling; P3 batch pauses; P2 judge sampling drops from 1% to 0.2%. 5. On-call sees a saturated-fallback alarm; ops decides whether to buy more TPM (provider negotiation) or accept the degradation.

The customer impact during the cut: slightly lower-quality answers (Haiku vs Sonnet) for ~5-15% of turns, no failures. Graceful degradation, not outage.

Q5. Versioning is great until you need to query by user. "Was this user on graph v17 or v18 last Tuesday?"

The event log is the source. Every event carries graph_version. A query by user_id in a time window joins the user's events to their graph_version distribution. We maintain a derived "user-version exposure" table updated nightly for fast lookups.

This becomes important for postmortems and for measuring per-user impact of canary cohorts. Without versioned events, this question has no answer.

Q6. What's the reliability story when the event pipeline itself is down?

Three rings of resilience: - Local buffering — each container has a 200MB ring buffer; loses events only if down >2 minutes. - Backup direct-to-S3 — if Kinesis is throttled, events go to S3 raw and are reprocessed when Kinesis recovers. - Sampled critical events — eval and cost events have a must_persist flag; the SDK acks the local write before the critical action proceeds.

Worst-case (Kinesis fully down): the harness keeps running; dashboards lag; eval and quota partial-blind for the duration. We've measured tolerance at ~30 minutes before customer impact (eval can't catch regressions; quota can over-reserve).

Intuition gained

One event schema, five use cases. Don't build five telemetry systems.
Cost is bigger than tokens. DDB, S3, compute, and infrastructure dominate at scale; the dashboard must include them.
Versioning is per-artifact and joins at the event. Five things have versions; every event carries them all.
Quota is a reservation system, not an after-the-fact counter. Reserve before invoke.
The chokepoint principle: rate limiting and quota live in one place; entry points all flow through.
Per-priority degradation (P0/P1/P2/P3) is what gives the system graceful behavior under quota cuts.