LOCAL PREVIEW View on GitHub

Skill 3.3.2: Data Source Traceability

Task: Task 3.3
Goal: Maintain end-to-end traceability of which data sources influenced each FM response.

User Story

As a Knowledge Governance Lead, I want MangaAssist to record which source documents, catalog records, and tool outputs contributed to each answer so that the team can explain, audit, and correct responses without guesswork.

Grounded Scenarios

Scenario Why It Matters
The chatbot answers a returns-policy question and the user disputes the policy We need to show exactly which policy source was used
A recommendation answer combines catalog metadata, editorial notes, and popularity signals Multi-source answers need source attribution, not a black box explanation
A stale KB article causes an outdated answer Traceability helps identify which source needs correction or deprecation

Deep-Dive Design

1. Register Sources Centrally

Use AWS Glue Data Catalog or a similar metadata registry to track:

  • source ID
  • source type: policy, catalog, editorial, tool API
  • owner
  • freshness SLA
  • approval status
  • region or language scope

This makes source governance queryable.

2. Propagate Source Metadata Through Retrieval

When retrieval happens, keep source IDs attached to:

  • retrieved chunks
  • catalog records
  • tool responses
  • reranked context objects

Do not strip metadata during prompt assembly. The orchestration layer must preserve provenance all the way to the answer object.

For each high-value response, capture:

  • request ID
  • cited source IDs
  • non-cited supporting source IDs
  • model and prompt version
  • whether the final answer was tool-derived or FM-composed

This is useful for both user-visible citations and internal audits.

4. User-Facing Attribution

When appropriate, the response should show:

  • source title
  • source type
  • freshness cue
  • link or reference label

For example, a returns answer might cite "JP Manga Returns Policy" while a shipping answer cites an order-status tool result plus a shipping policy page.

5. Audit Logging

Use CloudTrail and structured service logs to record who changed:

  • source approval status
  • metadata tags
  • KB ingestion jobs
  • retrieval pipeline configuration

Traceability is incomplete if we know which source answered the user but not who changed that source.

Acceptance Criteria

  • Approved sources are registered with owner, freshness, and scope metadata.
  • Retrieval and prompt assembly preserve source provenance through the whole answer path.
  • High-risk answers can be mapped back to exact source IDs after the fact.
  • User-visible citations are available for source-backed answers where appropriate.
  • Source-change events are auditable.

Signals and Metrics

  • percentage of source-backed answers with usable provenance
  • citation coverage rate on policy and catalog answers
  • time to trace a disputed answer back to a source
  • stale-source incident rate
  • number of production sources missing owner or freshness metadata

Failure Modes and Tradeoffs

  • Source metadata can get dropped during orchestration. Mitigation: make provenance part of the response schema.
  • Too many citations can clutter UX. Mitigation: separate internal provenance from user-facing evidence.
  • Unregistered shadow sources can appear in rushed projects. Mitigation: block production retrieval from unapproved datasets.

Interview Takeaway

Traceability is the backbone of trustworthy GenAI operations. If you cannot answer "which source produced this answer and who approved it," governance is weak.