Skill 3.3.2: Data Source Traceability

Task: Task 3.3
Goal: Maintain end-to-end traceability of which data sources influenced each FM response.

User Story

As a Knowledge Governance Lead, I want MangaAssist to record which source documents, catalog records, and tool outputs contributed to each answer so that the team can explain, audit, and correct responses without guesswork.

Grounded Scenarios

Scenario	Why It Matters
The chatbot answers a returns-policy question and the user disputes the policy	We need to show exactly which policy source was used
A recommendation answer combines catalog metadata, editorial notes, and popularity signals	Multi-source answers need source attribution, not a black box explanation
A stale KB article causes an outdated answer	Traceability helps identify which source needs correction or deprecation

Deep-Dive Design

1. Register Sources Centrally

Use AWS Glue Data Catalog or a similar metadata registry to track:

source ID
source type: policy, catalog, editorial, tool API
owner
freshness SLA
approval status
region or language scope

This makes source governance queryable.

2. Propagate Source Metadata Through Retrieval

When retrieval happens, keep source IDs attached to:

retrieved chunks
catalog records
tool responses
reranked context objects

Do not strip metadata during prompt assembly. The orchestration layer must preserve provenance all the way to the answer object.

3. Store Answer-to-Source Links

For each high-value response, capture:

request ID
cited source IDs
non-cited supporting source IDs
model and prompt version
whether the final answer was tool-derived or FM-composed

This is useful for both user-visible citations and internal audits.

4. User-Facing Attribution

When appropriate, the response should show:

source title
source type
freshness cue
link or reference label

For example, a returns answer might cite "JP Manga Returns Policy" while a shipping answer cites an order-status tool result plus a shipping policy page.

5. Audit Logging

Use CloudTrail and structured service logs to record who changed:

source approval status
metadata tags
KB ingestion jobs
retrieval pipeline configuration

Traceability is incomplete if we know which source answered the user but not who changed that source.

Acceptance Criteria

Approved sources are registered with owner, freshness, and scope metadata.
Retrieval and prompt assembly preserve source provenance through the whole answer path.
High-risk answers can be mapped back to exact source IDs after the fact.
User-visible citations are available for source-backed answers where appropriate.
Source-change events are auditable.

Signals and Metrics

percentage of source-backed answers with usable provenance
citation coverage rate on policy and catalog answers
time to trace a disputed answer back to a source
stale-source incident rate
number of production sources missing owner or freshness metadata

Failure Modes and Tradeoffs

Source metadata can get dropped during orchestration. Mitigation: make provenance part of the response schema.
Too many citations can clutter UX. Mitigation: separate internal provenance from user-facing evidence.
Unregistered shadow sources can appear in rushed projects. Mitigation: block production retrieval from unapproved datasets.

Interview Takeaway

Traceability is the backbone of trustworthy GenAI operations. If you cannot answer "which source produced this answer and who approved it," governance is weak.