Skill 3.3.2: Data Source Traceability
Task: Task 3.3
Goal: Maintain end-to-end traceability of which data sources influenced each FM response.
User Story
As a Knowledge Governance Lead, I want MangaAssist to record which source documents, catalog records, and tool outputs contributed to each answer so that the team can explain, audit, and correct responses without guesswork.
Grounded Scenarios
| Scenario | Why It Matters |
|---|---|
| The chatbot answers a returns-policy question and the user disputes the policy | We need to show exactly which policy source was used |
| A recommendation answer combines catalog metadata, editorial notes, and popularity signals | Multi-source answers need source attribution, not a black box explanation |
| A stale KB article causes an outdated answer | Traceability helps identify which source needs correction or deprecation |
Deep-Dive Design
1. Register Sources Centrally
Use AWS Glue Data Catalog or a similar metadata registry to track:
- source ID
- source type: policy, catalog, editorial, tool API
- owner
- freshness SLA
- approval status
- region or language scope
This makes source governance queryable.
2. Propagate Source Metadata Through Retrieval
When retrieval happens, keep source IDs attached to:
- retrieved chunks
- catalog records
- tool responses
- reranked context objects
Do not strip metadata during prompt assembly. The orchestration layer must preserve provenance all the way to the answer object.
3. Store Answer-to-Source Links
For each high-value response, capture:
- request ID
- cited source IDs
- non-cited supporting source IDs
- model and prompt version
- whether the final answer was tool-derived or FM-composed
This is useful for both user-visible citations and internal audits.
4. User-Facing Attribution
When appropriate, the response should show:
- source title
- source type
- freshness cue
- link or reference label
For example, a returns answer might cite "JP Manga Returns Policy" while a shipping answer cites an order-status tool result plus a shipping policy page.
5. Audit Logging
Use CloudTrail and structured service logs to record who changed:
- source approval status
- metadata tags
- KB ingestion jobs
- retrieval pipeline configuration
Traceability is incomplete if we know which source answered the user but not who changed that source.
Acceptance Criteria
- Approved sources are registered with owner, freshness, and scope metadata.
- Retrieval and prompt assembly preserve source provenance through the whole answer path.
- High-risk answers can be mapped back to exact source IDs after the fact.
- User-visible citations are available for source-backed answers where appropriate.
- Source-change events are auditable.
Signals and Metrics
- percentage of source-backed answers with usable provenance
- citation coverage rate on policy and catalog answers
- time to trace a disputed answer back to a source
- stale-source incident rate
- number of production sources missing owner or freshness metadata
Failure Modes and Tradeoffs
- Source metadata can get dropped during orchestration. Mitigation: make provenance part of the response schema.
- Too many citations can clutter UX. Mitigation: separate internal provenance from user-facing evidence.
- Unregistered shadow sources can appear in rushed projects. Mitigation: block production retrieval from unapproved datasets.
Interview Takeaway
Traceability is the backbone of trustworthy GenAI operations. If you cannot answer "which source produced this answer and who approved it," governance is weak.