Evaluation Systems GenAI

Golden-set design, LLM-as-judge with cross-family bias mitigation, drift detection, and the offline → shadow → live promotion ladder. Includes the agent-performance framework with explicit failure-mode taxonomy.

Interview talking points

How do you evaluate an LLM-backed agent? Lead with: heuristic recall + cross-family LLM judge + human spot-checks; never same-family judge.
What's in your golden set? Tiered (silver / gold), expected + forbidden keywords, an explicit notes column for intent. Refresh cadence matters.
How do you detect drift? Acceptance-rate over a rolling window, paired against eval recall on the golden set; alarms when they diverge.
Failure-mode taxonomy. Intent / Tool / Reasoning / Escalation — see 07-agent-performance-framework.md for the diagram.

Files in this folder

File	Title
01-fm-output-quality-assessment.md	01: FM Output Quality Assessment Framework
02-model-evaluation-optimal-configuration.md	02: Model Evaluation and Optimal Configuration
03-user-centered-evaluation.md	03: User-Centered Evaluation Systems
04-quality-assurance-processes.md	04: Quality Assurance Processes
05-comprehensive-assessment-multi-perspective.md	05: Comprehensive Assessment from Multiple Perspectives
06-retrieval-quality-testing.md	06: Retrieval Quality Testing
07-agent-performance-framework.md	07: Agent Performance Framework
08-reporting-visualization-systems.md	08: Reporting and Visualization Systems
09-deployment-validation-systems.md	09: Deployment Validation Systems
README.md	Evaluation Systems for GenAI — MangaAssist

Back to the home page.