Evaluation Systems GenAI
Golden-set design, LLM-as-judge with cross-family bias mitigation, drift detection, and the offline → shadow → live promotion ladder. Includes the agent-performance framework with explicit failure-mode taxonomy.
Interview talking points
- How do you evaluate an LLM-backed agent? Lead with: heuristic recall + cross-family LLM judge + human spot-checks; never same-family judge.
- What's in your golden set? Tiered (silver / gold), expected + forbidden keywords, an explicit notes column for intent. Refresh cadence matters.
- How do you detect drift? Acceptance-rate over a rolling window, paired against eval recall on the golden set; alarms when they diverge.
- Failure-mode taxonomy. Intent / Tool / Reasoning / Escalation — see
07-agent-performance-framework.mdfor the diagram.
Files in this folder
| File | Title |
|---|---|
| 01-fm-output-quality-assessment.md | 01: FM Output Quality Assessment Framework |
| 02-model-evaluation-optimal-configuration.md | 02: Model Evaluation and Optimal Configuration |
| 03-user-centered-evaluation.md | 03: User-Centered Evaluation Systems |
| 04-quality-assurance-processes.md | 04: Quality Assurance Processes |
| 05-comprehensive-assessment-multi-perspective.md | 05: Comprehensive Assessment from Multiple Perspectives |
| 06-retrieval-quality-testing.md | 06: Retrieval Quality Testing |
| 07-agent-performance-framework.md | 07: Agent Performance Framework |
| 08-reporting-visualization-systems.md | 08: Reporting and Visualization Systems |
| 09-deployment-validation-systems.md | 09: Deployment Validation Systems |
| README.md | Evaluation Systems for GenAI — MangaAssist |
Back to the home page.