LOCAL PREVIEW View on GitHub

Evaluation Systems GenAI

Golden-set design, LLM-as-judge with cross-family bias mitigation, drift detection, and the offline → shadow → live promotion ladder. Includes the agent-performance framework with explicit failure-mode taxonomy.

Interview talking points

  • How do you evaluate an LLM-backed agent? Lead with: heuristic recall + cross-family LLM judge + human spot-checks; never same-family judge.
  • What's in your golden set? Tiered (silver / gold), expected + forbidden keywords, an explicit notes column for intent. Refresh cadence matters.
  • How do you detect drift? Acceptance-rate over a rolling window, paired against eval recall on the golden set; alarms when they diverge.
  • Failure-mode taxonomy. Intent / Tool / Reasoning / Escalation — see 07-agent-performance-framework.md for the diagram.

Files in this folder

File Title
01-fm-output-quality-assessment.md 01: FM Output Quality Assessment Framework
02-model-evaluation-optimal-configuration.md 02: Model Evaluation and Optimal Configuration
03-user-centered-evaluation.md 03: User-Centered Evaluation Systems
04-quality-assurance-processes.md 04: Quality Assurance Processes
05-comprehensive-assessment-multi-perspective.md 05: Comprehensive Assessment from Multiple Perspectives
06-retrieval-quality-testing.md 06: Retrieval Quality Testing
07-agent-performance-framework.md 07: Agent Performance Framework
08-reporting-visualization-systems.md 08: Reporting and Visualization Systems
09-deployment-validation-systems.md 09: Deployment Validation Systems
README.md Evaluation Systems for GenAI — MangaAssist

Back to the home page.