Obsidian Metadata
| channel | Peter Yang |
| url | https://www.youtube.com/watch?v=uiza7wp1KrE |
| published | 2025-09-28 |
| categories | Youtube |
| author | Hamel Husain |
1. The Foundational Step: Data Accounting and Error Analysis
The most crucial and valuable step in any AI evaluation process is a manual review of your system’s interaction logs, known as traces 04:40. This process, often called Data Accounting or Open Coding, provides product-centric insights that generic metrics cannot 09:04.
Key Steps for Initial Analysis
-
Review Traces: Manually inspect about 100 traces (conversations/logs) from your system 04:55.
-
Write Notes: For each trace, write a simple note detailing what went wrong from the user’s perspective 05:00. The goal is to observe the problem, not to perform a full root cause analysis or write a solution 05:42.
-
Identify Themes (Axial Coding):
-
Prioritize Problems: Use a pivot table to count the frequency of each error category 14:18. This immediately highlights the most painful and frequent issues that need fixing, such as a “human handoff transfer issue” or “conversational flow issues” 14:31.
2. Pre-Launch Evaluations: Synthetic Data Generation
For generating test cases before launch, or for simulating adversarial inputs, generating synthetic data is key.
Nuance: Generating Diverse Queries
Simply asking an LLM to “come up with plausible questions” results in homogeneous inputs that don’t explore the full problem space 18:10.
-
Define Dimensions: Brainstorm critical dimensions (like different user personas, customer types, or apartment classes for a rental assistant) that are relevant to your product 17:31.
-
Generate Queries from Dimensions: Feed the LLM a combination of these dimensions (e.g., “Luxury Resident,” “Standard Manager”) and ask it to generate plausible user queries for each combination 17:50. This helps uncover potential edge cases and weaknesses in the system.
3. Designing Evals: Code-Based vs. LLM as a Judge
Once critical failure categories are identified, a decision must be made on the evaluation type.
| Eval Type | Use Case | Cost/Complexity | Example |
|---|---|---|---|
| Code-Based Eval | Objective, easily asserted failures 20:29. | Cheap 20:29. | Checking if a date is returned correctly or if an output has a specific structure 20:42. |
| LLM as a Judge | Subjective or complex failures where the LLM struggles with current rules 21:05. | More expensive, but high-value for iteration 21:24. | Evaluating if a human handoff was necessary and executed correctly 21:11. |
4. LLM as a Judge: Critical Gotchas (Measuring the Judge)
The primary challenge with LLM judges is ensuring they are trustworthy. This requires a meta-evaluation step: measuring the judge itself 22:04.
Gotcha 1: Avoid Likert Scales (1-5 Scores)
-
Recommendation: Use a binary score (True/False or Pass/Fail) as the judge’s output 99% of the time 24:38.
-
Why: Continuous scores (like 1-5) are often not clear and non-actionable 25:02. Seeing an average score of “3.2 versus 3.7” provides little utility and is often referred to as “fake science” or false precision 25:55. The goal is to make a decision: “Is this feature good enough to ship? Yes or no?” 25:37.
Gotcha 2: The “Agreement” Fallacy
-
Recommendation: Never use ‘Agreement’ (the percentage of times the judge matches the human label) as the main metric 28:50.
-
Why: If your system’s actual error rate is low (e.g., 10%), an LLM judge can achieve a 90% “agreement” score by simply saying the system never fails 29:45. This high score is highly misleading.
-
The Correct Metrics: You must use the True Positive Rate (TPR) and True Negative Rate (TNR) (also known as sensitivity/specificity or precision/recall) 30:30.
-
Visualization: Analyze these metrics using a Confusion Matrix 31:06. This table breaks down where the judge is wrong (e.g., an LLM thinks an error exists when a human says there is none—a false positive) 33:14, allowing you to iterate on the judge’s prompt 34:02.
5. Production and Ongoing Maintenance
Continuous Use of Evals
-
CI Integration: Place your trusted LLM judges in your Continuous Integration (CI) pipeline to test against a set of held-aside data every time a code change is made 36:15.
-
Production Monitoring: Run your judges on a sample or large portion of production traces to continuously monitor for known problems 36:26.
The Role of Human Labor
-
Regular Audits: Continue to perform manual trace reviews and human labeling on a regular cadence (e.g., weekly or monthly) to ensure the system hasn’t drifted 39:39.
-
Saturation Principle: Keep reviewing traces until you reach theoretical saturation—the point where you are no longer learning anything new about how the system is breaking 42:53.
-
Tooling: Teams can build simple annotation tools in a few hours to make the trace-review and labeling process faster and more integrated 40:53.
Final Gotcha: Misusing Generic Benchmarks
-
Avoid: Do not start your evaluation process by focusing on generic metrics like “helpfulness,” “toxicity,” or “hallucination score” 44:02.
-
Why: Generic prompts often don’t align with the most important, application-specific problems you’ve found in your initial trace analysis, leading you to waste time 44:29.
-
Advanced Use: Only use generic scores as a meta-tool or sampling mechanism 45:20. For example, you can sort all your production traces by a generic hallucination score and then manually inspect the top 1% to see if they reveal a new, interesting error category 45:13.

