Obsidian Metadata
| channel | Hamel Husain |
| url | https://www.youtube.com/watch?v=GL0XhAj5LPE |
| published | 2025-05-07 |
Description
LLM Evals Course for Engineers (35% Discount): http://bit.ly/eval-discount
Shreya Shankar and Hamel Husain discuss common mistakes people make when creating domain-specific evals.
00:51 Foundation model benchmarks are not the same as your application evals 03:00 Generic Evals Are Useless 04:00 Do not outsource labeling & prompting to non domain experts 09:28 You should make your own data annotation app 12:40 Your LLM prompts should be specific and grounded in error analysis 15:25 Use binary labels 18:57 Look at your data 23:41 Be careful of overfitting to test data 25:40 Do online tests
Summary
Shreya Shankar and Hamel Husain discuss common pitfalls in creating effective domain-specific LLM evaluations. They emphasize that generic foundation model benchmarks are inadequate for application-specific needs, and outsourcing labeling or prompting to non-experts leads to poor results. Key recommendations include building custom data annotation tools, crafting specific LLM prompts based on thorough error analysis, utilizing binary labels for clarity, diligently reviewing raw data, being cautious of overfitting to test sets, and conducting online tests for real-world validation.
Key Takeaways
- Foundation model benchmarks are not suitable for application-specific evaluations.
- Generic evaluations provide little practical value.
- Labeling and prompting for evals should not be outsourced to non-domain experts.
- Develop your own data annotation application tailored to your needs.
- LLM prompts for evaluation should be specific and grounded in error analysis.
- Employ binary labels for clearer evaluation outcomes.
- Thoroughly inspect and understand your data.
- Guard against overfitting to your test data.
- Implement online tests for real-world performance validation.
Mindmap
graph TD A[LLM Evals: Common Mistakes] --> B{Evaluation Strategy} A --> C{Data & Labeling} A --> D{Prompting & Analysis} A --> E{Testing & Validation} B --> B1[Avoid Generic Benchmarks] B --> B2[Generic Evals Are Useless] C --> C1[Don't Outsource Labeling & Prompting] C --> C2[Make Your Own Data Annotation App] C --> C3[Look at Your Data] C --> C4[Use Binary Labels] D --> D1[LLM Prompts Specific & Grounded in Error Analysis] E --> E1[Be Careful of Overfitting to Test Data] E --> E2[Do Online Tests]
Notable Quotes
- 00:51: Foundation model benchmarks are not the same as your application evals
- 03:00: Generic Evals Are Useless
- 04:00: Do not outsource labeling & prompting to non domain experts
- 09:28: You should make your own data annotation app
- 12:40: Your LLM prompts should be specific and grounded in error analysis
- 15:25: Use binary labels
- 18:57: Look at your data
- 23:41: Be careful of overfitting to test data
- 25:40: Do online tests
Transcript (YouTube)

