Obsidian Metadata
| description | Hamel Husain and Shreya Shankar teach the world’s most popular course on AI evals and have trained over 2,000 PMs and engineers (including many teams at OpenAI and Anthropic). In this conversation, they demystify the process of developing effective evals, walk through real examples, and share practical techniques that’ll help you improve your AI product.*What you’ll learn:*1. WTF evals are2. Why they’ve become the most important new skill for AI product builders3. A step-by-step walkthrough of how to create an effective eval4. A deep dive into error analysis, open coding, and axial coding5. Code-based evals vs. LLM-as-judge6. The most common pitfalls and how to avoid them7. Practical tips for implementing evals with minimal time investment (30 minutes per week after initial setup)8. Insight into the debate between “vibes” and systematic evals*Brought to you by:*Fin—The #1 AI agent for customer service: https://fin.ai/lennyDscout—The UX platform to capture insights at every stage: from ideation to production: https://www.dscout.com/Mercury—The art of simplified finances: https://mercury.com/*Transcript:* https://www.lennysnewsletter.com/p/why-ai-evals-are-the-hottest-new-skill*My biggest takeaways (for paid newsletter subscribers):* https://www.lennysnewsletter.com/i/173871171/my-biggest-takeaways-from-this-conversation*Where to find Shreya Shankar*• X: https://x.com/sh_reya• LinkedIn: https://www.linkedin.com/in/shrshnk/• Website: https://www.sh-reya.com/• Maven course: https://bit.ly/4myp27m*Where to find Hamel Husain*• X: https://x.com/HamelHusain• LinkedIn: https://www.linkedin.com/in/hamelhusain/• Website: https://hamel.dev/• Maven course: https://bit.ly/4myp27m*Where to find Lenny:*• Newsletter: https://www.lennysnewsletter.com• X: https://twitter.com/lennysan• LinkedIn: https://www.linkedin.com/in/lennyrachitsky/*In this episode, we cover:*(00:00) Introduction to Hamel and Shreya(04:57) What are evals?(09:56) Demo: Examining real traces from a property management AI assistant(16:51) Writing notes on errors(23:54) Why LLMs can’t replace humans in the initial error analysis(25:16) The concept of a “benevolent dictator” in the eval process(28:07) Theoretical saturation: when to stop(31:39) Using axial codes to help categorize and synthesize error notes(44:39) The results(46:06) Building an LLM-as-judge to evaluate specific failure modes(48:31) The difference between code-based evals and LLM-as-judge(52:10) Example: LLM-as-judge(54:45) Testing your LLM judge against human judgment(01:00:51) Why evals are the new PRDs for AI products(01:05:09) How many evals you actually need(01:07:41) What comes after evals(01:09:57) The great evals debate(1:15:15) Why dogfooding isn’t enough for most AI products(01:18:23) OpenAI’s Statsig acquisition(1:23:02) The Claude Code controversy and the importance of context(01:24:13) Common misconceptions around evals(1:22:28) Tips and tricks for implementing evals effectively(1:30:37) The time investment(1:33:38) Overview of their comprehensive evals course(1:37:57) Lightning round and final thoughts*LLM Log Open Codes Analysis Prompt:*_Please analyze the following CSV file. There is a metadata field which has an nested field called z_note that contains open codes for analysis of LLM logs that we are conducting. Please extract all of the different open codes. From the _note field, propose 5-6 categories that we can create axial codes from._*Referenced:*• Building eval systems that improve your AI product: https://www.lennysnewsletter.com/p/building-eval-systems-that-improve• Mercor: https://mercor.com/• Brendan Foody on LinkedIn: https://www.linkedin.com/in/brendan-foody-2995ab10b• Nurture Boss: https://nurtureboss.io/• Braintrust: https://www.braintrust.dev/• Andrew Ng on X: https://x.com/andrewyng• Carrying Out Error Analysis: https://www.youtube.com/watch?v=JoAxZsdw_3w• Julius AI: https://julius.ai/• Brendan Foody on X—“evals are the new PRDs”: https://x.com/BrendanFoody/status/1939764763485171948...References continued at: https://www.lennysnewsletter.com/p/why-ai-evals-are-the-hottest-new-skill*Recommended books:*• Pachinko: https://www.amazon.com/Pachinko-National-Book-Award-Finalist/dp/1455563935• Apple in China: The Capture of the World’s Greatest Company: https://www.amazon.com/Apple-China-Capture-Greatest-Company/dp/1668053373/• Machine Learning: https://www.amazon.com/Machine-Learning-Tom-M-Mitchell/dp/1259096955• Artificial Intelligence: A Modern Approach: https://www.amazon.com/Artificial-Intelligence-Modern-Approach-Global/dp/1292401133/_Production and marketing by https://penname.co/.__For inquiries about sponsoring the podcast, email podcast@lennyrachitsky.com._Lenny may be an investor in the companies discussed. |
| channel | Lenny's Podcast |
| url | https://www.youtube.com/watch?v=BsWxPI9UM4c |
| duration | 1:46:32 |
| published | 2025-09-25 |
| genre | People & Blogs |
| watched | true |
Description
Hamel Husain and Shreya Shankar teach the world’s most popular course on AI evals and have trained over 2,000 PMs and engineers (including many teams at OpenAI and Anthropic). In this conversation, they demystify the process of developing effective evals, walk through real examples, and share practical techniques that’ll help you improve your AI product.
What you’ll learn:
- WTF evals are
- Why they’ve become the most important new skill for AI product builders
- A step-by-step walkthrough of how to create an effective eval
- A deep dive into error analysis, open coding, and axial coding
- Code-based evals vs. LLM-as-judge
- The most common pitfalls and how to avoid them
- Practical tips for implementing evals with minimal time investment (30 minutes per week after initial setup)
- Insight into the debate between “vibes” and systematic evals
Brought to you by: Fin—The #1 AI agent for customer service: https://fin.ai/lenny Dscout—The UX platform to capture insights at every stage: from ideation to production: https://www.dscout.com/ Mercury—The art of simplified finances: https://mercury.com/
Transcript: https://www.lennysnewsletter.com/p/why-ai-evals-are-the-hottest-new-skill
My biggest takeaways (for paid newsletter subscribers): https://www.lennysnewsletter.com/i/173871171/my-biggest-takeaways-from-this-conversation
Where to find Shreya Shankar • X: https://x.com/sh_reya • LinkedIn: https://www.linkedin.com/in/shrshnk/ • Website: https://www.sh-reya.com/ • Maven course: https://bit.ly/4myp27m
Where to find Hamel Husain • X: https://x.com/HamelHusain • LinkedIn: https://www.linkedin.com/in/hamelhusain/ • Website: https://hamel.dev/ • Maven course: https://bit.ly/4myp27m
Where to find Lenny: • Newsletter: https://www.lennysnewsletter.com • X: https://twitter.com/lennysan • LinkedIn: https://www.linkedin.com/in/lennyrachitsky/
In this episode, we cover: (00:00) Introduction to Hamel and Shreya (04:57) What are evals? (09:56) Demo: Examining real traces from a property management AI assistant (16:51) Writing notes on errors (23:54) Why LLMs can’t replace humans in the initial error analysis (25:16) The concept of a “benevolent dictator” in the eval process (28:07) Theoretical saturation: when to stop (31:39) Using axial codes to help categorize and synthesize error notes (44:39) The results (46:06) Building an LLM-as-judge to evaluate specific failure modes (48:31) The difference between code-based evals and LLM-as-judge (52:10) Example: LLM-as-judge (54:45) Testing your LLM judge against human judgment (01:00:51) Why evals are the new PRDs for AI products (01:05:09) How many evals you actually need (01:07:41) What comes after evals (01:09:57) The great evals debate (1:15:15) Why dogfooding isn’t enough for most AI products (01:18:23) OpenAI’s Statsig acquisition (1:23:02) The Claude Code controversy and the importance of context (01:24:13) Common misconceptions around evals (1:22:28) Tips and tricks for implementing evals effectively (1:30:37) The time investment (1:33:38) Overview of their comprehensive evals course (1:37:57) Lightning round and final thoughts
LLM Log Open Codes Analysis Prompt: _Please analyze the following CSV file. There is a metadata field which has an nested field called z_note that contains open codes for analysis of LLM logs that we are conducting. Please extract all of the different open codes. From the note field, propose 5-6 categories that we can create axial codes from.
Referenced: • Building eval systems that improve your AI product: https://www.lennysnewsletter.com/p/building-eval-systems-that-improve • Mercor: https://mercor.com/ • Brendan Foody on LinkedIn: https://www.linkedin.com/in/brendan-foody-2995ab10b • Nurture Boss: https://nurtureboss.io/ • Braintrust: https://www.braintrust.dev/ • Andrew Ng on X: https://x.com/andrewyng • Carrying Out Error Analysis: https://www.youtube.com/watch?v=JoAxZsdw_3w • Julius AI: https://julius.ai/ • Brendan Foody on X—“evals are the new PRDs”: https://x.com/BrendanFoody/status/1939764763485171948 …References continued at: https://www.lennysnewsletter.com/p/why-ai-evals-are-the-hottest-new-skill
Recommended books: • Pachinko: https://www.amazon.com/Pachinko-National-Book-Award-Finalist/dp/1455563935 • Apple in China: The Capture of the World’s Greatest Company: https://www.amazon.com/Apple-China-Capture-Greatest-Company/dp/1668053373/ • Machine Learning: https://www.amazon.com/Machine-Learning-Tom-M-Mitchell/dp/1259096955 • Artificial Intelligence: A Modern Approach: https://www.amazon.com/Artificial-Intelligence-Modern-Approach-Global/dp/1292401133/
Production and marketing by https://penname.co/. For inquiries about sponsoring the podcast, email podcast@lennyrachitsky.com.
Lenny may be an investor in the companies discussed.
Summary
Hamel Husain and Shreya Shankar, instructors of a popular AI evals course, demystify the process of developing effective AI evaluation systems for product builders. They explain what evals are, why they’re crucial for AI products, and provide a step-by-step walkthrough for creating them. The discussion covers practical techniques like error analysis, open/axial coding, comparing code-based evals with LLM-as-judge approaches, and avoiding common pitfalls, emphasizing the importance of systematic evals over mere ‘vibes’.
Key Takeaways
- Understanding AI Evals: What evaluation systems are and their fundamental role in AI product development.
- Evals as a Core Skill: Why developing effective evals has become the most important new skill for AI product managers and engineers.
- Step-by-Step Creation: A practical guide on how to build an effective eval, including examining real AI assistant traces.
- Error Analysis Techniques: Deep dive into qualitative methods like open coding and axial coding for categorizing and synthesizing errors.
- Human vs. LLM in Evals: Why humans are indispensable for initial error analysis despite the rise of LLM-as-judge approaches.
- Code-based vs. LLM-as-Judge Evals: Distinguishing between different evaluation methodologies and when to use each.
- Practical Implementation: Tips for integrating evals with minimal time investment (e.g., 30 minutes per week after initial setup).
- Strategic Importance: Evals are evolving into the new Product Requirements Documents (PRDs) for AI products.
- Avoiding Pitfalls: Common misconceptions and how to circumvent them, emphasizing systematic evaluation over intuition (‘vibes’).
Mindmap
graph TD A[AI Evals: The Hottest New Skill] B[What are Evals?] C[Why Evals are Important] D[How to Create Effective Evals] E[Error Analysis & Coding] F[Eval Methodologies] G[Practical Implementation] H[Strategic Significance & Debates] A --> B A --> C A --> D A --> E A --> F A --> G A --> H B --> B1[Definition & Purpose] B --> B2["Real Traces Demo(09:56)"] C --> C1[New Skill for AI Product Builders] C --> C2["Not just vibes vs. systematic (01:09:57)"] C --> C3["Why dogfooding isn't enough (01:15:15)"] D --> D1["Step-by-Step Walkthrough"] D --> D2["Writing notes on errors (16:51)"] D --> D3["Benevolent Dictator Concept (25:16)"] D --> D4["When to stop: Theoretical Saturation (28:07)"] D --> D5["Results (44:39)"] E --> E1[Open Coding] E --> E2["Axial Coding (31:39)"] E --> E3[LLM Log Open Codes Analysis Prompt] E --> E4["Humans vs. LLMs for Initial Analysis (23:54)"] F --> F1["Code-based Evals (48:31)"] F --> F2["LLM-as-Judge (46:06)"] F --> F3["LLM-as-Judge Example (52:10)"] F --> F4["Testing LLM Judge vs. Human (54:45)"] G --> G1["Minimal Time Investment (1:30:37)"] G --> G2["Tips & Tricks (1:22:28)"] G3["Common Misconceptions (1:24:13)"] H --> H1["Evals are the new PRDs (01:00:51)"] H --> H2["How many evals needed (01:05:09)"] H --> H3["What comes after evals (01:07:41)"] H --> H4["OpenAI's Statsig acquisition (1:18:23)"] H --> H5["Claude Code controversy (1:23:02)"] H --> H6["Overview of Evals Course (1:33:38)"]
Notable Quotes
Note: A transcript was not provided in the input, therefore actual quotes cannot be extracted. The following are topics with their timestamps for easy navigation.
- 04:57: What are evals?
- 09:56: Demo: Examining real traces from a property management AI assistant
- 23:54: Why LLMs can’t replace humans in the initial error analysis
- 25:16: The concept of a “benevolent dictator” in the eval process
- 28:07: Theoretical saturation: when to stop
- 31:39: Using axial codes to help categorize and synthesize error notes
- 48:31: The difference between code-based evals and LLM-as-judge
- 01:00:51: Why evals are the new PRDs for AI products
- 01:09:57: The great evals debate
- 01:15:15: Why dogfooding isn’t enough for most AI products
- 01:22:28: Tips and tricks for implementing evals effectively
Transcript (YouTube)

