AI Evaluations Crash Course in 50 Minutes (Real Example) | Hamel Husain

Learn the exact 4-step process used by PMs and engineers at companies like OpenAI, Anthropic, and Google to build reliable AI evaluations

Sep 28, 2025

∙ Paid

Dear subscribers,

Today, I want to share a new episode with Hamel Husain.

Hamel has trained 2,000+ PMs and engineers from companies like OpenAI, Anthropic, and Google on how to run AI evals. In my new episode, he shares a free master class on how to build evals for a real AI agent in just 50 minutes using a simple spreadsheet. I learned a lot from Hamel and I think you will too!

Watch now on YouTube, Apple, and Spotify.

If you enjoyed this tutorial, Hamel is also offering $1,330 off his AI evaluations course to readers of this newsletter. Sign up for his last cohort of the year by 10/6.

Hamel and I talked about:

(00:00) What the most valuable part of evals is
(01:25) Live walkthrough: Analyzing 100 real production traces
(09:50) Creating the eval criteria using a simple spreadsheet
(24:44) Why binary pass/fail ratings beat 1-5 scores every time
(28:52) The agreement metric trap that fools most PMs
(30:08) True positive and negative rates explained
(36:00) How to set up continuous evals in production

Top 10 takeaways I learned from this episode

Skip generic eval criteria to evaluate specific product problems instead. “Generic evals don’t measure the most important problems with your AI product.” Instead of “helpfulness” or “correctness”, create evals for specific product issues like “human handoff failure” or “tour scheduling issue.”
Follow Hamel’s 4-step eval process.
1. Start by manually labeling 100+ AI conversations (traces):
  Paste your manual labels into a spreadsheet and categorize them with AI
2. Use data analysis to identify and count the most common issues:
  Create a simple pivot table to count issues by category
3. Create LLM judges with binary pass/fail labels. Validate judges using true positive and true negative rates instead of only alignment. More below.
4. Deploy LLM judges to production and do manual data labeling periodically.
5. You can do all of the above using a simple spreadsheet. Here’s a link to Hamel’s sheet to evaluate a real AI agent using the steps above:

AI Evaluations Crash Course in 50 Minutes (Real Example) | Hamel Husain

Learn the exact 4-step process used by PMs and engineers at companies like OpenAI, Anthropic, and Google to build reliable AI evaluations

Top 10 takeaways I learned from this episode

This post is for paid subscribers