AI Evaluations Crash Course in 50 Minutes (Real Example) | Hamel Husain
Learn the exact 4-step process used by PMs and engineers at companies like OpenAI, Anthropic, and Google to build reliable AI evaluations
Dear subscribers,
Today, I want to share a new episode with Hamel Husain.
Hamel has trained 2,000+ PMs and engineers from companies like OpenAI, Anthropic, and Google on how to run AI evals. In my new episode, he shares a free master class on how to build evals for a real AI agent in just 50 minutes using a simple spreadsheet. I learned a lot from Hamel and I think you will too!
Watch now on YouTube, Apple, and Spotify.
If you enjoyed this tutorial, Hamel is also offering $1,330 off his AI evaluations course to readers of this newsletter. Sign up for his last cohort of the year by 10/6.
Hamel and I talked about:
(00:00) What the most valuable part of evals is
(01:25) Live walkthrough: Analyzing 100 real production traces
(09:50) Creating the eval criteria using a simple spreadsheet
(24:44) Why binary pass/fail ratings beat 1-5 scores every time
(28:52) The agreement metric trap that fools most PMs
(30:08) True positive and negative rates explained
(36:00) How to set up continuous evals in production
Top 10 takeaways I learned from this episode
Skip generic eval criteria to evaluate specific product problems instead. “Generic evals don’t measure the most important problems with your AI product.” Instead of “helpfulness” or “correctness”, create evals for specific product issues like “human handoff failure” or “tour scheduling issue.”
Follow Hamel’s 4-step eval process.
Start by manually labeling 100+ AI conversations (traces):
Use data analysis to identify and count the most common issues:
Create LLM judges with binary pass/fail labels. Validate judges using true positive and true negative rates instead of only alignment. More below.
Deploy LLM judges to production and do manual data labeling periodically.
You can do all of the above using a simple spreadsheet. Here’s a link to Hamel’s sheet to evaluate a real AI agent using the steps above: