Complete Beginner's Course on AI Evaluations: Step by Step | Aman Khan
Watch 2 PMs build AI evals for a customer support agent from scratch - from labeling a golden dataset to aligning LLM judges
Dear subscribers,
Today, I want to share a new episode with Aman Khan (Arize).
The best way to learn about AI evaluations is to watch 2 PMs build them live from scratch. In my new episode, Aman and I walk through creating evals for an AI customer support agent — from labeling a golden dataset to aligning LLM judges. This is the complete beginner’s AI eval course you've been waiting for.
Watch now on YouTube, Apple, and Spotify.
Aman and I talked about:
(00:00) What are AI evals and how to get good at them
(02:52) The 4 types of AI evaluations everyone should know
(06:08) Live demo: Building evals for a customer support agent
(10:29) Using Anthropic's console to generate great prompts
(15:13) Creating the evaluation criteria
(17:40) Adding human labels to the golden dataset
(31:05) Scaling evals with LLM-judge prompts
(38:21) How to align LLM judges with human judgment
Aman also teaches a live AI prototype to production course for PMs (get $100 off).
🎙️ Coming up next on Behind the Craft
Ranking 15 PM Skills: What AI Can't Touch vs Will Disrupt | Nan Yu
Claude Code Beginner Tutorial: Build a Movie Discovery App in 15 Minutes
Top 10 takeaways I learned from this episode
Use four types of AI evaluations. Code-based evals for simple string checking, human evals to label a golden dataset, LLM-judges to scale, and user evals for real-world metrics. Each serves a specific purpose in your evaluation pipeline.
PMs must do manual labeling themselves. "I never found it useful to just completely outsource the human evals to contractors. The PM has to be in the spreadsheet doing the stuff themselves to start" to maintain product judgment.