Complete Beginner's Course on AI Evaluations: Step by Step | Aman Khan

Watch 2 PMs build AI evals for a customer support agent from scratch - from labeling a golden dataset to aligning LLM judges

Aug 24, 2025

∙ Paid

Dear subscribers,

Today, I want to share a new episode with Aman Khan (Arize).

The best way to learn about AI evaluations is to watch 2 PMs build them live from scratch. In my new episode, Aman and I walk through creating evals for an AI customer support agent — from labeling a golden dataset to aligning LLM judges. This is the complete beginner’s AI eval course you've been waiting for.

Watch now on YouTube, Apple, and Spotify.

Aman and I talked about:

(00:00) What are AI evals and how to get good at them
(02:52) The 4 types of AI evaluations everyone should know
(06:08) Live demo: Building evals for a customer support agent
(10:29) Using Anthropic's console to generate great prompts
(15:13) Creating the evaluation criteria
(17:40) Adding human labels to the golden dataset
(31:05) Scaling evals with LLM-judge prompts
(38:21) How to align LLM judges with human judgment

Aman also teaches a live AI prototype to production course for PMs (get $100 off).

🎙️ Coming up next on Behind the Craft

Ranking 15 PM Skills: What AI Can't Touch vs Will Disrupt | Nan Yu
Claude Code Beginner Tutorial: Build a Movie Discovery App in 15 Minutes

Subscribe on YouTube

Top 10 takeaways I learned from this episode

Use four types of AI evaluations. Code-based evals for simple string checking, human evals to label a golden dataset, LLM-judges to scale, and user evals for real-world metrics. Each serves a specific purpose in your evaluation pipeline.
PMs must do manual labeling themselves. "I never found it useful to just completely outsource the human evals to contractors. The PM has to be in the spreadsheet doing the stuff themselves to start" to maintain product judgment.

Complete Beginner's Course on AI Evaluations: Step by Step | Aman Khan

Watch 2 PMs build AI evals for a customer support agent from scratch - from labeling a golden dataset to aligning LLM judges

Top 10 takeaways I learned from this episode

This post is for paid subscribers