Behind the Craft by Peter Yang

Behind the Craft by Peter Yang

Podcast

Complete Beginner's Course on AI Evaluations: Step by Step | Aman Khan

Watch 2 PMs build AI evals for a customer support agent from scratch - from labeling a golden dataset to aligning LLM judges

Peter Yang's avatar
Peter Yang
Aug 24, 2025
∙ Paid
32
2
Share

Dear subscribers,

Today, I want to share a new episode with Aman Khan (Arize).

The best way to learn about AI evaluations is to watch 2 PMs build them live from scratch. In my new episode, Aman and I walk through creating evals for an AI customer support agent — from labeling a golden dataset to aligning LLM judges. This is the complete beginner’s AI eval course you've been waiting for.

Watch now on YouTube, Apple, and Spotify.

Aman and I talked about:

  • (00:00) What are AI evals and how to get good at them

  • (02:52) The 4 types of AI evaluations everyone should know

  • (06:08) Live demo: Building evals for a customer support agent

  • (10:29) Using Anthropic's console to generate great prompts

  • (15:13) Creating the evaluation criteria

  • (17:40) Adding human labels to the golden dataset

  • (31:05) Scaling evals with LLM-judge prompts

  • (38:21) How to align LLM judges with human judgment

Aman also teaches a live AI prototype to production course for PMs (get $100 off).


🎙️ Coming up next on Behind the Craft

  1. Ranking 15 PM Skills: What AI Can't Touch vs Will Disrupt | Nan Yu

  2. Claude Code Beginner Tutorial: Build a Movie Discovery App in 15 Minutes

Subscribe on YouTube


Top 10 takeaways I learned from this episode

  1. Use four types of AI evaluations. Code-based evals for simple string checking, human evals to label a golden dataset, LLM-judges to scale, and user evals for real-world metrics. Each serves a specific purpose in your evaluation pipeline.

  2. PMs must do manual labeling themselves. "I never found it useful to just completely outsource the human evals to contractors. The PM has to be in the spreadsheet doing the stuff themselves to start" to maintain product judgment.

This post is for paid subscribers

Already a paid subscriber? Sign in
© 2025 Peter Yang
Privacy ∙ Terms ∙ Collection notice
Start your SubstackGet the app
Substack is the home for great culture