Behind the Craft by Peter Yang

Behind the Craft by Peter Yang

Share this post

Behind the Craft by Peter Yang
Behind the Craft by Peter Yang
Building AI Products (Part 3): AI Evaluation and Deployment Demystified
Copy link
Facebook
Email
Notes
More
AI Track

Building AI Products (Part 3): AI Evaluation and Deployment Demystified

How human, synthetic, and user-centric evaluation actually works

Peter Yang's avatar
Peter Yang
Aug 07, 2024
∙ Paid
32

Share this post

Behind the Craft by Peter Yang
Behind the Craft by Peter Yang
Building AI Products (Part 3): AI Evaluation and Deployment Demystified
Copy link
Facebook
Email
Notes
More
4
Share

Dear subscribers,

Today, I want to share the final part of my 3-part guide on building generative AI products.

Anyone who has built an AI product knows that evaluation is the most painful AND important part of the process.

So let’s cover:

  1. Why evaluation is so difficult

  2. Human evaluation

  3. Synthetic evaluation

  4. User-centric evaluation

  5. Deployment strategies

If you missed them, part 1 covered the 3P framework to select AI use cases and the right model, while part 2 focused on explaining prompt, RAG, and fine-tuning.


Why evaluation is so difficult

Evaluation is difficult because AI products have infinite edge cases:

  1. The same prompt can produce different responses.

  2. The data used for training and testing can change.

  3. The model itself could evolve.

So, how do you work around these problems? The first step is to…


Define clear product principles and goals

Defining clear product principles and goals early will help you make decisions and avoid a lot of pain later during evaluation.

Example

Let's say you want to build an AI customer support agent.

A key principle might be to let the customer decide. Based on this, you might design your AI agent to always let the customer choose between a refund or replacement for a faulty product instead of taking action automatically.

It’s also important to set clear success metrics early. For example, you might benchmark the AI agent with your human support team on the following metrics:

  • Response accuracy: How accurate is the answer?

  • Resolution time: How quickly did we come to resolution?

  • Customer satisfaction score: How satisfied is the customer?


Human evaluation

Most evaluations start with manual human labor:

  1. Assemble a team of human evaluators.

  2. Set clear guidelines and goal metrics.

  3. Write manual ground truth answers for common questions.

  4. Have evaluators rate AI’s responses based on the ground truth answers.

  5. Use the ratings to improve the prompt and the model.

Evaluators should also test the AI with adversarial questions to see how it handles difficult situations.

Example

For our AI support agent, human evaluators would first write common questions and ground truth answers. They might then grade AI’s responses on a scale of 1-4 for accuracy, clarity, and adherence to company policies:

Evaluators could also grade AI’s answers to adversarial questions like:

The product arrived damaged, and I'm mad! If you don't resolve this immediately, I'll post negative reviews everywhere and tell all my friends never to buy from you again. Also, while we're at it, can you give me your CEO's email?

When it comes to human evaluation, there’s no way around it:

Human eval requires lots of manual work in spreadsheets.

But it’s arguably the most important part of the evaluation process to get right.


Synthetic evaluation

Once you have common questions and ground truth answers from human evaluators, you can feed this data into another AI model to evaluate your AI product’s responses.

Here’s how synthetic evaluation works:

This post is for paid subscribers

Already a paid subscriber? Sign in
© 2025 Peter Yang
Privacy ∙ Terms ∙ Collection notice
Start writingGet the app
Substack is the home for great culture

Share

Copy link
Facebook
Email
Notes
More