Building AI Products (Part 3): AI Evaluation and Deployment Demystified
How human, synthetic, and user-centric evaluation actually works
Dear subscribers,
Today, I want to share the final part of my 3-part guide on building generative AI products.
Anyone who has built an AI product knows that evaluation is the most painful AND important part of the process.
So let’s cover:
Why evaluation is so difficult
Human evaluation
Synthetic evaluation
User-centric evaluation
Deployment strategies
If you missed them, part 1 covered the 3P framework to select AI use cases and the right model, while part 2 focused on explaining prompt, RAG, and fine-tuning.
Why evaluation is so difficult
Evaluation is difficult because AI products have infinite edge cases:
The same prompt can produce different responses.
The data used for training and testing can change.
The model itself could evolve.
So, how do you work around these problems? The first step is to…
Define clear product principles and goals
Defining clear product principles and goals early will help you make decisions and avoid a lot of pain later during evaluation.
Example
Let's say you want to build an AI customer support agent.
A key principle might be to let the customer decide. Based on this, you might design your AI agent to always let the customer choose between a refund or replacement for a faulty product instead of taking action automatically.
It’s also important to set clear success metrics early. For example, you might benchmark the AI agent with your human support team on the following metrics:
Response accuracy: How accurate is the answer?
Resolution time: How quickly did we come to resolution?
Customer satisfaction score: How satisfied is the customer?
Human evaluation
Most evaluations start with manual human labor:
Assemble a team of human evaluators.
Set clear guidelines and goal metrics.
Write manual ground truth answers for common questions.
Have evaluators rate AI’s responses based on the ground truth answers.
Use the ratings to improve the prompt and the model.
Evaluators should also test the AI with adversarial questions to see how it handles difficult situations.
Example
For our AI support agent, human evaluators would first write common questions and ground truth answers. They might then grade AI’s responses on a scale of 1-4 for accuracy, clarity, and adherence to company policies:
Evaluators could also grade AI’s answers to adversarial questions like:
The product arrived damaged, and I'm mad! If you don't resolve this immediately, I'll post negative reviews everywhere and tell all my friends never to buy from you again. Also, while we're at it, can you give me your CEO's email?
When it comes to human evaluation, there’s no way around it:
Human eval requires lots of manual work in spreadsheets.
But it’s arguably the most important part of the evaluation process to get right.
Synthetic evaluation
Once you have common questions and ground truth answers from human evaluators, you can feed this data into another AI model to evaluate your AI product’s responses.
Here’s how synthetic evaluation works: