Barak Turovsky (Exec In Residence, Scale Venture Partners): How to Evaluate Generative AI Use Cases
Why search might be a red herring and who will capture the most value in the AI stack
Join 42,000+ others who get my best writing and interviews on how to level up your product skills and build a thriving creator business.
Today, I want to share a great framework for evaluating generative AI use cases.
Barak Turovsky is Executive in Residence at Scale Venture Partners and ex-head of product for Google Languages AI. I worked with Barak a decade ago, so naturally I had to chat with him about AI.
In the interview below, we talk about:
How to evaluate generative AI use cases
Why search might be a red herring vs. other use cases
Which companies will capture the most value in the AI stack
How to evaluate generative AI use cases
Welcome Barak! What’s your framework for evaluating generative AI use cases?
I like to evaluate use cases across two axes:
Fluency: How natural sounding the output is.
Accuracy: How correct the output is.
Here’s the breakdown:
High fluency, low accuracy: Great fit for generative AI. Examples include creator and productivity use cases like writing blogs, children's books, and poems. In these cases, output accuracy is often subjective.
Low fluency, high accuracy: Not a great fit for generative AI. Examples include search queries with clear answers (e.g., “When was Barack Obama born?”). Generative AI models don’t add a lot of value here.
High fluency, high accuracy: Seems like a good fit for generative AI, but cannot be trusted blindly. Examples include travel recommendations and business emails. You need a person to check the answer.
Makes sense. Just the other day I was using ChatGPT to clean up some data and it started making up numbers halfway through! Are there scenarios where generative AI can still help with high accuracy use cases?
Yes, they can still help if you have a person manually checking the AI’s output.
For example, ChatGPT can write the first draft of your business email for you to edit. But manual review breaks down in high volume use cases like search where it becomes impossible to have a trained person validate every result.
Why search might be a red herring vs. other use cases for large language models
Both Microsoft and Google are racing to use generative AI and large language models (LLMs) to improve search. Can you describe the types of search results that might be a good fit for these AI models?
Yes, let’s use the same framework above to breakdown search:
High fluency, low accuracy: e.g., “Generate 5 titles for this blog post.” This is more of a creator use case than traditional search.
Low fluency, high accuracy: e.g., “Where is Cape Town?” These queries often have a single factual answer. Google search already does a good job here.
High fluency, high accuracy: e.g., “Help me plan a vacation with kids.” LLMs can provide a fluent travel itinerary but you have to check the answers for accuracy.
That matches my personal experience with Bing AI. Over time, I found myself using it for creator use cases (e.g., Make this content more clear and concise.”)
Yes, a third axes of the framework above is how high the stakes are.
If you’re using LLM to explain a complex topic (e.g., “How do AI transformers work?”), the stakes are usually low. You may be fine with 80-90% accuracy.
If you’re using LLM to make a personal or financial decision, the stakes are high. You’ll want close to 100% accuracy.
Imagine the LLM gave you a Disneyland itinerary that recommended a subpar hotel or a restaurant that’s actually closed. You would be pretty mad.
Yeah, despite the hype I don’t think anyone will blindly trust LLMs to book travel or restaurants right now. How fast will LLMs fix the hallucination problem?
I think it might be hard to improve accuracy. LLMs like GPT4 are already trained on trillions of parameters, so there’s a diminishing return to creating bigger models.
The main value of LLMs is fluency with “good enough” accuracy. Even at 80-90% accuracy, LLMs could disrupt a wide variety of industries.
Which markets do you think LLMs would disrupt first?
So we discussed how LLMs are ideal for creator and productivity use cases. Here are three other markets that are ripe for disruption:
Customer support: LLMs will greatly improve customer interactions (e.g., support calls and emails) across a wide variety of industries.
Software development: Early studies show a 50%+ increase in productivity and developer satisfaction with AI-assisted coding.
Education: LLMs have already passed nursing, law, and other exams. They can become everyone’s personal tutor.
Which companies might capture the most value in the AI stack
Can you describe the generative AI stack and which layer might capture value?
At a high level, there are three layers:
Applications that are built on top of foundational AI models (e.g., Bing, Jasper).
Models that are used for for training and deployment (e.g., GPT, Stable Diffusion).
Infrastructure from big cloud providers like Microsoft Azure, AWS, Google Cloud and AI chip makers like Nvidia.
In terms of which layer will capture the most value, my bet is on the infra layer:
Cloud providers will acquire or build models. AWS and Google Cloud already have models and Microsoft could still end up acquiring OpenAI. Startups might be able to stand out by training models based on proprietary data like healthcare.
AI chip demand will continue to grow. Given the explosion of apps, compute infra providers like Nvidia and AMD will continue to see demand.
Applications, on the other hand, are risky.
There’s a joke that many AI apps are just wrappers around OpenAI’s APIs. How do you build a moat in the application layer?
I think AI apps can build moats in a few ways:
Fine-tune models for a specific vertical. I think this might be tough as foundational models are already good enough for many verticals.
Use private data or exclusive contracts. For example, an app could have a long term contract with an insurance provider or government agency.
Focus on non-AI UX and product differentiation. For example, productivity apps like Office and Google Docs can auto-include useful prompts in the UX instead of relying on users to construct prompts themselves.
Any closing thoughts on generative AI?
We’re still in the early stages of the AI revolution but I think one thing is clear:
Companies need to think about how they can use generative AI to enhance their product or they’ll risk getting left behind.
As with every new tech, productizing generative AI is both exciting and scary. I’m excited to see more companies use the framework we discussed to cross the chasm.
Thank you Barak! If you enjoyed this conversation, please follow Barak on LinkedIn.
A lot of the issues with high fluency/high accuracy use cases are resolved by using retrieval-augmented generation to improve reliability. https://arxiv.org/abs/2005.11401
My bet is this is where much of the infrastructure gets built out.
This is where companies have to adapt to the changing environment, more specifically how to leverage AI