Aparna Dhinakaran in conversation with Boris Power of OpenAI. Power talks about his background and work with GPT-3 and GPT-4, as well as his work as a data scientist, software engineer, and machine learning consultant. Power is part of the Partnerships Research team that is engaged with real-world use cases. He discusses OpenAI's recent launches, including chatGPT, the API for Whisper, and plugins for GPT-4, among others. Power also shares examples of production use cases, including Be My Eyes, an app for the blind and visually impaired that uses GPT-4 with visual capabilities.
Aparna: Hey everyone. Welcome back. I hope you like the keynote, and we have a really exciting session right after OpenAI is here to talk all things chatGPT, GPT-4. Boris, thank you so much for being here today.
Boris: Thank you so much for inviting me. Very nice to meet you.
Aparna: Absolutely. Can you introduce yourself and tell us a little bit more about your background and what you do at OpenAI?
Boris: Sure. So I've been fascinated with games in AI from a very early age. I played chess when I was younger and spent a lot of time with the computer analyzing various lines. That led me to do a lot of programming in high school and then study AI at university. I also started a PhD on computer Go, which is another board game that I also liked. And my approach is very similar to AlphaGo. But, I stopped my PhD because I found out that I really enjoyed the practical side of AI a lot more. I really liked applications. I really was very eager to get into the workplace. So there I worked as a data scientist, software engineer and machine learning consultant, and I was really fascinated with GPT 2 and 3.
That was kind of the time where I really felt like this is something new businesses need to start thinking about how they can adopt it. So I was an early user of GPT-3 and was interested to figure out how financial industries can be, uh, rethought with the addition of this new capability. And then I joined OpenAI, uh, just over two years ago now, and within OpenAI, um, I.
I'm on the applied research team and my team, uh, which is called Partnerships Research, is a team that's most practically engaged with the real world use cases, and we want to understand where the boundary is between what these models can do and with what, uh, they're not quite good enough yet.
Aparna: Awesome. Well, we are super excited to have you here at Observe. Uh, for folks in the audience, feel free to drop questions for Boris in the comments. I'll try my best to throw a couple at him. Boris, it's been a crazy six to eight months for OpenAI. Can you just walk us through the highlights of what's been launched and if there's any themes behind these launches?
Boris: Absolutely. So I think. Open AI really came to the world stage when, uh, chatGPT launched just five months ago. And most to the point that most people don't really know how to open I, they don't know chatGPT and I will often be referred to as people who work at chatGPT But uh, it's been a crazy, uh, yeah, six, six to eight months.
Uh, you're right. So I think we started with, uh, DalI-E, no wait list access, maybe just over six months ago. And, uh, we introduced the new embeddings, which were orders of magnitudes cheaper, I think, maybe in December last year. Uh, and we released the API for Whisper, which is our, uh, speech to tax model, API for Dall-E, API for chatGPT, uh, which also reduced the price of the inference.
Uh, so I think we reduced the price by three times, maybe eight months ago, and then another 10 times with the chat API launch, uh, just a month and a half ago. Uh, and then we also launched the GT four, uh, plugins about a month ago. And, uh, most recently we kind of announced, uh, some of the mostly really smaller things that classifier for indicating AI written text, uh, our approach to safety governance, and the background program as well.
Wow. Wow. That's, that's a really impressive list. Um, It feels like the initial kind of use cases, at least that, that we started off with, feel a little bit more ad hoc, like in the case of GPT and chatGPT, it's kind of users typing in ad hoc questions. Um, how much production usage of GPT or GPT-4 have you, have you really started to see?
Boris: Yeah, so I guess I have a slightly different view because I've been on the inside of Open AI and I worked with a very. Early partners, uh, from two years ago now, and they're real companies that are, uh, incorporating this technology within them. And also there's, uh, companies that are fully built on this technology as well.
Aparna: So from the respective companies like Morgan Stanley who are organizing their vast knowledge base, two completely new companies that are, uh, taking off. And I think we'll probably have some of those, uh, on stage today as well. Got it. Got it, got it. Any examples of production use cases that, that you can share?
Boris: Sure. So probably my favorite use case is the one, uh, use of GT four with visual capabilities, which is only released this one customer, this is Be My Eyes, they're an app that's been around for over, I think for about 10 years. And it's enough for visual, for blind and visually impaired, uh, people who can.
Who take a photo in with their phone and ask a question based on that photo. In the past, you'll have to wait for a volunteer, which may take like five minutes, and they really didn't want to ask questions that were maybe too trivial. And now that GBT four with the visual capabilities is serving the model that it can get responses within 10 seconds.
And I, I've seen these amazing examples of people kind of using it to read through fashion magazines and are a lot more frivolous things that you just wouldn't want to ask a volunteer to kind of help you out out with. So that's probably my favorite use case there. There's a number of others. I mean, I can keep going, but, uh, I think Stripe is fighting, uh, Broad.
I think that's a very, uh, close to my heart because like I really care about the financial crime and, uh, that side, uh, government of Iceland is preserving language and I think he's doing a great job of open sourcing their, uh, large giga corpus and also a large number of, NLP tasks in a format that any language model can learn from.
And, uh, they have made, uh, GPT-4 better at aic. Uh, we have. Yeah. Khan Academy, a great use case where I think it has a massive potential for, uh, improving education by having a personalized tutor that anyone can access and Salcon himself spent 20 hours, uh, adding examples of how, uh, you teach, uh, students well.
And I think that's really showing now in terms of G four, improving the abilities, uh, there. Wow. Um, yeah. There, there, there's others. I think Harvey is applying it to law, which I think is very interesting. I think law is one of those, it's almost like computational language. Uh, but yeah, I could keep going, but I think, uh, it's a really, uh, fruitful space.
Aparna: Wow. You, you got me with the government of Iceland is using it to preserve language. That is really, really cool. Um, any patterns or kind of best industries you feel like can take advantage of all of these production use cases of GPT-4? Do you think it's kind of gonna be widespread, uh, in, in a couple years?
Boris: So one of the biggest challenge with these models is they're, they're very general. So it's kind of like when people ask what are the use cases? Well, it's kind of everything out there. And I think my belief is that because charge is so popular, I keep seeing my friends from, uh, high school, primary school, they're using it within very different industries.
We'll never think of for very small use cases, but I think it'll be kind of maybe almost this bottom up LED worker led, uh, revolution where people are. Being able to be more efficient at their job by using some electric gp, PT where they need to kind of copy things in and out. It's not as efficient, but then hopefully those businesses will realize and then implement it in a much more, uh, scalable way where it's a standardized process.
I think in terms of industry specifically, I think, I believe it's the kind of places where there is a lot of unstructured data, which are either industries where you might traditionally. So pay. Pay a lot of money for services. So I guess law, medicine, accounting are all there. And I think the reason that it's quite useful there is because you're likely to just keep a human in the loop, an expert in the loop who will check for the outputs.
And it's almost like you give everyone a little promotion. So suddenly people who previously do the job of. Uh, collecting which cases are relevant to the case that we currently have in law, you'll now get the AI to do that step, and then the humans will just review if those references make sense. It's almost like everyone now becomes a manager and you have 10, uh, AI employees who are kind of helping you be more efficient.
Aparna: Wow. Okay. That’s a really great way, I guess, to paint, paint the future. Okay. I have a really spicy, hot question for you. Put you in the hot seat. Do you see LLMs eating up traditional ML use cases? Do data scientists today become prompt engineers? Where, where do you see that going?
Boris: So, Yes, in terms of eating kind of the existing ML use cases, so if your ML use cases are things like sentiment analysis or things that are to do with structured data, surely there'll be some of those, like LMS will just be better at either LMS with a little bit of fine tuning or just kind of LMS by themselves.
But I think it also opens up a plenty more use case. I think it opens up areas where previously you just kind of wouldn't. Be able to apply, uh, ML easily. So I think that that happens with, uh, unstructured data, for example, where I think embeddings are still massively underused. I think. Being able to embed a large knowledge base and then use it for retrieval, use it for clustering, uh, finding insights, fraud detection, anomaly detection, uh, new patterns emerging.
I think all of those are possible in the embedding space. And I think people are just starting to explore what can be done there. And I think this is exactly where data scientists are very good at understanding large dimensional spaces and what can be done with all those techniques. And all you have now is the better access towards, uh, A semantic, much better semantic representation of unstructured data.
Aparna: Got it. So you feel like the kind of sentiment classification, these type of use cases, LMS, can probably do better today, but some of the more complex ones will still, will still stay around and, and, and the new ones will emerge. I think this will kind of just unlock more and more work. Can, can you gimme some examples of, you know, uh, you know, how are you staying on top of all the emergent properties and, and can you gimme some examples of Emergence that you're personally really excited about?
Boris: So I think my team is primarily here to figure out this question and what we've done is we try to open source as much as we can. So we are, we've open sourced open eye evals, which are where we are asking people a challenging question, which is to come up with things that GPT-4 is not able to solve a hundred percent.
And that is, I think, a very good way of kind of seeing what's the boundary of where these models can do something. And then the other thing we do is we open also open source, the Open Eye cookbook, which has the large repository of best, uh, ways of using these large language models in combination to solve interesting use cases.
And then just how do we stay on top, I guess, uh, by working directly with the developers and trying to be in a position where we offer our advice and see what works, what doesn't, and. Work through those ourselves, do a bit of fine tuning, kind of really try to understand where is it that these models are, uh, good at or not.
And we just want to make a very good understanding of where is the boundary of what the current model can do. And then with that helps both our research push those boundaries, but also, uh, inform people by kind of setting a good example of, uh, what is the level of the capabilities of these models so that they kind of start off exploring in the right, right complexity.
Aparna: Yep. Yep, yep. I, I gotta say the opening eye evals library is amazing. Just been seeing that one grow, grow in popularity over the last few, few months. Um, okay. I'm gonna switch it up a little bit here. Any ideas about the biggest challenges that users are facing today or will face when they put GPT Forum production?
Boris: Um, I mean, I hear concerns about accuracy or hallucinations being top of the list of our customers. What are you hearing out there? Is this, will this eventually get solved with fine tuning or prompt engineering? Um, yeah. What are the challenges that you think we'll see? Yeah. Yeah, I think that's, that's a great kind of set of things that are maybe a challenge, I think, especially with all large language models.
So accuracy and hallucinations is that it's been around. Uh, it's reducing with every new version of models. So I'm not sure if like we need to do something specifically different or it kind of scale and the current approaches just seem to improve it. So we've seen drop in hallucinations from, uh, 40% to 2%, which you would transition from 3.5 to four on, on.
I think some of, some of the examples and I think yeah, language models are sort of jack of all trades. They're really good at so many different, uh, Things, but they're not that great kind of at any one specialty when compared to a human expert. So this is where I really think having a human in the loop is probably what people want to do for a long time.
I think the other thing is, It's likely to be the case, like with self-driving cars, we'll just ask of LMS a much higher standard than we will of humans, kind of in humans we don't necessarily measure how often they make mistakes. And I think, I think that's good. I think, I think it's good that we are holding these models to a very high standard and hopefully they'll keep improving and we'll get to a much better place where all models will be, uh, better.
But other things I think are a big challenge is almost like this is usually a question that needs to be thought of at the executive level because the business often needs a different structure. The previous functions might not all make sense. It's almost like you need to rethink from first principles.What makes sense in this industry? What are the new products we can launch, and then how can we actually scale and change the way that, uh, that processes work.
Aparna: Got it. Got it, got it. I guess, how do you all think about observability internally? Um, is there, you know, do you guys look at. You know, metrics like accuracy or, you know, how do you guys think about observability?
Boris: It's a great question. So I think what we've done by releasing evals really I think shows what can be done with larger models in, uh, by themselves. So one eval we have is the auto eval, where the model itself is evaluating the outputs. So if you know what the input is, what the desired behavior is, and then kind of what the correct call. Standard answer is then you can kind of assess how good is the answer that we got. I think in a similar way, you can also approach observability, where you can kind of get the large language models to look through conversations or look through inputs and outputs and ask like the 10 questions you care about, get those answers. And then those can either be classification or related order. They can be kind of like, What are some of the potential issues? And then you can try to start tracking those. And then the other thing you can do is also use the embeddings where you embed all the inputs and all the outputs, and then you can track across time.
How do those change? Is there a new pattern that just happened? And then if there's one error, well, how about you look at all the other things in the embedding space close to it? Is this a segment where things are not working quite well? And then if you make a change to. To the prompt or you make a change to the model, how does that change the output? How does that change all of these observed metrics? I think we'll see a lot more LLMs being used to analyze some subset of traffic or maybe even all of the traffic to get more and more insights. I think it, it'll just kind of unlock, uh, more potentials for observing what's happening.
Aparna: Totally. I'm really excited about that. All right. Let me change up topics a little bit. So, Agents, agents has been a really hot topic in the last, what, two, three weeks. Um, where do you think agents will be useful, um, and maybe especially in the production use cases?
Boris: Yeah, so is it, this is an interesting one for me because despite being the most trending GitHub repository with a hundred thousand stars, and despite me trying quite hard to kind of look at all of those, uh, demos and use cases and even try it out myself, I feel like it's one of those things that at the moment is over hyped, but uh, that I think there is something there.
I think there's a really good idea there. So maybe even if the model, if the models are not quite there today, they might be there in a year or two and uh, yeah, I would, I would probably say I. People are somewhat discovering GPT-4 through agents. GPT-4 is really capable and a lot of the things they're showing that sort of supposedly auto GPT can do or maybe AGI can do well, GPT-4 can do it as well if you give it a good prompt.
So that is kind of one surprise to me. It's almost like either it's something that GPT-4 can already do or it gets really brittle and starts breaking and it doesn't quite work properly. So if you look at kind of some of the more complex demos and if you, uh, drill down into kind of each of the stages, there's a few mistakes happening at each of the stages.
And so, so I guess. That, that, that's where I'm at at the moment, which is I think there will be use cases, uh, in the future. I think right now it's mostly kind of a fun, interesting idea and I think people are getting excited about both capabilities of GT four and also about trying to build a system that could actually work as a kind of more autonomous, uh, agent that can start solving real world problems.
I think it's a little bit early to say what the use cases are, but I think yeah, having Atic behavior just kind of unlocks it. I think it really depends on. Businesses wanted to give more control. I'm more excited about the plugins personally that we have released. I think plugins is a kind of first step, which just give extra information to the model.
I think if models start need to start acting in the world, I think we need a lot more, more monitoring and observability to kind of start actually, uh, trusting them and kind of measuring the success.
Aparna: Got it. Got it, got it. Wow, you heard it here first. Spicy, hot take from Boris agents are hype. There are a lot of really cool demos happening on Twitter right now, being able to wrap calls and, you know, agents for evaluation. But I'm also really curious what are the production type of use cases for, for agents. So, love that take. Um, okay. One more question for you and then I'll move into some audience questions.
There's been a lot of discussion around private, kind of these closed source foundational models versus open source foundational models. Um, were any take on where that's going, what might become more common in the long run? And I know, you know, might be a bias question to, to put in front of you, but we'd just love to hear your take on that.
Boris: Yeah, I think I usually like to look at trends and then start projecting based on trends. So I think neither of those is going away. I think, uh, and I think they both help each other as well, I think. GBT three has done probably more for open source language models than any other thing out there. And I think equally, uh, the innovations that happen within the open source community are just of the type that might not be quite as easy to do with the closed models.
So I think it's really good to have both. I think it's also important that the most capable models remain. Close source, or at least via api, where we can really monitor and understand what's happening. Because the more the capabilities of these models increase, the more, uh, there are scary potentials with this technology of people being able to seriously misuse it, to achieve, uh, bad outcomes.
And I think this is where, uh, Uh, open AI as a company spends a lot of effort into aligning the models and also observing what's happening, uh, within, uh, the API traffic so that, uh, we can understand if there are kind of any anomalies that come up that haven't been seen previously that, uh, might be disruptive to the world economy and, and humans kind of as a whole.
So really the mission of AI is to benefit all of humanity, and we really try hard to achieve that. And, uh, When, when it makes sense to open source, we do open source, we open source, uh, the whisper model and, uh, we'll keep doing. So, uh, yeah, we just think like open source in Jupyter four, probably not the best idea right now, given its capabilities.
Aparna: Got it, got it. Got it. That's, a great, great point. Um, all right, let me jump into some audience questions here. Uh, okay. I got a great question from Anika Patel. How do we test for bias in Lola? So I think there's one area of academic researchers that is growing a lot. I think there's a lot of really good papers in this space.
Boris: Uh, it really depends what kind of bias you mean. I think there's a lot of different types of bias. I think one way to do it is via, uh, Checking in the embedding space, are these language models encoding the correct. Some of the characteristics does, for example, the name of a person, changing the name of a person, change the response in some way. Um, so yeah, I think it's, it's a very interesting, uh, area of study and, uh, I think large models make that easier. You can also ask the model itself to explain its reasoning, and even though that's a post-hoc explanation, it's still, I think a useful way of kind of seeing where, where are the thoughts coming from?
Aparna: Hmm, that's, that's a really good point. Um, okay. Question from earlier in the talk sus. How are you thinking about solving the token limit problem? How are you guys thinking about it?
Boris: Great. So maybe just to give a little bit of context with you who might not be aware of what the token limit is. So basically with large language models, They read text token by token, or almost you can think of a token roughly as a word, and they produced the next word given all the previous words. So historically, uh, GT two had I think 512 tokens, which is about maybe half a page to a page of text. Then GT three had 2000 tokens, which is maybe three to four pages of text.
Uh, and then GPT-3.5 has increased that limit, and now with GPT-4, the limit is at 8,000 tokens, but also have an next experimental version of 32,000 tokens that we are testing out. So 30,000 tokens, getting us out of 50 pages of text. I think one way is to keep increasing, uh, improving the underlying architecture that allows for more and more tokens to be in scope.
But I think the other one is almost kind of thinking about well, The token space, 50 pages of text is a lot of text. That's almost like humans working memory. Uh, you need to put within those 50 pages of text, the context that's needed to produce the next output. But then you can actually try to extend that by having a retrieval.
So this is where you can use Embeddings and you can embed a much bigger knowledge base. You can embed all of the books in the library and then whatever context or whatever questions you're needing to do, or if you need to write an essay of the differences between someone’s view on free will, and I dunno somebody else's. It'll be able to retrieve the relevant books, the relevant chapters, and populate those within the context of the 50 pages. And then hopefully, based on that text, we'll do a much better job at. Producing the output than it would otherwise. So I think there's almost like, you can almost think of context as being infinite if you trust your retrieval process as being very good at, at some point.
Aparna: I mean, I think there's a lot, this might be a great question for you. There's a lot that you can actually do with in context learning or by, um, Being able to change the prompts to, to kind of add more context from, from whatever source. Um, what do you think, maybe this is just more of a trend. Will you see more people investing in prompt engineering prompt template type of workflows over fine tuning the LLM? Um, where do you see that, you know, when will people choose to fine tune and will most people actually first just start with prompt engineering.
Boris: Yeah, so I think there's a lot of things to be explored with fine tuning.
However, empirically what we've observed is that prompting just leads to more success faster. I think largely, I believe it's a function of it's so much faster to iterate with prompts. I think iteration speed is probably the key towards. Succeeding with fine tuning, you need to think about everything I had ahead of time.
And if you want to make a change, it's really painful to change all of the data. Uh, I think fine tuning makes sense if you have a use case where, Uh, there's a very well-defined metric that the company has been optimizing for years, and if you can move that metric by 0.5%, that will make a massive impact on the business.
That's where I would go for fine tuning, you know, the exact format of inputs and outputs, and you have a lot of extremely high quality, uh, historic data that you can spend your experts, uh, expert time analyzing and improving. Then fine. A model on that data, even if it's a small model, can actually result in a cheaper, faster, and better performance than a big model.
But I think for all the new use cases, uh, fine tuning just adds too much cost. And also, uh, fine tuning is not as intuitive as people think sometimes. So fine tuning works very well when you have. Given this input, this is the output you need to produce and you give it, let's say a hundred or a thousand examples of that.
What it doesn't do well is if you have a knowledge base and you want to fine tune the model to now understand that knowledge base and talk, uh, well about that knowledge base and also remember all the facts from that knowledge base. That's actually the case where fine tuning doesn't do well. It learns the style really well, but it doesn't learn the fact it can hallucinate those facts.
So there’s a much better approach is to actually embed the knowledge base and then use the. In context learning, which is basically cleaning from that knowledge base, putting that as part of the context of the prompt, and then having the model answer based in that context.
Aparna: Well, we are at time. Did not get to, you know, all the questions in the chat, but Boris will be available in our community Slack. Feel free to ask him questions, he'll be available to answer them. Boris, thank you so much for an amazing session. Walked away learning a lot. Uh, any last thoughts or, or advice for people? In this space.
Boris: Thank you so much. I'm just very humbled that you're all here and I think we have an amazing future ahead of us. So thank you so much for participating and building it.