In the keynote to kick off the third annual Arize:Observe, Aparna Dhinakaran, Chief Product Officer and Co-Founder, and Jason Lopatecki, CEO and Co-Founder, discuss the current state of MLOps and what generative AI means for ML engineers and data scientists--are all ML engineers and data scientists now prompt engineers? They also introduce two new Arize products: observability for LLMs, and Phoenix: ML observability in a notebook, and Aparna does a quick demo of each.
Aparna: Welcome to Arize:Observe. Hey, everyone! This is the largest LLM conference of the year, and we're so excited to have you here. There's over 5,000 plus attendees, 70 plus sessions. 2 days of all things generative. My name is Aparna, one of the founders here at Arize.
Jason Lopatecki: I'm Jason Lopatecki, the other founder of Arize. Incredibly excited and what an amazing event. We have Hugging Face. We have Weights & Biases. We have OpenAI. We have LabelBox and Scale and Spotify, and Etsy.
Aparna: We have all of the up and coming LLM start ups. We’ve got Jasper, LangChain, LlamaIndex, Prompt Layer. Baby AGI is speaking. This is gonna be a really fun next 2 days.
Jason: And what a year it's been I mean, it's been the year of AI. It's an amazing year. Think about it. 12 months ago, when we did our last Observe, prompts weren't even a thing in the lexicon. Dall-e 2 wasn't out, Mid Journey wasn't out, Toolformer, or the whole idea that would use tools. Yeah, we were just dreams and people's heads. GPT-4 didn't exist. I mean it's just amazing the momentum in this space and it's changing our lives. It's changing the lives of our families. It's changing the business lives as well of the people we know.
OpenAI and the University of Pennsylvania put out this study around how AI and GPT specifically might affect the jobs of people in the US. But I mean serious, seriously impressive or seriously interesting results. 80% of US workers likely affected–300 million people. You know. Mathematicians, survey writers, tax preparers, artists. It’s incredibly amazing how many people this is going to affect, and what was surprising to me in a partner, though, was like that.
The one role, one role that's very near and dear to our hearts was not on this list. And this is the LLM engineer and data scientist. And if you were, if you were in a hole or or under a rock and didn't see this Reddit post go around, let me tell you a little bit about it, and I think it's channeling what a lot of us feel. What's happening in LLMs is changing everything. And so this is a tech worker at a large tech company who's been working on and who's been building models for a long time, and they're realizing. GPT-4 is kind of making what they're doing obsolete. That the NLP models they were building are just kind of no longer relevant. And this is happening in a lot of areas of ML right now.
And let me just dive in and give you an example. What's meant by this? Well, text classification is a model. It's a model that someone trained today. They built today. They collect data for, they manage it. They manage this model in production, and all this work to make it work. And now it can be a prompt. It can be a text. You feed a model, you feed one of these, you GPT-4 a text around classification.
And it's amazing what models are being turned and really models becoming this task or prompt.
Aparna: Hey, Jason. I have a question for you. Are you saying data scientists will become prompt engineers?
Jason: I mean it's crazy. You might have spent 5 years getting a PhD, and now you're a prompt engineer. I mean…that is the reality. I mean I don't think it's all that case, but let's be honest. We're all prompt engineers now, and if we are all going to be doing this, and this is taking over the space and in a very, very big way, and it's not just text classification. It's all these models, these models that you've been building–niche models–for years and years and years are now being done by a single model, and being done many times better than what you're building individually.
And so for a lot of us you see this momentum in this direction. Me and Aparna are talking a lot about, is it? Wow! Is it? Is it one model to rule them all? Is it one model that does everything? And I think, what’s been amazing to us is not just these individual tasks that used to be models that are being done, but all the other skills that are coming out of these models that we didn't think about.
Now, there's this concept of emergence which is really, I think, one of the most amazing things happening in the space. That these models, these LLMs, are gaining skills, that they're getting skills the small models and they're just showing up. The ability to debug self debug code. It's it's written to create video games, auto GPT. And Baby AGI, you know, creating these controllers. It. It's absolutely awe inspiring. Exciting, but I would say at the same time, it's also got us slightly nervous. We're watching, you know, the first time in data science history where we just don't know all the skills the technology we've created has. And it might have millions. For me and Aparna, we’ve been in this space, for a long time we look at it, and it's just there's never been a time where observability is more important.
You need observability to watch and understand what these models are doing and the question you might have is as these models you think, so big and so good is like, do they do everything?
Do they take over everything and at one end? And I think there the answer is, No, it's it's there's a whole set of actions today. Things models do at high scale transaction, fraud, ranking that just are many, many, many orders of magnitude away from what these LLMs can produce in terms of volume and response rates. But if you're doing lower scale NLP, if you're doing something on a lower scale, it's likely to be replaced.
Aparna: Hey, Jason, I have another question for you. Can you give me an example where today, maybe not in 6 months, but today that LLMs don't replace?
Jason: So if you're doing, say, a high scale transaction fraud, or you're doing something Internet scale, say, e-commerce rankings those just are, or many orders of magnitude away. The other case, I would say, is, if you're doing something that requires a lot of personalized information that doesn't fit in the context window. A lot of information that is relevant to your business; that also doesn't make sense. But on the other side of this you've got, say, these lower volume, these lower volume skills are low volume applications like be like the B Twob copilot stuff which never could have been done before, which are almost a whole new company. So you have co-pilot for lawyers like Harvey. You have co-pilot for medicine. You have all these new skills to apply, and in a way that we've never thought of before. And we've surveyed 56% of the customers that we survey. These are enterprise, and then market companies, are planning to put GPT-4 into production in the next 12 months. So this will be widespread use, and Aparna, Maybe you can talk to us a little bit about what LLMs look like in production.
Aparna: Yeah. So what do LLMs look like in production today? Well, an application calls out to Open AI, Anthropic. It gets back to responses, and then it puts those responses in their application. It actually works pretty great when the responses are good, but LLMs can go wrong. LLMs can hallucinate.They can make up answers. We actually ran a massive survey where we asked hundreds of data scientists. What are you worried about with putting LLMs into production? Over 45% of them said that LLMs hallucinating, giving inaccurate responses was one of the biggest reasons why they were nervous about putting them into production.
Observability will be needed. And there's a whole new stack of technologies that are growing behind putting LLMs into production. It's called, and this new stack contains prompts, agents, fine tuning evaluations, agents. All of these will require some components of observability. And today we're launching Llm. Observability across your eyes platform.
Let me tell you what that looks like. So first off prompts and responses are logged to the Arize platform. These prompts and responses can be one off. They can span multiple conversations. They can be many tasks within an agent. The key thing here, though, is that it's not just the text. It's not just the prompt and the response. It's the embeddings. And Arize can generate the embeddings, or we can take in user embeddings. But these embeddings are going to be core to observability in the LLM space.
So prompt and responses are logged to the platform. What next? Well, troubleshooting really works when you can find groups or patterns of issues and and really kind of pinpoint where the problems are and Arize finds these clusters of bad responses that could mean: frustrated customers. It could mean language problems, but it finds these clusters where the responses were just not great.
Jason Lopatecki: What? What's it like? A very specific example that you've seen go wrong, and how people fixed it.
Aparna: Yeah,: we've seen language actually been a pretty big issue. This is where the prompt might be in one language. The response might be in a different language, and if the response you know the language that the user wanted in the response wasn't indicated in the prompt template or or kind of called out. We've noticed that the response can just be off. So that's one we're iterating and getting the prompt template to call it out is kind of a way to fix that. But even knowing that's a problem is really hard to find. So, being able to group together these common patterns and issues is important, and you also want to sort which clusters are really having the worst responses by some type of evaluation scores. And you might be asking, Well, what does evaluation even look like for, you can't just throw your typical accuracy type of metrics at it. So you know there's a whole spectrum on one end of the spectrum which we'll get into it. It can be really complex. You use AI to evaluate AI, it's called Llm assisted evaluation. There's a simple middle ground called, you know, where users can just where you can get user provided feedback. So this might be a thumbs up a thumbs down. How is my response that I gave Did the user accept and put it under application, or did they reject it?
There's also on the other end, a little bit more of these task based metrics. So if you're doing summarization tasks or translation tasks, you can pull from kind of the library of, you know metrics that were commonly used in the Nlp space and pull them into to use here.
Jason: So what do you see the most in our customer base of these today.
Aparna: The thing that is the most common right now is that user provided feedback is super simple. You can put it in your application. You can ask for a good response. Bad response. I'm sure many of you have done this on ChatGPT as well. The problem in the con is users might not always give you that feedback, and any data scientist who depends on just user provided feedback probably complains to you that you could get a really small sample set. So the thing i'm really bullish on is the LLM-assisted evaluation. So let me deep dive into what that looks like.
So what is LLM-assisted evaluation? The point here is, it's really AI evaluating AI, and the reasoning is, some tasks are just so complex. It requires equal intelligence to be able to evaluate them. For example, you're in college. You write a paper, you might have your professor, an actual human grades, your paper similar concept here, where, as the tasks get more complex, AI is going to grade AI. And the way it looks like in our world is you Log these prompt responses to Arize as your log. There's different form templates that you can use to generate the score. So you can just ask. Here's a question. Here is the answer Rate the answer, and that score gets logged.
The score comes from an LLM. It gets logged to the platform, and now you have scores for all of your responses that you can use when you're trying to understand what clusters really really have horrible responses. So. zooming back out, we now know where the problems are. We know which clusters to go focus on. That's the set that has the worst scores. What do you actually do to fix these the responses? And at a really simple level you can either ask better questions, or you can get better. You or you. You can train the answer to be better in the world of asking better questions. You can ask the questions with all the contacts you can prompt Engineer, so that the templates have a lot of context to give you better answers. Or if you want your answers to be very hyper personal. You need to fine tune it on your own data you you can use. You can kind of fine tune the Lm. To give you better responses, despite the type of maybe prompts that are sent in and let's.
Let's kind of dive into each one of them.
So in the world of prompt engineering, what does that look like? You have various templates. You get feedback, you're trying to understand which templates are better, which templates are are giving you the better responses, and have, let's say, higher approval rates in this example, and the the hard part is, how do you go improve those templates? You can go fix the user input, add more to the instructions, add more to the context, but it's really about building better prompt templates to get better responses. Prompt engineering might only get you so far, though. if you need to really build on your own data, you need to hyper personalize it. You might need a fine tune.
What does that look like? Well, you find these clusters of problems again, so it could be your language cluster. You go find as many examples as possible, and then you go fix the responses. The ideal response would have been, and fine tune the alarm to give you those better ideal responses.
Jason: And I feel like there's so much going on in LLMops here. I've also been hearing about agents. I was wondering if you can, these automatic agents, what are they and what do they do?
Aparna: There is no generative deck that can happen right now without talking about agents. So let's get into it. So agents really are about autonomous AI. It figures out how to do tasks for you based on some type of overall bigger goal. So if you have a complex goal like, go to the grocery store, figure out what ingredients are needed and go buy it for me. It can break down the complex tasks into these smaller components. Figure out how to order them, and then go execute them.
And every single layer of these tasks is going to need some type of observability. There's 2 great talks about agents at Observe that we have. There's Harrison Chase fromLangChain giving a talk. Baby AGI is giving a talk, and so we'll be going a lot deeper here in those session, definitely check them out.
The big picture is, agents have a lot of autonomy. If there's anywhere, observability is going to matter. It's going to be agents, and they're massively growing in popularity. AutoGPT has over 100k plus stars on Github already.
Jason: I mean it's 100,000 Github stars in like 4 weeks, i'm a little suspicious. There isn't a BabyAgi set up to the goal of getting auto GPTs. I mean, incredibly insane. One other thing I would add on this is that's my point to me is, if you look at, say, a code base, 300 line, 300 lines of code, 250 of them are probably prompts, which is a little bit of a note on the future. Pretty amazing.
Well, thank you for going through all that, LLMOps and everything going on in the space. Now let's hop in and show you what we built.
Observability needs to go across everything. It needs to go across your LLMs. It needs to go across all your model types. The same software you need to find problems and catch issues in production. To get the data problems that are causing that, and then iterate, you need one platform for that.
Aparna: Today we’re launching observability for LLMs. If you want to try it out, head over to the ARize docs, you’ll see a new model type: LLMs, and there you can check out a tutorial to get started. Let’s jump into a demo. In this demo I have a customer support chatbot where users can ask questions and the chatbot will give answers. We’re logging the inputs that the users ask as prompts and then the response that the chatbot gave. The metrics that we’re measuring are thumbs up or thumbs down. We want users to give up thumbs up and we’re really trying to avoid thumbs down. But it looks like we’re getting a lot of thumbs down so we want to jump in and see what’s going on.
So I jump in and I take a look at the response embeddings. So I take a look at the period where we saw a spike in thumbs down, and what it’s doing behind the scense is generating a Umap, and what we’re really trying to do is group together these prompt and responses to find problems. All of these are different clusters of prompt and responses. This cluster over here is one where it does look like users are asking questions like: I”m so frustrated, etc. So this cluster looks like users are frustrated, and I have a cluster where users are typing in math problems, I have a cluster where users are typing questions in spanish and what I want is to find a group where there’s problems and it really looked like the first one had that. So users are really frustrated in their prompts and in their response it’s clear the model isn’t all that helpful. SO what I can do from here, I have options to now go and fix it. I can download this cluster or problematic responses, and I can decide if I want to change the template so I’m getting better responses, or I can fine-tune my LLM and get better responses. These are workflows you can kick off from the Arize platform. And what this platform now enables is to find groups of problematic reponses and give you workflows and tools to let you try and fix it. We’re super excited to launch this and we hope you give it a try.
Jason: Well that was amazing. Well, there's one more thing. Observability needs to go from notebook to platform. You need observability in your notebook and platform to work together.
So we are announcing today open source LLM observability for your notebook.
Embeddings are at the core of what we built at the core of every new model load. Large language, models, latent structures. The core of it. and troubleshooting workflows that start with embeddings that help You understand what these models this decisions are, where the problems are, where they're manifolds and concepts and problems, and inside them are
Help you get down to that issue, even ask Gpt-4 what it thinks the solution is, and what the problem is, and the ability to compare these A/B clusters as well.
We're incredibly excited. Not only does it work on image and text, but it also works on tabular data. Excited to give you a demo of Phoenix.
Aparna: Phoenix is an open-source ML observability library. It’s designed for the notebook. Data scientists and ML engineers can ingest their model inference data into Phoenix for LLMs, CV, NLP, even tabular data sets, and really quickly figure out problems or insights about their models. They can then really quickly use this to export the issues, fine-tune the model, improve the model based on the issues they found.
Today I'm going to walk you through two examples, one with computer vision and one with a generative LLM model. If you want to follow along or try it yourself, there are a number of awesome tutorials in the Phoenix docs that you can check out today.
Let’s jump into the computer vision example. So in this example, I’ve already run the tutorial. The model predicts user actions so in this example it’s predicted users drinking a beverage, and they really are drinking a beverage. So we’re going ahead and we’re going to use this data in Phoenix itself. You can launch it into a notebook or browser. I’ve just launched it into a browser. I can see the models’ schema, it has the image embeddings, the predictive class, etc. Let’s jump in and take a look at the embeddings.
The first thing that comes up is the embedding drift. This really is comparing the embeddings from a primary data set against the embeddings in a reference data set. What that really means is its comparing has the underlying data moved or changed from the datasets. You can click on any one of the time periods in this drift graph and it creates a visualization for me. This is Umap, but to be clear Phoenix isn’t just a UMAP visualization tool. Phoenix is the first platform to visualize embeddings, provide clusters, and really give you a tool to troubleshoot clusters in a single platform. And the reason you need this is because models learn surfaces or manifolds of your data and in order to troubleshoot the surfaces, you need to understand them yourself. So we really learn to cluster by groups or concepts that are similar. Here’s a cluster that is grainy, here’s one that is blurry, and it's the way the model thinks of similar concepts within the data. What we saw was there were really these clusters where the model only saw this kind of data in the primary dataset, and didn't see it in my reference data set, so there are automatically workflows to export it and send it to my team so they can use it to retrain the model with more types of data.
Let me jump into an example with LLMs. In this example we’re looking at text data- prompt and response being sent from an application. We went ahead, logged this data to Phoenix, and let’s go ahead and launch Phoenix for this model. Here I have the prompt response vectors and I also have additional dimensions like prompt length, api call duration, and there’s really a lot you can add to visualize your data. Let’s jump in and take a look at the response vector here. In this case I’m going to take a look at a couple different clusters and I can see there is a cluster where the model is really annoyed, and If I want I can colorize this by different categories so I can easily see the different groups. Here the model is annoyed, there are a lot of Spanish responses, so maybe it’s not doing well with this language. A cluster of chemistry concepts, a cluster of travel related concepts. As the person behind an application I really want to dig into this and see what my customers found annoying and what they were frustrated with. I can export this dataset, load it back in directly into my notebook and I can even just ask chatGPT to summarize the data that it found in that cluster and ask it to explain what the cluster represents. Now I know what the cluster means, can use this data to make my prompt and take that into consideration when I generate my response and give a better experience to my users. There’s this and more in Phoenix, try it out today, there’s a lot you’ll discover with your own models.
Jason: If you have any questions for Aparna or myself, please drop them in the Slack community. Please try Phoenix. If you love it or like it please star it, that’s how we get feedback. OpenAI is next, enjoy the event.