Arize:Observe 2023

PromptLayer for Prompt Engineering in the Real World

This talk explores what prompt engineering is, how to get started doing it, and shares insights learned from our users. Jared Zoneraich, Founder at Prompt Layer, will discuss best practices, real-world challenges, and show how PromptLayer enables data-driven prompt engineering. There is a big difference between theoretical prompting and real-world prompt engineering, and this talk will contrast the two.

Jared Zoneraich: Hey, my name is Jared Zoneraich from Prompt Layer and I'll be giving a talk today on prompt engineering in the real world.

So prompt engineering is a real, really new thing and I think it's very important to know how people are doing it today. And what it is and how we can make it better, how it's gonna change in the future. So let's get started. Like I said, here's our company's logo is, uh, is the cake emoji.

So we actually got a cake at our last happy hour. So come through to the next one if you're in the city or if we do it in, in New York City or if we do it in the Bay Area. And we'll probably have more cakes like that. But you can follow me on Twitter here I’m Jared Z. And yeah, PromptLayer. You can go and otherwise let's get started.

I guess the only other thing is my background. I'll add I live in New York City, originally from Jersey was in California a bit for school, uh, go Bears. And, uh, otherwise, uh, if you're in New York City, hit me up and come say hi. Happy to, happy to always chat. You can reach me, jared at

So this talk will be divided into four different sections. First section is prompt engineering. So what is prompt engineering? How does it work? Second section is how is it done today? What is, like the name of the talk, what is prompt engineering in the real world? Third section is what are the challenges like now that we know how it's done today, what are the problems with this?

What are we seeing people. Doing what is hard for people. We've, uh, as part of prompt layer, we've spent kind of the last three months trying to talk to everybody we can find who's doing LLMs in production and doing prompt engineering, and just trying to understand how this is being done, what's, what are the problems and, uh, and help people do it.

And yeah, for the fourth section, I'll kind of just give you a little, uh, demo of prompt layer and, uh, we, uh, we try to solve a lot of these real, real world challenges. Some of them that I'll be talking about or not are just, uh, are general challenges, and then some of them are things that I could show you how we think to solve it, but there's always, uh, always more to learn there.

So, yeah, let's get started. Uh, first thing. First things first, uh, definitions. So what is prompt engineering? Uh, you could kind of, it's kind of buzzwordy, I'll admit. A lot of people are starting, uh, a lot of people don't like this term. Uh, but, uh, I like it. I think, uh, I think the big questions are, will prompt engineering exist in the future?

Will there be certain jobs called prompt engineers? And, uh, yeah, in general, uh, Yeah, I, I'll tell you my opinions on it, but let's start off. What is prompt engineering? Prompt engineering is kind of the practice of putting in a prompt and getting a output. So I'm assuming most people in this talk are kind of familiar with GPT, familiar with how LLMs work, but just in short, in case you're not, it's, uh, kind of a large language model, is a, uh, kind of a, kind of a AI model that takes in text and kind of completes the text.

So, As you've seen with chatGPT is you are asking, it's starting off the conversation saying you are an AI assistant. Here's what the user said here, your response is, and then it completes what the response is. So that's what a large language model is, and prompt engineering is just how do you kind of create that prompt to get what you want and a common question.

People ask, uh, the common question we get asked too is, Hey, you guys are building prompt layer, like you're building this prompt engineering platform. What if prompt engineering goes away? Like our language models just get, get better. And at the core, you can think of prompt engineering as just communicating with AI and.

Humans, you would say, are really good at communicating and really un good at understanding language. But even humans, you need to communicate with. Even humans have communication problems from time to time. So at the very core, prompt engineering is just telling the AI what you want it to do and the process of tinkering.

The process of changing. How do I, uh, I'll give you, there's a good example of, uh, People love to talk about how Open AI, GPT was trained on a lot of online data and a lot of data that were like questions. And in these questions, these stack overflow type questions. There was some people talk about, there's some correlation with how polite someone was and how good the answer was.

So people say, oh, if you ask GPT, if you say please and thank you, you'll get a better result. That might be true. Uh, but that's kind of like part of this whole tinkering process that might change over time. But the whole process is like where in the prompt you ask for something. So here's this, like write a tagline for ice cream shop.

Maybe we want it to be a, uh, allegory. Maybe I'll give examples in the prompt that's called f shot learning. And then maybe I'll put the examples at the end. So yeah, enough about kind of like high level of prompt engineering. Let's kind of dive into how to get started. So go to playground.

That's the best way to get started doing prompt engineering. Uh, just start playing around. Honestly, like prompt engineering is a new, it's a new field. It's a new skillset. It's not just engineering. It's not, there's a lot of non-technical people who are really good at prompt engineering. It's purely, To me it's, it's like hacking.

It's just tinkering. Uh, you're not gonna get very far, in my opinion, writing down rules for how to best prompt engineer. The best way to do it is just start trying stuff out. So, yeah. Let's go to the next slide. So what is prompt engineering today, now that we kind of understand. What prompt engineering is, um, kind, again, this whole ticking process of giving AI a task.

How do we, like, how are people doing it today? What is it existing today and how is this gonna change in the future?

So the most interesting thing that happens today is prompt engineering is kind of. We're kind of at day one. Um, if you're an engineer right here, it says, test a few prompts and ship it to production. That's how people are doing today. If you're an engineer and you see this, there should be sirens going off in your head.

Almost all the top, uh, almost all the top teams building lms. I'm not, I won't name names, I won't call them out, but has spoken to with almost all the top teams doing this. A vast majority of them, if not all of them. How they test prompts is they have, they basically have 10 test examples on their computer.

They write a prompt, test it against those 10 examples. If it works, they say, oh, looks good to me. And they just ship it to prod. If you're an engineer, uh, again, sirens should be going off because. You're probably used to at your current company having a thousand step deployment process where you're testing everything and, and you're, you should be like, it should be reliable.

You should be confident that if you ship it to production, it'll be good, you know, uh, in this, uh, playground that I showed you before, this is good for 70% of the time, but you need it to work a hundred percent of the time. So this is kind of like how it's done today. Um, I think like people kind of have these 10 pet.

Test cases and they'll run it against that. And open AI is asking for help evaluating and, um, all of that. And, you know, there's, uh, there's, yeah, this is like, we're very in, we're in the very early stages of, this is basically my point. And this is, that's how it's done today. So if you're, if you're doing prompt engineering, uh, don't worry so much.

Uh, if you're building version one that. Some things might be sketchy that you're doing because all, all the great companies are doing it Sketchly today, but that'll change hopefully. So let's talk about some concepts that exist today. Um, and concepts are, uh, these concepts are very interesting because these are not l o m concepts.

I, I would think about, I would think about prompt engineering as. Machine code, the prompt is the machine code. And just as Python is a few abstraction, layers above machine code prompts are the first version and memory, the concepts I'm about to talk about. These are new abstractions we've kind of built on top of prompts and prompt engineering in general to make it better.

So chains is the first concept, uh, very important concept right now. Uh, You probably heard of Lang Chain amazing library that's kind of become the defacto prototyping library. So how do you build version one? Lang Chain is built on this concept of how do you help people chain prompts together. Uh, as you see here, I wrote like, the more complicated the problem is, the more the more prompts you need, honestly.

Uh, so I'll give you an example. I. I made a small hack a few months ago. I never released it. I probably should clean it up and release it, but, uh, gives you a daily digest of what happened, and I just made that in two, three tweets so people can be updated every day to do that. It's not just one prompt, although maybe today it could be with G PT four, but.

I had to chunk up the summary because it's too long to put into one prompt prompts have a, uh, limited case. I had to chunk it up. I had to summarize each summary, make it shorter so I could put it into a smaller context window, and then I had to convert it to tweets. That's like three different prompts per summary.

That could be like 20 different prompts. Uh, so this what a chain is. Uh, Chains also have a concept of agents and, uh, stuff like that. But I should be talking about that next. There we go. I wasn't, uh, I, I forgot what came next. So agents and plugins, uh, is the second concept. So a chain is multiple prompts together, but it could also have agents, uh, and plugins is the chatGPTD version.

So you've probably seen chatGPT and, uh, its plugins like the Wolfram plugin or something like that. These are. Ways to get LLM to LMS to do tasks that LLMs are not good at. So I'll tell you my mental model. My mental model is that LLMs kind of have solved natural language. So what I mean by that is when I'm prompt engineering, I'm assuming that GPT or at the l o m can tell me, can understand my text and can write text so it knows how to talk. Period. But you can't really trust it with information. You can't ask it, uh, to define things. You can't, a, you definitely can't ask it to do computation because it only knows how to talk. It's, it's like a, it's a child that can't do math, but can only talk, think about it that way, and you'll have great results.

So like, Writing code, you can consider talking because it's just talking in code. So asking an LM to write a poem, it's really good at that cuz it's good at talking, but it's not good at telling you, uh, population statistics for countries it's not good at. Um, It's not good at like taking, uh, sorting a list or figuring out how many unique things are in a list.

That's where agents and plugins come in. So agents is a langchain term, langchain concept here. Plug-ins is the alternate. These are bringing in different tooling, bringing in wolf and bringing that stored thing. Third concept, user segmentation. This is very interesting. Uh, basically if you are.

Using LMS in product. If you're using LMS today for your application, you'll probably find that you need to use different prompts for different users and possibly different models for different users. Main reason, uh, different models is some models are better at something. Some are cheaper, some are faster.

Today we have three main ones from OpenAI people use, but soon you're gonna have coherent anthropic. You're gonna have fine-tuned models, you're gonna have open source models, shout out to GPT for all, doing some really cool stuff. And, uh, yeah, uh, you're gonna have to segment your users into different groups.

You're gonna have to say, we actually had, uh, some people using our platform that were telling us, uh, actually, I'll, I'll shout out. Feder ai really cool app. And, uh, he, uh, the, the creator told us that basically, Different users need to get different prompts to have the best results. So this user segmentation came up in so many teams we'd spoke to and basically the matching of prompts and users and models.

Cool. So now we've reached the next section, which is Real World Challenges. So kind of how do you like, now that we know how people use this today, What are the problems with using this? What are like, what's going wrong? First problem, breakages. Uh, how do you know when things fail? Uh, this is like the core of building a reliable system is knowing when it breaks.

And, uh, this is a non-trivial problem in LMS because what does breaking mean? Like, what, what is, uh, When is failure mode is failure. Like if you're creating Bing, I'm sure everybody heard of Bing was a very, uh, people were able to make it act very rude to them and angry is that failure. How do you detect that?

How do you Is failure, in my example earlier, if my tweet I'm generating is over 250 Cat 262, whatever, the number is too long, if my tweet is too long, is that a failure? The failure is not just did the LM crash, but it's kind of, did the LLM give a bad output that is not working? Um, I've heard of people using GPT to generate.

JSO is the JSON malformed, that's another failure. Um, so this is a key part. Like how do you detect when things are wrong? Uh, there's a lot of ways people do it today. I'll call out three different ways people do it. Uh, one. You could hire people on MTurk, sort through your data, that's really boring.

That's not the future. Second way end users. So you can ask your end user to give you a thumbs up, a thumbs down. You could ask your end user, you could interpret their actions. If a user clicks out of the, uh, chat bot, that's probably you gave them a bad result. If they share a star, it's probably a good result.

So kind of a way to understand when things are breaking. Comparisons. This is the next big challenge. Uh, at the end of the day, everything boils down to is prompt A better than prompt B, or version three better than version four? And are my new prompts safe to ship? Um, this goes back to that earlier thing I said, where if you're an engineer, there's sirens going off in your head.

If you're just building a prompt that you think is better and shipping it to production, You think is not good enough in production systems. You, you must know, like you must testily know and uh, for the same reason that it's hard to know when LMS fail. It's hard to know when prompt A is better than prompt B.

Our theory is production data is how you know, but in general, this is a big problem today. How do you compare different prompts? Cost is obviously first comparison. So, okay. In optimization, uh, optimization is the other problem here. Uh, so how do you, how do you make your LM app better over time? So this, this is encompasses the other problems, but how do you choose which model to use for which tasks today?

Not as big, big of an issue, but you can imagine. A little bit ago when people, uh, built, when people were using GPT-3 and then 3.5 Turbo came out, that was 10 times cheaper and 10 times faster, how did they figure out that it was good enough to use? Uh, today we, we know it was good. It, it's good enough for most things, but how do you actually test out and choose which model to use when. Uh, again, you can imagine a not too far away future where you have a lot of models to be choosing from. And then AB testing is obviously a huge part of this. Uh, this goes back to the user segmentation. How do you kind of provide different prompts to different people with different models?

And maybe you have a chain of five different prompts, all with different models and all determined at runtime based on the user. So here, the big thing here is how do you get the confidence to deploy your LM app? How do you get the confidence to deploy prompts? How do you get the confidence to deploy new prompts?

If you're building a system for thousands or millions of people, you better damn be sure you, you better be damn sure that your prompt is gonna work. And if you're building a hack on the side, And you're gonna launch it on Twitter, it's fine. You probably don't need to worry about all this stuff, just do some prompt engineering.

But when you're ready to build something reliable, when you wanna start charging end users, you need to make a good system. And confidence is an important part of that. So, yeah. Um, I guess the next part is prompt layer. So, prompt layer is the tool we're building and prompt layer is, Tries to solve a lot of these real world problems.

I think we, our goal is to be the platform you come to, to make your LLM reliable. Uh, again, that's the reason we've been talking to all these teams and trying to figure out what they're doing and giving you these examples of how people are doing it without calling, calling out names specifically. But let me just show you, uh, I'll show you how PromptLayer works.

So I will zoom in once here. What PromptLayer is, is super simple to use. Pim it install, create an API key, and it's just one line of code. Instead of importing OpenAI, you do prompt layer dot OpenAI and the rest of your code stays the same, and now you get, you could track your usage so you can see what's going on on your system.

This is the first step to solving any of these problems. You can see when. Like what the prompt is. You can go into the j s o, you can see how much it's charging you for the prompt, and you can, you can, uh, you favor them, you can share a link with your, your coworkers, all that fun stuff. And this is very important, like, uh, really the first step to building a reliable system is knowing what's going on in the system.

So, uh, prompt registry is the second part of the product, uh, that I wanna call you to. So this is built on top of LangChain's, prompt template spec. Let's you create a template visually. So just type in your prompt template here. You can create the template and uh, yeah, uh, that's, uh, it's very important to kind of organize these prompts.

Uh, like I was talking about, like this whole process is tinkering, so you should have tens or hundreds of prompts that you're kind of building new versions of and build them on the prompt registry and then kind of use our programmatic API to pull them down. And then this is kind of going into, if we click into one of these, this goes into the problems I was talking about in terms of prompt comparison and checking for breakages and stuff like that.

Uh, so we have to version three and version four, this is a stupid example I built. Summarize the below verse, uh, summarize the following text, using big words in jargon. I don't know, maybe using jargon gives us a better summary so I can see all the times it's been used. I can see all the times version three's been used.

I could see the cost, I could compare and the score. So this is a really interesting part to score. So where does the score come from? This is that problem, the real world problem of is prompt A better than prompt B. So let's open one of them. Uh, you could do the score visually, but most people do it programmatically.

Uh, basically this could be a end user giving you a thumbs up, thumbs down, or, this is way more interesting and I think I talked about this earlier, is LLM synthetic evaluation. So I have a version for my summary example. So this is, I wrote, you're a human helping to train an AI system. The AI system's designed to summarize pieces of text. We need to label the training data and you're the human that will help with this. Obviously this is a prompt engineering process. You could see I went through two ver versions here, but you have the original and this is a variable that kind of, we throw in the original, you have the summary, we throw it in.

How would you rate the summary from 0 2 10? A good summary does not leave out any information but just cuts down on the number of words and complexity of the sentence. The higher the rating, the simpler to read the summary is, the answer should be a single number. Uh, this is called Few Shot Learning, so kind of giving examples in your prompt.

Uh, it would be even better if I actually gave examples of the summary, but this was kind of like a quick hack, but yeah, example 5, 4, 8. Whatever this is to, to show it. I want a number and then your answer, and you could see here times have run it so we can open one of them. This is the original. We hold these truths to be self-evident.

Uh, do you know where that's from? Uh, the summary and then it gave it seven. So we'll see how good seven is. Uh, again, this is a toy example, but the whole point of. Evaluating synthetically is you're gonna have so much data as you can see on the left. You're gonna have so much data here that you need a way to kind of bubble up when things are not working.

That's the better way I think, to think about it. Think about it in the reverse, instead of trying to figure out how good your prompts are, try to figure out how bad they are and just kind of. We're actually have some features that'll come out soon, but like flag when it's not working. Give it a zero when something's bad.

You could imagine if I had a chatbot, I could have an LLM that says, is my chatbot being rude? And uh, yeah. The only other thing I'd show you here is we have an analytics dashboard and uh, yeah, you could see which models you're using. We can kind of look at how many requests there are. Latency. We can see how much my evaluation actually costs me a lot, like a dollar.

But, uh, Yeah, so this is prompt layer. Uh, I encourage you to sign Free to use. Uh, you can contact Uh, we we're trying to solve a lot of the problems that exist in real world prompt engineering. I think prompt engineering's amazing. It's gonna be, it's gonna be a huge part of the future of probably every single company, but, Making it reliable is a non-trivial problem.

So that's what we're doing. Please send me a message, uh, if you, uh, I mean, you could try it out for free. It's free today. Uh, so you could try it out. Uh, but send me a message if you have any questions. I hope you enjoyed this talk. Uh, I think prompt engineering, there's, uh, there's, I'm very excited for the discussions that everybody's gonna be having around prompt engineering in the future.

I think. Honestly, it's just fun, and so I think there's a lot to learn about it. There's a lot of places it's gonna go, but it's definitely gonna be important. So I'll be in the Arize Slack community, so if you have any questions, I'll be answering them there. But, uh, yeah. Thank you for tuning in.

Subscribe to our resources and blogs