Sally-Ann DeLucia and Amber Roberts headshots

RAG vs Fine-Tuning

Published Feb 8, 2024

Sarah Welsh

Contributor

Introduction

This week we discussed “RAG vs Fine-tuning: Pipelines, Tradeoffs, and a Case Study on Agriculture.” This paper that explores a pipeline for Fine-tuning and RAG, and presents the tradeoffs of both for multiple popular LLMs, including Llama 2-13B, GPT-3.5, and GPT-4. The authors propose a pipeline that consists of multiple stages, including extracting information from PDFs, generating questions and answers, using them for fine-tuning, and leveraging GPT-4 for evaluating the results. Overall, the results point to how systems built using LLMs can be adapted to respond and incorporate knowledge across a dimension that is critical for a specific industry, paving the way for further applications of LLMs in other industrial domains.

Watch

Dive in:

Listen

Transcript

Amber Roberts: Alright. Before we get started on the paper, I just want to briefly go through RAG. I notice when I talk about search and retrieval methods, and when I talk about RAG, sometimes I use them interchangeably where search or retrieval is just the kind of the traditional ML version of RAG, where it doesn’t use an LLM.

Search and retrieval is going to be your traditional recommendation systems, search engines. But RAG and search and retrieval have the same use cases–Essentially just, RAG is going to have that augmented generation aspect.

If you didn’t check out mine and Mikyo’s workshop–that was one week ago. Sarah, if you have the recording of that, I think folks that are attending this paper reading would be interested, but we showed how to use RAG–a full Code-along notebook for it–and kind of went into the details of evaluations with it. But essentially, if you’re using a retrieval augmented generation, you have this offline step where you have this knowledge base of articles, you’re figuring out how to store and index them, you have a user query or any query that’s available, you embed that query in the same embedding method that you’re using to store all these documents, you then look up which documents are the most similar to that query, you pull all those documents, you figure out which documents you want to give the LLM. You tell the LLM: This is the query, these are the relevant documents, can you generate a response for this user? And then you send that back to the user? And you might have been seeing RAG a lot. I’m sure that’s why so many people are attending the paper reading have attended our various RAG workshops and events, because we’re seeing RAG being used most commonly in industry for an LLM use case like the most common use cases still in industry aren’t large language model-related, and SallyAnn and I will get into that, especially because SallyAnn has been working directly with customers and seeing where their thoughts are around large language models and RAG systems.

But what we’re seeing more with RAG is the number of advanced capabilities, the number of complications that are coming up when implementing them in production. Because I show you this workflow, and it sounds pretty straightforward. It’s like, Okay, these are the steps. Make sense. But each one of these steps can have a ton of problems involved and require a ton of evaluations like, you know, whichever indexing method and embedding model you’re using to store this information, whatever lookup or similarity score a distant metric you’re using to get these relevant documents, and then how you’re ranking those relevant documents, the number of documents you’re selecting, how you’re scoring their relevance–all that factors in to what you’re giving the LLM to generate that response. And of course there’s all the response evaluations.

We want to answer a lot of questions like what people are thinking of with search and retrieval methods and leveraging RAG methods because it can be as complicated as you’d want it to be, and you can add as many additional steps as you’d like.

Really, with this paper. It talks about fine-tuning and retrieval augmented generation, and we’ll get into the paper cause SallyAnn and I do have some thoughts, but in terms of fine-tuning versus RAG, from what Arize has talked about previously is, if you are using a search and retrieval system, you’re using your own internal documents. If you’re using a RAG system, you have to find a way to leverage your proprietary data when using normally a third party large language model. So if you don’t have your own small large language model or your own open source. Large language model that you’ve trained, you’re probably using a RAG system to give that LLM relevant information.

This can be very helpful to increase performance for things like chatbots and for QA Systems. But it is very costly. These are large systems. These are expensive. And there’s more and more things coming out every day on ways that RAG fails in production. So these are all being figured out currently like, there’s a lot going on that’s trying to debug these systems, which is good because this is cutting edge. But there aren’t a lot of teams that have implemented this and have it running smoothly in production. We haven’t seen it coming up as much because like 6 months ago, people are, we’re saying: Oh, is there a way to fine tune these really large systems? And there weren’t great ways of doing that, but now there are smaller systems to fine tune with that salon. I will get into looking at the piece.

So, SallyAnn, what were your first thoughts when you saw the title: “RAG vs Fine-tuning Pipelines, Tradeoffs and a Case Study on Agriculture”?

SallyAnn DeLucia: Yeah, when I first saw this title, I thought, you know, the focus of the group at Microsoft was going to be on really comparing the performance of a RAG system versus fine-tuning on the same task. So I thought it was going to kinda answer that question that comes up a lot like which should I use? RAG? Should I use fine-tuning? Because I think there’s been a big discussion of like, okay, you should use RAG when you need in context, learning meaning there’s a task like, I think. QA. Is the task correct? Where you need to supplement the user’s question with some context for the I want to appropriately answer, and then you’ll find tuning has been said to be like only for formatting or style things like that, or a downstream task. So I thought we were gonna maybe take one task with these two approaches and then compare the performance.

But that is not exactly what they did here. So my real take away from this. It’s a very interesting approach that this group has taken. They are looking at the agricultural use case, which I think is definitely an interesting one. They list a bunch of reasons why they’ve chosen this use case, and I totally agree with that. So what they did was, they took what they considered you know, reliable sources from different agencies. A lot of them are government agencies for this agricultural information. They extracted all of that information from the PDFs, and they use that to create, like QA pairs. They then use that for rag and fine-tuning. So that’s like a high level of what is going on there.

Any questions on that, Amber? Any follow ups?

Amber Roberts: Yeah. And folks, feel free to use the chat and Q&A. If you have additional questions. Yes. So SallyAnn can you tell us about the way in which the extraction… because we’ve talked about structured data extraction previously. Using LLMs for structure data extraction. Like, did they do that? Did they do that in a different way?

SallyAnn DeLucia: Yeah, what’s really interesting is they didn’t choose to use LLMs to do the structured extraction from the PDFs, so what everybody may or may not know here is PDFs aren’t your usual text documents. They have a weird structure to them. So, extracting information can be difficult because you have to not only get the actual text information, but you not need to kind of keep that structure of the PDF intact.

So they talk a lot about needing to do that which is absolutely right. But they want some more traditional methods like PDF to text. That’s I feel like one of those old school open source Python packages that you can use for like OCR tasks. They mentioned another one which I’m not super familiar with, which is Gene Ration of Bibliographic Data, which is, I guess another one from scientific literature. So my first thought of reading that was it was interesting that they chose to use traditional methods. I would have loved, maybe even seeing how, when Alan did that I almost would have loved to see like a fine tune like focus on the fine-tuning a little bit more, and see how well they could extract information from these like kind of old scientific PDFs, so that’s how they did it.

Amber Roberts: So this is the general workflow that they did. So you mentioned like, how they created and generated the QA pairs–so, real world use case, but not real world data–generated data and information from these large language models. And then they broke it up into two areas of fine-tuning and RAG, and then connected that to GPT-4. And then, having GPT-4 based evals.

Can you dive into this process here? And because it’s interesting what they chose to do first.

SallyAnn DeLucia: Yeah. So what’s interesting is–it’s visual right? That’s supposed to simplify things a little bit. But there are some intermediate steps of where like for the QA generation, so first step is they do the question generation and so they’re kind of feeding in those Json files from the extracted text.

So I think if you scroll down a little bit, maybe a page or two, they show you kind of what the final output of extracting all the data from the Json is.

Okay, so they’re using this kind of Json that they’ve created from those PDFs so this is keeping in not only the information, but all the structural information as well as anything related. So you can think if you’re reading a PDF, and you know. Maybe it’s talking about something, and then it references. An image like those relationships are all kept in this Json. And so they feed this to the LLM. And they ask it a series of questions. They basically prompt it to create questions for them. But then they use RAG to generate the answers which is kind of left out of that initial visual that we were looking at in that pipeline. So to create the question, answer pairs. They first create questions and then they create the answers with the RAG pipeline.

So this isn’t even really exactly what they use to generate the questions. But this is kind of how they verify the supporting context there. Yeah, okay, cool.

So they prompted: you’re an expert in agriculture. You need to format questions from documents to assess the students. They’re like making a test almost and then, like they give them the metadata from the documents. And then they, you know, ask it to give an answer.

Amber Roberts: okay. And then we have. So we have the fine-tuning aspect here.

SallyAnn DeLucia: yeah. So they’re taking those questions and answers from those data sets that they’ve created. And they’re fine tuning some Llama models, so I believe it’s, you know, open Llama-3B, 7B, and 13B. And then they did that with some fully sharded data per parallelism. And what’s interesting is that they did GPT-4 a little bit differently.

So because it’s larger and expensive, something that you know, everybody probably is familiar with these models. They had to take a different approach. They actually did, LoRA.

I would have liked to see them be a little bit more consistent with their approaches. I feel like, you know, we’re trying to kind of get this comparison of how models are doing with fine-tuning. Fine-tuning a set of models one way, and then you know another one another way. I feel like that just adds a little bit of an extra variable there. So it’s a little bit hard to feel confident in how to interpret these results, too. So that’s something that I noticed while reading these.

Amber Roberts: and we did a paper reading on the LoRA paper as well. If folks are interested in seeing how that, how that method is used.

And let me just go back to yeah, cause this is the kind of agricultural textbook that you would open up to get. And then those levels of extraction to get to the QA pairs here.

And then, in terms of these next steps for evaluation, SallyAnn, our company does specialize in doing evaluations and productions. So were there some obvious things that you were expecting to see? What did you end up seeing? Did you have feedback for what would make a good next step for this paper for evals?

SallyAnn DeLucia: Yeah, what was really interesting to me is, I feel like they went outside what I would consider maybe like the norm for evals. I feel like there’s a lot of opinions surfacing and of course at Arize we have our own but there’s like a lot of, I think groundwork being done for evals. And what I was really surprised, especially for a group coming out of Microsoft–was, there were new evals that I haven’t really heard of. So I think relevance is an eval. We hear of a lot, especially with RAG, right? There’s always that argument of you know how RAG works is to fetch contacts based on a similarity metric. Which is going to look at the embedding similarity. But that doesn’t always mean that the context retrieval is going to be the most relevant so that one was expected. Actually, if you scroll, there’s two places.

They do evaluate their questions and answers with different criteria and also the QA pairs. So this right here is the QA pairs. And so they measure the relevance. They say, to the chatGPT, they’re basically prompting it. Say, like, Hey, rate, this question on scale of 1–5, is it a question that would be asked by a farmer or not? So the relevance that I was expecting was not actually the relevance that they ended up using. This is actually a relevance to the use case because this is like an agricultural use case. They were asking if the question was actually relevant to agriculture, which I thought was interesting. I expected this to be more of the traditional relevance. And then they follow that up with making sure the question was relevant like, without considering any context that was given. So it’s a really interesting way to think about it.

I do think if you’re going to build an application that’s super specific to an industry like they are here, I guess it is worth measuring if it’s relevant to my use case or not. But it was just a different kind of eval than I’m used to.

Amber Roberts: Especially with so much research being done on relevance, relevant scoring, and even then even the out-of-the-box metrics that are available, that support relevance like when you have to redefine relevance. so people understand what you are meaning by relevance. It’s a little different.

And someone asked how they were converting PDFs to Json.

SallyAnn DeLucia: Yeah, so they use some traditional ML techniques there. I believe they ended up with this. I’m gonna just type it in the chat. It’s GROB ID. This is Gene Ration of Bibliographic Data. So it’s a library that specifically is tailored for extracting data from scientific literature. They did talk about a PDF to text like some of those other traditional ones. But they didn’t end up going with that, so I can throw it in the chat there. That was something I had not heard of.

Amber Roberts: All I knew is they didn’t use LLMs for structured data extraction. Very surprised by that. Great question.

SallyAnn DeLucia: Okay. So we were talking about the evals. So what’s interesting kind of to your point is like, you know, like relevance to some people might be something different. They use Converge as what I think you and I might think of as relevant, which is like–can the answer be extracted from the context from RAG? So convergence to them is maybe relevance and like, our Evals are a lot of the industry standard is. The other thing that I thought was kind of interesting—well there were two that were actually really interesting.

So overlap was an eval that I thought was super interesting. And I recognize that metrics KL divergence is a drift metric, because that’s how I usually think about it. And so what they did here was kind of to compare the overlap of like between the questions and like the contacts which I thought was an interesting application of KL divergence. I feel like there are other metrics that we saw very early on with generative AI, especially with the summary of a blue score. Rooks were like those metrics that kind of see how similar the 2 texts are. I thought it was a really interesting metric. The other was Details. So the way that they assess the level of detail was actually to count the number of tokens which I thought was interesting cause like, I don’t know. You and I have seen a number of times an LLM has generated this incredibly lengthy response, and it’s kind of just regurgitating the same information over and over again. So it’s interesting to me that they just use the count of tokens to determine how detailed something was. Because I think that could be potentially misleading.

Amber Roberts: Yeah. I have no idea how these metrics were selected. Going down to page 17 here: coherence, relevance, groundedness, are what we’re hearing a bit more commonly when we’re talking about RAG systems, but not even like a precision was used. They used a lot of scale, 1 through 5. And the thing I don’t like about 1 through 5 scales is I don’t know how this was normalized, I don’t know the distribution. And then, you know, you have KL divergence which could be from 0 to 1, but from my understanding the overlap will be 0. And it’s a good thing if the question just restates or the answer restates the question almost in this case. So that’s interesting.

If people are using RAG systems, I’m curious for what metrics you’re most commonly using. Just as a kind of brief part for Arize, we tend to evaluate responses with hallucinations and correctness, but for anything involved in RAG especially when those documents are being returned. We still use traditional search and retrieval metrics. So precision NDCG, like traditional ranking metrics. And I think commonly, instead of using a scale 1 through 5. Just is this relevant, or is this not relevant? I’m not sure how to compare it with cause. Essentially right? SallyAnn they’re just using GPT-4 to give the feedback for that relevance and providing it a prompt for that ranking.

SallyAnn DeLucia: Yeah, exactly. And so I totally agree. I do like the binary. Just because it’s operating in a world that I’m more like I’m comfortable with. I’ve used those metrics for years. I understand how to interpret them. But when we’re talking in this new world of like, you know, what are there eight different new metrics essentially here that they’re looking at? And to your point, too. Like, how do I interpret those? How do you feel confident in them? I feel like there’s just more room for error when you’re giving them a scale to evaluate something on.

Amber Roberts: And for the point you mentioned about details being the counting the number of words, and how it almost contradicts what we think when we think of high token counts like the current research in LLMs, higher token count means higher latency, you know, slower performed models, and then additional paper means we’ve done about lost in the middle, like the loss of contextual information, the more words that you have. so like more words not being directly related to relevance, and sometimes it being inversely related to how relevant something is, it’s just a very long winded answer.

SallyAnn DeLucia: Yeah, totally agree that one definitely caught me by surprise. So it’s interesting to see. So kind of going through the rest. So they do some answer evaluations kind of. They have their own set. They look at coherence, relevance, groundedness. And then they do evaluate the model looking at those evaluations with guidelines. So they kind of give, they prompt, using kind of guidelines. and they also did some experiments, too, to see, like the difference between different quality of QA pairs by, like GPT 3, 3.5, 4 using different context set ups which I thought was interesting of, you know, I believe they did like partial context, and likeI think it was no context, context, and then, external context, so it’s interesting to see them do all those variations.

But again, kind of going back to where we started in our conversation. It wasn’t the experiments I expected to see. So I feel like there was a lot of separate experiments going on kind of covered in one paper. And so there was a lot to kind of go through and in the end they didn’t really give me any new information that I feel like I didn’t already know within their conclusion, essentially said, look. But I can prove accuracy. It’s effective for tasks where data is contextually relevant. And there’s that kind of low initial embedding cost. But really, this is gonna be for like in context, info. And then, you know, for fine-tuning, they say, you know, it’s precise output which is like that going back to what I said in the beginning about fine-tuning being, for the style of an LLM’s output and it’s effective for specific tasks, but it does have a high cost. So all the stuff at the end like it wasn’t. I don’t feel like anything new. It was kind of a reiteration of what we already kind of know about these tasks? But that’s kind of kind of my take on the paper.

Amber Roberts: Yeah, I thought it was interesting, because they’re using some different prompts. But they’re not experimenting with the prompt engineering aspect. Especially when we’ve seen how much that really improves the performance of your rag and of your if you have a fine-tuned model for that capability.

SallyAnn DeLucia: I’d be curious what your thoughts on this. After reading this paper, I feel like it was very focused on, perhaps like a synthetic data generation use case like, I feel like this was less. I think my first initial thought was like, Oh, we’re gonna make like a useful copilot or something for people who are in agricultural farmers who are in a specific region, and they have a question that might not be as like easily found like, they might have to search your textbooks and textbooks to find it something like that. And for me, what it really feels like is they’re experimenting with, you know, RAG and fine-tuning for generating these question, answer pairs.

So I thought that was kind of interesting. We’re seeing data, synthetic data generation. Come up more and more. But I would have liked it, maybe to focus a little bit more on, like the quality of the data set, like there’s another paper out of Microsoft that you know we covered like two weeks ago. Or about like, you know, textbooks are all you need. That kind of concept of using super high quality data for fine-tuning is super important. I would have loved to see them, maybe dive into that. You know, they talk about how they use specific data from agencies from different countries, and like why they chose them. But I would have loved them to kind of drill into that a little bit more and like, actually see if those agencies actually gave high quality information.

Amber Roberts: More of what I’m seeing now are ways of trying to leverage agents trying to use smaller models. You know, trying to essentially like troubleshoot RAG on the different pain points like this Towards Data Science piece came out recently and just actually implemented in production and seeing like what’s going wrong in production than just kind of traditional research of like, oh, would fine-tuning kind of be better than search and retrieval techniques it is. It is very interesting, seeing how different a RAG piece from Microsoft because they have so many research teams covering it, and like the different approaches that are being taken. I feel like the team that wrote this paper is maybe more of a machine learning team that try wanted to dive in and take a different view on how to evaluate these models as opposed to what we’ve been seeing for fine-tuning and RAG. I don’t think I’ve seen a fine-tuning and RAG comparison where they don’t talk about the prompt engineering aspects, and they don’t talk about the quality of the research. I also think to your point about the conclusion. I didn’t really get a conclusion to the agricultural use case like which they put in here. Like, if I was someone in agricultural studies, trying to see how to apply large language models or RAG in my work, I don’t think this would have been the right paper for me if I was AI and agriculture.

SallyAnn DeLucia: Yeah, kind of just said the basic thing that we saw the time that just like “it’s promising.” So we don’t really get anything conclusive about that. And yeah, it does seem like the premise works right? Like you can use to integrate these high quality or somewhat high quality QA pairs. But yeah, it. It really wasn’t that conclusive on that specific use case just said, there’s promising results there, and kind of to your point of like it being different groups. I am actually surprised that they didn’t borrow from those other Microsoft groups of what they’re working out like the fact that they didn’t even like include like Phi-2, or Phi-1.5, and this as the model evaluations like, I was kind of sad to see that they only use the like true large language models, because from what I’m seeing. These small language models seem like they can pack a punch and just need so much less data. So it would have been, even cooler for them to maybe look into how RAG works with those smaller models, and like fine-tuning with smaller data sets.

I just think there’s like a step further, they could have taken. And they focused a lot on the data here, even though that wasn’t the whole premise.

Amber Roberts: Yeah. And I’m surprised, you know, with evaluation metrics that Microsoft is putting out like this wasn’t their kind of standard for what they were using for evaluations for Microsoft but it’s interesting, like, maybe maybe we can get in contact with the authors ask some questions related to it. I also think, whenever there’s big topics coming out, and RAG has just been such a large topic over the past few months, and you know what we were talking about. Kind of right when this discussion happened was how customers and how teams are saying they want to use large language models, they don’t know how, they don’t know where it would fit, and RAG is a good use case for them, because they just have to provide certain amount of information, a certain amount of dat, and then they can use these systems. So I think part of it is a bit of the hype that’s just around RAG systems.

And maybe now, for, like the last bit of time we can have a discussion on the future of RAG, of these systems like how things are starting to evolve because it’s almost like overwhelming. The amount of RAG that’s coming out. And all the RAG we’re seeing.

SallyAnn DeLucia: It is crazy. I feel like at first. I feel like LLMs came out right, and, like everybody wasn’t really like they’re just so caught up on their tech. And then it could do all these new things. And it kind of came to the point where, like, okay, we need to be able to adjust these LLMs in some way, and I feel like RAG just took over. Nobody cared about fine-tuning. And I feel like fine tuning, slowly creeping up again to become maybe the dominant choice for these tasks.

So it’s really interesting. My personal take on it with all this research that’s going on to these small language models that kind of bringing back the whole garbage and garbage out concept of like using only high quality data sets. I think fine-tuning could really be the future. Because I think you mentioned it earlier, too, of like RAG is only for those QA used cases. For a majority of that task, it’s not going to work in the way that people might expect it to.

Amber Roberts: Yeah. Cause when we think of oh, you want to implement a RAG system like the first thing you’re gonna need is not having your own language model for your various tasks. You’re going to need whichever large language model you’re gonna want to use. You’re gonna need the budget around that because rag is for performance optimization, not if you have budget and computational constraints, really. And so it has to fit that used case has to fit that budget has to kind of fit a lot of these aspects.

SallyAnn DeLucia: Yeah, you brought up an interesting point there. And they mentioned the paper that it’s like a low cost like initial cost of embeddings. But like the long term cost of RAG is not small. I think that’s something that’s kind of deceiving a lot of the times when RAG is being discussed. I think it’s the cost effective, like fine-tuning, so expensive. But, like again, kind of going back to the small language models, and like the efficient ways to fine tune networking like LoRA, the quantization–I feel like there’s a million ways that you can do efficient fine-tuning these days. So I think it’s, it’s seeming to say that RAG is the cheaper option

Amber Roberts:And when I go to conferences and I ask people, like what is their large language use case it’s still a lot of internal language use cases of doing structured data extraction and doing anything real where like OCR is involved. Doing entity recognition. So the use cases haven’t really changed for what people need language models to do, and a lot of them have just been building up Bert and building up some of their other methods like traditional machine learning methods that they’ve used. I think we’ll see more potential use cases for LLMs and maybe less on RAG. I mean, you’re talking about seeing these smaller models. And we’re seeing better development around these open source language models. You only need RAG because you don’t have access to these models, and you can’t train them yourself, but that can change moving forward, and then you wouldn’t need a RAG-based system.

And then the other part that I think is really interesting is the storage of this data, because we talk about. There is an expense with RAG that a lot of people don’t focus on because they focus on the performance aspect.

But the storage and computation of embeddings where all these models are tabular, and they’ve been tabular for a long time, that’s a huge undertaking. A lot of teams are starting to do that, they are starting to incorporate vector databases. But you work with a lot of customers, SallyAnn– like, does it feel overwhelming for them? They want the results, but everything that’s required in terms of infrastructure that’s needed to get there.

SallyAnn DeLucia: Yeah, it’s interesting. I think it is, there’s a lot more to it than I think teams initially think when they’re starting on RAG. But I think another thing that’s worth noting is a lot of these cases, you do see, are those like traditional, like natural language processing tasks that, like they’re traditionally done with, like our smaller foundation models like Bert. And like, now, they’re kind of transitioning them over to these larger language models. So I think that that’s worth noting. And then, if there’s a lot of infrastructure that’s required. And then you get to choose, right? Like, okay, which database are you gonna go with and like, there’s a lot that just goes into setting up a RAG system–it’s not magic. There’s a lot of tweaking and getting it just right to even perform in the way that you want it to. So that, I think, is kind of the downfall of RAG, too.

Amber Roberts: Yeah. And then we got a question from Johnny from our community. Does fine-tuning involve more initial upfront costs compared to RAG?

SallyAnn DeLucia: It depends on a lot of things. What model are you fine-tuning? How much data? What are the resources you’re going to use? I would say, if you really want a very simple answer: yes, it’s going to be a little bit more expensive upfront, but it doesn’t have to be.

Amber Roberts: And that’s why we typically say, you know, once you have a pipeline, if you are using large language model to experiment with prompt templates like we have a prompt playground in Arize where, just experimenting with the wording and the messaging you’re giving your large language model, like your large language model is an assistant. So if you are actually returning relevant content and your content, metrics look good, and if you have any questions on how to tell the difference how to evaluate it, that’s the question SallyAnn and I answer all day long. So we’re happy to answer those questions in community. But essentially, if you’re getting bad responses, but you’re actually getting decent retrievals, like the documents are there? You’re supplying decent documents. Then, just playing around with the prompt template and doing prompt engineering is going to be the most cost effective solution.

Yeah, we tend to say, fine-tuning is very expensive. And, you know, going back to RAG, it does depend. If you have that data ready, you already have that data ready, because obviously a huge cost is going to be around creating that data and making that data available if you have to annotate that data. So there’s a lot. And then, you know, you have LLMs evaluating LLMs, and so on, and so forth for a lot of costs around that but I haven’t seen. I haven’t really seen a paper come out on most efficient, like cost effective solutions for what’s the cheapest way to do it? Only the kind of the best performing, based on leaderboards and everything.

SallyAnn DeLucia: Yeah, I think we’re starting to like, I don’t think anybody’s ever like been like this is the way to do it for cost effectiveness. But I think it’s always when you read the papers. If they are talking about fine today, they will always like highlight. The fact that this will save you some money, which is good to see and it’s helpful. But yeah, definitely make sure that you actually need to fine tune which I think is what you’re talking about there, Amber.

Amber Roberts: awesome. Well, thanks everyone for attending. I think that was the rest of the community questions. Again, if you have any questions on RAG, fine-tuning, prompt engineering evaluations we are in the Arize community slack, and Sarah just sent out the link to that. So if you haven’t already joined, please join. And this recording will also be available. Thanks, everyone.

SallyAnn DeLucia: Thanks. Everyone.

Share

Suggested reading

Understanding LLM Benchmarks

Evaluating Large Language Models: Are Modern Benchmarks Sufficient?

Building and Deploying Observable AI Agents with Google Agent Framework and Arize