AI that improves itself.

See what we shipped at Observe

Amber Roberts and Sally-Ann DeLucia

Large Content And Behavior Models to Understand, Simulate, and Optimize Content and Behavior.

Published Sep 18, 2023

Sarah Welsh

Contributor

Introduction

Amber Roberts and Sally-Ann DeLucia discuss “Large Content And Behavior Models To Understand, Simulate, And Optimize Content And Behavior.” This paper highlights that while LLMs have great generalization capabilities, they struggle to effectively predict and optimize communication to get the desired receiver behavior. We’ll explore whether this might be because of a lack of “behavior tokens” in LLM training corpora and how Large Content Behavior Models (LCBMs) might help to solve this issue.

Watch

Dive in:

Main Takeaways

Explores the parallels and differences between how communication is structured in Information Theory vs in Generative AI
Shows the value in including behavior data in pre training for LLMs.
Highlights LLMs ability to analyze how content affects viewer behavior, offering potential in content generation but also raising ethical and social concerns.

Transcript

Amber Roberts, ML Growth Lead, Arize AI: Alright, well, we’ll go ahead and get started. I’m guessing more people will be trickling in. And, I’ll plug events at the end. I’ll also plug them now so we can just get that link in the chat, cause there’ll be more paper readings, more upcoming events. So, if you are interested, please sign up for those. We do a lot of events here at Arize.

So, welcome everyone. Today, Sally-Ann and I will be talking about Large Content and Behavior models to Understand, simulate and optimize content and behavior. So it’s a little bit of a mouthful, but you might see behavior is mentioned a few times in that sentence and this is really about content, behavior, and understanding for large language models. So I’ll get into the initial premise of the paper.

So, Claude Shannon and his seminal papers on information theory. So Claude Shannon, founder of information theory, cryptographer, laid down a lot of the foundations for AI and the development of models and communication systems. He divided communication into 3 essential levels, where there’s the technical, semantic, and effectiveness. And where the technical essentially the technical barrier, just being able to receive transmissions. Now this is in the time before the internet when he proposed these 3 levels. So by having the web, you know, various forms of communication. We do get the technical level of communication through.

The semantic and effectiveness levels–for the folks in this paper that are using this theory to try to evaluate essentially the effectiveness of LLMs, they’re saying the semantic part is largely solved for the understanding that LLMs can produce. But the effectiveness to mean like the impact you’re having as a user, how you’re feeling. About like the information, for example, if you’re engaging with a chat bot like, are you enjoying your experience? Is it effective? Are you getting the answers that you need? That there’s, you know, not great effectiveness or the effectiveness is a little muddled, it’s hard to tell the effectiveness for these LLMs and what they’re doing and what they propose might be a reason that the effectiveness is at least hard to measure whether or not it is effective or not. Is a lack of what they’re calling “behavior tokens.” So if there’s underperformance for your LLMs, that’s saying it could be a lack of these behavior tokens that saying it could be a lack of these behavior tokens.

Now they’re defining these behavior tokens as shares, likes, clicks, purchases, etc. But I think as folks come from industry as data sciences, machine learning engineers, they call these behavioral tokens, machine learning engineers, they call these behavior tokens feedback. Like highly sought after user feedback for are you giving a product five stars? Are you liking an add you’re seeing? Are you clicking on that ad? Are you adding an item to cart? Are you clicking to purchase that item?

So these behavior tokens. Reading this paper coming from AI training, for us, this is user feedback that’s really highly sought after, but is often sparse and is hard to come by. So it would make sense. A lack of behavior tokens. Being a reason why LLMs might not not perform optimally. Like if they don’t have the feedback for a given task it can be hard to make any personalization and any kind of customized result for users.

Now, what they claim though is there’s this issue of noise. So in information theory, you have the the transmitters you have like information source you have transmitter you have a signal that goes out, and then you receive the signal. So, you’ll hear a “receiver” referred to a few times, in this and between the transmitter and receiver, you have noise that’s coming into the system. And you essentially always want to reduce noise. Now it’s calling these behavioral tokens noise that get filtered out in the training process, which would be great to have, you know, someone from this team gonna offer feedback for exactly why they’re treating behavior tokens as noise.

For SallyAnn and I who work with a lot of clients in the space that are in these large organizations and our own teams that work for e-commerce sites. For example, it’s hard to understand why in behavioral tokens, why user feedback would be considered noise. But they’re saying it’s considered noise and that’s why a lack of behavior tokens, you know, just aren’t in the presence of these pre-trained LLMs. So Essentially, LLMs are very generalizable. They don’t have this user feedback of life, shares purchases, and so they create these what’s called large content and behavior models LCBMs. So kind of compact, specific LLMs like for a given use case, they create these to show particular behavior and better effectiveness or what they’re calling effectiveness and the four benchmarks that they’re using are behavior simulation, content simulation, content understanding and behavior understanding.

These are areas that teams pay a lot of money in industry to collect this kind of feedback from users for behavior simulations, for example, when they have here is: I have two ad variants A and B, which one will the user like?

And it’s hard if you just ask chatGPT: I have these 2 variants, which one will the user like? It’s going to be difficult for them to give that information if they didn’t have any of these behavior tokens or have any of these behavior tokens or have any user feedback to give an indication of these behavior tokens or have any user feedback to give an indication of, oh, variant A is better than variant B.

And then you have a content situation of, oh, variant A is better than variant B. And then you have a content situation. So simulation and understanding. And then you have a content situation. So simulation and understanding, where the content understanding, and then you have a content situation.And then you have a content situation. So simulation and understanding, where the content understanding, if you ask, what is the color of the wall behind the man in a video clip, where the content understanding if you ask, what is the color of the wall behind the man in a video clip, for example, it needs to be able to understand the content that it’s seen and then being able to understand behavior as well.

So for example, what they have here in the behavioral understanding part. If 100,000 people viewed this food ad, why did only 500 people click on it? So being able to take in a clip, understand what it’s seen and then infer behavior from it to try and get a better understanding of: well, you showed this ad at night and generally, or you show this out in the morning and generally people prefer to eat healthier in the morning, so if you showed it at night, that would be better. So different behavioral situations. So those are like the four main benchmarks.

Sally-Ann, are there additional either charts or graphics that you think would be helpful for folks to understand. Further the concepts that they’re trying to talk about in this paper?

Sally-Ann DeLucia, ML Solutions Engineer, Arize AI: Yes, there are a few visual and some points that I think really add color to the process they went through to really observe the effects of adding in this behavior data.

I think one thing to note is an abstract and some of the things that Amber was mentioning. The first part is just kind of expressing why LLMs don’t work now for these kind of behavior-oriented tasks.

And so what they’re trying to do now is observe it by adding in this behavior data that’s often considered noise when we’re pre training these models to do, you know, their generalization tasks. And so this first one here just kind of highlights that, the performance on this new LCBM model is better than our traditional models like our GPT 3.5 or GPT 4. So just kind of highlight there that they did find results that were better.

But let’s go down to this visual here because I think this really highlights how that performance kind of comes into play. And so here we have an example of a frame. So we’ll get into this a little while about how they’re actually passing this information to their LCBM, but this is just kind of a high level overview of the performance. So, you can see here we have this frame. We have some comments that we’re given, with this frame video or the frame from the video.

As well as some metadata and then we’re asking the question: Why would the average sentiment of the comments of the audience be either positive neutral or negative and explain why.

And so you can see the various responses we got from different LLMs. And so their model here, you can see that it was able to identify the negative sentiment that was attached to the audience comments and then it was able to give a really good explanation for why and they was really able to hit on that they feel sympathy towards the man’s situation and that’s if you read through the call or the comments that’s exactly what they’re commenting on right they feel bad, they feel sorry for him, they think that he was desperate.

So it really is able to pick up on that using this new model architecture. And so when we look at Vicuna and GPT 3.5, you can see that maybe it gets some of it’s like, Vicuna was able to feel sympathetic but it wasn’t able to pinpoint that they were negative, not both positive and negative and kind of similar to that, GPT 3.5 really didn’t give a very good answer. I didn’t really pick up on whether it was positive or negative. It thought it was exciting or thrilling. So just overall was wrong. And so we can see how this LCBM is coming out on top in this situation.

Very similarly here, except a little bit better results by the others. Again, same information that’s, and now we’re asking, basically the same question. And then again, LCBM was able to get the positive and the reason why and the others struggled. It looks like 3.5 did better here so that’s positive news but still not as good as the LCBM model.

So, that’s kind of the results. So we can see that it’s coming out on top with their new process and let’s get into a little bit about how they’re solving this effectiveness problem.

I want to point out this fact that we’re doing this step of verbalization, which is how we’re actually passing this information to the LLM. So we’re taking videos and for every video V, we’re taking a hundred retention values and we’re spreading that over the duration of the video. So, it doesn’t really matter how long the video is, it’s always going to have a hundred evenly spaced retention values and each of these have a replay value associated with them which signifies how often these parts were replayed. So that’s kind of important too for them understanding the behavior of the video that the LLM is going to be analyzing.

They also then select two random frames to send and they caption this with BLIP and they also use Whisper to transcribe and then finally they encode the video frames with a clip model, which you know is really important. We need those embeddings, that’s the information that these LLMs can really make sense of, so that’s a super important one.

And then they’re going to verbalize that behavior data. And we’ll go down to the example so you can see exactly what that looks like. So, here is kind of where we can see some of that behavior data we have, you know. This is a channel, so they’re using videos from YouTube, so they take in the channel information as well. We have a group of writers from Adobe, so you’re going to see Adobe in here, and that’s the channel they’ve chosen to use. And so they’ll send the title and how many subscribers, but they also show how many times it was viewed and liked. So that’s adding in that behavior data to the model itself.

And so kind of scrolling down to what this all looks like in the architecture…

So this looks a little alarming. You think this is a huge model, right? But it’s actually kind of a set of smaller models and the LCBM is smaller. We don’t know exactly the architecture, what they’ve done there based off the architecture or what they’ve done there based on the paper but we do know that they’re based off the paper but we do know that they claim to be much much smaller than our GPT 3.5 and GPT 4. But the thing I want to really point out here is all these beginning steps are essentially just to format and verbalize our data so that we can send it to our LLM. We can’t just send in the frame and the video, we need this all to be verbalized, so it’s kind of this elaborate prompt engineering kind of workflow, if you can think about it that way, where we’re taking in those different types of input and putting in our prompt template in a way and then passing it to our LLM to ask a question like before like why is the wall blue or why did they make that comment?

So that I think is just the overall process of what they’ve done and then we observed the results, and essentially what we found–or what they found–was that this change in process, this change in how we’re sending the data to this content specific model. this change in process, this change in how we’re sending the data to this content specific model has better results than if you were just to ask those questions to our standard foundation LLMs.

So that’s kind of the whole premise. I think it would be good to maybe circle back now that we have the understanding of what they did. Let’s chat about this diagram here in comparison to the Claude Shannon diagram. So I think it’s a really interesting research perspective.

We see a lot of these papers coming out that are very focused on how to improve performance and this one here is a little bit more abstract in my opinion. Where they’re like let’s study LLMs under the umbrella of communication and information science and see how that relates um and so we know at the end like there are many um traditional ml applications that are doing this predictive behavior um or this Behavior prediction like you think of your recommender systems like a lot of in e-commerce um things like that that is all essentially predicting behavior for the desired outcome except here they’re just trying to apply it to LLMs.

And so as it notes at the end, there are many traditional ML applications that are doing this predictive behavior and so you think of your recommended systems like a lot of in commerce, things like that. Is all essentially predicting behavior for the desired outcome, except here they’re just trying to apply it to LLMs.

And so do you want to talk through how this diagram relates to the other one?

Amber Roberts: Yeah, yeah, and in my mind too, this is almost like. Information theory full circle where when it was initially proposed, obviously not of this existed. This was the foundational theory to build all these and now the folks that you know laid out the groundwork for this paper, decided to use it almost as a way to evaluate the effective communication for these large language models, which has obviously been a very large topic. How do we know that these models are performing well when these are stochastic models? Each time you run them, you might get something slightly different. How do you make sure that the user is satisfied with these? And so by having a communicator and a receiver, in this image. It looks like they don’t put in the noise aspect. That they do in the Claude Shannon diagram, but you’re still having the Like the initial information that’s going out the place that the information is received and then almost the interpretation of that information that was received.

And for this example, I think it makes a lot of sense. And probably what we’re gonna chat more about is going to be around how most of the time, when models like this are created, all that noise that they’re referring to is not the same kind of noise in information theory. It’s incredibly important and almost the key things that folks want from those models.

So when you’re looking at like likes, views, scene replays, all those do have predictions or tend to have predictions from historic data and from those behavior tokens or what we would call “performance metrics” for that user feedback. So, I think it’s really interesting in terms of information theory, drawing the parallels to it, because all those parallels helped create the technology that we’re using today and you know, just have exponential impact on communication globally and putting education into the hands of people all around the world.

So, I do really like the parallels, and almost using information theory as a way to evaluate the effectiveness of LLM performance.

Sally-Ann DeLucia: Yeah, it’s really an interesting view, like I mentioned before, just from the research perspective. I think here like for me when I read through this and I boiled down to like what the goal was and I think the goal was to identify whether or not the behavior data had meaning to it for LLM-specific behavior. And I think what they proved is yes, there is a meaning in there.

And I think what’s interesting is what this means for the future. It’s kind of I think almost like a POC for these behavior oriented ML cases like can we actually apply LLMs here instead of the traditional methods we’re using? And so I think now it’s going to need to optimize.

You and I read that paper a few weeks ago on long content or context and I like to look at this prompt here that we’re sending and like that’s a lot of information.

And I wonder like we saw good results, right? Like they saw that there was improved performance here, but then I start to wonder should we maybe experiment with how we’re sending in this context? Like is it too much? Is some of it getting lost in the middle? Like we saw in that paper.

And so I would be interested to see if they introduce some of these other optimization techniques rather than just focusing on this, you know, one new model, what other potential benefits we would see. And I’ll also be interested to see like if the future pre trained LLMs will start including this behavior data as part of the generalization pre chaining task.

Amber Roberts: So do you feel that this model is almost just a use case specific model that other teams and industry are also building their own use case specific models with this prompt engineering?

Sally-Ann DeLucia: I do think that this is probably very use case specific. I would love to see what’s under the hood with this LCVM model.

But I think it will be very specific to behavior prediction task. Like if you want to be able to predict how a user will, react or behave with an email was an example they gave. Like if you were in marketing and you’re sending out an email, you want to use the one that’s maybe going to lead to the most clicks on the URL to then get them to your website.

And so that could be an application where I see you’d want to use this, but I can’t see this model having extremely better performance. It’s better at all on more of the generalized past like your generative, your chat bot, those kinds of use cases, I think like there’s probably too much information in there that’s just not. If it requires all that behavior and it’s going to be fine tuned specifically through the lens of predicting behavior. So I think it is very use case specific.

Amber Roberts: And it’s interesting when we look at just the final thing we’re seeing here when the large content behavior model spits it out, it says the, you know, the replay values, this video, would have this many views. And so it’s giving predictions based on the inputs and outputs that it’s previously seen. So all that ground truth data that is coming in from their behavior tokens.

When we initially were reading this paper, we noticed that there’s kind of no talk about RAG, which is being used a lot by teams. There was a paper reading on RAG–retrieval augmented generation. Do you think that–I know I’d be interested–do you think that rag would have a similar performance? The way teams can take certain, relevant content, use that relevant content and then use an LLM to kind of generate that response. What are your thoughts there?

Sally-Ann DeLucia: Yeah, I mean, as soon as I read this paper, I was thinking: is there an easier way? To me it seems like a lot of work, right? So we find the videos, we have to encode them, we have to train them, we then have to, you know, verbalize all of the behavior data. I was thinking the same thing. Like could this be a different use case? I’m not convinced RAG would be appropriate here and I’ll tell you why it’s because of the format of this behavior data that they’re looking for because they’re attaching like one specific instance with their very specific behavior where with the rag system, you know, we would need to use similarity, right, to retrieve those relevant pieces of information. So it comes down to like how we’re formatting that in our vector database. I think it’s possible–we’ve seen some amazing work be done and with RAG. So like I think it’s possible, but I think if they were able to actually use what I would do or I would be curious to do is like take this data set of all this behavioral data and using that to kind of generalize it, see if it can come up with some patterns around that and then maybe we vectorize that and then maybe use that to power your rag but I think there would need to be some generalization and figuring out like what the patterns are with this behavior data before it could be used in that kind of system.

Amber Roberts: Okay, yeah, very, very interesting. Because with the behavior tokens when they’re talking about likes, purchases, everything that, you know, essentially companies are trying to get and are trying to monitor in production. You said: Is there a simpler way to do this or, you know, just wanting to ask the writers of this paper if they’ve considered other methods. What would you like to ask them in terms of, you know, did they consider this option, or, you know, what you would like to see in the future of the LCBM model compared to.

Sally-Ann DeLucia: Yeah, I think the big lingering question I had while reading this is kind of that. So what kind of question? Wanting to know like I know their motivation was to understand where, and basically to observe the impact of including this behavior data with an LLM and I just would love to understand what’s beyond that like where did they see applications for this use case what their real vision for this kind of technology would be. Because I think where they even acknowledge themselves, we’re still really far away from solving the effectiveness problem in the various communication levels. So that I think that would be my number one question is like, what do you visualize and also is it purely experimental then, or are we really trying to kind of reinvent the wheel?

I’d love to ask you, Amber, like we know that these systems exist and work really well in traditional LLM applications. Do you think it’s “worth”– that’s kind of a strong word, but do you think it’s worth bringing LLMs into this? Because the data itself is innately tabular, right? Like all this behavior data is tabular.So I’d be interested to see your thoughts on that too.

Amber Roberts: Yeah, I mean, it’s interesting. Especially like, they’re talking about the size comparison as well. They’re saying, you know, because they’re using the largest parameter models here, the 100 billion parameters. I was kind of surprised that They didn’t use Claude because this is Claude Shannon’s theory. And I’m curious how constitutional AI. I think it’s actually better at understanding behavior. And so I would be curious to see that maybe in a future comparison. But it’s almost like comparing apples to oranges when you said, you know, would be this farewell against a general, a generalization task for language, that GPT 4 has been shown to be dominant for.

The thing we always say to customers is: it depends. It depends what you’re really looking to get out of your model and it does depend like, do you have the set of behavioral tokens? They’re using YouTube videos with, I think the one had hundreds of thousands of views that they showed in the, like the very first example. So very well seen videos and content that have. Tangible feedback that they can get from that video and engagement metrics that they can already get from that video. Some teams are starting from scratch and they don’t have that. And other teams have all that understanding so they can essentially say these are a lot of behavioral tokens and we can use it to do and they’ve already been using it to kind of do this prompt engineering to really focus on their use case.

So, is it fully worth it to implement it as an LCBM? I don’t know, but if anything, this paper does prove the effectiveness of prompt engineering and, for these large language models in the same way that You know, we reformat how we might talk about a given topic. Just because of how just the way words now play in the effectiveness and performance of these models. We’re saying like, you know, the next language that everyone is learning for AI has been English and no longer just Python and frameworks.

Sally-Ann DeLucia: For sure. And I think it’s. Really interesting to see how this goes. I do think they also proved that like, you know, there is value in including that behavior data in pre training. I think that’s kind of again the overall thesis there’s like they’re saying you know these foundation models they strip out the behavioral data because it’s noise to them we say it’s not noise, it’s important for predicting behavior, but again it’s all through that behavior line.

So it’ll be really, really interesting to see where this goes next.

I kind of was thinking this, you mentioned it in the data sets that they use the YouTube videos.

They use YouTube and emails from like the Enron email data set essentially. So I would love to see some different data sets. They mentioned that unfortunately you have to rely a lot on simulated data in this kind of behavior prediction, which opts its own challenges but I’d love to see this across you know a variety of tasks and all underneath the behavior umbrella but that’s a really good point. But on that kind of thought I’d love to hear what you think a potential use case is. Say we figured this out, we unlocked an LLM that is able to predict behavior. Where do you see customers wanting to use this?

Amber Roberts: Yeah, I think if you could really develop that and it works really well, you would be able to make a lot of money because many customers would want that. Essentially, in terms of user experience and user experience research, there’s a lot of money that goes into just trying to get feedback from user behavior. And every user is going to be a bit different, and there would be a lot of applications for this. Every team that’s doing e-commerce, any team that really wants feedback. You know, there’s been a lot of work on chatbots recently and I think that would only be like the first stage kind of of it, like you know is this user going If you could predict if a user is going to be happy with the response of a chat bot. I think that would be kind of like the baseline use case, but if you could predict, you know, if you’re Spotify, you could predict like will the. With high certainty if a user would play a song or like a song, or skip a song, there’s so many models being developed for that kind of user behavior. And if this is better at understanding the effectiveness and if we can actually use these models to communicate better to our end users, that would be huge.

And maybe it is tied into the effectiveness here. But yeah, just going back to what we’ve talked about: is it really noise is the questions we’ve been asking and, you know, obviously those behavior tokens aren’t going to be present for specific use cases in these very large models in GPT 4, which is expected, but I would be curious to see how this performs against the use case specific models and if this did one for a specific use case like predicting whether or not a user will click on an add or add an item to cart or skip a song. I’d be really interested to see how that would compare.

Sally-Ann DeLucia: Yeah, and they mentioned a few times of even about kind of eliciting a desired behavior and that’s another interesting kind of use case for this because you know what you were just talking about is more like: okay, given this data, what do you think the user is gonna do, right? And there’s also a side of this where it could potentially be more of a generative thing of like: we want to elicit this type of behavior, what content should we put forward? And that’s something that’s interesting, but I think It kind of starts approaching that social and ethical implications of this technology.

And so I definitely think it’s something that we should. Be mindful of as we continue forward in this kind of area research. We’ve seen AI be harnessed to do that type of malicious context here of changing people’s minds, without them even realizing. And so I think it’s something to definitely keep eyes on but I’d love to get your thoughts on that. Do you think that’s something that we have to pay attention to, or you think that this is more focused based on the paper, of course, is more focused on, you know, predicting the probability of a behavior?

Amber Roberts: That’s really interesting, Sally-Ann, and that’s a good point. Because say you have a user’s every behavior and you know that they’re more likely to online shop at 3am so you send them these coupons at odd hours and you know the ethics behind it is a really good point, because we’re calling them user performance or performance metrics or behavior tokens, but it’s like these are how people act. And like these are their personal preferences and behaviors that we’re using to make predictions on.

And you know, we’re measuring how accurate our predictions are and it’s very easy to just extract, you know, the information we need and aren’t thinking about, you know, the consumer as a person that, you know, maybe doesn’t want all their behavior correctly protected.

I think we both read a lot of sci-fi and just thinking of like where that actually leads if someone could still do all those predictions and if someone could still like do all those predictions and if you watch Rick and Morty, do all those predictions and you watch Rick and Morty, it’s like the heist episode and it’s just all these predictions and maybe I’m getting a little on the tangent side but I’m getting a little on the tangent side but I’m getting a little on the tangent side but I think you know just being able to predict users behavior and then kind of like backtracking things so that everything’s set up for that particular user. There’s a difference between personalization and just trying to get what the user actually wants, like out of your site, and then trying to optimize everything to just keep selling. And, yeah, I guess but that’s a huge point on it and the ethics. Yeah, I think our are things that need to be honestly debated like at higher ups for what companies are allowed to do with all user behavior and how to filter it into individual models. And I think Europe is a little bit ahead of us on some of those aspects.

Sally-Ann DeLucia: Just a little bit, I would totally agree that it’s interesting. I think there are a lot of people thinking about this and I totally agree with you that trade off between like personalization and really augmenting the way people behave. And I think there’s a clear distinction and it’ll be interesting as you see to be clear nobody’s doing this currently but with this kind of research that’s going on. It was something that my brain immediately thought about was like, okay, how does this interact with your social structures and the ethics of it all? So, something I’m personally really interested about is how we balance this because I do believe there’s obviously so much good in this kind of technology, but I think it’s important that we all kind of always have that kind of angel and devil on the shoulder of being like okay this is really cool but like how do we make sure we’re not harming real people, you know?

Amber Roberts: So I had a friend that worked at Walmart Labs and she gave an example of showing that their prediction algorithms worked really well, but it wasn’t the desired effect.

So basically what happened and this is like a well known story kind of within like the AI team. A man called customer service and it got back to the AI team and he was asking, why are you sending my daughter like all these like, baby ads and all these ads for like, like infants and baby formula is it just because she’s a woman, why are you sending all these?

And then he ended up calling back later and saying like, actually she was pregnant like we didn’t know so sorry I guess it was helpful. But based on her pattern of purchases, they had predicted that she was pregnant. So like buying tests, prenatal vitamins, buying these things and then all of a sudden these coupons go to whatever address to our family’s house. And, because the predictions are so great, it’s like, oh, she has all these things, she’s going to need these. And it’s just like one unforeseen consequence of correctly predicting behavior so well.

And that always stuck with me because you’re trying to help the user, but you also have to think about the ethics and that what the user needs and what they want might be two different things.

Sally-Ann DeLucia: For sure. I know we only have a few minutes left. There was one other thing I thought would be interesting for us to chat about kind of switching gears drastically here. And this is just kind of a little anecdotal thought that I had, about this and it’s on the fact that we do not 100% understand right the human brain and how all these interactions happen, how we actually make meaning from info. And do you think that we need to fully or at least better understand that before we can expect ML to actually replicate communication here? Because that’s essentially what they’re doing here.

Amber Roberts: Right, I mean, when you tell people you work in AI and they’re like, how close are we to AI taking over? And then we’re like, we can’t even figure out why someone is rage clicking on a site, you know. So, my opinion is yes, we need a lot more understanding, but I can’t see us ever fully get in there. It’s my opinion on it. What are your thoughts, Sallyan?

Sally-Ann DeLucia: Yeah, I totally like not in our lifetime. I don’t think they’ll fully ever understand. I think I almost see a path where we’re pivoting a little bit, like we’ve talked about the attention mechanism in LLMs and how that’s causing issues. Before it was really great because we were able to get that real NLP kind of the NLG going, but now we’re seeing like, okay, for these LLM applications that has some issues, associated with the size and things like that. And so we’re rethinking the way that we do that. But that was all based off the way that humans pay attention, right? And so it’s interesting. I wonder if now that we have this kind of solid ML foundation under us if we start pivoting away from trying to replicate the human brain and kind of going back to maybe more mathematical or statistical basis to our models.

So it’ll be really interesting. I don’t have a clear cut answer, but I’m excited to see how these two fields grow in parallel to one another.

Amber Roberts: Yeah, I think the pivot is a good point. Especially when you’re making a purchase online. This has happened to me before, I buy like a main product, let’s say on Amazon and don’t realize I need batteries or I need this extension thing, right? I need another part of it that almost needs to be bundled. And so, yeah, I don’t think it needs to mirror the human brain, it almost just needs to jump a few steps ahead of that need versus want. Like instead of me just thinking about the next thing I want to buy in general like what might I need to like to go with this product and what have previous users use and less trying to focus just on me and focus on just what general data tells us. Because sometimes it’s even a little too strange now how much they know my preferences.

Sally-Ann DeLucia: Yes. It can definitely be changed. I think that brings me to a point where the generalization of these might have the edge over those traditional ML applications where, you know, that they’re predicting those specific kind of, you know, next step of what you think they think you’ll need, those specific kind of, you know, next step of what you think they think you’ll need, whereas the generalization might address that problem you’re having of like, okay, I’m buying something that needs batteries, might address that problem you’re having of like, okay, I’m buying selling that needs batteries and I don’t have batteries so therefore I should probably add batteries. That’s kind of more of a general just kind of instead of looking specifically at like what you’ve purchase and what you might then want from there. So that’s a really interesting point.

Amber Roberts: Awesome. Well, maybe we can just close with what our final thoughts were on this and I think we both really liked the angle that it took about going all the way back to information theory and the fundamentals of communication and AI and technology and and then using that as a system to be like maybe we can learn from this and evaluate against it.

But we did have just like a few thoughts on, you know, is this a completely novel method. And is this behavioral token actually noise or is it, you know, highly sought after performance metrics?

I’ll plug one more time for Arize events. Because I know our marketing team once, wants to drop that link in here. We have these paper readings. We have workshops coming up. Sally, any closing thoughts before we end the session?

Sally-Ann DeLucia: Not really. If you haven’t read this paper, I definitely encourage you to check it out. It’s a change of pace from some of the other LLM papers that have been coming out. It’s a different lens, different writing style. So I think it’s definitely worth the read and I’m super excited to see where they take this idea.

Share

Suggested reading

Bar chart titled "Correctness" comparing four arms on a 0 to 1.000 scale. LobeHub 0.826, Vault 0.833, MCP 0.834, Baseline 0.845. All four bars sit in a tight cluster just above 0.825.

MCP vs. CLI Skills for agents: what our eval found (and which you should use)

Arize AX Adds Native Support for NVIDIA NIM as AI Model Provider