Why You Need To Understand These Five Papers Reshaping AI’s Future

In this session, Brian Burns — founder and writer of AI__Pub, a popular Twitter account and talent network covering technical topics and AI news — dives into the top five trends and technical papers published in 2023 that are reshaping the field.

This talk was originally delivered at Arize:Observe 2023, a conference on the intersection of large language models, generative AI, and machine learning observability in an era of LLMOps.

Brian Burns: Hi everyone. I'm here to present on three trends in AI research that I think are kind of going to shape how products and startups evolve that I find pretty interesting. So the three trends that I have in mind are generative self supervision, language models as base models and encoders for like models that are in different domains in language.

And then alternative language model architectures. As an introduction, I'm Brian. I very recently dropped out of a machine learning PhD at the University of Washington up in Seattle. Um, I run a Twitter account called AI Pub. We cover technical AI research topics.

I also run a podcast called Deep Papers with the CEO and CEO of Arize, where we interview, uh, AI researchers. And then I also run a recruiting business where I recruit for AI startups. Right now I'm mostly recruiting for a really cool legal tech company called Harvey. Um, so, you know, presentation overview…

I'm gonna talk about three trends that I think are really interesting in AI research. Each trend I'm going to talk about like one specific paper in that area, and then a couple like follow up papers that I think are interesting. The last thing I wanna say is just a couple caveats you know, one: I'm just some guy who reads machine learning papers, you know, I don't have a comprehensive view of the landscape, not even clear if anyone has a comprehensive view of the landscape. You know, there's just lots and lots of machine learning papers coming out. So there's kind of a lot of different trends. So these are just like three trends that I find particularly interesting.

The other thing to say is that some of these trends, I think are consensus among researchers. Like, I don't, you know, if you're, uh, if you're. Reading papers all the time. I don't even know if these are necessarily like extremely interesting insights, but I think for people who are, you know, uh, building companies or like engineers who aren't like exactly on the forefront of research that the, uh, this talk might be insightful.

So, uh, I appreciate you guys joining. I'm just gonna check my phone and I actually just wanna check the notifications in the chat just to make sure that this is right, uh, and that everyone can hear me. Okay. Great. So I'm not getting any comments in the private chat. So think everything's all right.

So the first trend I wanna talk about, uh, I don't know if this is the official term, but it's what I call “Generative Self Supervision.” Um, so this is, this is related to how you train these large machine learning models. Um, so historically there are a couple ways to do this. Uh, the first, especially in computer vision, is to just train on a large human labeled dataset. So think something like ImageNet, um, you know, with, with the advent of the, uh, foundation model paradigm, uh, and these really huge data sets, um, you know, especially text data sets that like, just, it's, it's actually quite impossible, it's scale to have human labeling. You've seen the rise of like self, self, super, self supervised training methods.

There's more recently the language models have gotten so good that you can do a thing where you actually use the language models to generate data to train itself. Um, and in a way this allows you to like really bootstrap language model performance. Um, so I think this is like a super interesting trend.

Wanna talk about it for a bit. Um, the paper that I think exemplifies this trend really well is a paper that came out just a couple months ago, uh, called Toolformer. Uh, the objective of the paper is to train language models to use external tools, uh, like calculators or web search. Um, so Toolformer, you know, a lot of really cool ideas in this paper.

The thing I wanna focus on that I think is actually kind of the key idea is collecting the data set. Um, so how do you actually train these language models to use tools? The way you do it, um, is by taking a kind of plain dataset, like common crawl, a plain text dataset and enriching it with API annotations.

So you can see this right here on this slide. What, what they've done is they've added annotations, to a text dataset that tell the language model, um, when it should call an external API or use an external tool and then list the result of that api. Um, so here's here, I guess there's a clear view of that, of those API annotations.

Um, so. Once you have a data set of this sort, it's like not, it's not too hard to train a language model on it and train it to use tools. Um, there are some engineering intricacies that I won't get into here. Uh, but the main obstacle is actually generating this data set. So the question is, how do you generate this API annotated data set?

If you have a huge text corpus like Wikipedia, um, both in terms of time and money, it's gonna be far too expensive to hire human annotators. So how, how do you do this? The really interesting idea from the paper is that you can actually use language models to generate the annotations themselves.

So you can just automatically take a huge data set. Like we could text, uh, or the pile or something, and you can actually just use language models to automatically annotate it. Um, and then go with that. So it's just a little bit more complicated. So I'll explain how that happens in the next slide.

So how do you do this? Uh, step one, basically what they did is they used language models to sample a massive number of API calls and just insert them at various points in the text. Um, so you can see this on, on the right. Basically what they did is they used a prompt that specified how to use a question answering api, and then they just had the language model, uh, insert.

Uh, API calls to that in, in various forms of the text, just using like few shot prompting. Um, what this does is it ends up annotating the text, but almost all of the, uh, all of the, almost all of the annotations are bad. They produce garbage results or they're just not at the right place. Um, so the really interesting thing is that, um, Once you've generated all of these API annotations, uh, the authors then basically filtered out something like 90 to 95% of them, uh, based on, based on perplexity.

So what they, what they did is they said, according to the language model's, own understanding, um, of basically the extent to which the A calls explain the text that, uh, ensues, um, relative to just leaving it empty relative to. Not including the API call, um, you know, which, which is better, and using that filter, uh, which, which again is a filter that's internal to the language model.

Um, it's just based on the log probabilities of the language model itself. Um, they're able to filter out the vast majority of these, uh, of these kind of crappy, uh, API annotations. Um, and then once they've done that, they, they can fine tune the language model on this more enriched, uh, annotated data set.

And it turns out that doing that was sufficient to train a language model to use a variety of tools. I'm just gonna check my email just to make sure like nothing's going wrong. Sweet. I think we're good. Um, so just, just a few notes on this that I find very interesting. Um, this method is extremely scalable, no humans needed.

It just uses the language model itself. Um, it's dataset agnostic, um, so you can use it on any like large, uh, text data set. Um, Which is really interesting. So, so that's like, that's like one trend that I find really cool. Using language models to basically generate their own training data, uh, bootstrapping language models, uh, off of themselves.

Um, two other papers that I think are really cool in this domain that kind of touch on similar topics. Um, there's this one paper called, uh, language Models Can Teach Themselves to program Better. Uh, how that works is you basically like, Uh, give them, give them a set of like 150 programming puzzles, um, and then make them generate more programming puzzles for themselves and generate solutions for those programming puzzles.

And then kind of similar to the tool for tool former paper. You use a Python interpreter to filter out the, uh, the, the bad solutions to the programming puzzles, and then you fine tune on, uh, on, on a data set of like solved programming puzzles that have been generated by the language model. Um, the interest of time, I'll, I'll make this really brief.

Another one, uh, is from philanthropic, it's called Constitutional ai. It's kind of like a variant of open AI's, uh, you know, reinforcement learning with human feedback. Um, but where many of the steps instead of using human labelers, uh, it's there, first off, there are a couple of additional steps. And then basically almost all of the steps are done by language models themselves.

So it's, uh, it ends up being much more scalable and, uh, much less costly than doing a lot of the human labeling that's required to do, uh, reinforcement learning for human feedback, training of language models. Um, here, I'm actually just gonna, uh, pop out real quick just to see if there are any questions, um, before I go to trend two.


Uh, right. Just gonna check my emails, see if there's anything here. Great. Um, okay, so the second, the second trend I wanna talk about, um, is using language models. I don't know if this is the official term, but this is the term that I kind of use it in my head. Uh, using language models as base models or encoders in domains that are really, really different from language.

Um, one way I kind of think about language models is as feature extractors. If you, if you think about like computer vision, like what are kind of like, I don't know, computer vision, like five years ago or something before, I guess seven years ago or something. Uh, back when, like convolutional neural nets were a really big deal, uh, before all the transformer stuff.

Uh, think about like the early layers of these, um, you know, convolutional, neural, neural nets. Like what are they picking up? They're, they're basically these feature extractors. So they, they detect things like, uh, you know, edges or dark spots. And then there are like these kind of higher level features. Um, here I have an image of like, eyes, ears, and noses.

Um, One way to think about language models is that they're basically the richest and most flexible feature extractors that we have. Um, they've, they've basically done feature extraction on the entire internet. Uh, so some of these features, instead of being like edges or, you know, ears or eyes or whatever, some of these features are like facts about the world.

Uh, you know, low level text features like grammar, um, you know, basic reasoning capabilities. So we have these huge. Basically pools of features, um, you know, that are probably useful in domains that are other, other than language. Um, so I, I just want to talk quickly about some interesting applications in biology and chemistry, robotics and AI agents.

Um, so the example paper that I'm gonna use for this trend that I think is just super, super cool is, uh, this paper that came out from, uh, from meta AI called E SM Fold. Um, the problem at hand is meta-genomics, uh, which is to understand the protein structures of microbes. So that's like bacteria, viruses, and fungi.

Um, there's a term that these genomics people use for this, which is, uh, there's, there's huge like, space of proteins and, um, From what I understand, people who study genetics, they understand a lot of the, uh, proteins that come from flora and fauna pretty well. Um, but there's, there's basically the term they use is there's this huge, like dark matter in the protein universe related to bacteria, viruses, and fungi.

That's, that's relatively unmapped, uh, and is much larger than the space of proteins that we, uh, understand so far. Um, so the metagenomic proteins vastly outnumber the proteins that we understand better. They come from plants and animals. Um, and like the problem, the setup for this e s M fold paper is that, um, You just need way faster techniques to, uh, predict protein structures from these enormous metagenomic databases.

Um, so this is the setup for ESM fold, uh, which, which at least in terms of speed is a state-of-the-art, uh, protein folding model that came out from meta ai, uh, a few months ago. Um, so the really cool thing about ESM fold is the way that they trained it. Um, and the way that they did it is they actually just started with a, with a language model.

And what they did is, uh, they trained the language model on a masked sequence modeling task, where the sequences are text protein sequences. So you can see right here, like this is a string, you know, GSMDKK Y. SIGLA. This is, this is just a text string that describes, uh, a protein, the, the chemical structure of a protein via the, uh, amino acid sequence.

There's, there's no more data than this. There's not like a 3D structure thing going on. Um, and how they trained the language model, uh, was basically by, uh, blocking out some of the letters. So blocking out some of the, um, Some of the, uh, proteins in the sequence. So here, you know, they blocked out the H and they also blocked out the y and the, the task was a masked sequence modeling task, which is basically trained some language model to fill in the blanks that were masked out.

This is done at scale. Cause you can do this on enormous database of, uh, protein sequences. Um, so they trained a language model to do this kind of just prediction. Very simple prediction. It's only predicting single letters at a time. Um, but the really interesting thing is that in, in doing this training on this huge text database, Basically the, the intermediate features in the language model, like the intermediate layers in the language model, uh, learned very, like substantial, uh, facts about, uh, chem chemistry and protein folding.

Because in order to succeed at this high level task, you actually have to understand a lot of stuff about, um, you know, how these, how these proteins relate to each other. So basically they, they, they, in this very clever way, like. Set up a chemical, uh, and protein feature extractor by training, training it on a pure language task.

Um, And then what they did after they, they did this pre-training, is they basically, they took, they took the language model, which is right over here. Um, and they, uh, they, they basically stuck it into an architecture that was, uh, that was public from the, um, from the alpha, alpha fold architecture. Uh, and then once, once they did that, then they fine tuned this whole thing, um, on a sequence.

Or on a data set of, of 3d, uh, protein structures. Just to be clear on like what the input and output is, like the goal of this protein folding stuff. The input is a single protein sequence just like this. Um, just a single like, uh, piece of text that describes the sequence of proteins and then the output is actually like a 3D model of how the protein actually folds in space.

Um, so once they did that fine tuning, uh, the, the language model, uh, w it, it, it performed in terms of accuracy very similarly to alpha fold two. Um, but was able to actually perform like 60 times faster. Uh, and they were able to. They were able to basically predict the protein structures of something like 600 million, uh, metagenomic proteins.

Uh, and this is now like all public. So if you, if you google ESM fold, there are like really cool blog posts about this on Meta AI's website. I'm gonna try to go fast in the interest of time, especially since I want to take questions at the end. Um, so this is just one application. There are like tons of really cool applications of using language models, uh, as encoders or base models and other domains.

Um, another that's like thematic in robotics is this paper called Palm E. Um, what they did is they, uh, I'm gonna make this really short. They, they took v i t and Palm and they basically made it so that they, they output in it like a joint embedding space, and then used the outputs of this multimodal language model to do robotic planning.

So they're actually like robots that, that are able to use these kinda multimodal models to do basic planning of tasks and execute tasks. Another example that like just came out this month is kind of interesting. That's not using it as an encoder, but rather using a language model as, um, Sort of an element in a large ensemble of other language models that are used to execute tasks is this paper called emergent Autonomous Scientific Research Capabilities of Large Language Models.

And basically you, you, you end up setting up this like organization of language models that given an input prompt from a user is able to do chemical tasks. Um, so you can tell it, hey, you synthesize. Ibuprofen for me. And what it will do is it has like a web searcher, a planner and then like an automation API component that's actually hooked up to a cloud-based chemistry lab that runs via api.

And you can actually say like, Hey, synthesize aspirin for me. And from what I remember, it actually successfully will synthesize basic chemicals for you. Um, so that's, that's an instance of using a language model, not, not like, just as an encoder, um, but basically like as a component and a much larger system to, to do something very different from language.

All right. I'm gonna run through this really fast cause I wanna, I want to have time for questions. I'm just gonna check, see if there's anything else here.
Okay. I'll get to these questions at the end, cause I'll just run through this really fast. Um, alright, so the third, the third trend that I think is really interesting are alternative language model architectures. Um, you know, since 2017, you know, maybe a year or two after that, um, We've kind of been living in this like weird transformer hegemony, um, where everything's being done by transformers.

Um, but there's a huge bottleneck, on transformer LLM context, which is that a single pass, uh, through a self attention layer requires n squared compute. Where n is the is the context length what this, this prevents you from. Dealing with like really, really large context. So there are some interesting tasks that require large context.

One is like doing like video processing, video understanding. Uh, another that people have told me about is like genetics. Um, And even dealing with like, really long form text. So like I recruit for this startup called Harvey. They're dealing with like, you know, I don't know, a 2000 page corpus that describes like a corporate merger.

Like you can't feed all of that into you know, into the context of a language model or even code, you know, code bases. Like there are all these kind of like search and information retrieval hacks you have to do to be able to do, um, you know, document q and a with a code base. It would be really cool.

If instead you just have a language model that just takes the entire code basis of context. But currently that's not possible because of this n Square compute bottleneck. I'm gonna try to go really fast. Just go through this in like two minutes, uh, so we can have five minutes for questions. So there have been a variety of approaches to try to attack this problem.

One that is really cool is this, uh, this approach from Stanford NLP where they propose the state space model architecture for language models. The key idea is, Uh, these state space models, they do the computations via very long convolutions with like a large sequence of matrices. And like the key idea is, um, is from like math or FOIA analysis where, uh, it turns out that basically the fourier transform turns, um, convolution into multiplication.

So if you want to do a really long convolution, one way to do it is actually to take the foray, transform of the two things you want to involve. Multiply them in Fourier space and then take the inverse Fourier transform, uh, to get the result of the convolution back. Um, the, the cool thing is, you know, there's this fast Fourier transform algorithm that, uh, does the Fast Fourier transform and n log in time.

So like the super high level idea is you use the Fast Fourier transform to basically get n log N compute requirements rather than n squared. Um, and then this allows you to deal with like much, much longer context. So, again, I'll just try to like zoom through this in one minute. But, uh, these authors from, uh, Stanford, n l p, they came out with this, uh, this, uh, language modeling architecture that they called Hungry, hungry Hippos, um, that basically does this.

Um, and they had this hybrid model with just two attention layers that ended up beating transformers, uh, up to 2.7 billion parameters on the pile. And then two months later they followed up with this, uh, other paper called Hyena Hierarchy, where they basically, they improved some of the architectural stuff.

Um, Such that they could get rid of intention entirely. So they had this convolution based language modeling architecture, uh, doesn't use attention at all. Um, you know, has purely end login scaling, um, in context length. Um, and ended up, they, they only scaled this one up to 335 million parameters, but matches transformer perplexity on the pile up to that scale.

It'll be really interesting to see how they scaled that up. One more minute really fast. So like some, you know, that's not the only alternate LLM architecture's work that's going on. Two other cool papers in the space. Um, Or projects in the space are R K V W. This is like this open source project.

There's like a discord you can join. It's like a parallel parallelizable R and n. Uh, it's, you know, currently scaled up to up to 14 billion parameters. You can check it out. There's like a GitHub. You just Google it. Then the last one I actually don't know too much about, but I feel compelled to comment on, um, Is this thing that blew up on Twitter, like over the last two days, which is this recurrent memory transformer.

I don't really know what's going on here, but you've somehow augmented the transformer with external memory. It's been going on for months, but they just came out with a paper where they recently scaled, uh, the context length to like 1 million tokens, uh, which is really interesting. So the, some of the people that I respect in the language modeling space have been commenting, saying like, oh, this is just hype.

It's not actually gonna perform while at scale, but it's worth looking up. Uh, it kind of blew up on Twitter.
Thank you. Uh, and I'm, I'm gonna, uh, answer questions. I'm gonna check my email and, um, yeah, I'm gonna check my email and, uh, see if there are any other questions.

Um, so what is the tool? Former external api? Former external api?

There's not a Unic external api. Um, there's not a Unic external api, but rather, um, they basically train the language model to use each tool separately, which is kind of a pain. And it's one of the things that the authors acknowledge as a limitation. Um, there are other methods that are coming out that where you don't have to train these language models to use tools, kind of separately, one at a time.

Um, I know that. There are startups and then from what I understand, this is just what OpenAI does with the plugins, where, um, you can just feed in the documentation, uh, of an API and like the language model will, will understand how to use it. Um, so from what I understand, this is how like the, uh, You know, uh, open AI chat, GPT plugins work.

And then there's also this really cool startup called Lindy ai, where I know you basically can just give you, you give, you give Lindy, which is this AI assistant, like access to the documentation, um, for an api, and we'll just learn how to use it.

Um, Yeah. You know, so there's another question. I would be curious to hear more about the constitutional AI paper if there's time. So let's go back. I'm, I'm actually a huge expert on this paper. Um, but I can comment on it and then I, and then I'll, I'll cut myself off in just a minute cuz you know, I know we're at time and I know there's a big panel after this.

Um, Yeah, so, so from what I understand, the purpose of the constitutional AI paper is, is, is basically, it's, it's like, one way I think about it is kind of a response to reinforcement learning from human feedback which is like opening eyes thing. The problem is trying to align. Um, AI models, especially like a base model like G P T four, like the raw GPT-4 that's just been trained on the internet, um, and trying to align it to human preferences.

Um, you know, open AI's response to this problem is this reinforcement learning with human feedback. Um, where like, there, there are three steps, but like the main thing is you have the AI model, like generate responses to questions and then you have human labelers like rank the responses. And then what you do is you end up like training a reinforcement learning model to learn human preferences from those manual rankings.

This idea, this constitutional AI approach from philanthropic is meant to get around that bottleneck of human labeling for two reasons, from what I understand from philanthropic. So, you know, um, how would you say echoing them here, but I might not be faithful. Um, The first reason is that like human labelers are really expensive both in terms of literal dollar cost and then also in time.

And then they also make a claim, um, that as the language models get better and better and better, it will actually be harder to judge their output. Um, especially like, especially on advanced tasks like programming, for example. Um, it's just gonna be hard for. Humans to actually judge the output. So like their approach is what if, what if we basically used language models as labelers instead, in this process?

So, I think we're at time, so I'm not gonna go into details, but they basically, they, they take the RLHF approach and they add some additional steps into it. Um, and the, where the constitutional AI comes from is instead of having like all these human labelers, you basically, you provide the language model with a constitution, which is basically like a source short set of principles that it should follow, that then like the language models use to generate, the responses and the rankings.

This has been a lot of fun. I hope you find some of these trends interesting and I hope you guys enjoy the rest of the Observe conference. So thanks for coming.

Subscribe to our resources and blogs