Live | Every Other Wednesday
10:15am PT | 45 minutes
Join us every other Wednesday for an engaging discussion session where we delve into the latest technical papers, covering a range of topics including large language models (LLM), generative models, ChatGPT, and more. This recurring event offers an opportunity to collectively analyze and exchange insights on cutting-edge research in these areas and their broader implications.
Join us for a deep dive into the “Agent-as-a-Judge” framework, a new paradigm for evaluating agentic systems. Where typical evaluation methods focus solely on outcomes or demand extensive manual work, this approach uses agentic systems to evaluate agentic systems, offering intermediate feedback throughout the task-solving process. We’ll discuss how Agent-as-a-Judge enables scalable self-improvement.
Paper: https://arxiv.org/abs/2410.10934
Recording: https://www.youtube.com/watch?v=YhT6PhG_05U
Join us as we break down OpenAI’s real-time API. Learn how to seamlessly integrate powerful language models into your applications for instant, context-aware responses that drive user engagement. Whether you’re building chatbots, dynamic content tools, or enhancing real-time collaboration, we’ll walk through the API’s capabilities, potential use cases, and best practices for implementation. Don’t miss this opportunity to dive into the next level of AI-driven interactions!
Recording: https://www.youtube.com/watch?v=OjAgZsS9J7E
This week, we’re diving into OpenAI’s Swarm – an experimental lightweight multi-agent framework.
Swarm enables the creation of multi-agent systems, where each agent has a defined focus and limited actions. At any given time, only one agent is in control, but it can seamlessly pass control to another.
Recording: https://www.youtube.com/watch?v=Oyk_ifMqruQ
This week were taking a look at Google’s NotebookLM, a personalized AI research assistant powered by Gemini 1.5 Pro.
Recording: https://www.youtube.com/watch?v=aPPZsU3ie3U
Blog, Transcript, Podcast: https://arize.com/blog/exploring-google-notebook-lm/
This week, we’re diving into OpenAI’s latest series of models: o1-preview and o1-mini. We’ll also share insights from our recent research, where we put these models to the test! Want to qualitatively know how o1-preview performed against Claude’s Sonnet 3.5? Join us!
Recording: https://www.youtube.com/watch?v=QCSn7W_w0Rg
Blog, Transcript & Podcast: https://arize.com/blog/exploring-openai-o1-preview-and-o1-mini/
A recent announcement on X boasted a tuned model with pretty outstanding performance, and claimed these results were achieved through Reflection Tuning. However, people were unable to reproduce the results. We dive into some recent drama in the AI community as a jumping off point for a discussion about Reflection 70B. We talk about a paper from 2023 on Reflection Tuning that this new model (Reflection 70B) draws concepts from.
Recording: https://www.youtube.com/watch?v=noBNz_Uxqqs
Paper: https://arxiv.org/abs/2310.11716
Blog, Transcript & Podcast: https://arize.com/blog/breaking-down-reflection-tuning-enhancing-llm-performance-with-self-learning/
This week, we’re excited to be joined by Kyle O’Brien, Applied Scientist at Microsoft, to discuss his most recent paper, Composable Interventions for Language Models. Kyle and his team present a new framework, composable interventions, that allows for the study of multiple interventions applied sequentially to the same language model. The discussion will cover their key findings from extensive experiments, revealing how different interventions—such as knowledge editing, model compression, and machine unlearning—interact with each other.
Recording: https://www.youtube.com/watch?v=9wzHBVQUaos&t=1s
Paper: https://arxiv.org/pdf/2407.06483
Blog, Transcript & Podcast: https://arize.com/blog/composable-interventions-for-language-models/
This week’s paper, Judging the Judges: Evaluating Alignment and Vulnerabilities in LLMs-as-Judges, presents a comprehensive study of the performance of various LLMs acting as judges. The researchers leverage TriviaQA as a benchmark for assessing objective knowledge reasoning of LLMs and evaluate them alongside human annotations which they find to have a high inter-annotator agreement. The study includes nine judge models and nine exam-taker models – both base and instruction-tuned. They assess the judge models’ alignment across different model sizes, families, and judge prompts to answer questions about the strengths and weaknesses of this paradigm, and what potential biases it may hold.
Recording: https://www.youtube.com/watch?v=shHgMRB5Eu0
Paper: https://arxiv.org/pdf/2406.12624
Blog, Transcript & Podcast: https://arize.com/blog/judging-the-judges-llm-as-a-judge/
Meta just released Llama 3.1 405B–according to them, it’s “the first openly available model that rivals the top AI models when it comes to state-of-the-art capabilities in general knowledge, steerability, math, tool use, and multilingual translation.” Will the latest Llama herd ignite new applications and modeling paradigms like synthetic data generation? Will it enable the improvement and training of smaller models, as well as model distillation? Meta thinks so. We’ll take a look at what they did here, talk about open source, and decide if we want to believe the hype.
Recording: https://www.youtube.com/watch?v=uXt6rYXnV8U
Paper: https://ai.meta.com/research/publications/the-llama-3-herd-of-models/
Blog, Transcript & Podcast: https://arize.com/blog/breaking-down-meta-llama-3/
Chaining language model (LM) calls as composable modules is fueling a new way of programming, but ensuring LMs adhere to important constraints requires heuristic “prompt engineering.” The paper this week introduces LM Assertions, a programming construct for expressing computational constraints that LMs should satisfy. The researchers integrated their constructs into the recent DSPy programming model for LMs and present new strategies that allow DSPy to compile programs with LM Assertions into more reliable and accurate systems. They also propose strategies to use assertions at inference time for automatic self-refinement with LMs. They reported on four diverse case studies for text generation and found that LM Assertions improve not only compliance with imposed rules but also downstream task performance, passing constraints up to 164% more often and generating up to 37% more higher-quality responses.
Recording: https://www.youtube.com/watch?v=Hf6u4SDSFcg
Paper: https://arxiv.org/pdf/2312.13382
Blog, Transcript & Podcast: https://arize.com/blog/dspy-assertions-computational-constraints/
We’re excited to host Sai Kolasani, researcher at UC Berkeley’s RISE Lab, to talk about his work on RAFT: Adapting Language Model to Domain Specific RAG. RAFT is a training recipe that improves an LLM’s ability to answer questions in a “open-book” in-domain settings. Given a question, and a set of retrieved documents, the model is trained to ignore documents that don’t help in answering the question (aka distractor documents). This coupled with RAFT’s chain-of-thought-style response, helps improve the model’s ability to reason. In domain-specific RAG, RAFT consistently improves the model’s performance across PubMed, HotpotQA, and Gorilla datasets, presenting a post-training recipe to improve pre-trained LLMs to in-domain RAG.
Recording: https://www.youtube.com/watch?v=cbQ5rm1jOuU
Blog, Transcript & Podcast: https://arize.com/blog/raft-adapting-language-model-to-domain-specific-rag/
It’s been an exciting couple weeks for GenAI! Join us as we discuss the latest research from OpenAI and Anthropic. We’re excited to chat about this significant step forward in understanding how LLMs work and the implications it has for deeper understanding of the neural activity of language models. Hope you can join the conversation!
Open AI’s Paper: https://openai.com/index/extracting-concepts-from-gpt-4/
Anthropic’s Paper: https://transformer-circuits.pub/2024/scaling-monosemanticity/
Recording: https://www.youtube.com/watch?v=fkW0bGnbDkQ
Blog, Transcript & Podcast: https://arize.com/blog/llm-interpretability-and-sparse-autoencoders-openai-anthropic/
Ensuring alignment (aka: making models behave in accordance with human intentions) has become a critical task before deploying LLMs in real-world applications. However, a major challenge faced by practitioners is the lack of clear guidance on evaluating whether LLM outputs align with social norms, values, and regulations. To address this issue, this paper presents a comprehensive survey of key dimensions that are crucial to consider when assessing LLM trustworthiness. The survey covers seven major categories of LLM trustworthiness: reliability, safety, fairness, resistance to misuse, explainability and reasoning, adherence to social norms, and robustness.
The measurement results indicate that, in general, more aligned models tend to perform better in terms of overall trustworthiness. However, the effectiveness of alignment varies across the different trustworthiness categories considered. By shedding light on these key dimensions of LLM trustworthiness, this paper aims to provide valuable insights and guidance to practitioners in the field. Understanding and addressing these concerns will be crucial in achieving reliable and ethically sound deployment of LLMs in various applications.
Paper: https://arxiv.org/abs/2308.05374
Recording: https://www.youtube.com/watch?v=yKN1f4Gkjro
Blog, Transcript & Podcast: https://arize.com/blog/trustworthy-llms-a-survey-and-guideline-for-evaluating-large-language-models-alignment/
Due to the cumbersome nature of human evaluation and limitations of code-based evaluation, Large Language Models (LLMs) are increasingly being used to assist humans in evaluating LLM outputs. Yet LLM-generated evaluators often inherit the problems of the LLMs they evaluate, requiring further human validation.
This week’s paper explores EvalGen, a mixed-initative approach to aligning LLM-generated evaluation functions with human preferences. EvalGen assists users in developing both criteria acceptable LLM outputs and developing functions to check these standards, ensuring evaluations reflect the users’ own grading standards.
Paper: https://arxiv.org/abs/2404.12272
Recording: https://www.youtube.com/watch?v=kco7kA4qO-0
Blog, Transcript & Podcast: https://arize.com/blog/breaking-down-evalgen-who-validates-the-validators/
Paper: https://arxiv.org/pdf/2210.03629.pdf
Recording: https://www.youtube.com/watch?v=QX-p-vsDoiQ
Blog, Transcript & Podcast: https://arize.com/blog/keys-to-understanding-react/
This week, we’ve covering Amazon’s time series model: Chronos. Developing accurate machine-learning-based forecasting models has traditionally required substantial dataset-specific tuning and model customization. Chronos however, is built on a language model architecture and trained with billions of tokenized time series observations, enabling it to provide accurate zero-shot forecasts matching or exceeding purpose-built models.
Paper: https://arxiv.org/abs/2403.07815
Recording: https://www.youtube.com/watch?v=yKKWCqABspw
Blog, Transcript & Podcast: https://arize.com/blog/demystifying-chronos-learning-the-language-of-time-series/
Join us for this week’s Arize Community Paper Reading where we’ll dive into the latest buzz in the AI world – the arrival of Claude-3, the newest model in the LLM space, challenging the likes of GPT-4.
We will explore Anthropic’s recent paper, and walk through Arize’s latest research comparing Claude-3 to GPT-4. Whether you’re a researcher, practitioner, or simply curious about the future of AI, we hope you’ll join the conversation.
Recording: https://www.youtube.com/watch?v=mU6Ob-7eAhY
Blog, Transcript & Podcast: https://arize.com/blog/anthropic-claude-3/
We’re exploring Reinforcement Learning in the Era of LLMs this week with Claire Longo, Arize’s Head of Customer Success. Recent advancements in Large Language Models (LLMs) have garnered wide attention and led to successful products such as ChatGPT and GPT-4. Their proficiency in adhering to instructions and delivering harmless, helpful, and honest (3H) responses can largely be attributed to the technique of Reinforcement Learning from Human Feedback (RLHF). This week’s paper, aims to link the research in conventional RL to RL techniques used in LLM research and demystify this technique by discussing why, when, and how RL excels.
Paper: https://arxiv.org/abs/2310.06147
Recording: https://www.youtube.com/watch?v=g2x1A2SzyU0
Blog, Transcript & Podcast: https://arize.com/blog/reinforcement-learning-in-the-era-of-llms/
This week, we’re delighted to be joined by community member & AI Engineer Vibhu Sapra to discuss OpenAI’s technical report on their Text-To-Video Generation Model: Sora. We’ll also explore recent research done on EvalCrafter: Benchmarking and Evaluating Large Video Generation Models.
Recording: https://www.youtube.com/watch?v=dUv9GoQMDb0&t=3s
This week, we’re discussing RAG vs Fine-tuning, a paper that explores a pipeline for Fine-tuning and RAG, and present the tradeoffs of both for multiple popular LLMs, including Llama2-13B, GPT-3.5, and GPT-4. The authors propose a pipeline that consists of multiple stages, including extracting information from PDFs, generating questions and answers, using them for fine-tuning, and leveraging GPT-4 for evaluating the results. Overall, the results point to how systems built using LLMs can be adapted to respond and incorporate knowledge across a dimension that is critical for a specific industry, paving the way for further applications of LLMs in other industrial domains.
Link to paper: https://arxiv.org/abs/2401.08406
Recording: https://www.youtube.com/watch?v=EbEPHOABgSY
With only 2.7 billion parameters, Phi-2 surpasses the performance of Mistral and Llama-2 models at 7B and 13B parameters on various aggregated benchmarks. Notably, it achieves better performance compared to 25x larger Llama-2-70B model on multi-step reasoning tasks, i.e., coding and math. Furthermore, Phi-2 matches or outperforms the recently-announced Google Gemini Nano 2, despite being smaller in size.
Link to paper: https://www.microsoft.com/en-us/research/blog/phi-2-the-surprising-power-of-small-language-models/
Recording: https://www.youtube.com/watch?v=qtBk7wmwCA0
We’re back with Arize CPO & Co-Founder, Aparna Dhinakaran, for a continued exploration of the new kids on the block: Gemini and Mixtral-8x7B. Catch up on Part 1 here.
In this virtual discussion, Arize CPO & Co-Founder, Aparna Dhinakaran, with be joined by a couple members of her team for an exploration of the new kids on the block: Gemini and Mixtral-8x7B.
Recording: https://www.youtube.com/watch?v=B_-syBNYWzU
We’re thrilled to be joined by Shuaichen Chang, LLM researcher and the author of this week’s paper to discuss his findings. Shuaichen’s research investigates the impact of prompt constructions on the performance of large language models (LLMs) in the text-to-SQL task, particularly focusing on zero-shot, single-domain, and cross-domain settings. Shuaichen and his team explore various strategies for prompt construction, evaluating the influence of database schema, content representation, and prompt length on LLMs’ effectiveness. The findings emphasize the importance of careful consideration in constructing prompts, highlighting the crucial role of table relationships and content, the effectiveness of in-domain demonstration examples, and the significance of prompt length in cross-domain scenarios.
Link to paper: https://arxiv.org/pdf/2305.11853.pdf
Recording: https://youtu.be/8ZU6WpDRnis
We’re excited to be joined by Samuel Marks, Postdoctoral Research Associate at Northeastern University, to discuss his paper, “The Geometry of Truth: Emergent Linear Structure in LLM Representation of True/False Datasets”. Samuel and his team curated high-quality datasets of true/false statements and used them to study in detail the structure of LLM representations of truth. Overall, they present evidence that language models linearly represent the truth or falsehood of factual statements and also introduce a novel technique, mass-mean probing, which generalizes better and is more causally implicated in model outputs than other probing techniques.
Link to paper: https://arxiv.org/abs/2310.06824
Recording: https://youtu.be/7XNqsFA0Znw
This week, we’re discussing “Decomposing Language Models Into Understandable Components”, which addresses the challenge of understanding the inner workings of neural networks, drawing parallels with the complexity of human brain function. It explores the concept of “features,” (patterns of neuron activations) providing a more interpretable way to dissect neural networks. By decomposing a layer of neurons into thousands of features, this approach uncovers hidden model properties that are not evident when examining individual neurons. These features are demonstrated to be more interpretable and consistent, offering the potential to steer model behavior and improve AI safety.
Link to paper: https://transformer-circuits.pub/2023/monosemantic-features/index.html
Recording: https://www.youtube.com/watch?v=hlCxSqWS6Rw
In this paper reading, we’ll be discussing RankVicuna, the first fully open-source LLM capable of performing high-quality listwise reranking in a zero-shot setting. While researchers have successfully applied LLMs such as ChatGPT to reranking in an information retrieval context, such work has mostly been built on proprietary models hidden behind opaque API endpoints.This approach yields experimental results that are not reproducible and non-deterministic, threatening the veracity of outcomes that build on such shaky foundations. RankVicuna provides access to a fully open-source LLM and associated code infrastructure capable of performing high-quality reranking
Link to paper: https://arxiv.org/abs/2309.15088v1
Recording: https://youtu.be/fAVHx89aRHU
Join Arize Co-Founder & CEO Jason Lopatecki, and ML Solutions Engineer, SallyAnn DeLucia, as they discuss “Explaining Grokking Through Circuit Efficiency”. This paper explores novel predictions about grokking, providing significant evidence in favour of its explanation. Most strikingly, the research conducted in this paper demonstrates two novel and surprising behaviors: ungrokking, in which a network regresses from perfect to low test accuracy, and semi-grokking, in which a network shows delayed generalisation to partial rather than perfect test accuracy.
Link to paper: https://arxiv.org/abs/2309.02390
Recording: https://youtu.be/n-hkcgd7SBw
Join Arize’s Amber Roberts and SallyAnn DeLucia as they discuss “Large Content And Behavior Models To Understand, Simulate, And Optimize Content And Behavior”. This paper highlights that while LLMs have great generalization capabilities, they struggle to effectively predict and optimize communication to get the desired receiver behavior. We’ll explore whether this might be because of a lack of “behavior tokens” in LLM training corpora and how Large Content Behavior Models (LCBMs) might help to solve this issue.
Link to paper: https://arxiv.org/abs/2309.00359
Recording: https://www.youtube.com/watch?v=KY76SCEjEIo
Join us for an exploration of the ‘Skeleton-of-Thought’ (SoT) approach, aimed at reducing large language model latency while enhancing answer quality, with the presence of two authors, Xuefei Ning and Zinan Lin. SoT’s innovative methodology guides LLMs to construct answer skeletons before parallel content elaboration, achieving impressive speed-ups of up to 2.39x across 11 models. Don’t miss the opportunity to delve into this human-inspired optimization strategy and its profound implications for efficient and high-quality language generation.
Link to paper: https://arxiv.org/abs/2307.15337
During this week’s paper reading event, we are thrilled to announce that we will be joined by Frank Liu of Zilliz, who will be sharing valuable insights with us. This paper examines Position Interpolation (PI), a method extending context window sizes of LLaMA models up to 32,768 positions with minimal fine-tuning. The extended models showed strong results on tasks requiring long context and retained their quality within the original context window. PI avoids catastrophic attention score issues by linearly down-scaling input position indices. The method’s stability was demonstrated, and existing optimization and infrastructure could be reused in the extended models. Additionally, during the event, we will also discuss the write-up “Extending Context is Hard… But Not Impossible” available at https://kaiokendev.github.io/context.
Link to Paper: https://arxiv.org/pdf/2306.15595.pdf
Recording:
https://youtu.be/HDm9YjlLE60
In this paper reading, we explore the paper “Open Foundation and Fine-Tuned Chat Models.” The paper introduces Llama 2, a collection of pretrained and fine-tuned large language models ranging from 7 billion to 70 billion parameters. Their fine-tuned model, Llama 2-Chat, is specifically designed for dialogue use cases and showcases superior performance on various benchmarks. Through human evaluations for helpfulness and safety, Llama 2-Chat emerges as a promising alternative to closed-source models. Discover the approach to fine-tuning and safety improvements, allowing us to foster responsible development and contribute to this rapidly evolving field.
Link to Paper: https://ai.meta.com/research/publications/llama-2-open-foundation-and-fine-tuned-chat-models/
Recording: https://www.youtube.com/watch?v=HyppoCyOwfY
This paper examines how well language models utilize longer input contexts. The study focuses on multi-document question answering and key-value retrieval tasks. The researchers find that performance is highest when relevant information is at the beginning or end of the context. Accessing information in the middle of long contexts leads to significant performance degradation. Even explicitly long-context models experience decreased performance as the context length increases. The analysis enhances our understanding and offers new evaluation protocols for future long-context models.
Link to paper: https://arxiv.org/abs/2307.03172
Link to recording:
Recent research focuses on improving smaller models through imitation learning using outputs from large foundation models (LFMs). Challenges include limited imitation signals, homogeneous training data, and a lack of rigorous evaluation, leading to overestimation of small model capabilities. To address this, we introduce Orca, a 13-billion parameter model that learns to imitate LFMs’ reasoning process. Orca leverages rich signals from GPT-4, surpassing state-of-the-art models by over 100% in complex zero-shot reasoning benchmarks. It also shows competitive performance in professional and academic exams without CoT. Learning from step-by-step explanations, generated by humans or advanced AI models, enhances model capabilities and skills.
Link to Paper: https://arxiv.org/abs/2306.02707
Link to Recording: https://www.youtube.com/watch?v=BswvaWZdWw4
Introducing GLoRA: a universal, parameter-efficient fine-tuning approach for diverse tasks. GLoRA enhances LoRA with a generalized prompt module, optimizing pre-trained model weights and activations. Its scalable, layer-wise structure search enables efficient parameter adaptation. GLoRA excels in transfer learning, few-shot learning, and domain generalization, outperforming previous methods on various datasets. With fewer parameters and no extra inference cost, GLoRA is a practical solution for resource-limited applications. Join us to explore GLoRA’s capabilities in this interactive community paper reading!
Link to Paper: https://arxiv.org/abs/2306.07967
Explore HyDE, a thrilling zero-shot learning technique that combines GPT-3’s language understanding with contrastive text encoders. HyDE revolutionizes information retrieval and grounding in real-world data by generating hypothetical documents from queries and retrieving similar real-world documents. It outperforms traditional unsupervised retrievers, rivaling fine-tuned retrievers across diverse tasks and languages.
This leap in zero-shot learning efficiently retrieves relevant real-world information without task-specific fine-tuning, broadening AI model applicability and effectiveness. Join us for a paper reading on how HyDE works!
Link to Paper: https://arxiv.org/abs/2212.10496
Recording: https://youtu.be/PvT8ntmm1Xs
VOYAGER, the first LLM-powered embodied lifelong learning agent in Minecraft, autonomously explores the world, acquires skills, and makes discoveries without human intervention. It outperforms previous approaches, achieving exceptional proficiency in playing Minecraft and successfully applies its learned skills to solve novel tasks in different Minecraft worlds, surpassing techniques that struggle with generalization.
Link to Paper: https://arxiv.org/pdf/2305.16291.pdf
Link to Recording: https://www.youtube.com/watch?v=BU3w_AbCEbA
This week we’re diving into the world of Retrieval-Augmented Generation (RAG)!
We know GPT-like LLMs are great at soaking up knowledge during pre-training and fine-tuning them can lead to some pretty great, specific results. But when it comes to tasks that really demand heavy knowledge lifting, they still fall short. Plus, it’s not exactly easy to figure out where their answers come from or how to update their knowledge.
Enter RAG models, a hybrid beast that combines the best of both worlds: the learning power of pre-trained models (the parametric part), and an explicit, non-parametric memory — imagine a searchable index of all of Wikipedia.
Link to paper: https://arxiv.org/abs/2005.11401
This paper introduces a novel approach, DragGAN, for achieving precise control over the pose, shape, expression, and layout of objects generated by GANs. It allows users to “drag” any points of an image to specific target points — in other words, it enables the deformation of images with better control over where pixels end up to produce ultra-realistic outputs. Paper: https://arxiv.org/abs/2305.10973
View Recording: https://youtu.be/DxzsgV8rTOw