Community Papers Reading

  Live | Every Other Wednesday

  10:15am PT | 45 minutes

Join us every other Wednesday for an engaging discussion session where we delve into the latest technical papers, covering a range of topics including large language models (LLM), generative models, ChatGPT, and more. This recurring event offers an opportunity to collectively analyze and exchange insights on cutting-edge research in these areas and their broader implications.

On-Demand | Chronos: Learning the Language of Time Series

This week, we’ve covering Amazon’s time series model: Chronos. Developing accurate machine-learning-based forecasting models has traditionally required substantial dataset-specific tuning and model customization. Chronos however, is built on a language model architecture and trained with billions of tokenized time series observations, enabling it to provide accurate zero-shot forecasts matching or exceeding purpose-built models.

Paper: https://arxiv.org/abs/2403.07815

Recording: https://www.youtube.com/watch?v=yKKWCqABspw

Blog, Transcript & Podcast: https://arize.com/blog/demystifying-chronos-learning-the-language-of-time-series/

On Demand | Opus Claude-3: A GPT-4 Competitor Has Arrived

Join us for this week’s Arize Community Paper Reading where we’ll dive into the latest buzz in the AI world – the arrival of Claude-3, the newest model in the LLM space, challenging the likes of GPT-4.
We will explore Anthropic’s recent paper, and walk through Arize’s latest research comparing Claude-3 to GPT-4. Whether you’re a researcher, practitioner, or simply curious about the future of AI, we hope you’ll join the conversation.

Recording: https://www.youtube.com/watch?v=mU6Ob-7eAhY

Blog, Transcript & Podcast: https://arize.com/blog/anthropic-claude-3/

On Demand | Reinforcement Learning in the Era of LLMs

We’re exploring Reinforcement Learning in the Era of LLMs this week with Claire Longo, Arize’s Head of Customer Success. Recent advancements in Large Language Models (LLMs) have garnered wide attention and led to successful products such as ChatGPT and GPT-4. Their proficiency in adhering to instructions and delivering harmless, helpful, and honest (3H) responses can largely be attributed to the technique of Reinforcement Learning from Human Feedback (RLHF). This week’s paper, aims to link the research in conventional RL to RL techniques used in LLM research and demystify this technique by discussing why, when, and how RL excels.

Paper: https://arxiv.org/abs/2310.06147

Recording: https://www.youtube.com/watch?v=g2x1A2SzyU0

Blog, Transcript & Podcast: https://arize.com/blog/reinforcement-learning-in-the-era-of-llms/

On Demand | Exploring Sora & Evaluating Large Video Generation Models

This week, we’re delighted to be joined by community member & AI Engineer Vibhu Sapra to discuss OpenAI’s technical report on their Text-To-Video Generation Model: Sora. We’ll also explore recent research done on EvalCrafter: Benchmarking and Evaluating Large Video Generation Models.

Recording: https://www.youtube.com/watch?v=dUv9GoQMDb0&t=3s

On Demand | RAG vs Fine-tuning: Pipelines, Tradeoffs, and a Case Study on Agriculture

This week, we’re discussing RAG vs Fine-tuning, a paper that explores a pipeline for Fine-tuning and RAG, and present the tradeoffs of both for multiple popular LLMs, including Llama2-13B, GPT-3.5, and GPT-4. The authors propose a pipeline that consists of multiple stages, including extracting information from PDFs, generating questions and answers, using them for fine-tuning, and leveraging GPT-4 for evaluating the results. Overall, the results point to how systems built using LLMs can be adapted to respond and incorporate knowledge across a dimension that is critical for a specific industry, paving the way for further applications of LLMs in other industrial domains.

Link to paper: https://arxiv.org/abs/2401.08406

Recording: https://www.youtube.com/watch?v=EbEPHOABgSY

On Demand | SLMs (Small Language Models) vs LLMs: Phi-2

With only 2.7 billion parameters, Phi-2 surpasses the performance of Mistral and Llama-2 models at 7B and 13B parameters on various aggregated benchmarks. Notably, it achieves better performance compared to 25x larger Llama-2-70B model on multi-step reasoning tasks, i.e., coding and math. Furthermore, Phi-2 matches or outperforms the recently-announced Google Gemini Nano 2, despite being smaller in size.

Link to paper: https://www.microsoft.com/en-us/research/blog/phi-2-the-surprising-power-of-small-language-models/

Recording: https://www.youtube.com/watch?v=qtBk7wmwCA0

On Demand |A Deep Dive Into Generative's Newest Models: Gemini vs Mistral (Mixtral-8x7B) – Part 2

We’re back with Arize CPO & Co-Founder, Aparna Dhinakaran, for a continued exploration of the new kids on the block: Gemini and Mixtral-8x7B. Catch up on Part 1 here.

On Demand | A Deep Dive Into Generative's Newest Models: Gemini vs Mistral (Mixtral-8x7B) – Part 1

In this virtual discussion, Arize CPO & Co-Founder, Aparna Dhinakaran, with be joined by a couple members of her team for an exploration of the new kids on the block: Gemini and Mixtral-8x7B.

Recording: https://www.youtube.com/watch?v=B_-syBNYWzU

On Demand | How to Prompt LLMs for Text-to-SQL: A Study in Zero-shot, Single-domain, and Cross-domain Settings

We’re thrilled to be joined by Shuaichen Chang, LLM researcher and the author of this week’s paper to discuss his findings. Shuaichen’s research investigates the impact of prompt constructions on the performance of large language models (LLMs) in the text-to-SQL task, particularly focusing on zero-shot, single-domain, and cross-domain settings. Shuaichen and his team explore various strategies for prompt construction, evaluating the influence of database schema, content representation, and prompt length on LLMs’ effectiveness. The findings emphasize the importance of careful consideration in constructing prompts, highlighting the crucial role of table relationships and content, the effectiveness of in-domain demonstration examples, and the significance of prompt length in cross-domain scenarios.

Link to paper: https://arxiv.org/pdf/2305.11853.pdf

Recording: https://youtu.be/8ZU6WpDRnis

On Demand | The Geometry of Truth: Emergent Linear Structure in LLM Representation of True/False Datasets

We’re excited to be joined by Samuel Marks, Postdoctoral Research Associate at Northeastern University, to discuss his paper, “The Geometry of Truth: Emergent Linear Structure in LLM Representation of True/False Datasets”. Samuel and his team curated high-quality datasets of true/false statements and used them to study in detail the structure of LLM representations of truth. Overall, they present evidence that language models linearly represent the truth or falsehood of factual statements and also introduce a novel technique, mass-mean probing, which generalizes better and is more causally implicated in model outputs than other probing techniques.

Link to paper: https://arxiv.org/abs/2310.06824

Recording: https://youtu.be/7XNqsFA0Znw

On-Demand | Towards Monosemanticity: Decomposing Language Models With Dictionary Learning

This week, we’re discussing “Decomposing Language Models Into Understandable Components”, which addresses the challenge of understanding the inner workings of neural networks, drawing parallels with the complexity of human brain function. It explores the concept of “features,” (patterns of neuron activations) providing a more interpretable way to dissect neural networks. By decomposing a layer of neurons into thousands of features, this approach uncovers hidden model properties that are not evident when examining individual neurons. These features are demonstrated to be more interpretable and consistent, offering the potential to steer model behavior and improve AI safety.

Link to paper: https://transformer-circuits.pub/2023/monosemantic-features/index.html

Recording: https://www.youtube.com/watch?v=hlCxSqWS6Rw

On Demand | RankVicuna: Zero-Shot Listwise Document Reranking with Open-Source Large Language Models

In this paper reading, we’ll be discussing RankVicuna, the first fully open-source LLM capable of performing high-quality listwise reranking in a zero-shot setting. While researchers have successfully applied LLMs such as ChatGPT to reranking in an information retrieval context, such work has mostly been built on proprietary models hidden behind opaque API endpoints.This approach yields experimental results that are not reproducible and non-deterministic, threatening the veracity of outcomes that build on such shaky foundations. RankVicuna provides access to a fully open-source LLM and associated code infrastructure capable of performing high-quality reranking

Link to paper: https://arxiv.org/abs/2309.15088v1

Recording: https://youtu.be/fAVHx89aRHU

On Demand | Explaining Grokking Through Circuit Efficiency

Join Arize Co-Founder & CEO Jason Lopatecki, and ML Solutions Engineer, SallyAnn DeLucia, as they discuss “Explaining Grokking Through Circuit Efficiency”. This paper explores novel predictions about grokking, providing significant evidence in favour of its explanation. Most strikingly, the research conducted in this paper demonstrates two novel and surprising behaviors: ungrokking, in which a network regresses from perfect to low test accuracy, and semi-grokking, in which a network shows delayed generalisation to partial rather than perfect test accuracy.

Link to paper: https://arxiv.org/abs/2309.02390

Recording: https://youtu.be/n-hkcgd7SBw

On Demand | Large Content And Behavior Models

Join Arize’s Amber Roberts and SallyAnn DeLucia as they discuss “Large Content And Behavior Models To Understand, Simulate, And Optimize Content And Behavior”. This paper highlights that while LLMs have great generalization capabilities, they struggle to effectively predict and optimize communication to get the desired receiver behavior. We’ll explore whether this might be because of a lack of “behavior tokens” in LLM training corpora and how Large Content Behavior Models (LCBMs) might help to solve this issue.

Link to paper: https://arxiv.org/abs/2309.00359

Recording: https://www.youtube.com/watch?v=KY76SCEjEIo

On Demand | Skeleton of Thought: LLMs Can Do Parallel Decoding

Join us for an exploration of the ‘Skeleton-of-Thought’ (SoT) approach, aimed at reducing large language model latency while enhancing answer quality, with the presence of two authors, Xuefei Ning and Zinan Lin. SoT’s innovative methodology guides LLMs to construct answer skeletons before parallel content elaboration, achieving impressive speed-ups of up to 2.39x across 11 models. Don’t miss the opportunity to delve into this human-inspired optimization strategy and its profound implications for efficient and high-quality language generation.

Link to paper: https://arxiv.org/abs/2307.15337


On-Demand | Extending the Context Window of LLaMA Models

During this week’s paper reading event, we are thrilled to announce that we will be joined by Frank Liu of Zilliz, who will be sharing valuable insights with us. This paper examines Position Interpolation (PI), a method extending context window sizes of LLaMA models up to 32,768 positions with minimal fine-tuning. The extended models showed strong results on tasks requiring long context and retained their quality within the original context window. PI avoids catastrophic attention score issues by linearly down-scaling input position indices. The method’s stability was demonstrated, and existing optimization and infrastructure could be reused in the extended models. Additionally, during the event, we will also discuss the write-up “Extending Context is Hard… But Not Impossible” available at https://kaiokendev.github.io/context.

Link to Paper: https://arxiv.org/pdf/2306.15595.pdf


On Demand | Llama 2

In this paper reading, we explore the paper “Open Foundation and Fine-Tuned Chat Models.” The paper introduces Llama 2, a collection of pretrained and fine-tuned large language models ranging from 7 billion to 70 billion parameters. Their fine-tuned model, Llama 2-Chat, is specifically designed for dialogue use cases and showcases superior performance on various benchmarks. Through human evaluations for helpfulness and safety, Llama 2-Chat emerges as a promising alternative to closed-source models. Discover the approach to fine-tuning and safety improvements, allowing us to foster responsible development and contribute to this rapidly evolving field.

Link to Paper: https://ai.meta.com/research/publications/llama-2-open-foundation-and-fine-tuned-chat-models/

Recording: https://www.youtube.com/watch?v=HyppoCyOwfY

On-Demand | Lost in the Middle

This paper examines how well language models utilize longer input contexts. The study focuses on multi-document question answering and key-value retrieval tasks. The researchers find that performance is highest when relevant information is at the beginning or end of the context. Accessing information in the middle of long contexts leads to significant performance degradation. Even explicitly long-context models experience decreased performance as the context length increases. The analysis enhances our understanding and offers new evaluation protocols for future long-context models.

Link to paper: https://arxiv.org/abs/2307.03172

Link to recording:

On-Demand | Orca

Recent research focuses on improving smaller models through imitation learning using outputs from large foundation models (LFMs). Challenges include limited imitation signals, homogeneous training data, and a lack of rigorous evaluation, leading to overestimation of small model capabilities. To address this, we introduce Orca, a 13-billion parameter model that learns to imitate LFMs’ reasoning process. Orca leverages rich signals from GPT-4, surpassing state-of-the-art models by over 100% in complex zero-shot reasoning benchmarks. It also shows competitive performance in professional and academic exams without CoT. Learning from step-by-step explanations, generated by humans or advanced AI models, enhances model capabilities and skills.

Link to Paper: https://arxiv.org/abs/2306.02707

Link to Recording: https://www.youtube.com/watch?v=BswvaWZdWw4

On-Demand | Generalized LoRA (GLoRA)

Introducing GLoRA: a universal, parameter-efficient fine-tuning approach for diverse tasks. GLoRA enhances LoRA with a generalized prompt module, optimizing pre-trained model weights and activations. Its scalable, layer-wise structure search enables efficient parameter adaptation. GLoRA excels in transfer learning, few-shot learning, and domain generalization, outperforming previous methods on various datasets. With fewer parameters and no extra inference cost, GLoRA is a practical solution for resource-limited applications. Join us to explore GLoRA’s capabilities in this interactive community paper reading!

Link to Paper: https://arxiv.org/abs/2306.07967

Recording: https://www.youtube.com/watch?v=GCh2HWOKiaU&t=5s

On-Demand | HyDE

Explore HyDE, a thrilling zero-shot learning technique that combines GPT-3’s language understanding with contrastive text encoders. HyDE revolutionizes information retrieval and grounding in real-world data by generating hypothetical documents from queries and retrieving similar real-world documents. It outperforms traditional unsupervised retrievers, rivaling fine-tuned retrievers across diverse tasks and languages.

This leap in zero-shot learning efficiently retrieves relevant real-world information without task-specific fine-tuning, broadening AI model applicability and effectiveness. Join us for a paper reading on how HyDE works!

Link to Paper: https://arxiv.org/abs/2212.10496

Recording: https://youtu.be/PvT8ntmm1Xs

On-Demand | VOYAGER

VOYAGER, the first LLM-powered embodied lifelong learning agent in Minecraft, autonomously explores the world, acquires skills, and makes discoveries without human intervention. It outperforms previous approaches, achieving exceptional proficiency in playing Minecraft and successfully applies its learned skills to solve novel tasks in different Minecraft worlds, surpassing techniques that struggle with generalization.

Link to Paper: https://arxiv.org/pdf/2305.16291.pdf

Link to Recording: https://www.youtube.com/watch?v=BU3w_AbCEbA

On-Demand | Retrieval-Augmented Generation (RAG)

This week we’re diving into the world of Retrieval-Augmented Generation (RAG)!

We know GPT-like LLMs are great at soaking up knowledge during pre-training and fine-tuning them can lead to some pretty great, specific results. But when it comes to tasks that really demand heavy knowledge lifting, they still fall short. Plus, it’s not exactly easy to figure out where their answers come from or how to update their knowledge.

Enter RAG models, a hybrid beast that combines the best of both worlds: the learning power of pre-trained models (the parametric part), and an explicit, non-parametric memory — imagine a searchable index of all of Wikipedia.

Link to paper: https://arxiv.org/abs/2005.11401

On-Demand | LIMA: Less Is More for Alignment
On-Demand | Drag Your GAN: Interactive Point-based Manipulation on the Generative Image Manifold

This paper introduces a novel approach, DragGAN, for achieving precise control over the pose, shape, expression, and layout of objects generated by GANs. It allows users to “drag” any points of an image to specific target points — in other words, it enables the deformation of images with better control over where pixels end up to produce ultra-realistic outputs. Paper: https://arxiv.org/abs/2305.10973

View Recording: https://youtu.be/DxzsgV8rTOw

Register for the Series


Aparna Dhinakaran
Co-founder & Chief Product Officer

Aparna Dhinakaran is the Co-Founder and Chief Product Officer at Arize AI, a pioneer and early leader in machine learning (ML) observability. A frequent speaker at top conferences and thought leader in the space, Dhinakaran was recently named to the Forbes 30 Under 30. Before Arize, Dhinakaran was an ML engineer and leader at Uber, Apple, and TubeMogul (acquired by Adobe). During her time at Uber, she built several core ML Infrastructure platforms, including Michealangelo. She has a bachelor’s from Berkeley's Electrical Engineering and Computer Science program, where she published research with Berkeley's AI Research group. She is on a leave of absence from the Computer Vision Ph.D. program at Cornell University.

Get ML observability in minutes.

Get Started