AI that improves itself.

AI Research Papers

Dive into the latest AI and agent engineering research papers. Sign up to join our paper readings and author office hours, or learn from prior readings on demand.

Explore More AI Research

Stay up to date with the latest breakthroughs in AI.

Top AI research papers

Source	Description
Source	Why Language Models Hallucinate	Description	Shows the math and evaluation choices that underlie LLM hallucinations.
Source	ARE and Gaia2	Description	Gaia2 is a new AI agent benchmark that checks write actions — the ones that modify the world (like sending an email) — and don’t explicitly verify pure reads.

Recommended resources

Prompt Learning Playbook

How TheFork Leverages Online Evals To Boost Conversions with Arize AX on AWS

How Handshake Deployed and Scaled 15+ LLM Use Cases In Under Six Months — With Evals From Day One

Start your AI observability journey.

Book a demo Get started

Arize AX

Learn

Insights

Company

AI Research Papers

Dive into the latest AI and agent engineering research papers. Sign up to join our paper readings and author office hours, or learn from prior readings on demand.

Explore More AI Research

Deploying IBM Generalist Agent in Enterprise Production

TUMIX: Multi-Agent Test-Time Scaling with Tool-Use Mixture

ARE: scaling up agent environments and evaluations

AgentArch: Benchmarking AI Agents for Enterprise Workflows

Why Language Models Hallucinate

Small Language Models are the Future of Agentic AI

Verizon’s Stan Miasnikov Walks Through His Latest Paper On Inter-Agent Communication

A Watermark for Large Language Models

Self-Adapting Language Models

Accurate KV Cache Quantization with Outlier Tokens Tracing

Scalable Chain of Thoughts via Elastic Reasoning

Sleep-time Compute: Beyond Inference Scaling at Test-time

Merge, Ensemble, and Cooperate! A Survey on Collaborative LLM Strategies

Agent-as-a-Judge: Evaluate Agents with Agents

Introduction to OpenAI’s Realtime API

Model Context Protocol (MCP) from Anthropic

How DeepSeek is Pushing the Boundaries of AI Development

Multiagent Finetuning: A Conversation with Researcher Yilun Du

Swarm: OpenAI’s Experimental Approach to Multi-Agent Systems

Breaking Down Reflection Tuning: Enhancing LLM Performance with Self-Learning

Composable Interventions for Language Models

Judging the Judges: Evaluating Alignment and Vulnerabilities in LLMs-as-Judges

Extending the Context Window of LLaMA Models Paper Reading

DSPy Assertions: Computational Constraints for Self-Refining Language Model Pipelines

RAFT: Adapting Language Model to Domain Specific RAG

LLM Interpretability and Sparse Autoencoders: Research from OpenAI and Anthropic

Trustworthy LLMs: A Survey and Guideline for Evaluating Large Language Models’ Alignment

Breaking Down EvalGen: Who Validates the Validators?

Keys To Understanding ReAct: Synergizing Reasoning and Acting in Language Models

Demystifying Amazon’s Chronos: Learning the Language of Time Series

Anthropic Claude 3

Reinforcement Learning in the Era of LLMs

Sora: OpenAI’s Text-to-Video Generation Model

Phi-2 Model

Mistral AI (Mixtral-8x7B): Performance, Benchmarks

How to Prompt LLMs for Text-to-SQL

The Geometry of Truth: Emergent Linear Structure in LLM Representation of True/False Datasets

Towards Monosemanticity: Decomposing Language Models With Dictionary Learning

RankVicuna: Zero-Shot Listwise Document Reranking with Open-Source Large Language Models

Explaining Grokking Through Circuit Efficiency

Large Content And Behavior Models to Understand, Simulate, and Optimize Content and Behavior.

Skeleton of Thought: LLMs Can Do Parallel Decoding Paper Reading

Extending the Context Window of LLaMA Models Paper Reading

Llama 2: Open Foundation and Fine-Tuned Chat Models Paper Reading

Lost in the Middle: How Language Models Use Long Contexts Paper Reading

Orca: Progressive Learning from Complex Explanation Traces of GPT-4 Paper Reading

One-for-All: Generalized LoRA for Parameter-Efficient Fine-tuning

HyDE: Precise Zero-Shot Dense Retrieval without Relevance Labels

Voyager: An Open-Ended Embodied Agent with LLMs Paper Reading and Discussion

Retrieval-Augmented Generation – Paper Reading and Discussion

LoRA: Low-Rank Adaptation of Large Language Models Paper Reading and Discussion

Drag Your GAN: Interactive Point-Based Manipulation on the Generative Image Manifold

LIMA: Less Is More for Alignment – Paper Reading and Discussion

Hungry Hungry Hippos (H3) and Language Modeling with State Space Models

Toolformer: Training LLMs To Use Tools

OpenAI on Reinforcement Learning With Human Feedback (RLHF)

Top AI research papers

Recommended resources

Prompt Learning Playbook

How TheFork Leverages Online Evals To Boost Conversions with Arize AX on AWS

How Handshake Deployed and Scaled 15+ LLM Use Cases In Under Six Months — With Evals From Day One

Start your AI observability journey.

Subscribe to The Evaluator