AI Research - Arize AX Docs

Time Series Evals with OpenAI o1-preview

Prompt Caching Benchmarking

Multi-Agent Systems: Swarm

Instrumenting LLMs with OTel

Comparing Agent Frameworks

Testing Generation in RAG

Time Series Evals with OpenAI o1-preview

We benchmarked o1-preview on our hardest eval task - time series trend evaluations. This post compares that performance against GPT-4o-mini, Claude 3.5 sonnet, and GPT-4o.

o1-preview Time Series EvaluationsArize AI

Prompt Caching Benchmarking

We compare the performance and cost savings of prompt caching on Anthropic vs OpenAI.

How to Make Your AI App Feel Magical: Prompt CachingArize AI

Multi-Agent Systems: Swarm

We compare and contrast OpenAI’s experimental Swarm repo against other popular multi-agent frameworks: Autogen and CrewAI

Comparing OpenAI Swarm with other Multi Agent FrameworksArize AI

Instrumenting LLMs with OTel

Lessons learned from our journey to one million downloads of our OpenTelemetry wrapper, OpenInference.

Zero to a Million: Instrumenting LLMs with OTELArize AI

Comparing Agent Frameworks

We built the same agent in LangGraph, LlamaIndex Workflows, CrewAI, Autogen, and pure code. See how each framework compares.

Comparing Agent FrameworksArize AI

Testing Generation in RAG

Testing the generation stage of RAG across GPT-4 and Claude 2.1.

Evaluating the Generation Stage in RAGArize AI

Jailbreak and Prompt Injection Defense

⌘I

Time Series Evals with OpenAI o1-preview

Prompt Caching Benchmarking

Multi-Agent Systems: Swarm

Instrumenting LLMs with OTel

Comparing Agent Frameworks

Testing Generation in RAG

​Time Series Evals with OpenAI o1-preview

o1-preview Time Series EvaluationsArize AI

​Prompt Caching Benchmarking

How to Make Your AI App Feel Magical: Prompt CachingArize AI

​Multi-Agent Systems: Swarm

Comparing OpenAI Swarm with other Multi Agent FrameworksArize AI

​Instrumenting LLMs with OTel

Zero to a Million: Instrumenting LLMs with OTELArize AI

​Comparing Agent Frameworks

Comparing Agent FrameworksArize AI

​Testing Generation in RAG

Evaluating the Generation Stage in RAGArize AI

Time Series Evals with OpenAI o1-preview

Prompt Caching Benchmarking

Multi-Agent Systems: Swarm

Instrumenting LLMs with OTel

Comparing Agent Frameworks

Testing Generation in RAG