AI Research

Collection of advanced experiments and benchmarks in LLM evaluation, instrumentation, and agent systems

Time Series Evals with OpenAI o1-preview

We benchmarked o1-preview on our hardest eval task - time series trend evaluations. This post compares that performance against GPT-4o-mini, Claude 3.5 sonnet, and GPT-4o.

o1-preview Time Series EvaluationsArize AI

Prompt Caching Benchmarking

We compare the performance and cost savings of prompt caching on Anthropic vs OpenAI.

How to Make Your AI App Feel Magical: Prompt CachingArize AI

Multi-Agent Systems: Swarm

We compare and contrast OpenAI's experimental Swarm repo against other popular multi-agent frameworks: Autogen and CrewAI

Comparing OpenAI Swarm with other Multi Agent FrameworksArize AI

Instrumenting LLMs with OTel

Lessons learned from our journey to one million downloads of our OpenTelemetry wrapper, OpenInference.

Zero to a Million: Instrumenting LLMs with OTELArize AI

Comparing Agent Frameworks

We built the same agent in LangGraph, LlamaIndex Workflows, CrewAI, Autogen, and pure code. See how each framework compares.

Comparing Agent FrameworksArize AI

Testing Generation in RAG

Testing the generation stage of RAG across GPT-4 and Claude 2.1.

Evaluating the Generation Stage in RAGArize AI

Last updated 1 month ago

Was this helpful?