AI Research

Collection of advanced experiments and benchmarks in LLM evaluation, instrumentation, and agent systems

Time Series Evals with OpenAI o1-preview

We benchmarked o1-preview on our hardest eval task - time series trend evaluations. This post compares that performance against GPT-4o-mini, Claude 3.5 sonnet, and GPT-4o.


Prompt Caching Benchmarking

We compare the performance and cost savings of prompt caching on Anthropic vs OpenAI.


Multi-Agent Systems: Swarm

We compare and contrast OpenAI's experimental Swarm repo against other popular multi-agent frameworks: Autogen and CrewAI


Instrumenting LLMs with OTel

Lessons learned from our journey to one million downloads of our OpenTelemetry wrapper, OpenInference.


Comparing Agent Frameworks

We built the same agent in LangGraph, LlamaIndex Workflows, CrewAI, Autogen, and pure code. See how each framework compares.


Testing Generation in RAG

Testing the generation stage of RAG across GPT-4 and Claude 2.1.

Last updated

Was this helpful?