AI Research
Collection of advanced experiments and benchmarks in LLM evaluation, instrumentation, and agent systems
Time Series Evals with OpenAI o1-preview
We benchmarked o1-preview on our hardest eval task - time series trend evaluations. This post compares that performance against GPT-4o-mini, Claude 3.5 sonnet, and GPT-4o.
Prompt Caching Benchmarking
We compare the performance and cost savings of prompt caching on Anthropic vs OpenAI.
Multi-Agent Systems: Swarm
We compare and contrast OpenAI's experimental Swarm repo against other popular multi-agent frameworks: Autogen and CrewAI
Instrumenting LLMs with OTel
Lessons learned from our journey to one million downloads of our OpenTelemetry wrapper, OpenInference.
Comparing Agent Frameworks
We built the same agent in LangGraph, LlamaIndex Workflows, CrewAI, Autogen, and pure code. See how each framework compares.
Testing Generation in RAG
Testing the generation stage of RAG across GPT-4 and Claude 2.1.
Last updated
Was this helpful?