Zero to a Million: Instrumenting LLMs with OTEL

Published Oct 26, 2024

Research

Aparna Dhinakaran

Co-founder & Chief Product Officer

Thanks to Roger Yang, Xander Song, and John Gilhuly for their contributions to this piece.

A few months ago, we hit a significant milestone: our OTEL LLM instrumentation surpassed one million monthly downloads. This journey has been challenging but rewarding. We, along with other key players in the industry, are paving the way for observability in AI using OpenTelemetry (OTEL).

OTEL is crucial for LLM applications. It provides a standardized way to collect data, which is key to building effective evaluation pipelines in both pre-production and production. With OTEL, you can evaluate AI models consistently across different languages and settings. Arize’s OpenInference instrumentation extends OTEL to the world of large language models (LLMs). Achieving this hasn’t been easy. Here are some of the challenges we faced.

One of our first challenges was managing latent data—information that appears after an initial event. OTEL spans are immutable, yet evaluating LLMs often requires attributing evaluation metrics to existing spans.

We considered two approaches:

Option 1: Augmentation and Materialization – In Arize, we augment spans by adding new columns for evaluation metrics. This approach is technically challenging but provides efficient handling of large-scale data.
Option 2: Metadata Table Joins – In Phoenix, our open-source tool, we create a separate metadata table that joins with spans on the fly. This works well for smaller datasets but becomes impractical for larger volumes due to computational overhead.

OTEL does not support lists of objects in span attributes. This is a problem when dealing with lists like embeddings, messages, or retrieved documents. Our solution involves encoding each list element as an indexed key-value pair and reconstructing the list later in the collector.

This workaround is necessary for:

Lists of Message Objects
Embeddings
Tool Calls and Parameters
Lists of Retrieved Documents

It’s not perfect, but it gets the job done.

List attributes

OTEL exporters have a default limit of 128 attributes per span. If this limit is exceeded, attributes are dropped in a First In, First Out (FIFO) order. Dropping important attributes like span kind can disrupt evaluation logic. To prevent this, we prioritize crucial fields and attach them last, ensuring they are retained.

Handling asynchronous operations like futures and promises is complex. Key challenges include:

When should a child span start? At future creation, invocation, or resolution?
When should it end? If the future is never resolved, how do you end the span?
How do you ensure the parent span ends after the child?

These issues impact trace continuity and accuracy. Errors here lead to incomplete traces, making it difficult to evaluate model performance or identify problems.

Streaming responses add another layer of complexity. OTEL spans are immutable, meaning they cannot be modified once ended. For LLMs, you need the span to include all streamed content, which means delaying the span’s closure until the stream ends. This often requires complex wrappers around internal objects from the LLM library or framework. When parent spans depend on these child spans, ensuring proper recording becomes even more challenging.

We deal with synchronous, asynchronous, streaming, non-streaming, chat completions, and agent calls. Normalizing attributes like tool calls and LLM responses across frameworks is a daunting task. The LLM landscape evolves rapidly, with each framework presenting unique challenges. Creating a standardized approach that covers them all is demanding work.

tracing integration list

Reaching a million downloads was a journey full of technical puzzles. We are proud to be part of the evolving OTEL and LLM observability landscape. Standardization is on the horizon, and we are committed to shaping that future. The work we do today will help define how LLMs are instrumented, evaluated, and improved.

Stay tuned—we will keep sharing our insights as we tackle the next milestone.

Bar chart titled "Correctness" comparing four arms on a 0 to 1.000 scale. LobeHub 0.826, Vault 0.833, MCP 0.834, Baseline 0.845. All four bars sit in a tight cluster just above 0.825.

MCP vs. CLI Skills for agents: what our eval found (and which you should use)

CLAUDE.md: Best Practices Learned from Optimizing Claude Code with Prompt Learning

Recommended resources

resource

Iterative Agent Improvement with Voiceflow & Arize Phoenix

resource

Phoenix Community Challenge: Agents

Eric Xiao and an AI teddy bear, building better AI

resource

Arize AX

Learn

Insights

Company

Arize AX

Learn

Insights

Company

On this page

Suggested reading

Zero to a Million: Instrumenting LLMs with OTEL

Aparna Dhinakaran

Co-founder & Chief Product Officer

Challenge 1: Dealing with Latent Data

Challenge 2: Navigating Lists in OTEL

Challenge 3: Avoiding Attribute Loss

Challenge 4: Futures and Promises

Challenge 5: Handling Streaming Responses

Challenge 6: The Scope of Instrumentation

Moving Forward

Suggested reading

Recommended resources

Iterative Agent Improvement with Voiceflow & Arize Phoenix

Phoenix Community Challenge: Agents

Improving Safety and Reliability of LLM Applications

Arize AX

Learn

Insights

Company

On this page

Suggested reading

Zero to a Million: Instrumenting LLMs with OTEL

Aparna Dhinakaran

Co-founder & Chief Product Officer

Challenge 1: Dealing with Latent Data

Challenge 2: Navigating Lists in OTEL

Challenge 3: Avoiding Attribute Loss

Challenge 4: Futures and Promises

Challenge 5: Handling Streaming Responses

Challenge 6: The Scope of Instrumentation

Moving Forward

Suggested reading

Recommended resources

Iterative Agent Improvement with Voiceflow & Arize Phoenix

Phoenix Community Challenge: Agents

Improving Safety and Reliability of LLM Applications

Sign up for our monthly newsletter, The Evaluator.

Subscribe to The Evaluator