Zero to a Million: Instrumenting LLMs with OTEL

Aparna Dhinakaran

Co-founder & Chief Product Officer

Thanks to Roger Yang, Xander Song, and John Gilhuly for their contributions to this piece.

A few months ago, we hit a significant milestone: our OTEL LLM instrumentation surpassed one million monthly downloads. This journey has been challenging but rewarding. We, along with other key players in the industry, are paving the way for observability in AI using OpenTelemetry (OTEL).

OTEL is crucial for LLM applications. It provides a standardized way to collect data, which is key to building effective evaluation pipelines in both pre-production and production. With OTEL, you can evaluate AI models consistently across different languages and settings. Arize’s OpenInference instrumentation extends OTEL to the world of large language models (LLMs). Achieving this hasn’t been easy. Here are some of the challenges we faced.

One of our first challenges was managing latent data—information that appears after an initial event. OTEL spans are immutable, yet evaluating LLMs often requires attributing evaluation metrics to existing spans.

We considered two approaches:

  • Option 1: Augmentation and Materialization – In Arize, we augment spans by adding new columns for evaluation metrics. This approach is technically challenging but provides efficient handling of large-scale data.
  • Option 2: Metadata Table Joins – In Phoenix, our open-source tool, we create a separate metadata table that joins with spans on the fly. This works well for smaller datasets but becomes impractical for larger volumes due to computational overhead.

OTEL does not support lists of objects in span attributes. This is a problem when dealing with lists like embeddings, messages, or retrieved documents. Our solution involves encoding each list element as an indexed key-value pair and reconstructing the list later in the collector.

This workaround is necessary for:

  • Lists of Message Objects
  • Embeddings
  • Tool Calls and Parameters
  • Lists of Retrieved Documents

It’s not perfect, but it gets the job done.

List attributes

OTEL exporters have a default limit of 128 attributes per span. If this limit is exceeded, attributes are dropped in a First In, First Out (FIFO) order. Dropping important attributes like span kind can disrupt evaluation logic. To prevent this, we prioritize crucial fields and attach them last, ensuring they are retained.

Handling asynchronous operations like futures and promises is complex. Key challenges include:

  • When should a child span start? At future creation, invocation, or resolution?
  • When should it end? If the future is never resolved, how do you end the span?
  • How do you ensure the parent span ends after the child?

These issues impact trace continuity and accuracy. Errors here lead to incomplete traces, making it difficult to evaluate model performance or identify problems.

Streaming responses add another layer of complexity. OTEL spans are immutable, meaning they cannot be modified once ended. For LLMs, you need the span to include all streamed content, which means delaying the span’s closure until the stream ends. This often requires complex wrappers around internal objects from the LLM library or framework. When parent spans depend on these child spans, ensuring proper recording becomes even more challenging.

We deal with synchronous, asynchronous, streaming, non-streaming, chat completions, and agent calls. Normalizing attributes like tool calls and LLM responses across frameworks is a daunting task. The LLM landscape evolves rapidly, with each framework presenting unique challenges. Creating a standardized approach that covers them all is demanding work.

tracing integration list

Reaching a million downloads was a journey full of technical puzzles. We are proud to be part of the evolving OTEL and LLM observability landscape. Standardization is on the horizon, and we are committed to shaping that future. The work we do today will help define how LLMs are instrumented, evaluated, and improved.

Stay tuned—we will keep sharing our insights as we tackle the next milestone.