Skip to main content
This guide must be run locally on a machine with a microphone — audio capture requires physical hardware that Colab cannot provide. The companion notebook is at Arize-ai/tutorials. Download it and run it in Jupyter, or follow the steps below to assemble the same code in a regular .py file.
This guide is a hands-on walkthrough for instrumenting an OpenAI Realtime voice agent with Arize AX, using the OpenAI Agents SDK and openinference-instrumentation-openai-agents. You will learn:
  • How a single OpenAIAgentsInstrumentor().instrument(...) call auto-traces an agents.realtime.RealtimeSession — capturing audio, transcripts, token counts, and tool calls without any manual span code
  • The span tree the instrumentor produces (one AUDIO parent per conversational turn, with USER, LLM, and TOOL children)
  • The Agents-SDK event stream — RealtimeAudio, RealtimeAudioInterrupted, RealtimeAgentEndEvent, RealtimeToolStart — and how to drive a live microphone + speaker loop on top of it
  • How to evaluate the captured audio with an audio-aware OpenAI model and log the results back to Arize

Initial setup

You will need an Arize AX account to run this guide. Sign up now for free if you don’t have one. You also need an OpenAI API key with access to the Realtime API.

Create a project directory and virtual environment

Create a new directory for your script and a Python virtual environment inside it.

Install libraries

Install the dependencies you’ll use across the rest of the guide — the OpenAI Agents SDK, the OpenInference instrumentor, the arize-otel helper, the arize client for exporting traces and logging evaluations, and sounddevice + numpy + pandas for audio I/O and dataframe work:
pip install openai openai-agents openinference-instrumentation-openai-agents \
    arize-otel arize sounddevice numpy pandas
On Linux you also need the system PortAudio library that sounddevice links against. macOS and Windows wheels bundle PortAudio already.
# Linux only:
sudo apt-get install -y libportaudio2   # apt-based distros
sudo dnf install -y portaudio           # dnf-based distros

Set environment variables

The script reads three secrets from environment variables. Find your Arize Space ID and API Key on your Space Settings page: Arize AX Space Settings page showing the Space ID and API Key fields Export them in the same shell session you will run the code from:
export ARIZE_SPACE_ID="your-arize-space-id"
export ARIZE_API_KEY="your-arize-api-key"
export OPENAI_API_KEY="your-openai-api-key"

What you’ll see in Arize

The OpenAI Agents instrumentor turns each turn of a Realtime conversation into a complete span tree, automatically. You write no tracing code yourself — the instrumentor patches RealtimeSession and emits spans as audio, tool calls, and responses flow through the WebSocket. For each conversational turn, the tree looks like this:
AUDIO  "conversation.turn"     ← parent; aggregated transcripts, llm.model_name, llm.invocation_parameters
├─ USER "user"                 ← input.audio.url, input.audio.transcript
├─ LLM  "assistant"            ← output.audio.url, output.audio.transcript, token counts, time_to_first_token_ms
│  └─ TOOL "<tool_name>"        ← one per function call within the turn
└─ ...                          ← additional siblings for split input or tool round-trips
A few details worth knowing:
  • The parent span uses the AUDIO span kind — an instrumentor-introduced extension used for the conversational-turn parent. Arize AX renders it with audio-aware UI: a play button, waveform, and the input/output transcripts inline.
  • input.audio.url and output.audio.url carry the captured audio as inline data:audio/wav;base64,… URIs — the bytes ride on the span itself, so no separate audio storage is needed.
  • time_to_first_token_ms on the assistant span is the latency from the user’s audio commit to the first byte of the model’s audio response — the metric that matters most for perceived voice-agent responsiveness.
  • Token counts include audio-specific breakdowns llm.token_count.prompt_details.audio and llm.token_count.completion_details.audio, so you can see audio vs text token cost separately.

Defining the voice agent

Three pieces make up the agent we’ll wire to the Realtime API: the tools (Python functions the model can call), the agent definition (instructions and a tool list), and the audio plumbing that pipes the microphone in and the speaker out. None of this code is tracing-specific — it’s the same shape you’d write without observability.

Tools

Two tools: get_weather and get_current_time. They’re trivial dummies, enough to exercise a tool round-trip in the Realtime API. The @function_tool decorator from the Agents SDK introspects the function’s type hints and docstring to generate the JSON schema the model receives — no manual tool-spec dict needed.
from datetime import datetime
from agents import function_tool


@function_tool
def get_weather(location: str, unit: str = "fahrenheit") -> str:
    """Get the current weather for a location."""
    if unit == "celsius":
        return f"The weather in {location} is 22 °C and sunny."
    return f"The weather in {location} is 72 °F and sunny."


@function_tool
def get_current_time(timezone: str = "UTC") -> str:
    """Get the current time for a timezone."""
    now = datetime.now().strftime("%I:%M %p")
    return f"The current time in {timezone} is {now}."

The agent

A RealtimeAgent pairs a system prompt with a tool list for the OpenAI Realtime API. It’s consumed by RealtimeRunner, which negotiates the WebSocket connection and handles the bidirectional audio stream.
from agents.realtime import RealtimeAgent

agent = RealtimeAgent(
    name="Assistant",
    instructions=(
        "You are a helpful voice assistant. "
        "You have tools to look up weather and the current time — use them when asked. "
        "Keep responses short and conversational."
    ),
    tools=[get_weather, get_current_time],
)

Audio plumbing

The OpenAI Realtime API speaks 24 kHz PCM16 mono on the wire. We use sounddevice to bridge the system’s microphone and speaker to that format. sounddevice runs the input stream’s callback on a background thread; the callback enqueues mic bytes into an asyncio.Queue that the async send loop drains. For playback we use a non-callback OutputStream and write to it synchronously via loop.run_in_executor(...), letting PortAudio own the buffering. The callback also accepts an optional recording list. When supplied, every chunk is appended to it with a time.time_ns() timestamp. We use this local copy in the evaluation step at the end of the guide — see that section for the reason it’s needed and the conditions under which you can skip it.
import time

import numpy as np


def make_mic_callback(mic_queue, loop, recording=None):
    """Build a sounddevice InputStream callback that forwards mic chunks into an asyncio queue."""
    def cb(indata, frames, _time, status):
        if status:
            print(f"Mic: {status}")
        # `indata` is a view onto a buffer sounddevice reuses on the next callback,
        # so we copy before handing the bytes to a different thread.
        chunk = indata.copy().tobytes()
        if recording is not None:
            recording.append((time.time_ns(), chunk))
        loop.call_soon_threadsafe(mic_queue.put_nowait, chunk)
    return cb

Setting up tracing

This is the only tracing code you need to write for the whole guide — one register(...) call and one OpenAIAgentsInstrumentor().instrument(...) call. The instrumentor patches agents.realtime.RealtimeSession to emit the span tree shown above; all spans flow through the registered TracerProvider to Arize AX. By default, captured audio rides inline on the span attributes as data:audio/wav;base64,… URIs — no separate storage needed. The inline payload is capped by OPENINFERENCE_BASE64_AUDIO_MAX_LENGTH (default 32000 characters ≈ 0.5 s of 24 kHz mono PCM16). We raise it to 2000000 (~30 s) below, which is plenty for typical voice turns. Set this env var before calling instrument(...) — the instrumentor reads it at patch time.
import os
from arize.otel import register
from openinference.instrumentation.openai_agents import OpenAIAgentsInstrumentor

# Raise the inline base64 cap to ~30 s of audio (default is ~0.5 s).
# Must be set BEFORE .instrument() — the instrumentor reads it at patch time.
os.environ["OPENINFERENCE_BASE64_AUDIO_MAX_LENGTH"] = "2000000"

PROJECT_NAME = "openai-realtime-voice"

tracer_provider = register(
    space_id=os.environ["ARIZE_SPACE_ID"],
    api_key=os.environ["ARIZE_API_KEY"],
    project_name=PROJECT_NAME,
)

OpenAIAgentsInstrumentor().instrument(tracer_provider=tracer_provider)
If you need durable storage or longer turns than the inline cap allows, swap the audio URL for any HTTPS endpoint Arize AX can fetch — an S3 presigned URL, a GCS object, your own CDN. Attach a custom SpanProcessor or wrap the SpanExporter to substitute input.audio.url / output.audio.url with your chosen URL before the span leaves the process. Arize plays back whatever URL the attribute carries.Using an externally fetchable URL also removes the need for the local-recording workaround in the evaluation step below — see the eval section for the details.

Running a voice session

This is the live mic/speaker loop. It opens an agents.realtime.RealtimeSession via RealtimeRunner, pumps mic audio into it, plays back the assistant’s audio, and exits cleanly after one full conversational turn — including any tool calls and the assistant’s follow-up response. Three async tasks run concurrently inside the session:
  • send_mic — drains the mic queue and forwards chunks to the session via session.send_audio(...)
  • handle_events — consumes the session’s event stream and dispatches on event type
  • hard_timer — sets stop after MAX_SESSION_SECONDS as a safety net
The exit logic is deliberately deferred. RealtimeAgentEndEvent fires after every response.done from the API — including the response where the model decides to call a tool, before the tool runs and the follow-up response. So we schedule the exit after EXIT_GRACE_PERIOD seconds and cancel it if a RealtimeToolStart or a new RealtimeAgentStartEvent follows within the window. For tool-free turns the grace window simply elapses and the session exits.
import asyncio

import sounddevice as sd
from agents.realtime import RealtimeRunner
from agents.realtime.events import (
    RealtimeAgentEndEvent,
    RealtimeAgentStartEvent,
    RealtimeAudio,
    RealtimeAudioInterrupted,
    RealtimeToolStart,
)

SAMPLE_RATE = 24_000
CHANNELS = 1
MIC_CHUNK_FRAMES = 1_200
MAX_SESSION_SECONDS = 60
EXIT_GRACE_PERIOD = 1.5

# Captured mic chunks with ns-precision timestamps. Populated by the mic
# callback during the session, then sliced by the eval step to extract
# per-USER-span audio.
mic_recording: list[tuple[int, bytes]] = []


async def run_session():
    loop = asyncio.get_running_loop()
    mic_queue: asyncio.Queue = asyncio.Queue(maxsize=100)
    stop = asyncio.Event()
    mic_recording.clear()

    async def hard_timer():
        await asyncio.sleep(MAX_SESSION_SECONDS)
        stop.set()

    runner = RealtimeRunner(agent)
    async with await runner.run() as session:
        print("Connected. Speak now — session will end once the assistant finishes its turn.\n")

        async def send_mic():
            while not stop.is_set():
                try:
                    chunk = await asyncio.wait_for(mic_queue.get(), timeout=0.1)
                    await session.send_audio(chunk)
                except asyncio.TimeoutError:
                    continue

        timer_task = asyncio.create_task(hard_timer())
        send_task = asyncio.create_task(send_mic())
        events_task = None

        try:
            with sd.InputStream(
                samplerate=SAMPLE_RATE,
                channels=CHANNELS,
                dtype="int16",
                blocksize=MIC_CHUNK_FRAMES,
                callback=make_mic_callback(mic_queue, loop, recording=mic_recording),
            ), sd.OutputStream(
                samplerate=SAMPLE_RATE,
                channels=CHANNELS,
                dtype="int16",
                blocksize=MIC_CHUNK_FRAMES,
            ) as out_stream:

                async def handle_events():
                    exit_task = None

                    async def maybe_exit():
                        try:
                            await asyncio.sleep(EXIT_GRACE_PERIOD)
                            await loop.run_in_executor(None, out_stream.stop)
                            stop.set()
                        except asyncio.CancelledError:
                            pass

                    def cancel_pending_exit():
                        nonlocal exit_task
                        if exit_task is not None and not exit_task.done():
                            exit_task.cancel()
                        exit_task = None

                    async for event in session:
                        if stop.is_set():
                            break
                        if isinstance(event, RealtimeAudio):
                            arr = np.frombuffer(event.audio.data, dtype=np.int16)
                            await loop.run_in_executor(None, out_stream.write, arr)
                        elif isinstance(event, RealtimeAudioInterrupted):
                            cancel_pending_exit()
                            await loop.run_in_executor(None, out_stream.abort)
                            await loop.run_in_executor(None, out_stream.start)
                            print("[interrupted]")
                        elif isinstance(event, (RealtimeAgentStartEvent, RealtimeToolStart)):
                            cancel_pending_exit()
                        elif isinstance(event, RealtimeAgentEndEvent):
                            cancel_pending_exit()
                            exit_task = asyncio.create_task(maybe_exit())

                events_task = asyncio.create_task(handle_events())
                await stop.wait()
        except KeyboardInterrupt:
            print("\nInterrupted.")
            stop.set()
        finally:
            pending = [t for t in (timer_task, send_task, events_task) if t is not None]
            for t in pending:
                t.cancel()
            await asyncio.gather(*pending, return_exceptions=True)

    tracer_provider.force_flush()
    print("Session ended. Traces flushed to Arize.")


asyncio.run(run_session())
Try asking:
  • “What’s the weather in London?” — exercises a tool call (get_weather)
  • “What time is it in Tokyo?” — exercises the other tool (get_current_time)
  • “Tell me a fun fact about Python” — no tool call; tests the simpler one-response path
Server-side VAD (the Realtime API’s default turn_detection) detects when you’ve finished speaking and commits your audio buffer automatically.

See your traces in Arize

Head to your Arize AX project (openai-realtime-voice) to see the traces. Each conversation turn appears as a conversation.turn span (kind: AUDIO) containing user, assistant, and any <tool_name> children. An audio trace in the Arize AX trace view Things to look at in the trace view:
  • Audio playback — click the play button on the user and assistant spans. The audio is served from the inline data:audio/wav;base64,... URI on the span attribute.
  • Transcripts — the input.audio.transcript and output.audio.transcript attributes show what the model heard and what it said. Useful for debugging mishearings.
  • Tool calls — child TOOL spans under the assistant span show the function name, arguments JSON, and the value the function returned.
  • Latencytime_to_first_token_ms on the assistant span gives the user-perceived “how fast did the agent start talking” latency.
  • Audio token costllm.token_count.prompt_details.audio and llm.token_count.completion_details.audio break out audio tokens separately from text tokens.

Audio redaction

The OpenInference instrumentor recognises three environment variables for controlling what audio data ends up on spans. Set them before calling instrument(...):
  • OPENINFERENCE_HIDE_INPUT_AUDIO=true — drop input.audio.* from USER spans
  • OPENINFERENCE_HIDE_OUTPUT_AUDIO=true — drop output.audio.* from LLM spans
  • OPENINFERENCE_BASE64_AUDIO_MAX_LENGTH=<n> — cap the inline base64 payload length (default 32000)
The general OpenInference TraceConfig(hide_inputs=True) and TraceConfig(hide_outputs=True) settings also cascade to the corresponding audio attributes.

Evaluating the voice session

Now that traces are flowing into Arize, let’s run an evaluation against the captured audio. We’ll classify the tone of each user utterance as positive, neutral, or negative — a classic voice-agent quality signal — and ship the results back to Arize so they appear on the same spans in the UI. There’s one architectural wrinkle to know about up front. When the OpenInference instrumentor uploads inline data:audio/wav;base64,… URIs to Arize, the backend stores the audio in its own multimodal blob bucket and the export client returns an internal gs://arize-multimodal-prod/… reference that external code can’t authenticate against. We work around this by keeping a local copy of the mic audio as it’s captured (the mic_recording list populated by the mic callback), and slicing it per USER span using the span’s start and end times from the Arize export.
If you swap the inline-base64 path for external cloud storage (the tip in the tracing section shows how — a custom SpanProcessor that puts an S3 / GCS / CDN URL on input.audio.url instead of the inline data URI), this workaround disappears. The audio URL on each USER span points at an object you control and can fetch directly. You can then drop the recording=mic_recording argument from make_mic_callback, delete the extract_audio_b64 slicing helper, and replace it with a plain HTTP fetch of attributes.input.audio.url. The local-recording dance below exists only because the default inline-base64 path stores audio in Arize-controlled storage.
The flow has four steps:
  1. Export the USER spans from Arize — we only need them for span IDs and time bounds, not the audio bytes
  2. Slice mic_recording between each USER span’s start_time and end_time, wrap the raw PCM in a WAV header
  3. Classify the resulting audio with gpt-audio-mini (OpenAI’s audio-input chat model)
  4. Log the evaluations back to Arize via client.spans.update_evaluations(...)
Phoenix’s phoenix.evals framework was rewritten in late 2025 and its prompt templates currently support text content only — multimodal/audio content blocks aren’t supported yet. So instead of using create_classifier(...), we call OpenAI directly and shape the result into the eval-dataframe format Arize expects.

Export the USER spans from Arize

client.spans.export_to_df(...) returns a flat pandas DataFrame with one row per span. We only need three columns from it: context.span_id (where the eval attaches), start_time, and end_time. The audio bytes themselves come from mic_recording.
import os
from datetime import datetime, timedelta, timezone

import pandas as pd
from arize import ArizeClient

PROJECT_NAME = "openai-realtime-voice"

arize_client = ArizeClient(api_key=os.environ["ARIZE_API_KEY"])

# 1-hour lookback covers the session you just ran plus a generous margin for
# Arize's ingestion latency. Narrow the window if you have a noisy project.
end_time = datetime.now(timezone.utc)
start_time = end_time - timedelta(hours=1)

traces_df = arize_client.spans.export_to_df(
    space_id=os.environ["ARIZE_SPACE_ID"],
    project_name=PROJECT_NAME,
    start_time=start_time,
    end_time=end_time,
)
print(f"Exported {len(traces_df)} spans")

# USER spans are the per-utterance children of each AUDIO turn. Their
# start_time / end_time bound the audio we want to evaluate.
user_spans = traces_df[
    traces_df.get("attributes.openinference.span.kind") == "USER"
].copy()
print(f"{len(user_spans)} user spans to evaluate")

Slice the user audio out of the local recording

For each USER span, slice mic_recording between the span’s start_time and end_time (both pandas.Timestamp values, convertible to ns-since-epoch via .value). The slice is raw PCM16 mono at 24 kHz; we wrap it in a WAV header using the standard-library wave module, base64-encode the result, and hand it to the classifier. If no mic chunk falls in the span’s window, the helper returns None and we skip that span — that can happen for very short utterances right at the start or end of the recording.
import base64
import io
import wave


def _ts_to_ns(ts) -> int:
    """Convert a pandas/datetime/string timestamp to ns since the Unix epoch."""
    return int(pd.Timestamp(ts).value)


def _wrap_pcm_as_wav_b64(pcm_bytes: bytes) -> str:
    """Wrap raw PCM16 mono 24 kHz bytes in a WAV header and return base64."""
    buf = io.BytesIO()
    with wave.open(buf, "wb") as wav:
        wav.setnchannels(CHANNELS)
        wav.setsampwidth(2)  # int16
        wav.setframerate(SAMPLE_RATE)
        wav.writeframes(pcm_bytes)
    return base64.b64encode(buf.getvalue()).decode("utf-8")


def extract_audio_b64(span, recording=mic_recording) -> str | None:
    """Slice the recording by the span's [start_time, end_time] and return WAV base64."""
    start_ns = _ts_to_ns(span["start_time"])
    end_ns = _ts_to_ns(span["end_time"])
    pcm = b"".join(chunk for ts, chunk in recording if start_ns <= ts <= end_ns)
    if not pcm:
        return None
    return _wrap_pcm_as_wav_b64(pcm)

Define the tone classifier

The classifier mirrors what a classic emotion-classification template does:
  1. Task description — a system prompt that tells the model what to listen for and how to reply
  2. Rails — the fixed set of valid labels (positive / neutral / negative); the prompt forbids anything else
  3. Output format — a small JSON object with label and explanation, parsed back into the score columns Arize expects
gpt-audio-mini accepts an input_audio content block alongside text. We send the system prompt as one message and a short text + audio block as the user message. modalities=["text"] tells the model to reply in text only (not audio). The full gpt-audio model would also work — gpt-audio-mini is cheaper and plenty accurate for three-way tone classification.
Audio chat models don’t support response_format={"type": "json_object"} (yet). We pin the output shape with a strong prompt instruction and parse leniently — _extract_json_object pulls the first {...} block out of the response, which survives minor deviations like the model wrapping its reply in markdown fences.
import json
import re

from openai import OpenAI

openai_client = OpenAI()

# Three-way tone classification. The model is constrained to these labels by
# the prompt; the score mapping turns each label into a numeric value Arize
# can sort and chart on (any monotonic mapping works — 0 / 0.5 / 1 is just
# convenient for filtering "tone went negative" in the trace list).
TONE_RAILS = ["positive", "neutral", "negative"]
TONE_SCORES = {"positive": 1.0, "neutral": 0.5, "negative": 0.0}

TONE_SYSTEM_PROMPT = (
    "You are an evaluator that listens to a user's voice clip and classifies the tone of voice. "
    f"Reply with one of exactly these labels: {', '.join(TONE_RAILS)}. "
    "Respond with ONLY a JSON object on a single line, with two keys: "
    '"label" (one of the labels above) and "explanation" (a one-sentence reason). '
    "Do not wrap the JSON in markdown, do not include any preamble, do not add any other text."
)


def _extract_json_object(text: str) -> dict:
    """Pull the first {...} object out of the model's response and parse it.

    Audio models don't support `response_format={"type": "json_object"}`, so we
    rely on prompt instructions and a lenient extractor that survives minor
    deviations like the model wrapping its reply in markdown fences.
    """
    match = re.search(r"\{.*?\}", text, re.DOTALL)
    if not match:
        raise ValueError(f"No JSON object in classifier response: {text!r}")
    return json.loads(match.group(0))


def classify_tone(audio_b64: str) -> dict:
    """Classify the tone of one user utterance. Returns {label, score, explanation}."""
    response = openai_client.chat.completions.create(
        model="gpt-audio-mini",
        # We want text out (the JSON label), not synthesized audio.
        modalities=["text"],
        messages=[
            {"role": "system", "content": TONE_SYSTEM_PROMPT},
            {"role": "user", "content": [
                {"type": "text", "text": "Classify the tone of this audio clip:"},
                # `input_audio` content blocks accept base64-encoded WAV directly.
                {"type": "input_audio", "input_audio": {"data": audio_b64, "format": "wav"}},
            ]},
        ],
    )
    parsed = _extract_json_object(response.choices[0].message.content)
    label = parsed.get("label", "neutral").lower()
    if label not in TONE_RAILS:
        # Model went off-rails — snap to neutral so the eval row still lands.
        label = "neutral"
    return {
        "label": label,
        "score": TONE_SCORES[label],
        "explanation": parsed.get("explanation", ""),
    }

Run the evaluation and log results back to Arize

Loop over each user-audio span, classify, and assemble an eval dataframe in the exact shape update_evaluations requires:
  • context.span_id — the span the eval attaches to
  • eval.<name>.label — string label (one of the rails)
  • eval.<name>.score — numeric score
  • eval.<name>.explanation — free-text justification from the model
# Build one row per USER span in the shape Arize's `update_evaluations`
# requires: `context.span_id` plus `eval.<name>.{label,score,explanation}` columns.
rows = []
for _, span in user_spans.iterrows():
    audio_b64 = extract_audio_b64(span)
    if audio_b64 is None:
        print(f"  skipping {span['context.span_id']}: no mic audio in span window")
        continue
    result = classify_tone(audio_b64)
    rows.append({
        "context.span_id": span["context.span_id"],
        "eval.tone.label": result["label"],
        "eval.tone.score": result["score"],
        "eval.tone.explanation": result["explanation"],
    })

evals_df = pd.DataFrame(rows)

arize_client.spans.update_evaluations(
    space_id=os.environ["ARIZE_SPACE_ID"],
    project_name=PROJECT_NAME,
    dataframe=evals_df,
)
print(f"Logged {len(evals_df)} tone evaluations to Arize.")

Review the evaluation in Arize

Refresh your Arize AX project. Each USER span in the trace now carries the tone evaluation alongside the audio playback and transcript — open a trace and you’ll see the label, score, and the model’s explanation: Audio tone evaluation shown on the USER span in the Arize AX trace view The label and score are filterable in the trace list view, so you can quickly find sessions where tone went negative — useful for digging into failure modes or sampling for regression review.

Reference

Background details and conventions worth knowing once you have the walkthrough working.

Key Realtime API events the instrumentor listens for

The OpenAI Agents SDK auto-instrumentor consumes the OpenAI Realtime API’s WebSocket events under the hood. The most consequential ones for tracing are:
  1. Session events
    • session.created — the session was opened
    • session.updated — session parameters changed (model, instructions, tools, VAD config)
  2. Audio input events
    • input_audio_buffer.speech_started — server-side VAD detected the user beginning to speak
    • input_audio_buffer.speech_stopped — server-side VAD detected the user finishing
    • input_audio_buffer.committed — audio buffer committed for processing
  3. Conversation events
    • conversation.item.created — a new conversation item (user message, function call, etc.) was added
  4. Response events
    • response.created — the model has started generating a response (becomes RealtimeAgentStartEvent at the SDK layer)
    • response.audio_transcript.delta — incremental transcript of the audio response
    • response.audio_transcript.done — transcript complete
    • response.audio.delta — output audio bytes (become RealtimeAudio events)
    • response.done — response finished (becomes RealtimeAgentEndEvent); may include function_call output items
  5. Error events
    • error — any error encountered during processing
You don’t need to handle these directly when using the Agents SDK + OpenInference instrumentor — they’re all consumed and mapped onto spans for you.

Semantic conventions

The OpenInference instrumentor populates the following span attributes. These are the keys you can filter, query, and write evaluations against in Arize AX.
  1. Session attributes
    • session.id — unique identifier for the session
  2. Audio attributes
    • input.audio.url — URL of the input audio (a data:audio/wav;base64,... URI by default, or any HTTPS URL if you’ve wired a custom storage backend)
    • input.audio.mime_type — MIME type of the input audio (e.g. audio/wav)
    • input.audio.transcript — transcript of the input audio
    • output.audio.url — URL of the output audio
    • output.audio.mime_type — MIME type of the output audio
    • output.audio.transcript — transcript of the output audio
  3. Span kind
    • openinference.span.kindAUDIO for turn parents, USER for user inputs, LLM for assistant responses, TOOL for tool calls. See Voice Extensions for AUDIO and USER (instrumentor-introduced extensions) and the canonical span-kinds reference for LLM and TOOL.
  4. LLM attributes
    • llm.model_name — Realtime model used (e.g. gpt-4o-realtime-preview)
    • llm.invocation_parameters — session config as JSON
    • llm.token_count.prompt_details.audio — audio input tokens consumed
    • llm.token_count.completion_details.audio — audio output tokens generated
    • time_to_first_token_ms — latency from input commit to first response byte
  5. Tool attributes
    • tool.name — name of the called function
    • tool.arguments — JSON of the call arguments
    • Output value populated on the TOOL span when the function returns
  6. Error attributes
    • error.type — class of error
    • error.message — error detail
See the OpenInference semantic conventions spec for the canonical definitions and how they map to OpenTelemetry’s underlying attribute keys.

Implementation considerations

A few things that come up once you move past the demo:
  • Audio length: raise OPENINFERENCE_BASE64_AUDIO_MAX_LENGTH to whatever your longest expected turn needs. Truncated base64 produces a broken WAV that the trace UI cannot play and the evaluator cannot decode. If your turns are long enough to bump the cap regularly, swap the inline data URI for an HTTPS URL pointing at your storage of choice (S3, GCS, your own CDN) using a custom SpanProcessor or wrapping SpanExporter.
  • Cost of audio tokens: audio input and output tokens are billed at a higher rate than text tokens. Track llm.token_count.prompt_details.audio and llm.token_count.completion_details.audio on the assistant span as you would any cost metric.
  • Redaction: if your traces will leave a controlled environment, set OPENINFERENCE_HIDE_INPUT_AUDIO=true and OPENINFERENCE_HIDE_OUTPUT_AUDIO=true. Transcripts and durations stay; the raw audio attributes are dropped before the span ships.
  • Voice evaluations are still primarily an OpenAI capability: gpt-audio-mini is the audio-capable model used in the eval step. Other providers’ audio models can be substituted with the same dataframe-shaping pattern; the Arize update_evaluations half doesn’t care which LLM produced the labels.
  • Replay-friendly storage: if you swap to external storage, key your audio objects by <timestamp>_<trace_id>_<span_id>_<input|output>.wav (or a similar scheme involving the trace + span IDs). That makes each clip traceable back to the span that produced it — useful when you want to spot-check a particular trace months later.

Read more