This guide must be run locally on a machine with a microphone — audio capture requires physical hardware that Colab cannot provide. The companion notebook is at Arize-ai/tutorials. Download it and run it in Jupyter, or follow the steps below to assemble the same code in a regular
.py file.openinference-instrumentation-openai-agents. You will learn:
- How a single
OpenAIAgentsInstrumentor().instrument(...)call auto-traces anagents.realtime.RealtimeSession— capturing audio, transcripts, token counts, and tool calls without any manual span code - The span tree the instrumentor produces (one
AUDIOparent per conversational turn, withUSER,LLM, andTOOLchildren) - The Agents-SDK event stream —
RealtimeAudio,RealtimeAudioInterrupted,RealtimeAgentEndEvent,RealtimeToolStart— and how to drive a live microphone + speaker loop on top of it - How to evaluate the captured audio with an audio-aware OpenAI model and log the results back to Arize
Initial setup
You will need an Arize AX account to run this guide. Sign up now for free if you don’t have one. You also need an OpenAI API key with access to the Realtime API.Create a project directory and virtual environment
Create a new directory for your script and a Python virtual environment inside it.Install libraries
Install the dependencies you’ll use across the rest of the guide — the OpenAI Agents SDK, the OpenInference instrumentor, thearize-otel helper, the arize client for exporting traces and logging evaluations, and sounddevice + numpy + pandas for audio I/O and dataframe work:
sounddevice links against. macOS and Windows wheels bundle PortAudio already.
Set environment variables
The script reads three secrets from environment variables. Find your Arize Space ID and API Key on your Space Settings page:
Export them in the same shell session you will run the code from:
What you’ll see in Arize
The OpenAI Agents instrumentor turns each turn of a Realtime conversation into a complete span tree, automatically. You write no tracing code yourself — the instrumentor patchesRealtimeSession and emits spans as audio, tool calls, and responses flow through the WebSocket.
For each conversational turn, the tree looks like this:
- The parent span uses the
AUDIOspan kind — an instrumentor-introduced extension used for the conversational-turn parent. Arize AX renders it with audio-aware UI: a play button, waveform, and the input/output transcripts inline. input.audio.urlandoutput.audio.urlcarry the captured audio as inlinedata:audio/wav;base64,…URIs — the bytes ride on the span itself, so no separate audio storage is needed.time_to_first_token_mson the assistant span is the latency from the user’s audio commit to the first byte of the model’s audio response — the metric that matters most for perceived voice-agent responsiveness.- Token counts include audio-specific breakdowns
llm.token_count.prompt_details.audioandllm.token_count.completion_details.audio, so you can see audio vs text token cost separately.
Defining the voice agent
Three pieces make up the agent we’ll wire to the Realtime API: the tools (Python functions the model can call), the agent definition (instructions and a tool list), and the audio plumbing that pipes the microphone in and the speaker out. None of this code is tracing-specific — it’s the same shape you’d write without observability.Tools
Two tools:get_weather and get_current_time. They’re trivial dummies, enough to exercise a tool round-trip in the Realtime API.
The @function_tool decorator from the Agents SDK introspects the function’s type hints and docstring to generate the JSON schema the model receives — no manual tool-spec dict needed.
The agent
ARealtimeAgent pairs a system prompt with a tool list for the OpenAI Realtime API. It’s consumed by RealtimeRunner, which negotiates the WebSocket connection and handles the bidirectional audio stream.
Audio plumbing
The OpenAI Realtime API speaks 24 kHz PCM16 mono on the wire. We usesounddevice to bridge the system’s microphone and speaker to that format.
sounddevice runs the input stream’s callback on a background thread; the callback enqueues mic bytes into an asyncio.Queue that the async send loop drains. For playback we use a non-callback OutputStream and write to it synchronously via loop.run_in_executor(...), letting PortAudio own the buffering.
The callback also accepts an optional recording list. When supplied, every chunk is appended to it with a time.time_ns() timestamp. We use this local copy in the evaluation step at the end of the guide — see that section for the reason it’s needed and the conditions under which you can skip it.
Setting up tracing
This is the only tracing code you need to write for the whole guide — oneregister(...) call and one OpenAIAgentsInstrumentor().instrument(...) call. The instrumentor patches agents.realtime.RealtimeSession to emit the span tree shown above; all spans flow through the registered TracerProvider to Arize AX.
By default, captured audio rides inline on the span attributes as data:audio/wav;base64,… URIs — no separate storage needed. The inline payload is capped by OPENINFERENCE_BASE64_AUDIO_MAX_LENGTH (default 32000 characters ≈ 0.5 s of 24 kHz mono PCM16). We raise it to 2000000 (~30 s) below, which is plenty for typical voice turns. Set this env var before calling instrument(...) — the instrumentor reads it at patch time.
Running a voice session
This is the live mic/speaker loop. It opens anagents.realtime.RealtimeSession via RealtimeRunner, pumps mic audio into it, plays back the assistant’s audio, and exits cleanly after one full conversational turn — including any tool calls and the assistant’s follow-up response.
Three async tasks run concurrently inside the session:
send_mic— drains the mic queue and forwards chunks to the session viasession.send_audio(...)handle_events— consumes the session’s event stream and dispatches on event typehard_timer— setsstopafterMAX_SESSION_SECONDSas a safety net
RealtimeAgentEndEvent fires after every response.done from the API — including the response where the model decides to call a tool, before the tool runs and the follow-up response. So we schedule the exit after EXIT_GRACE_PERIOD seconds and cancel it if a RealtimeToolStart or a new RealtimeAgentStartEvent follows within the window. For tool-free turns the grace window simply elapses and the session exits.
- “What’s the weather in London?” — exercises a tool call (
get_weather) - “What time is it in Tokyo?” — exercises the other tool (
get_current_time) - “Tell me a fun fact about Python” — no tool call; tests the simpler one-response path
turn_detection) detects when you’ve finished speaking and commits your audio buffer automatically.
See your traces in Arize
Head to your Arize AX project (openai-realtime-voice) to see the traces. Each conversation turn appears as a conversation.turn span (kind: AUDIO) containing user, assistant, and any <tool_name> children.
Things to look at in the trace view:
- Audio playback — click the play button on the
userandassistantspans. The audio is served from the inlinedata:audio/wav;base64,...URI on the span attribute. - Transcripts — the
input.audio.transcriptandoutput.audio.transcriptattributes show what the model heard and what it said. Useful for debugging mishearings. - Tool calls — child
TOOLspans under theassistantspan show the function name, arguments JSON, and the value the function returned. - Latency —
time_to_first_token_mson theassistantspan gives the user-perceived “how fast did the agent start talking” latency. - Audio token cost —
llm.token_count.prompt_details.audioandllm.token_count.completion_details.audiobreak out audio tokens separately from text tokens.
Audio redaction
The OpenInference instrumentor recognises three environment variables for controlling what audio data ends up on spans. Set them before callinginstrument(...):
OPENINFERENCE_HIDE_INPUT_AUDIO=true— dropinput.audio.*fromUSERspansOPENINFERENCE_HIDE_OUTPUT_AUDIO=true— dropoutput.audio.*fromLLMspansOPENINFERENCE_BASE64_AUDIO_MAX_LENGTH=<n>— cap the inline base64 payload length (default32000)
TraceConfig(hide_inputs=True) and TraceConfig(hide_outputs=True) settings also cascade to the corresponding audio attributes.
Evaluating the voice session
Now that traces are flowing into Arize, let’s run an evaluation against the captured audio. We’ll classify the tone of each user utterance aspositive, neutral, or negative — a classic voice-agent quality signal — and ship the results back to Arize so they appear on the same spans in the UI.
There’s one architectural wrinkle to know about up front. When the OpenInference instrumentor uploads inline data:audio/wav;base64,… URIs to Arize, the backend stores the audio in its own multimodal blob bucket and the export client returns an internal gs://arize-multimodal-prod/… reference that external code can’t authenticate against. We work around this by keeping a local copy of the mic audio as it’s captured (the mic_recording list populated by the mic callback), and slicing it per USER span using the span’s start and end times from the Arize export.
If you swap the inline-base64 path for external cloud storage (the tip in the tracing section shows how — a custom
SpanProcessor that puts an S3 / GCS / CDN URL on input.audio.url instead of the inline data URI), this workaround disappears. The audio URL on each USER span points at an object you control and can fetch directly. You can then drop the recording=mic_recording argument from make_mic_callback, delete the extract_audio_b64 slicing helper, and replace it with a plain HTTP fetch of attributes.input.audio.url. The local-recording dance below exists only because the default inline-base64 path stores audio in Arize-controlled storage.- Export the USER spans from Arize — we only need them for span IDs and time bounds, not the audio bytes
- Slice
mic_recordingbetween each USER span’sstart_timeandend_time, wrap the raw PCM in a WAV header - Classify the resulting audio with
gpt-audio-mini(OpenAI’s audio-input chat model) - Log the evaluations back to Arize via
client.spans.update_evaluations(...)
Phoenix’s
phoenix.evals framework was rewritten in late 2025 and its prompt templates currently support text content only — multimodal/audio content blocks aren’t supported yet. So instead of using create_classifier(...), we call OpenAI directly and shape the result into the eval-dataframe format Arize expects.Export the USER spans from Arize
client.spans.export_to_df(...) returns a flat pandas DataFrame with one row per span. We only need three columns from it: context.span_id (where the eval attaches), start_time, and end_time. The audio bytes themselves come from mic_recording.
Slice the user audio out of the local recording
For each USER span, slicemic_recording between the span’s start_time and end_time (both pandas.Timestamp values, convertible to ns-since-epoch via .value). The slice is raw PCM16 mono at 24 kHz; we wrap it in a WAV header using the standard-library wave module, base64-encode the result, and hand it to the classifier.
If no mic chunk falls in the span’s window, the helper returns None and we skip that span — that can happen for very short utterances right at the start or end of the recording.
Define the tone classifier
The classifier mirrors what a classic emotion-classification template does:- Task description — a system prompt that tells the model what to listen for and how to reply
- Rails — the fixed set of valid labels (
positive/neutral/negative); the prompt forbids anything else - Output format — a small JSON object with
labelandexplanation, parsed back into the score columns Arize expects
gpt-audio-mini accepts an input_audio content block alongside text. We send the system prompt as one message and a short text + audio block as the user message. modalities=["text"] tells the model to reply in text only (not audio). The full gpt-audio model would also work — gpt-audio-mini is cheaper and plenty accurate for three-way tone classification.
Audio chat models don’t support
response_format={"type": "json_object"} (yet). We pin the output shape with a strong prompt instruction and parse leniently — _extract_json_object pulls the first {...} block out of the response, which survives minor deviations like the model wrapping its reply in markdown fences.Run the evaluation and log results back to Arize
Loop over each user-audio span, classify, and assemble an eval dataframe in the exact shapeupdate_evaluations requires:
context.span_id— the span the eval attaches toeval.<name>.label— string label (one of the rails)eval.<name>.score— numeric scoreeval.<name>.explanation— free-text justification from the model
Review the evaluation in Arize
Refresh your Arize AX project. Each USER span in the trace now carries thetone evaluation alongside the audio playback and transcript — open a trace and you’ll see the label, score, and the model’s explanation:
The label and score are filterable in the trace list view, so you can quickly find sessions where tone went negative — useful for digging into failure modes or sampling for regression review.
Reference
Background details and conventions worth knowing once you have the walkthrough working.Key Realtime API events the instrumentor listens for
The OpenAI Agents SDK auto-instrumentor consumes the OpenAI Realtime API’s WebSocket events under the hood. The most consequential ones for tracing are:- Session events
session.created— the session was openedsession.updated— session parameters changed (model, instructions, tools, VAD config)
- Audio input events
input_audio_buffer.speech_started— server-side VAD detected the user beginning to speakinput_audio_buffer.speech_stopped— server-side VAD detected the user finishinginput_audio_buffer.committed— audio buffer committed for processing
- Conversation events
conversation.item.created— a new conversation item (user message, function call, etc.) was added
- Response events
response.created— the model has started generating a response (becomesRealtimeAgentStartEventat the SDK layer)response.audio_transcript.delta— incremental transcript of the audio responseresponse.audio_transcript.done— transcript completeresponse.audio.delta— output audio bytes (becomeRealtimeAudioevents)response.done— response finished (becomesRealtimeAgentEndEvent); may includefunction_calloutput items
- Error events
error— any error encountered during processing
Semantic conventions
The OpenInference instrumentor populates the following span attributes. These are the keys you can filter, query, and write evaluations against in Arize AX.- Session attributes
session.id— unique identifier for the session
- Audio attributes
input.audio.url— URL of the input audio (adata:audio/wav;base64,...URI by default, or any HTTPS URL if you’ve wired a custom storage backend)input.audio.mime_type— MIME type of the input audio (e.g.audio/wav)input.audio.transcript— transcript of the input audiooutput.audio.url— URL of the output audiooutput.audio.mime_type— MIME type of the output audiooutput.audio.transcript— transcript of the output audio
- Span kind
openinference.span.kind—AUDIOfor turn parents,USERfor user inputs,LLMfor assistant responses,TOOLfor tool calls. See Voice Extensions forAUDIOandUSER(instrumentor-introduced extensions) and the canonical span-kinds reference forLLMandTOOL.
- LLM attributes
llm.model_name— Realtime model used (e.g.gpt-4o-realtime-preview)llm.invocation_parameters— session config as JSONllm.token_count.prompt_details.audio— audio input tokens consumedllm.token_count.completion_details.audio— audio output tokens generatedtime_to_first_token_ms— latency from input commit to first response byte
- Tool attributes
tool.name— name of the called functiontool.arguments— JSON of the call arguments- Output value populated on the TOOL span when the function returns
- Error attributes
error.type— class of errorerror.message— error detail
Implementation considerations
A few things that come up once you move past the demo:- Audio length: raise
OPENINFERENCE_BASE64_AUDIO_MAX_LENGTHto whatever your longest expected turn needs. Truncated base64 produces a broken WAV that the trace UI cannot play and the evaluator cannot decode. If your turns are long enough to bump the cap regularly, swap the inline data URI for an HTTPS URL pointing at your storage of choice (S3, GCS, your own CDN) using a customSpanProcessoror wrappingSpanExporter. - Cost of audio tokens: audio input and output tokens are billed at a higher rate than text tokens. Track
llm.token_count.prompt_details.audioandllm.token_count.completion_details.audioon the assistant span as you would any cost metric. - Redaction: if your traces will leave a controlled environment, set
OPENINFERENCE_HIDE_INPUT_AUDIO=trueandOPENINFERENCE_HIDE_OUTPUT_AUDIO=true. Transcripts and durations stay; the raw audio attributes are dropped before the span ships. - Voice evaluations are still primarily an OpenAI capability:
gpt-audio-miniis the audio-capable model used in the eval step. Other providers’ audio models can be substituted with the same dataframe-shaping pattern; the Arizeupdate_evaluationshalf doesn’t care which LLM produced the labels. - Replay-friendly storage: if you swap to external storage, key your audio objects by
<timestamp>_<trace_id>_<span_id>_<input|output>.wav(or a similar scheme involving the trace + span IDs). That makes each clip traceable back to the span that produced it — useful when you want to spot-check a particular trace months later.
Read more
- Companion notebook on GitHub — the full runnable version of this guide
- OpenInference OpenAI Agents instrumentor — source, changelog, and the canonical
realtime_with_tools.pyexample - OpenAI Agents SDK realtime docs —
RealtimeAgent,RealtimeRunner, and the full event reference - OpenAI Realtime API reference — the underlying WebSocket protocol the SDK speaks
- OpenAI audio input guide — the
gpt-audio/gpt-audio-miniAPI used by the evaluator - OpenInference semantic conventions — the formal definitions for span kinds and audio attributes