Gemini Audio Evals

This notebook is adapted from Google's "Gemini API: Audio Quickstart Notebook" and provides an example of how to prompt Gemini Flash using an audio file.

In this case, you'll use a sound recording of President John F. Kennedy’s 1961 State of the Union address.

This notebook performs the following tasks:

  1. Prompt Gemini to generate a transcript of the audio recording.

  2. Trace Gemini API calls and send the traces to the Arize platform with links to audio file for playback.

  3. Evaluate the transcription output from Gemini for sentiment analysis using Phoenix Evals and Gemini LLM (LLM as a Judge).

Install dependencies

%pip install -q -U google-genai arize-phoenix-evals arize opentelemetry-api opentelemetry-sdk openinference-semantic-conventions arize-otel
from google import genai

Configure your Gemini API key

To run the following cell, your API key must be stored it in a Colab Secret named GEMINI_API_KEY. If you don't already have an API key, or you're not sure how to create a Colab Secret, see Authentication for an example.

import getpass

#from google.colab import userdata

GEMINI_API_KEY = getpass.getpass(prompt="Enter your Gemini API Key: ")

Load an audio file sample and set the URL

## Audio file url --> allows you to play audio in UI
URL = "https://storage.googleapis.com/generativeai-downloads/data/State_of_the_Union_Address_30_January_1961.mp3"
!wget -q $URL -O sample.mp3
gemini_client = genai.Client(api_key=GEMINI_API_KEY)

your_file = gemini_client.files.upload(file='sample.mp3')

Tracing setup

You'll need to set Arize AX variables (Space id, API key and Developer Key) below to send traces to the Arize AX Platform. Sign up for free here.

from opentelemetry import trace
from arize.otel import register
from opentelemetry.trace import Status, StatusCode
from opentelemetry.semconv.trace import SpanAttributes


ARIZE_SPACE_ID = getpass.getpass(prompt="Enter your ARIZE SPACE ID Key: ")
ARIZE_API_KEY = getpass.getpass(prompt="Enter your ARIZE API Key: ")
PROJECT_NAME = "gemini-audio"  # Set this to any name you'd like for your app

# Setup OTel via our convenience function
tracer_provider = register(
    space_id = ARIZE_SPACE_ID, # in app space settings page
    api_key = ARIZE_API_KEY, # in app space settings page
    project_name = PROJECT_NAME,
)

trace.set_tracer_provider(tracer_provider)
tracer = trace.get_tracer(__name__)

Configure prompt

prompt = "Provide a transcript of the speech from 01:00 to 01:30."

Call Gemini

# Call Gemini

with tracer.start_as_current_span(
    "process_audio",
    openinference_span_kind="llm",
) as span:
  span.set_attribute("input.audio.url", URL)
  span.set_attribute("llm.prompts", prompt)
  span.set_attribute("input.value", prompt)
  response = gemini_client.models.generate_content(
    model='gemini-2.0-flash',
    contents=[
      prompt,
      your_file,
    ]
  )
  span.set_attribute("input.audio.transcript", response.text)
  span.set_attribute("output.value", response.text)
  span.set_status(Status(StatusCode.OK))

response.text

Evaluate Gemini's output transcript for sentiment analysis

First, export spans from Arize that contain transcript output from Arize

print('#### Installing arize SDK')

! pip install "arize[Tracing]>=7.1.0"

print('#### arize SDK installed!')

import os

os.environ['ARIZE_API_KEY'] = ARIZE_API_KEY

from datetime import datetime, timezone, timedelta

from arize.exporter import ArizeExportClient
from arize.utils.types import Environments

client = ArizeExportClient()

print('#### Exporting your primary dataset into a dataframe.')

primary_df = client.export_model_to_df(
    space_id=getpass.getpass(prompt="Enter your ARIZE SPACE ID Key: "),
    model_id=PROJECT_NAME,
    where="name = 'process_audio'", #Just pull the spans with name = "process_audio"
    environment=Environments.TRACING,
    start_time = datetime.now(timezone.utc) - timedelta(days=1),
    end_time = datetime.now(timezone.utc) #pull traces for the last 24 hours
)

#set the column in the dataframe to match the variable name used in our eval template
primary_df["output"] = primary_df["attributes.output.value"]

Evaluation Template

SENTIMENT_EVAL_TEMPLATE = """

You are a helpful AI bot that checks for the sentiment in the output text. Your task is to evaluate the sentiment of the given output and categorize it as positive, neutral, or negative.

Here is the data:
[BEGIN DATA]
============
[Output]: {attributes.output.value}
============
[END DATA]

Determine the sentiment of the output based on the content and context provided. Your response should be ONLY a single word, either "positive", "neutral", or "negative", and should not contain any text or characters aside from that word.

Then write out in a step by step manner an EXPLANATION to show how you determined the sentiment of the output.  Do not include any text or characters aside from the EXPLANATION.

Your response should follow the format of the example response below. Provide a single LABEL and a single EXPLANATION. Do not include any special characters in the response. Do not include special characters such as "#" in your response.

Example response:

EXPLANATION: An explanation of your reasoning for why the label is "positive", "neutral", or "negative"
LABEL: "positive" or "neutral" or "negative"

"""

Evaluate transcriptions using Gemini as a LLM as a Judge

#Gemini as LLM as a Judge - LLM Classify

#google auth to access the Gemini model
!gcloud auth application-default login # authenticate with google
!gcloud config set project audioevals # you must have a valid project id in your google cloud account first

import pandas as pd
from phoenix.evals import (GeminiModel, llm_classify)

#We will use Gemini 1.5 pro to evaluate the text transcription
project_id = "audioevals" # Set this to your google project id
gemini_model = GeminiModel(model="gemini-1.5-pro", project=project_id)

rails = ["positive", "neutral", "negative"]

evals_df = llm_classify(
    data=primary_df,
    template=SENTIMENT_EVAL_TEMPLATE,
    model=gemini_model,
    rails=rails,
    provide_explanation=True
)

#set eval labels
evals_df["eval.sentiment.label"] = evals_df["label"]
evals_df["eval.sentiment.explanation"] = evals_df["explanation"]
evals_df["context.span_id"] = primary_df["context.span_id"]

evals_df.head()

Send evaluations to Arize

from arize.pandas.logger import Client


# Initialize Arize client using the model_id and version you used previously
arize_client = Client(
    space_id=ARIZE_SPACE_ID,
    api_key=ARIZE_API_KEY,
)

# send the evaluation results to Arize
arize_client.log_evaluations_sync(evals_df, "gemini-audio")

Next Steps

Useful API references:

More details about Gemini API's vision capabilities in the documentation.

If you want to know about the File API, check its API reference or the File API quickstart.

Check this example using the audio files to give you more ideas on what the gemini API can do with them:

Continue your discovery of the Gemini API

Have a look at the Audio quickstart to learn about another type of media file, then learn more about prompting with media files in the docs, including the supported formats and maximum length for audio files. .

Last updated

Was this helpful?