LlamaIndex Evals
This guide demonstrates how to use Arize for monitoring and debugging your LLM using Traces and Spans. We're going to use data from a chatbot built on top of Arize docs (https://arize.com/docs/ax), with example query and retrieved text. Let's figure out how to understand how well our RAG system is working.
In this tutorial we will:
Build a RAG application using Llama-Index
Set up Phoenix as a trace collector for the Llama-Index application
Use Phoenix's evals library to compute LLM generated evaluations of our RAG app responses
Use arize SDK to export the traces and evaluations to Arize
You can read more about LLM tracing in Arize here.
Install Dependencies π
Let's get the notebook setup with dependencies.
# Dependencies needed to build the Llama Index RAG application
!pip install -qq gcsfs llama-index-llms-openai llama-index-embeddings-openai llama-index-core
# Dependencies needed to export spans and send them to our collector: Phoenix
!pip install -qq llama-index-callbacks-arize-phoenix
# Install Phoenix to generate evaluations
!pip install -qq "arize-phoenix[evals]>7.0.0"
# Install Arize SDK with `Tracing` extra dependencies to export Phoenix data to Arize
!pip install -qq "arize>7.29.0"
Set up Phoenix as a Trace Collector in our LLM app
To get started, launch the phoenix app. Make sure to open the app in your browser using the link below.
import phoenix as px
session = px.launch_app()
Once you have started a Phoenix server, you can start your LlamaIndex application and configure it to send traces to Phoenix. To do this, you will have to add configure Phoenix as the global handler
from llama_index.core import set_global_handler
set_global_handler("arize_phoenix")
That's it! The Llama-Index application we build next will send traces to Phoenix.
Build Your Llama Index RAG Application π
We start by setting your OpenAI API key if it is not already set as an environment variable.
import os
from getpass import getpass
OPENAI_API_KEY = globals().get("OPENAI_API_KEY") or getpass(
"π Enter your OpenAI API key: "
)
os.environ["OPENAI_API_KEY"] = OPENAI_API_KEY
This example uses a RetrieverQueryEngine
over a pre-built index of the Arize documentation, but you can use whatever LlamaIndex application you like. Download the pre-built index of the Arize docs from cloud storage and instantiate your storage context.
from gcsfs import GCSFileSystem
from llama_index.core import StorageContext
file_system = GCSFileSystem(project="public-assets-275721")
index_path = "arize-phoenix-assets/datasets/unstructured/llm/llama-index/arize-docs/index/"
storage_context = StorageContext.from_defaults(
fs=file_system,
persist_dir=index_path,
)
We are now ready to instantiate our query engine that will perform retrieval-augmented generation (RAG). Query engine is a generic interface in LlamaIndex that allows you to ask question over your data. A query engine takes in a natural language query, and returns a rich response. It is built on top of Retrievers. You can compose multiple query engines to achieve more advanced capability.
from llama_index.llms.openai import OpenAI
from llama_index.core import (
Settings,
load_index_from_storage,
)
from llama_index.embeddings.openai import OpenAIEmbedding
Settings.llm = OpenAI(model="gpt-4o")
Settings.embed_model = OpenAIEmbedding(model="text-embedding-ada-002")
index = load_index_from_storage(
storage_context,
)
query_engine = index.as_query_engine()
Let's test our app by asking a question about the Arize documentation:
response = query_engine.query(
"What is Arize and how can it help me as an AI Engineer?"
)
print(response)
Great! Our application works!
Use the instrumented Query Engine
We will download a dataset of questions for our RAG application to answer.
from urllib.request import urlopen
import json
queries_url = "http://storage.googleapis.com/arize-phoenix-assets/datasets/unstructured/llm/context-retrieval/arize_docs_queries.jsonl"
queries = []
with urlopen(queries_url) as response:
for line in response:
line = line.decode("utf-8").strip()
data = json.loads(line)
queries.append(data["query"])
queries[:5]
We use the instrumented query engine and get responses from our RAG app.
from tqdm.notebook import tqdm
N = 10 # Sample size
qa_pairs = []
for query in tqdm(queries[:N]):
resp = query_engine.query(query)
qa_pairs.append((query, resp))
To see the questions and answers in phoenix, use the link described when we started the phoenix server
Run Evaluations on the data in Phoenix
We will use the phoenix client to extract data in the correct format for specific evaluations and the custom evaluators, also from phoenix, to run evaluations on our RAG application.
from phoenix.session.evaluation import get_qa_with_reference
px_client = px.Client() # Define phoenix client
queries_df = get_qa_with_reference(
px_client
) # Get question, answer and reference data from phoenix
Next, we enable concurrent evaluations for better performance.
import nest_asyncio
nest_asyncio.apply() # needed for concurrent evals in notebook environments
Then, we define our evaluators and run the evaluations
from phoenix.evals import (
HallucinationEvaluator,
OpenAIModel,
QAEvaluator,
run_evals,
)
eval_model = OpenAIModel(
model="gpt-4o",
)
hallucination_evaluator = HallucinationEvaluator(eval_model)
qa_correctness_evaluator = QAEvaluator(eval_model)
hallucination_eval_df, qa_correctness_eval_df = run_evals(
dataframe=queries_df,
evaluators=[hallucination_evaluator, qa_correctness_evaluator],
provide_explanation=True,
)
Finally, we log the evaluations into Phoenix
from phoenix.trace import SpanEvaluations
px_client.log_evaluations(
SpanEvaluations(eval_name="Hallucination", dataframe=hallucination_eval_df),
SpanEvaluations(
eval_name="QA_Correctness", dataframe=qa_correctness_eval_df
),
)
Export data to Arize
Get data into dataframes
We extract the spans and evals dataframes from the phoenix client
tds = px_client.get_trace_dataset()
spans_df = tds.get_spans_dataframe(include_evaluations=False)
spans_df.head()
evals_df = tds.get_evals_dataframe()
evals_df.head()
Initialize Arize Client
from arize.pandas.logger import Client
Sign up/log in to your Arize account here. Find your space ID and API key. Copy/paste into the cell below.

SPACE_ID = globals().get("SPACE_ID") or getpass(
"π Enter your Arize Space ID: "
)
API_KEY = globals().get("API_KEY") or getpass("π Enter your Arize API Key: ")
arize_client = Client(
space_id=SPACE_ID,
api_key=API_KEY,
)
model_id = "tutorial-tracing-llama-index-rag-export-from-phoenix"
model_version = "1.0"
Lastly, we use log_spans
from the arize client to log our spans data and, if we have evaluations, we can pass the optional evals_dataframe
.
response = arize_client.log_spans(
dataframe=spans_df,
evals_dataframe=evals_df,
model_id=model_id,
model_version=model_version,
)
# If successful, the server will return a status_code of 200
if response.status_code != 200:
print(
f"β logging failed with response code {response.status_code}, {response.text}"
)
else:
print("β
You have successfully logged traces set to Arize")
Last updated
Was this helpful?