# Evaluating a RAG-Powered Chatbot

{% embed url="https://colab.research.google.com/github/Arize-ai/tutorials/blob/main/python/llm/evaluation/llamaindex-evals.ipynb" %}

This guide demonstrates how to use Arize for monitoring and debugging your LLM using Traces and Spans. We're going to use data from a chatbot built on top of Arize docs ([https://arize.com/docs/ax](https://arize.com/docs/ax)), with example query and retrieved text. Let's figure out how to understand how well our RAG system is working.

In this tutorial we will:

1. Build a RAG application using Llama-Index
2. Set up [Phoenix](https://docs.arize.com/phoenix) as a [trace collector](https://docs.arize.com/phoenix/tracing/llm-traces) for the Llama-Index application
3. Use Phoenix's [evals library](https://docs.arize.com/phoenix/evaluation/llm-evals) to compute LLM generated evaluations of our RAG app responses
4. Use arize SDK to export the traces and evaluations to Arize

You can read more about LLM tracing in Arize [here](https://docs.arize.com/arize/llm-large-language-models/llm-traces).

## Install Dependencies 📚

Let's get the notebook setup with dependencies.

```python
# Dependencies needed to build the Llama Index RAG application
!pip install -qq gcsfs llama-index-llms-openai llama-index-embeddings-openai llama-index-core

# Dependencies needed to export spans and send them to our collector: Phoenix
!pip install -qq llama-index-callbacks-arize-phoenix

# Install Phoenix to generate evaluations
!pip install -qq "arize-phoenix[evals]>7.0.0"

# Install Arize SDK with `Tracing` extra dependencies to export Phoenix data to Arize
!pip install -qq "arize>7.29.0"
```

## Set up Phoenix as a Trace Collector in our LLM app

To get started, launch the phoenix app. Make sure to open the app in your browser using the link below.

```python
import phoenix as px

session = px.launch_app()
```

Once you have started a Phoenix server, you can start your LlamaIndex application and configure it to send traces to Phoenix. To do this, you will have to add configure Phoenix as the global handler

```python
from llama_index.core import set_global_handler

set_global_handler("arize_phoenix")
```

That's it! The Llama-Index application we build next will send traces to Phoenix.

## Build Your Llama Index RAG Application 📁

We start by setting your OpenAI API key if it is not already set as an environment variable.

```python
import os
from getpass import getpass

OPENAI_API_KEY = globals().get("OPENAI_API_KEY") or getpass(
    "🔑 Enter your OpenAI API key: "
)
os.environ["OPENAI_API_KEY"] = OPENAI_API_KEY
```

This example uses a `RetrieverQueryEngine` over a pre-built index of the Arize documentation, but you can use whatever LlamaIndex application you like. Download the pre-built index of the Arize docs from cloud storage and instantiate your storage context.

```python
from gcsfs import GCSFileSystem
from llama_index.core import StorageContext

file_system = GCSFileSystem(project="public-assets-275721")
index_path = "arize-phoenix-assets/datasets/unstructured/llm/llama-index/arize-docs/index/"
storage_context = StorageContext.from_defaults(
    fs=file_system,
    persist_dir=index_path,
)
```

We are now ready to instantiate our query engine that will perform retrieval-augmented generation (RAG). Query engine is a generic interface in LlamaIndex that allows you to ask question over your data. A query engine takes in a natural language query, and returns a rich response. It is built on top of Retrievers. You can compose multiple query engines to achieve more advanced capability.

```python
from llama_index.llms.openai import OpenAI
from llama_index.core import (
    Settings,
    load_index_from_storage,
)
from llama_index.embeddings.openai import OpenAIEmbedding


Settings.llm = OpenAI(model="gpt-4o")
Settings.embed_model = OpenAIEmbedding(model="text-embedding-ada-002")
index = load_index_from_storage(
    storage_context,
)
query_engine = index.as_query_engine()
```

Let's test our app by asking a question about the Arize documentation:

```python
response = query_engine.query(
    "What is Arize and how can it help me as an AI Engineer?"
)
print(response)
```

Great! Our application works!

## Use the instrumented Query Engine

We will download a dataset of questions for our RAG application to answer.

```python
from urllib.request import urlopen
import json

queries_url = "http://storage.googleapis.com/arize-phoenix-assets/datasets/unstructured/llm/context-retrieval/arize_docs_queries.jsonl"
queries = []
with urlopen(queries_url) as response:
    for line in response:
        line = line.decode("utf-8").strip()
        data = json.loads(line)
        queries.append(data["query"])

queries[:5]
```

We use the instrumented query engine and get responses from our RAG app.

```python
from tqdm.notebook import tqdm

N = 10  # Sample size
qa_pairs = []
for query in tqdm(queries[:N]):
    resp = query_engine.query(query)
    qa_pairs.append((query, resp))
```

To see the questions and answers in phoenix, use the link described when we started the phoenix server

## Run Evaluations on the data in Phoenix

We will use the phoenix client to extract data in the correct format for specific evaluations and the custom evaluators, also from phoenix, to run evaluations on our RAG application.

```python
from phoenix.session.evaluation import get_qa_with_reference

px_client = px.Client()  # Define phoenix client
queries_df = get_qa_with_reference(
    px_client
)  # Get question, answer and reference data from phoenix
```

Next, we enable concurrent evaluations for better performance.

```python
import nest_asyncio

nest_asyncio.apply()  # needed for concurrent evals in notebook environments
```

Then, we define our evaluators and run the evaluations

```python
from phoenix.evals import (
    HallucinationEvaluator,
    OpenAIModel,
    QAEvaluator,
    run_evals,
)

eval_model = OpenAIModel(
    model="gpt-4o",
)
hallucination_evaluator = HallucinationEvaluator(eval_model)
qa_correctness_evaluator = QAEvaluator(eval_model)

hallucination_eval_df, qa_correctness_eval_df = run_evals(
    dataframe=queries_df,
    evaluators=[hallucination_evaluator, qa_correctness_evaluator],
    provide_explanation=True,
)
```

Finally, we log the evaluations into Phoenix

```python
from phoenix.trace import SpanEvaluations

px_client.log_evaluations(
    SpanEvaluations(eval_name="Hallucination", dataframe=hallucination_eval_df),
    SpanEvaluations(
        eval_name="QA_Correctness", dataframe=qa_correctness_eval_df
    ),
)
```

## Export data to Arize

### Get data into dataframes

We extract the spans and evals dataframes from the phoenix client

```python
tds = px_client.get_trace_dataset()
spans_df = tds.get_spans_dataframe(include_evaluations=False)
spans_df.head()
```

```python
evals_df = tds.get_evals_dataframe()
evals_df.head()
```

### Initialize Arize Client

```python
from arize.pandas.logger import Client
```

Sign up/log in to your Arize account [here](https://app.arize.com/auth/login). Find your [space ID and API key](https://docs.arize.com/arize/api-reference/arize.pandas/client). Copy/paste into the cell below.

<figure><img src="https://storage.googleapis.com/arize-assets/fixtures/copy-id-and-key.png" alt=""><figcaption></figcaption></figure>

```python
SPACE_ID = globals().get("SPACE_ID") or getpass(
    "🔑 Enter your Arize Space ID: "
)
API_KEY = globals().get("API_KEY") or getpass("🔑 Enter your Arize API Key: ")

arize_client = Client(
    space_id=SPACE_ID,
    api_key=API_KEY,
)
model_id = "tutorial-tracing-llama-index-rag-export-from-phoenix"
model_version = "1.0"
```

Lastly, we use `log_spans` from the arize client to log our spans data and, if we have evaluations, we can pass the optional `evals_dataframe`.

```python
response = arize_client.log_spans(
    dataframe=spans_df,
    evals_dataframe=evals_df,
    model_id=model_id,
    model_version=model_version,
)

# If successful, the server will return a status_code of 200
if response.status_code != 200:
    print(
        f"❌ logging failed with response code {response.status_code}, {response.text}"
    )
else:
    print("✅ You have successfully logged traces set to Arize")
```