> ## Documentation Index
> Fetch the complete documentation index at: https://arize-ax.mintlify.site/docs/llms.txt
> Use this file to discover all available pages before exploring further.

# Evaluations Quickstart

> **Evaluations** are essential to understanding how well your model is performing in real-world scenarios, allowing you to identify strengths, weaknesses, and areas of improvement.

<Card title="Google Colab" href="https://colab.research.google.com/github/Arize-ai/tutorials/blob/main/python/llm/evaluation/quickstart-evals.ipynb" icon="https://storage.googleapis.com/arize-phoenix-assets/assets/images/arize-docs-images/cookbooks/gc.png" horizontal />

Offline evaluations are run as code and then sent back to Arize AX using `log_evaluations_sync`.

This guide assumes you have traces in Arize AX and are looking to run an evaluation to measure your application performance.

To add evaluations you can set up online evaluations as a task to run automatically, or you can follow the steps below to generate evaluations and log them to Arize AX:

<Steps>
  <Step title="Install the Arize SDK" />

  <Step title="Import your spans in code" />

  <Step title="Run a custom evaluator using Phoenix" />

  <Step title="Log evaluations back to Arize AX" />
</Steps>

## Install dependencies and setup keys

```bash theme={null}
!pip install -q arize arize-phoenix-evals

!pip install -q openai pandas nest_asyncio
```

Copy the `ARIZE_API_KEY` and `SPACE_ID` from your Space Settings page (shown below) to the variables in the cell below.

<Frame>
  ![](https://storage.googleapis.com/arize-phoenix-assets/assets/images/arize-docs-images/cookbooks/image-6.png)
</Frame>

```python theme={null}
import os
from getpass import getpass

SPACE_ID = globals().get("SPACE_ID") or getpass(
    "🔑 Enter your Arize AX Space ID: "
)
API_KEY = globals().get("API_KEY") or getpass("🔑 Enter your Arize AX API Key: ")
OPENAI_API_KEY = globals().get("OPENAI_API_KEY") or getpass(
    "🔑 Enter your OpenAI API key: "
)
os.environ["OPENAI_API_KEY"] = OPENAI_API_KEY
```

## Import your spans in code

Once you have traces in Arize AX, you can visit the LLM Tracing tab to see your traces and export them in code. By clicking the export button, you can get the boilerplate code to copy paste to your evaluator.

```python theme={null}
# Note: This example uses Python SDK v7
# import statements required for getting your spans
from datetime import datetime, timedelta
from arize.exporter import ArizeExportClient
from arize.utils.types import Environments

start_time = datetime.now() - timedelta(days=14)  # 14 days ago
end_time = datetime.now()  # Today

# Exporting your dataset into a dataframe
client = ArizeExportClient(api_key=API_KEY)
primary_df = client.export_model_to_df(
    space_id=os.environ["SPACE_ID"],
    model_id="tracing-haiku-tutorial",  # change this to the name of your project
    environment=Environments.TRACING,
    start_time=start_time,
    end_time=end_time,
)
```

## Run a custom evaluator using Phoenix

Create a prompt template for the LLM to judge the quality of your responses. You can utilize any of the Arize AX Evaluator Templates or you can create your own. Below is an example which judges the positivity or negativity of the LLM output.

```python theme={null}
import os
from phoenix.evals import OpenAIModel, llm_classify

eval_model = OpenAIModel(
    model="gpt-4o", temperature=0, api_key=os.environ["OPENAI_API_KEY"]
)

MY_CUSTOM_TEMPLATE = """
    You are evaluating the positivity or negativity of the responses to questions.
    [BEGIN DATA]
    ************
    [Question]: {input}
    ************
    [Response]: {output}
    [END DATA]


    Please focus on the tone of the response.
    Your answer must be single word, either "positive" or "negative"
    """
```

Notice the variables in brackets for {input} and {output} above. You will need to set those variables appropriately for the dataframe so you can run your custom template. We use OpenInference as a set of conventions (complementary to OpenTelemetry) to trace AI applications. This means depending on the provider you are using, the attributes of the trace will be different.

You can use the code below to check which attributes are in the traces in your dataframe.

```python theme={null}
primary_df.columns
```

Use the code below to set the input and output variables needed for the prompt above.

```python theme={null}
primary_df["input"] = primary_df["attributes.input.value"]
primary_df["output"] = primary_df["attributes.output.value"]
```

Use the `llm_classify` function to run the evaluation using your custom template. You will be using the dataframe from the traces you generated above. We also add `nest_asyncio` to run the evaluations concurrently (if you are running multiple evaluations).

```python theme={null}
import nest_asyncio

nest_asyncio.apply()

evals_df = llm_classify(
    dataframe=primary_df,
    template=MY_CUSTOM_TEMPLATE,
    model=eval_model,
    rails=["positive", "negative"],
    provide_explanation=True,
)
```

If you'd like more information, see our detailed guide on [custom evaluators.](https://arize.com/docs/ax/llm-evaluation-and-annotations/catching-hallucinations/custom-evaluators) You can also use our [pre-tested evaluators](https://arize.com/docs/ax/llm-evaluation-and-annotations/catching-hallucinations/arize-evaluators-llm-as-a-judge) for evaluating hallucination, toxicity, retrieval, etc.

## Log evaluations back to Arize AX

Use the `log_evaluations_sync` function as part of our Python SDK to attach evaluations you've run to traces. The code below assumes that you have already completed an evaluation run, and you have the `evals_dataframe` object. It also assumes you have a `traces_dataframe` object to get the `span_id` that you need to attach the evals.

The `evals_dataframe` requires four columns, which should be auto-generated for you based on the evaluation you ran using Phoenix. The `<eval_name>` must be alphanumeric and cannot have hyphens or spaces.

* `eval.<eval_name>.label`
* `eval.<eval_name>.score`
* `eval.<eval_name>.explanation`
* `context.span_id`

An example evaluation data dictionary would look like:

```python theme={null}
evaluation_data = {
   'context.span_id': ['74bdfb83-a40e-4351-9f41-19349e272ae9'],  # Use your span_id
   'eval.myeval.label': ['accuracy'],  # Example label name
   'eval.myeval.score': [0.95],        # Example label value
   'eval.myeval.explanation': ["some explanation"]
}
evaluation_df = pd.DataFrame(evaluation_data)
```

Here is sample code to log the evaluations back to Arize AX. The API reference can be found [here](https://arize-client-python.readthedocs.io/en/latest/llm-api/logger.html#arize.pandas.logger.Client.log_evaluations_sync).

```python theme={null}
evals_df["eval.tone_eval.label"] = evals_df["label"]
evals_df["eval.tone_eval.explanation"] = evals_df["explanation"]
evals_df.head()
```

```python theme={null}
# Note: This example uses Python SDK v7
import os
from arize.pandas.logger import Client

ARIZE_API_KEY = os.environ.get("ARIZE_API_KEY")
SPACE_ID = os.environ.get("SPACE_ID")

# Initialize Arize AX client to log evaluations
arize_client = Client(
    space_id=SPACE_ID, api_key=ARIZE_API_KEY
)

# Set the evals_df to have the correct span ID to log it to Arize AX
evals_df["context.span_id"] = primary_df["context.span_id"]

# send the eval to Arize AX
arize_client.log_evaluations_sync(evals_df, "tracing-haiku-tutorial")
```
