Quickstart: Write First Eval

Evaluations let you measure the quality and behavior of your LLM or agent — from accuracy and coherence to tone, safety, or task success. With Arize AX, you can create, run, and track evaluations that help you understand how your models perform across real data and experiments.

Choose your path to getting started: follow the step-by-step guide to write and run your first eval via the UI or through code, or dive straight into our end-to-end walkthroughs: cookbooks.

Evals & Tasks (UI)

1

Upload a CSV as a dataset

There are several ways to start an eval task — either on a project or a dataset. For this specific walkthrough, we can run evals on a dataset.

Download this sample CSV and upload it as a dataset in the Prompt Playground.

2

Set up a task in the playground

Load the dataset you created into prompt playground and enter the following prompt: Who invented {attributes.output.value}?

3

Create your first evaluator

Next, we will create an evaluator that will assess the outputs of our LLM.

  • Navigate to Add Evaluator and choose LLM-As-A-Judge

  • From the evaluation templates choose Human vs AI

  • Adjust the variables in the template to match the columns of this dataset

    [BEGIN DATA]
    ************
    [Question]: Who invented {attributes.output.value}? 
    ************
    [Human Ground Truth Answer]: {attributes.input.value}
    ************
    [AI Answer]: {output}
    ************
    [END DATA]
  • Finish by creating the evaluator

4

Run the task and evaluator

Finally, you can run the task in the playground. Navigate to the experiment to see the outputs and evaluation results.

Congratulations! You just made, saved, and ran your first eval. Next Steps

Evals & Tasks (Code)

1

Install dependencies & set your API Keys + Space ID

pip install arize openai pandas 'arize-phoenix[evals]'

Setup client: Grab your Arize AX Space ID and API key from the Settings page.

from arize.experimental.datasets import ArizeDatasetsClient

arize_client = ArizeDatasetsClient(api_key=ARIZE_API_KEY)
2

Create your dataset

Datasets are useful groupings of data points you want to use for test cases. You can create datasets in code, generate them automatically with Alyx, or import spans directly through the Arize AX UI.

Here's an example of creating them with code:

import pandas as pd 
from arize.experimental.datasets.utils.constants import GENERATIVE

# Example dataset
inventions_dataset = pd.DataFrame({
    "attributes.input.value": ["Telephone", "Light Bulb"],
    "attributes.output.value": ["Alexander Graham Bell", "Thomas Edison"],
})

# Create Arize AX dataset
dataset_id = arize_client.create_dataset(space_id=ARIZE_SPACE_ID, dataset_name = "test_dataset", dataset_type=GENERATIVE, data=inventions_dataset)
3

Define a task

Define the LLM task to be tested against your dataset here. This could be structured data extraction, SQL generation, chatbot response generation, search, or any task of your choosing.

The input is the dataset_row so you can access the variables in your dataset, and the output is a string.

Learn more about how to create a task for an experiment.

import openai 

def answer_question(dataset_row) -> str:
    invention = dataset_row.get("attributes.input.value")# ex: "Telephone"
    openai_client = openai.OpenAI()

    response = openai_client.chat.completions.create(
        model="gpt-4.1",
        messages=[{"role": "user", "content": f"Who invented {invention}?"}],
        max_tokens=20,
    )
    
    return response.choices[0].message.content
4

Define your first evaluators

An evaluator is any function that (1) takes the task output and (2) returns an assessment. This gives you tremendous flexibility to write your own LLM judge using a custom template, or use code-based evaluators.

Here we will define a code-based evaluator. Phoenix Evals allow you to turn a function into an Evaluator using the create_evaluator decorator

from arize.experimental.datasets.experiments.types import EvaluationResult

def correctness(experiment_output, dataset_row):
    expected = dataset_row.get("attributes.output.value")
    correct = expected in experiment_output
    return EvaluationResult(
        score=int(correct),
        label="correct" if correct else "incorrect",
        explanation = "correct" if correct else "incorrect"
    )
5

Run the experiment

To run an experiment, you need to specify the space you are logging it to, the dataset_id, the task you would like to run, and the list of evaluators defined on the output. This also logs the traces to Arize AX so you can debug each run.

You can specify dry_run=True , which does not log the result to Arize. You can also specify exit_on_error=True , which makes it easier to debug when an experiment doesn't run correctly.

See the run_experiment SDK definition for more info.

arize_client.run_experiment(
    space_id=ARIZE_SPACE_ID, 
    dataset_id=dataset_id,
    task=answer_question, 
    evaluators=[correctness], #include your evaluation functions here 
    experiment_name="basic-experiment",
    concurrency=10,
    exit_on_error=False,
    dry_run=False,
)
6

View the evaluator result in Arize AX

Navigate to the dataset in the UI and see the evaluator results in the experiment output table.

Congratulations! You just made, saved, and ran your first eval.

Next steps

Dive deeper into the following topics to keep improving your LLM application!

Last updated

Was this helpful?