Quickstart: Run First Experiment
Experiments in Arize AX help you compare prompts, models, or configurations to understand what performs best. You can run structured tests, log results, and visualize outcomes side-by-side — all within the same workspace.
Choose your path to getting started: follow the step-by-step guide to set up and run your first experiment, or jump straight into example cookbooks and datasets to explore how experiments work in action.
Experiments (UI)
(optional) Compare experiments
With Diff Mode enabled, you can compare experiments side-by-side to easily spot improvements and regressions. Learn more
Congratulations! You just made, saved, and ran your first experiment. Next Steps
Experiments (Code)
Create your dataset
Datasets are useful groupings of data points you want to use for test cases. You can create datasets through code, using LLMs to auto-generate them, or by importing spans using the Arize AX UI. Here's an example of creating them with code:
import pandas as pd
from arize.experimental.datasets.utils.constants import GENERATIVE
# Example dataset
inventions_dataset = pd.DataFrame({
"attributes.input.value": ["Telephone", "Light Bulb"],
"attributes.output.value": ["Alexander Graham Bell", "Thomas Edison"],
})
# Create Arize AX dataset
dataset_id = arize_client.create_dataset(space_id=ARIZE_SPACE_ID, dataset_name = "test_dataset", dataset_type=GENERATIVE, data=inventions_dataset)Define a task
Define the LLM task to be tested against your dataset here. This could be structured data extraction, SQL generation, chatbot response generation, search, or any task of your choosing.
The input is the dataset_row so you can access the variables in your dataset, and the output is a string.
Learn more about how to create a task for an experiment.
import openai
def answer_question(dataset_row) -> str:
invention = dataset_row.get("attributes.input.value")# ex: "Telephone"
openai_client = openai.OpenAI()
response = openai_client.chat.completions.create(
model="gpt-4.1",
messages=[{"role": "user", "content": f"Who invented {invention}?"}],
max_tokens=20,
)
return response.choices[0].message.content(optional) Define your evaluators
An evaluator is any function that (1) takes the task output and (2) returns an assessment. This gives you tremendous flexibility to write your own LLM judge using a custom template, or use code-based evaluators.
Read more about creating evaluators for experiments.
from arize.experimental.datasets.experiments.types import EvaluationResult
def is_correct(output, dataset_row):
expected = dataset_row.get("attributes.output.value")
correct = expected in output
return EvaluationResult(
score=int(correct),
label="correct" if correct else "incorrect",
explanation="Evaluator explanation here"
)Run the experiment
To run an experiment, you need to specify the space you are logging it to, the dataset_id, the task you would like to run, and the list of evaluators defined on the output. This also logs the traces to Arize AX so you can debug each run.
You can specify dry_run=True , which does not log the result to Arize. You can also specify exit_on_error=True , which makes it easier to debug when an experiment doesn't run correctly.
See the run_experiment SDK definition for more info.
arize_client.run_experiment(
space_id=ARIZE_SPACE_ID,
dataset_id=dataset_id,
task=answer_question,
evaluators=[is_correct], #include your evaluation functions here
experiment_name="basic-experiment",
concurrency=10,
exit_on_error=False,
dry_run=False,
)Congratulations! You just made, saved, and ran your first experiment.
Next steps
Dive deeper into the following topics to keep improving your LLM application!
Last updated
Was this helpful?



