Run experiment
Run experiments to test model, prompt, agent changes. Experiments can be run via the UI or via code.
Run experiment via UI
1. Test a prompt in playground
First, create a dataset. Load the dataset you created into prompt playground, and run it to see your results. Once you've finished the run, you can save it as an experiment to track your changes.
2. Run an evaluator on your playground experiments
Use evaluators to automatically measure the quality of your experiment results. Once defined, Arize runs it in the background. Evaluators can be either LLM Judges or code-based assessments.
3. Compare experiment results
Each prompt iteration is stored separately, and Arize makes it easy to compare experiment results against each other with Diff Mode.
You can also use Alyx to get automated insights as you compare your experiments, with the ability to both summarize results and highlight key differences across runs.
Run experiment via Code
Check out the API reference for more details:
1. Define your dataset
You can create a new dataset or use an existing dataset.
from arize.experimental.datasets import ArizeDatasetsClient
from arize.experimental.datasets.utils.constants import GENERATIVE
# Example dataset
inventions_dataset = pd.DataFrame({
"attributes.input.value": ["Telephone", "Light Bulb"],
"attributes.output.value": ["Alexander Graham Bell", "Thomas Edison"],
})
arize_client = ArizeDatasetsClient(api_key=ARIZE_API_KEY)
dataset_id = arize_client.create_dataset(space_id=ARIZE_SPACE_ID, dataset_name = "test_dataset", dataset_type=GENERATIVE, data=inventions_dataset)
2. Define a task
A task is any function that you want to run on a dataset. The simplest version of a task looks like the following:
def task(dataset_row: Dict):
return dataset_row
When you create a dataset, each row is stored as a dictionary with attributes you can retrieve within your task. This can be the user input, the expected output for an evaluation task, or metadata attributes.
import openai
def answer_question(dataset_row) -> str:
invention = dataset_row.get("attributes.input.value") # example: "Telephone"
openai_client = openai.OpenAI()
response = openai_client.chat.completions.create(
model="gpt-4.1",
messages=[{"role": "user", "content": f"Who invented {invention}?"}],
max_tokens=20,
)
return response.choices[0].message.content
Task inputs
The task function can take the following optional arguments for convenience, which will automatically pass dataset_row
attributes to your task function. The easiest way to access anything you need is using dataset_row
.
dataset_row
the entire row of the data, including every column as dictionary key
--
def task_fn(dataset_row): ...
input
experiment run input
attributes.input.value
def task_fn(input): ...
expected
the expected output
attributes.output.value
def task_fn(expected): ...
metadata
metadata for the function
attributes.metadata
def task_fn(metadata): ...
3. Define an evaluator (Optional)
You can also optionally define an evaluator to assess your task outputs in experiments. These evaluators can be LLM Judges or Code Evaluators. For example, here’s a simple code evaluator that verifies whether the LLM output aligns with the expected output:
from arize.experimental.datasets.experiments.types import EvaluationResult
def is_correct(output, dataset_row):
expected = dataset_row.get("attributes.output.value")
correct = expected in output
return EvaluationResult(
score=int(correct),
label="correct" if correct else "incorrect",
explanation="Evaluator explanation here"
)
4. Run the experiment
Then, use the run_experiment
function to run the task function against your dataset, run the evaluation function against the outputs, and log the results and traces to Arize.
arize_client.run_experiment(
space_id=ARIZE_SPACE_ID,
dataset_id=dataset_id,
task=answer_question,
evaluators=[is_correct], #include your evaluation functions here
experiment_name="basic-experiment",
concurrency=10,
exit_on_error=False,
dry_run=False,
)
We offer several convenience attributes:
concurrency
reduces time to complete the experiment.dry_run=True
does not log the result to Arize.exit_on_error=True
makes it easier to debug when an experiment doesn't run correctly.
Once your experiment has finished running, you can see your experiment results in the Arize UI.
Last updated
Was this helpful?