Run experiments with code
How to create an LLM task to use in experiments
To run an experiment in code, you need to define the following things:
And then use the run_experiment
function to run the task function against your data, run the evaluation function against the outputs, and log the results and traces to Arize.
Alternatively, you can log your experiment results via code if you already have your LLM outputs or evaluation outputs.
Define your dataset
You can create a new dataset, or you can export your existing dataset in code.
Define your task function
A task is any function that you want to test on a dataset. The simplest version of a task looks like the following:
When you create a dataset, each row is stored as a dictionary with attributes you can retrieve within your task. This can be the user input, the expected output for an evaluation task, or metadata attributes.
Here's an example of how the data passes through the task function from your dataset.
Let's create a task that uses an LLM to answer a question.
Task inputs
The task function can take the following optional arguments for convenience, which will automatically pass dataset_row
attributes to your task function. The easiest way to access anything you need is using dataset_row
.
dataset_row
the entire row of the data, including every column as dictionary key
--
def task_fn(dataset_row): ...
input
experiment run input
attributes.input.value
def task_fn(input): ...
expected
the expected output
attributes.output.value
def task_fn(expected): ...
metadata
metadata for the function
attributes.metadata
def task_fn(metadata): ...
Run the experiment
To run the experiment with this task, you can load the task into run_experiment
as following.
We offer several convenience attributes, such as concurrency
to reduce time to complete the experiment. You can specify dry_run=True
, which does not log the result to Arize. You can also specify exit_on_error=True
, which makes it easier to debug when an experiment doesn't run correctly.
Once your experiment has finished running, you can visualize your experiment results in the Arize UI.

Advanced Options
Setup asynchronous experiments
Experiments can be run as either Synchronous or Asynchronous.
We recommend:
Synchronous: Slower but easier to debug. When you are building your tests these are inherently easier to debug. Start with synchronous and then make them asynchronous.
Asynchronous: Faster. When timing and speed of the tests matter. Make the tasks and/or Evals asynchronous and you can 10x the speed of your runs.
The synchronous running of an experiment runs one after another. The asynchronous running of an experiment runs in parallel.


Here are some code differences between the two. You just need to add the async
keyword before your functions def and add async_
at the front of the name, and then run nest_asyncio.apply()
. This will rely on the concurrency
parameter in run_experiment
, so if you'd like to run them faster, set it to a higher number.
Sampling a dataset for an experiment
Running a test on dataset sometimes requires running on random or stratified samples of the dataset. Arize supports running on samples by allowing teams to download a dataframe. That dataframe can be sampled prior to running the experiment.
An experiment will only matched up with the data that was run against it. You can run experiments with different samples of the same dataset. The platform will take care of tracking and visualization.
Any complex sampling method that can be applied to a dataframe can be used for sampling.
Tracing your experiment
When running experiments, arize_client.run_experiment()
will produce a task span attached to the experiment. If you want to add more traces on the experimental run, you can actually instrument any part of that experiment and they will get attached below the task span

Arize tracers instrumented on experiment code will automatically trace the experiments into the platform.


Tracing Using Explicit Spans
Tracing Using Auto-Instrumentor
Last updated
Was this helpful?