Experiment in code

Use code when the experiment lives outside a single prompt call: a pipeline, agent, sandbox, CI job, or any runtime you control. You can either let the Python client run the task against the dataset for you, or upload results you already produced elsewhere.

Two approaches

The rest of the page follows these two paths:

	Run an experiment	Log an experiment
Who executes the task	Your Python environment	Your code, anywhere
Upload interface	Python SDK	Python SDK, TypeScript/JS SDK, CLI (`ax`), REST API
Concurrency, tracing, evals	Handled by the client	You handle execution and any scores
Best for	Self-contained Python task functions	Pipelines, agents, sandboxes, CI, and remote runtimes

Both paths land in the same experiments UI, so you can compare runs side by side after they finish.

Run an experiment

Use this when the entire task fits inside a single Python function and you want the Python client to orchestrate it for you. The client resolves the dataset, runs the task against every row, scores the outputs with the evaluators you pass, and logs the results for you.

Define a task

A task function is the unit of work you want to measure: a single LLM call, a retrieval pipeline, an agent workflow, or any application logic. This is where you decide what stays fixed and what changes between runs. If you still need to create or refine the dataset first, use Build a dataset.

Python

import openai

def answer_question(dataset_row) -> str:
    question = dataset_row.get("attributes.input.value", "")
    response = openai.OpenAI().chat.completions.create(
        model="gpt-4.1",
        messages=[{"role": "user", "content": question}],
    )
    return response.choices[0].message.content or ""

Task functions receive the full dataset_row. Read whatever columns your dataset provides, whether they are OpenInference attributes such as attributes.input.value or custom columns such as input, question, tickers, or focus. To version the task’s prompt in Prompts instead of hard-coding it, fetch it inside your task with client.prompts.get().

Add an evaluator

Use a code evaluator when you have deterministic rules or ground truth. Use an LLM judge when the criterion is more subjective. A clear starting point is a function that returns an EvaluationResult:

Python

from arize.experiments import EvaluationResult

def correctness(output, dataset_row):
    expected = dataset_row.get("attributes.output.value", "")
    generated = output or ""
    correct = bool(expected) and expected.lower() in generated.lower()
    return EvaluationResult(
        score=int(correct),
        label="correct" if correct else "incorrect",
        explanation=f"Expected '{expected}', got '{generated[:80]}...'",
    )

Keep the first evaluator simple. For deeper evaluator workflows, see Evaluate.

Run it

Pass your task and evaluators to the client with the documented dataset= parameter, pointing at the dataset by ID or name. Continuing from the task and evaluator above:

Python

from arize import ArizeClient

client = ArizeClient(api_key="YOUR_API_KEY")

experiment, experiment_df = client.experiments.run(
    name="support-agent-baseline",
    dataset="YOUR_DATASET_ID_OR_NAME",
    task=answer_question,
    evaluators=[correctness],
    concurrency=10,
)

client.experiments.run() resolves the dataset, runs the task and evaluators in your Python environment, and logs the results to Arize. Name runs consistently so the comparison view diffs cleanly; Plan a baseline covers the naming pattern itself. For the full parameter list, see the Python experiments API reference. If you’re migrating older experiment code, see the experiments migration guide.

Use dry_run=True to test the loop locally without logging results. In dry-run mode, client.experiments.run() returns (None, experiment_df).

Once either path has produced runs, the rest of the workflow is the same.

Log an experiment

Choose this path when the task already runs in its own environment: another service, a TypeScript app, a sandbox, CI, or a notebook with its own orchestration. Compute the outputs yourself, key each row by example_id, and upload the finished results.

Assemble your results

Put your task outputs in a table such as a DataFrame, CSV, JSON, JSONL, or Parquet file. Every row must carry:

example_id, the dataset row ID the result corresponds to
output (or another column you map to it), the task output for that example

If you ran evaluators yourself, include their label and score fields alongside each row. You can map the exact column names when you upload and give that evaluator a name such as correctness, or attach evaluators later if you do not have them yet. If your results already contain example IDs, keep them. If they do not, fetch the dataset examples first and map id to example_id before you upload the run. When you resolve a dataset by name while fetching examples, pass space= to client.datasets.list_examples(). Your external job should produce one result row for each dataset example it ran; the example below shows the required shape and maps each row to an existing dataset example ID.

Python

import pandas as pd
from arize import ArizeClient

client = ArizeClient(api_key="YOUR_API_KEY")

dataset_examples = client.datasets.list_examples(
    dataset="YOUR_DATASET_ID",
    all=True,
).to_df()

experiment_run_df = pd.DataFrame(
    {
        "result": ["The telephone was invented by Alexander Graham Bell."],
        "label": ["correct"],
        "score": [1],
    }
)

# Match each result row to the dataset example it came from.
experiment_run_df["example_id"] = dataset_examples["id"].head(
    len(experiment_run_df)
).to_list()

Upload the run

By Arize Skills
By Code

Use the Arize skills plugin with the arize-experiment skill to upload a results file directly. The file must have example_id and output columns (CSV, JSON, JSONL, or Parquet). See ax experiments create for the full schema.Try asking your agent:

“Upload runs.csv as a new experiment on dataset ds_xxx and name it baseline-v1.”
“Create an experiment from nightly_runs.jsonl for dataset qa-regression.”

Coding agent using the arize-experiment skill to upload experiment runs and create a new experiment

The Python SDK is the most explicit option when you need to map custom task or evaluator columns. The supported create() contract is the results table plus task_fields and, if you have them, evaluator_columns. TypeScript/JS, CLI, and REST are also supported upload interfaces for remote runs.The Python example below continues from the experiment_run_df assembled above and defines the field mappings before upload.

from arize import ArizeClient
from arize.experiments import ExperimentTaskFieldNames, EvaluationResultFieldNames

client = ArizeClient(api_key="YOUR_API_KEY")

task_fields = ExperimentTaskFieldNames(example_id="example_id", output="result")
evaluator_fields = EvaluationResultFieldNames(
    label="label",
    score="score",
)

experiment = client.experiments.create(
    name="my_experiment",
    dataset="YOUR_DATASET_ID",
    experiment_runs=experiment_run_df,
    task_fields=task_fields,
    # `correctness` is the evaluator name shown in Arize.
    evaluator_columns={"correctness": evaluator_fields},
)

Python reference: client.experiments.create()
TypeScript reference: createExperiment()
CLI reference: ax experiments create
REST overview: Arize REST API

If you’re migrating older Python experiment-logging code, see the experiments migration guide.

Remote experiments fit pipelines you already own. The task stays in your runtime, and Arize records and compares the resulting runs.

Evaluate a remote experiment

You have two options for attaching scores to a remote run:

Score outputs yourself and upload alongside results. Add evaluator columns such as label and score to the results DataFrame and map them through EvaluationResultFieldNames, as shown in the Python upload example above. Use this when the evaluator logic lives in the same environment as your task.
Attach evaluators in Arize after upload. Upload the experiment without eval columns, then open the experiment results page and click Add Evaluator, run the arize-evaluator skill, or run an existing evaluator from the experiment workflow. Use this when you want an LLM-as-a-judge scored from Arize itself, especially across remote runs from multiple languages.

Try asking your agent:

“Create a correctness LLM-as-a-judge evaluator using my OpenAI integration and run it on experiment exp_xxx.”
“Score every run in experiment exp_xxx with a groundedness judge.”

The judge LLM needs stored credentials. Use the arize-ai-provider-integration skill to set up your OpenAI or Anthropic keys, or your Bedrock role. For the broader evaluator workflow:

Create evaluators: create or manage reusable evaluators.
Run offline evals on experiments: run evaluators against an existing experiment.
Human review and Labeling queues: collect labels before you automate.

Manage your experiments

Compare experiments

Whether you logged the run or had Arize run it, compare the results in the same experiments UI. For the full walkthrough, see Compare experiments.

Export or get results

If you want experiment metadata and runs programmatically, pull them in code:

from arize import ArizeClient

client = ArizeClient(api_key="YOUR_API_KEY")

experiment = client.experiments.get(
    experiment="support-agent-baseline",
    dataset="YOUR_DATASET_ID",
)

runs_df = client.experiments.list_runs(
    experiment=experiment.id,
    all=True,
).to_df()

Tag a winner

Once a prompt variant clears the baseline on the evaluators you care about, tag that prompt version as production in Prompts, or use whatever label your application loads in production. For model or pipeline changes, promote the winning value in your app configuration and keep the experiment name or metadata tied to that promoted version.

Classification metrics

If each experiment returns a categorical label instead of free-form text, configure classification metrics from the dataset’s Experiments tab. The full setup for ground-truth mapping, positive-class selection, and metric definitions lives on Experiment in Playground.

Additional code workflows

Once the main loop is in place, use these patterns to work faster or handle edge cases. The sections below are mostly about the Python run() path.

Evaluator patterns

If the minimal evaluator above is enough, stop there. Function evaluators can return an EvaluationResult, a numeric score, or a string label; use EvaluationResult when you want to include score, label, and explanation together. Class-based evaluators can accept mapped inputs such as input, output, dataset_row, and metadata. Use the patterns below when you need more than one evaluator in the same run, shared state, or reusable evaluators. For deeper evaluator references, see Run offline evals on experiments and Create evaluators.

Multiple evaluators

Pass a list to evaluators= and Arize runs each evaluator against each experiment result. Start with multiple function evaluators, or mix function and class-based evaluators in the same list. Each evaluator shows up as its own column in the comparison view.

Show multiple-evaluator example

Python

from arize import ArizeClient
from arize.experiments import EvaluationResult

client = ArizeClient(api_key="YOUR_API_KEY")

def echo_input(dataset_row) -> str:
    return dataset_row.get("attributes.input.value", "")

def correctness(output, dataset_row) -> EvaluationResult:
    expected = dataset_row.get("attributes.output.value", "")
    generated = output or ""
    correct = bool(expected) and expected.lower() in generated.lower()
    return EvaluationResult(
        score=int(correct),
        label="correct" if correct else "incorrect",
    )

def has_output(output, dataset_row) -> EvaluationResult:
    present = bool(output)
    return EvaluationResult(
        score=float(present),
        label="present" if present else "missing",
    )

experiment, experiment_df = client.experiments.run(
    name="multi-eval-experiment",
    dataset="YOUR_DATASET_ID",
    task=echo_input,
    evaluators=[
        correctness,
        has_output,
    ],
)

Class-based evaluators

The main API reference focuses on function evaluators. Use a subclass of Evaluator when an evaluator holds shared state, runs async, or is reused across projects. Class-based evaluator methods can request the inputs they need:

Parameter	Description	Example
`input`	Experiment run input	`def evaluate(self, input, **kwargs): ...`
`output`	Experiment run output	`def evaluate(self, output, **kwargs): ...`
`dataset_row`	The full dataset row, including every column	`def evaluate(self, dataset_row, **kwargs): ...`
`metadata`	Experiment metadata	`def evaluate(self, metadata, **kwargs): ...`

Show code evaluator class example

Python

from arize.experiments import EvaluationResult, Evaluator

class MatchesExpected(Evaluator):
    annotator_kind = "CODE"
    name = "matches_expected"

    def evaluate(self, output, dataset_row, **kwargs) -> EvaluationResult:
        expected_output = dataset_row.get("attributes.output.value")
        matches = expected_output == output
        label = "match" if matches else "mismatch"
        score = float(matches)
        return EvaluationResult(score=score, label=label)

    async def async_evaluate(self, *, output, dataset_row, **kwargs) -> EvaluationResult:
        return self.evaluate(output=output, dataset_row=dataset_row)

Show LLM evaluator class example

This example uses Phoenix Evals and an OpenAI-backed judge. Install phoenix-evals and set OPENAI_API_KEY before running it.

Python

import os
import pandas as pd
from arize.experiments import EvaluationResult, Evaluator
from phoenix.evals import (
    HALLUCINATION_PROMPT_RAILS_MAP,
    HALLUCINATION_PROMPT_TEMPLATE,
    OpenAIModel,
    llm_classify,
)

class HallucinationEvaluator(Evaluator):
    name = "hallucination"

    def evaluate(self, output, dataset_row, **kwargs) -> EvaluationResult:
        expected_output = dataset_row.get("attributes.output.value")
        df_in = pd.DataFrame(
            {"selected_output": [output], "expected_output": [expected_output]}
        )
        result = llm_classify(
            dataframe=df_in,
            template=HALLUCINATION_PROMPT_TEMPLATE,
            model=OpenAIModel(
                model="gpt-4o-mini",
                api_key=os.getenv("OPENAI_API_KEY"),
            ),
            rails=HALLUCINATION_PROMPT_RAILS_MAP,
            provide_explanation=True,
        )
        label = result["label"][0]
        return EvaluationResult(
            score=1 if label == "factual" else 0,
            label=label,
            explanation=result["explanation"][0],
        )

    async def async_evaluate(self, *, output, dataset_row, **kwargs) -> EvaluationResult:
        return self.evaluate(output=output, dataset_row=dataset_row)

Async experiments

When throughput matters on the run() path, declare your task and evaluators with async def and raise concurrency. In Jupyter, install nest_asyncio with pip install nest_asyncio (it is not bundled with arize) and call nest_asyncio.apply() first so the runner can nest its event loop inside the kernel’s.

Diagram showing synchronous sequential execution versus asynchronous parallel execution for experiment tasks and evaluators

Show async experiment example

Python

from arize import ArizeClient
from arize.experiments import EvaluationResult
import nest_asyncio

client = ArizeClient(api_key="YOUR_API_KEY")
nest_asyncio.apply()

async def async_task(dataset_row):
    return dataset_row.get("attributes.input.value", "")

async def async_has_output(output, dataset_row):
    present = bool(output)
    return EvaluationResult(
        score=float(present),
        label="present" if present else "missing",
    )

experiment, experiment_df = client.experiments.run(
    name="async-experiment",
    dataset="YOUR_DATASET_ID_OR_NAME",
    task=async_task,
    evaluators=[async_has_output],
    concurrency=10,
)

Start with synchronous tasks and evaluators while you’re developing the experiment. Sync failures usually break at the line that raised the error, which makes them easier to debug before you switch to async for throughput.

Dataset sampling

For quick spot checks or balanced subsets, sample the dataset before running. The fastest path is dry_run=True with dry_run_count, which runs the task against the first N examples without logging:

Show dry-run sampling example

Python

from arize import ArizeClient

client = ArizeClient(api_key="YOUR_API_KEY")

def quick_task(dataset_row) -> str:
    return dataset_row.get("attributes.input.value", "")

experiment, experiment_df = client.experiments.run(
    name="smoke-test",
    dataset="YOUR_DATASET_ID_OR_NAME",
    task=quick_task,
    dry_run=True,
    dry_run_count=10,
)

When you want a specific sample (random, stratified, or systematic), pull the examples, sample in pandas, drop system-managed fields, and create a temporary dataset:

Show sampled dataset example

Python

from arize import ArizeClient

client = ArizeClient(api_key="YOUR_API_KEY")

def sampled_task(dataset_row) -> str:
    return dataset_row.get("attributes.input.value", "")

examples_df = client.datasets.list_examples(
    dataset="YOUR_DATASET_ID_OR_NAME",
    all=True,
).to_df()

sampled_df = examples_df.sample(frac=0.1, random_state=42)
# Stratify by a label column to preserve class balance:
# sampled_df = examples_df.groupby("class_label", group_keys=False).apply(
#     lambda g: g.sample(frac=0.1, random_state=42)
# )
# Or select every 10th row:
# sampled_df = examples_df.iloc[::10, :]

# Keep only dataset example columns before creating a new dataset.
example_columns = [
    col for col in sampled_df.columns if col not in {"id", "created_at", "updated_at"}
]
sampled_examples = sampled_df[example_columns].copy()

sampled_dataset = client.datasets.create(
    name="support-dataset-sample-10pct",
    space="your-space-name-or-id",
    examples=sampled_examples,
)

experiment, experiment_df = client.experiments.run(
    name="sampled-experiment",
    dataset=sampled_dataset.id,
    task=sampled_task,
)

You can run experiments on different samples of the same source dataset. Arize still tracks each run against the data it actually saw, so you can compare and visualize those sampled runs cleanly in the product.

Experiment tracing

When you want your own spans for retrieval, tool calls, or nested model activity, pass set_global_tracer_provider=True so the experiment run registers a global tracer provider for that execution. Use it when you want manual or auto-instrumented tracing to participate in the same run-time tracing setup. Install the OpenTelemetry and OpenInference packages required by your tracing setup before running these examples. For broader setup guidance, see Set up tracing. Explicit spans. Create spans manually for the parts of the task you want visible:

Show explicit spans example

Python

from arize import ArizeClient
from opentelemetry import trace

client = ArizeClient(api_key="YOUR_API_KEY")

def task_add_1(dataset_row):
    tracer = trace.get_tracer(__name__)
    with tracer.start_as_current_span("test_function") as span:
        num = dataset_row.get("attributes.my_number")
        span.set_attribute("dataset.my_number", num)
        return num + 1

experiment, experiment_df = client.experiments.run(
    name="tracing-demo",
    dataset="YOUR_DATASET_ID_OR_NAME",
    task=task_add_1,
    set_global_tracer_provider=True,
)

Auto-instrumentor. For LLM, framework, or vector-store calls, install the matching OpenInference auto-instrumentor so library calls made inside your task can emit spans during the run:

Show auto-instrumentor example

Python

from arize import ArizeClient
from openai import OpenAI
from openinference.instrumentation.openai import OpenAIInstrumentor

client = ArizeClient(api_key="YOUR_API_KEY")
openai_client = OpenAI()

OpenAIInstrumentor().instrument()

def traced_task(dataset_row) -> str:
    question = dataset_row.get("attributes.input.value", "")
    response = openai_client.chat.completions.create(
        model="gpt-4.1",
        messages=[{"role": "user", "content": question}],
    )
    return response.choices[0].message.content or ""

experiment, experiment_df = client.experiments.run(
    name="auto-instrumented",
    dataset="YOUR_DATASET_ID_OR_NAME",
    task=traced_task,
    set_global_tracer_provider=True,
)

For broader tracing guidance (providers, exporters, and other instrumentors), see Set up tracing and the Python client’s OpenTelemetry tracing reference.

LangGraph tracing example

Runnable Google Colab notebook showing experiment tracing with LangGraph and nested spans.

Handle row-level failures

If you leave exit_on_error=False, inspect the returned DataFrame after the run and use the schema from your environment before deciding what to retry.

The exact columns in experiment_df can vary by run. Check experiment_df.columns in your environment before you hard-code a retry filter.

A safe retry pattern is:

Inspect experiment_df.columns and identify how your environment marks failed rows.
Filter experiment_df down to just the rows you want to retry.
Keep only the original dataset columns when you build a temporary retry dataset.
Rerun client.experiments.run() against that temporary dataset.

Next step

CI/CD with experiments

Automate experiments as regression gates on every PR or deploy. Remote runs are a natural fit for CI, but the same loop also works with client.experiments.run().

How to Use Arize

Quickstart

Instrument

Observe

Evaluate

Improve

Machine Learning

Settings

Security

Two approaches

Run an experiment

Define a task

Add an evaluator

Run it

Log an experiment

Assemble your results

Upload the run

Evaluate a remote experiment

Manage your experiments

Compare experiments

Export or get results

Tag a winner

Classification metrics

Additional code workflows

Evaluator patterns

Multiple evaluators

Class-based evaluators

Async experiments

Dataset sampling

Experiment tracing

LangGraph tracing example

Handle row-level failures

Next step

CI/CD with experiments

Further reading

How to Use Arize

Quickstart

Instrument

Observe

Evaluate

Improve

Machine Learning

Settings

Security

​Two approaches

​Run an experiment

​Define a task

​Add an evaluator

​Run it

​Log an experiment

​Assemble your results

​Upload the run

​Evaluate a remote experiment

​Manage your experiments

​Compare experiments

​Export or get results

​Tag a winner

​Classification metrics

​Additional code workflows

​Evaluator patterns

​Multiple evaluators

​Class-based evaluators

​Async experiments

​Dataset sampling

​Experiment tracing

LangGraph tracing example

​Handle row-level failures

​Next step

CI/CD with experiments

​Further reading

Two approaches

Run an experiment

Define a task

Add an evaluator

Run it

Log an experiment

Assemble your results

Upload the run

Evaluate a remote experiment

Manage your experiments

Compare experiments

Export or get results

Tag a winner

Classification metrics

Additional code workflows

Evaluator patterns

Multiple evaluators

Class-based evaluators

Async experiments

Dataset sampling

Experiment tracing

Handle row-level failures

Next step

Further reading