Skip to main content
Use code when the experiment lives outside a single prompt call: a pipeline, agent, sandbox, CI job, or any runtime you control. You can either let the Python client run the task against the dataset for you, or upload results you already produced elsewhere.

Two approaches

The rest of the page follows these two paths:
Run an experimentLog an experiment
Who executes the taskYour Python environmentYour code, anywhere
Upload interfacePython SDKPython SDK, TypeScript/JS SDK, CLI (ax), REST API
Concurrency, tracing, evalsHandled by the clientYou handle execution and any scores
Best forSelf-contained Python task functionsPipelines, agents, sandboxes, CI, and remote runtimes
Both paths land in the same experiments UI, so you can compare runs side by side after they finish.

Run an experiment

Use this when the entire task fits inside a single Python function and you want the Python client to orchestrate it for you. The client resolves the dataset, runs the task against every row, scores the outputs with the evaluators you pass, and logs the results for you.

Define a task

A task function is the unit of work you want to measure: a single LLM call, a retrieval pipeline, an agent workflow, or any application logic. This is where you decide what stays fixed and what changes between runs. If you still need to create or refine the dataset first, use Build a dataset.
Python
import openai

def answer_question(dataset_row) -> str:
    question = dataset_row.get("attributes.input.value", "")
    response = openai.OpenAI().chat.completions.create(
        model="gpt-4.1",
        messages=[{"role": "user", "content": question}],
    )
    return response.choices[0].message.content or ""
Task functions receive the full dataset_row. Read whatever columns your dataset provides, whether they are OpenInference attributes such as attributes.input.value or custom columns such as input, question, tickers, or focus. To version the task’s prompt in Prompts instead of hard-coding it, fetch it inside your task with client.prompts.get().

Add an evaluator

Use a code evaluator when you have deterministic rules or ground truth. Use an LLM judge when the criterion is more subjective. A clear starting point is a function that returns an EvaluationResult:
Python
from arize.experiments import EvaluationResult

def correctness(output, dataset_row):
    expected = dataset_row.get("attributes.output.value", "")
    generated = output or ""
    correct = bool(expected) and expected.lower() in generated.lower()
    return EvaluationResult(
        score=int(correct),
        label="correct" if correct else "incorrect",
        explanation=f"Expected '{expected}', got '{generated[:80]}...'",
    )
Keep the first evaluator simple. For deeper evaluator workflows, see Evaluate.

Run it

Pass your task and evaluators to the client with the documented dataset= parameter, pointing at the dataset by ID or name. Continuing from the task and evaluator above:
Python
from arize import ArizeClient

client = ArizeClient(api_key="YOUR_API_KEY")

experiment, experiment_df = client.experiments.run(
    name="support-agent-baseline",
    dataset="YOUR_DATASET_ID_OR_NAME",
    task=answer_question,
    evaluators=[correctness],
    concurrency=10,
)
client.experiments.run() resolves the dataset, runs the task and evaluators in your Python environment, and logs the results to Arize. Name runs consistently so the comparison view diffs cleanly; Plan a baseline covers the naming pattern itself. For the full parameter list, see the Python experiments API reference. If you’re migrating older experiment code, see the experiments migration guide.
Use dry_run=True to test the loop locally without logging results. In dry-run mode, client.experiments.run() returns (None, experiment_df).
Once either path has produced runs, the rest of the workflow is the same.

Log an experiment

Choose this path when the task already runs in its own environment: another service, a TypeScript app, a sandbox, CI, or a notebook with its own orchestration. Compute the outputs yourself, key each row by example_id, and upload the finished results.

Assemble your results

Put your task outputs in a table such as a DataFrame, CSV, JSON, JSONL, or Parquet file. Every row must carry:
  • example_id, the dataset row ID the result corresponds to
  • output (or another column you map to it), the task output for that example
If you ran evaluators yourself, include their label and score fields alongside each row. You can map the exact column names when you upload and give that evaluator a name such as correctness, or attach evaluators later if you do not have them yet. If your results already contain example IDs, keep them. If they do not, fetch the dataset examples first and map id to example_id before you upload the run. When you resolve a dataset by name while fetching examples, pass space= to client.datasets.list_examples(). Your external job should produce one result row for each dataset example it ran; the example below shows the required shape and maps each row to an existing dataset example ID.
Python
import pandas as pd
from arize import ArizeClient

client = ArizeClient(api_key="YOUR_API_KEY")

dataset_examples = client.datasets.list_examples(
    dataset="YOUR_DATASET_ID",
    all=True,
).to_df()

experiment_run_df = pd.DataFrame(
    {
        "result": ["The telephone was invented by Alexander Graham Bell."],
        "label": ["correct"],
        "score": [1],
    }
)

# Match each result row to the dataset example it came from.
experiment_run_df["example_id"] = dataset_examples["id"].head(
    len(experiment_run_df)
).to_list()

Upload the run

Use the Arize skills plugin with the arize-experiment skill to upload a results file directly. The file must have example_id and output columns (CSV, JSON, JSONL, or Parquet). See ax experiments create for the full schema.Try asking your agent:
  • “Upload runs.csv as a new experiment on dataset ds_xxx and name it baseline-v1.”
  • “Create an experiment from nightly_runs.jsonl for dataset qa-regression.”
Coding agent using the arize-experiment skill to upload experiment runs and create a new experiment
Remote experiments fit pipelines you already own. The task stays in your runtime, and Arize records and compares the resulting runs.

Evaluate a remote experiment

You have two options for attaching scores to a remote run:
  1. Score outputs yourself and upload alongside results. Add evaluator columns such as label and score to the results DataFrame and map them through EvaluationResultFieldNames, as shown in the Python upload example above. Use this when the evaluator logic lives in the same environment as your task.
  2. Attach evaluators in Arize after upload. Upload the experiment without eval columns, then open the experiment results page and click Add Evaluator, run the arize-evaluator skill, or run an existing evaluator from the experiment workflow. Use this when you want an LLM-as-a-judge scored from Arize itself, especially across remote runs from multiple languages.
Try asking your agent:
  • “Create a correctness LLM-as-a-judge evaluator using my OpenAI integration and run it on experiment exp_xxx.”
  • “Score every run in experiment exp_xxx with a groundedness judge.”
The judge LLM needs stored credentials. Use the arize-ai-provider-integration skill to set up your OpenAI or Anthropic keys, or your Bedrock role. For the broader evaluator workflow:

Manage your experiments

Compare experiments

Whether you logged the run or had Arize run it, compare the results in the same experiments UI. For the full walkthrough, see Compare experiments.

Export or get results

If you want experiment metadata and runs programmatically, pull them in code:
from arize import ArizeClient

client = ArizeClient(api_key="YOUR_API_KEY")

experiment = client.experiments.get(
    experiment="support-agent-baseline",
    dataset="YOUR_DATASET_ID",
)

runs_df = client.experiments.list_runs(
    experiment=experiment.id,
    all=True,
).to_df()

Tag a winner

Once a prompt variant clears the baseline on the evaluators you care about, tag that prompt version as production in Prompts, or use whatever label your application loads in production. For model or pipeline changes, promote the winning value in your app configuration and keep the experiment name or metadata tied to that promoted version.

Classification metrics

If each experiment returns a categorical label instead of free-form text, configure classification metrics from the dataset’s Experiments tab. The full setup for ground-truth mapping, positive-class selection, and metric definitions lives on Experiment in Playground.

Additional code workflows

Once the main loop is in place, use these patterns to work faster or handle edge cases. The sections below are mostly about the Python run() path.

Evaluator patterns

If the minimal evaluator above is enough, stop there. Function evaluators can return an EvaluationResult, a numeric score, or a string label; use EvaluationResult when you want to include score, label, and explanation together. Class-based evaluators can accept mapped inputs such as input, output, dataset_row, and metadata. Use the patterns below when you need more than one evaluator in the same run, shared state, or reusable evaluators. For deeper evaluator references, see Run offline evals on experiments and Create evaluators.

Multiple evaluators

Pass a list to evaluators= and Arize runs each evaluator against each experiment result. Start with multiple function evaluators, or mix function and class-based evaluators in the same list. Each evaluator shows up as its own column in the comparison view.
Python
from arize import ArizeClient
from arize.experiments import EvaluationResult

client = ArizeClient(api_key="YOUR_API_KEY")

def echo_input(dataset_row) -> str:
    return dataset_row.get("attributes.input.value", "")

def correctness(output, dataset_row) -> EvaluationResult:
    expected = dataset_row.get("attributes.output.value", "")
    generated = output or ""
    correct = bool(expected) and expected.lower() in generated.lower()
    return EvaluationResult(
        score=int(correct),
        label="correct" if correct else "incorrect",
    )

def has_output(output, dataset_row) -> EvaluationResult:
    present = bool(output)
    return EvaluationResult(
        score=float(present),
        label="present" if present else "missing",
    )

experiment, experiment_df = client.experiments.run(
    name="multi-eval-experiment",
    dataset="YOUR_DATASET_ID",
    task=echo_input,
    evaluators=[
        correctness,
        has_output,
    ],
)

Class-based evaluators

The main API reference focuses on function evaluators. Use a subclass of Evaluator when an evaluator holds shared state, runs async, or is reused across projects. Class-based evaluator methods can request the inputs they need:
ParameterDescriptionExample
inputExperiment run inputdef evaluate(self, input, **kwargs): ...
outputExperiment run outputdef evaluate(self, output, **kwargs): ...
dataset_rowThe full dataset row, including every columndef evaluate(self, dataset_row, **kwargs): ...
metadataExperiment metadatadef evaluate(self, metadata, **kwargs): ...
Python
from arize.experiments import EvaluationResult, Evaluator

class MatchesExpected(Evaluator):
    annotator_kind = "CODE"
    name = "matches_expected"

    def evaluate(self, output, dataset_row, **kwargs) -> EvaluationResult:
        expected_output = dataset_row.get("attributes.output.value")
        matches = expected_output == output
        label = "match" if matches else "mismatch"
        score = float(matches)
        return EvaluationResult(score=score, label=label)

    async def async_evaluate(self, *, output, dataset_row, **kwargs) -> EvaluationResult:
        return self.evaluate(output=output, dataset_row=dataset_row)
This example uses Phoenix Evals and an OpenAI-backed judge. Install phoenix-evals and set OPENAI_API_KEY before running it.
Python
import os
import pandas as pd
from arize.experiments import EvaluationResult, Evaluator
from phoenix.evals import (
    HALLUCINATION_PROMPT_RAILS_MAP,
    HALLUCINATION_PROMPT_TEMPLATE,
    OpenAIModel,
    llm_classify,
)

class HallucinationEvaluator(Evaluator):
    name = "hallucination"

    def evaluate(self, output, dataset_row, **kwargs) -> EvaluationResult:
        expected_output = dataset_row.get("attributes.output.value")
        df_in = pd.DataFrame(
            {"selected_output": [output], "expected_output": [expected_output]}
        )
        result = llm_classify(
            dataframe=df_in,
            template=HALLUCINATION_PROMPT_TEMPLATE,
            model=OpenAIModel(
                model="gpt-4o-mini",
                api_key=os.getenv("OPENAI_API_KEY"),
            ),
            rails=HALLUCINATION_PROMPT_RAILS_MAP,
            provide_explanation=True,
        )
        label = result["label"][0]
        return EvaluationResult(
            score=1 if label == "factual" else 0,
            label=label,
            explanation=result["explanation"][0],
        )

    async def async_evaluate(self, *, output, dataset_row, **kwargs) -> EvaluationResult:
        return self.evaluate(output=output, dataset_row=dataset_row)

Async experiments

When throughput matters on the run() path, declare your task and evaluators with async def and raise concurrency. In Jupyter, install nest_asyncio with pip install nest_asyncio (it is not bundled with arize) and call nest_asyncio.apply() first so the runner can nest its event loop inside the kernel’s.
Diagram showing synchronous sequential execution versus asynchronous parallel execution for experiment tasks and evaluators
Python
from arize import ArizeClient
from arize.experiments import EvaluationResult
import nest_asyncio

client = ArizeClient(api_key="YOUR_API_KEY")
nest_asyncio.apply()

async def async_task(dataset_row):
    return dataset_row.get("attributes.input.value", "")

async def async_has_output(output, dataset_row):
    present = bool(output)
    return EvaluationResult(
        score=float(present),
        label="present" if present else "missing",
    )

experiment, experiment_df = client.experiments.run(
    name="async-experiment",
    dataset="YOUR_DATASET_ID_OR_NAME",
    task=async_task,
    evaluators=[async_has_output],
    concurrency=10,
)
Start with synchronous tasks and evaluators while you’re developing the experiment. Sync failures usually break at the line that raised the error, which makes them easier to debug before you switch to async for throughput.

Dataset sampling

For quick spot checks or balanced subsets, sample the dataset before running. The fastest path is dry_run=True with dry_run_count, which runs the task against the first N examples without logging:
Python
from arize import ArizeClient

client = ArizeClient(api_key="YOUR_API_KEY")

def quick_task(dataset_row) -> str:
    return dataset_row.get("attributes.input.value", "")

experiment, experiment_df = client.experiments.run(
    name="smoke-test",
    dataset="YOUR_DATASET_ID_OR_NAME",
    task=quick_task,
    dry_run=True,
    dry_run_count=10,
)
When you want a specific sample (random, stratified, or systematic), pull the examples, sample in pandas, drop system-managed fields, and create a temporary dataset:
Python
from arize import ArizeClient

client = ArizeClient(api_key="YOUR_API_KEY")

def sampled_task(dataset_row) -> str:
    return dataset_row.get("attributes.input.value", "")

examples_df = client.datasets.list_examples(
    dataset="YOUR_DATASET_ID_OR_NAME",
    all=True,
).to_df()

sampled_df = examples_df.sample(frac=0.1, random_state=42)
# Stratify by a label column to preserve class balance:
# sampled_df = examples_df.groupby("class_label", group_keys=False).apply(
#     lambda g: g.sample(frac=0.1, random_state=42)
# )
# Or select every 10th row:
# sampled_df = examples_df.iloc[::10, :]

# Keep only dataset example columns before creating a new dataset.
example_columns = [
    col for col in sampled_df.columns if col not in {"id", "created_at", "updated_at"}
]
sampled_examples = sampled_df[example_columns].copy()

sampled_dataset = client.datasets.create(
    name="support-dataset-sample-10pct",
    space="your-space-name-or-id",
    examples=sampled_examples,
)

experiment, experiment_df = client.experiments.run(
    name="sampled-experiment",
    dataset=sampled_dataset.id,
    task=sampled_task,
)
You can run experiments on different samples of the same source dataset. Arize still tracks each run against the data it actually saw, so you can compare and visualize those sampled runs cleanly in the product.

Experiment tracing

When you want your own spans for retrieval, tool calls, or nested model activity, pass set_global_tracer_provider=True so the experiment run registers a global tracer provider for that execution. Use it when you want manual or auto-instrumented tracing to participate in the same run-time tracing setup. Install the OpenTelemetry and OpenInference packages required by your tracing setup before running these examples. For broader setup guidance, see Set up tracing. Explicit spans. Create spans manually for the parts of the task you want visible:
Python
from arize import ArizeClient
from opentelemetry import trace

client = ArizeClient(api_key="YOUR_API_KEY")

def task_add_1(dataset_row):
    tracer = trace.get_tracer(__name__)
    with tracer.start_as_current_span("test_function") as span:
        num = dataset_row.get("attributes.my_number")
        span.set_attribute("dataset.my_number", num)
        return num + 1

experiment, experiment_df = client.experiments.run(
    name="tracing-demo",
    dataset="YOUR_DATASET_ID_OR_NAME",
    task=task_add_1,
    set_global_tracer_provider=True,
)
Auto-instrumentor. For LLM, framework, or vector-store calls, install the matching OpenInference auto-instrumentor so library calls made inside your task can emit spans during the run:
Python
from arize import ArizeClient
from openai import OpenAI
from openinference.instrumentation.openai import OpenAIInstrumentor

client = ArizeClient(api_key="YOUR_API_KEY")
openai_client = OpenAI()

OpenAIInstrumentor().instrument()

def traced_task(dataset_row) -> str:
    question = dataset_row.get("attributes.input.value", "")
    response = openai_client.chat.completions.create(
        model="gpt-4.1",
        messages=[{"role": "user", "content": question}],
    )
    return response.choices[0].message.content or ""

experiment, experiment_df = client.experiments.run(
    name="auto-instrumented",
    dataset="YOUR_DATASET_ID_OR_NAME",
    task=traced_task,
    set_global_tracer_provider=True,
)
For broader tracing guidance (providers, exporters, and other instrumentors), see Set up tracing and the Python client’s OpenTelemetry tracing reference.

LangGraph tracing example

Runnable Google Colab notebook showing experiment tracing with LangGraph and nested spans.

Handle row-level failures

If you leave exit_on_error=False, inspect the returned DataFrame after the run and use the schema from your environment before deciding what to retry.
The exact columns in experiment_df can vary by run. Check experiment_df.columns in your environment before you hard-code a retry filter.
A safe retry pattern is:
  1. Inspect experiment_df.columns and identify how your environment marks failed rows.
  2. Filter experiment_df down to just the rows you want to retry.
  3. Keep only the original dataset columns when you build a temporary retry dataset.
  4. Rerun client.experiments.run() against that temporary dataset.

Next step

CI/CD with experiments

Automate experiments as regression gates on every PR or deploy. Remote runs are a natural fit for CI, but the same loop also works with client.experiments.run().

Further reading