Two approaches
The rest of the page follows these two paths:| Run an experiment | Log an experiment | |
|---|---|---|
| Who executes the task | Your Python environment | Your code, anywhere |
| Upload interface | Python SDK | Python SDK, TypeScript/JS SDK, CLI (ax), REST API |
| Concurrency, tracing, evals | Handled by the client | You handle execution and any scores |
| Best for | Self-contained Python task functions | Pipelines, agents, sandboxes, CI, and remote runtimes |
Run an experiment
Use this when the entire task fits inside a single Python function and you want the Python client to orchestrate it for you. The client resolves the dataset, runs the task against every row, scores the outputs with the evaluators you pass, and logs the results for you.Define a task
A task function is the unit of work you want to measure: a single LLM call, a retrieval pipeline, an agent workflow, or any application logic. This is where you decide what stays fixed and what changes between runs. If you still need to create or refine the dataset first, use Build a dataset.Python
dataset_row. Read whatever columns your dataset provides, whether they are OpenInference attributes such as attributes.input.value or custom columns such as input, question, tickers, or focus.
To version the task’s prompt in Prompts instead of hard-coding it, fetch it inside your task with client.prompts.get().
Add an evaluator
Use a code evaluator when you have deterministic rules or ground truth. Use an LLM judge when the criterion is more subjective. A clear starting point is a function that returns anEvaluationResult:
Python
Run it
Pass your task and evaluators to the client with the documenteddataset= parameter, pointing at the dataset by ID or name.
Continuing from the task and evaluator above:
Python
client.experiments.run() resolves the dataset, runs the task and evaluators in your Python environment, and logs the results to Arize. Name runs consistently so the comparison view diffs cleanly; Plan a baseline covers the naming pattern itself.
For the full parameter list, see the Python experiments API reference. If you’re migrating older experiment code, see the experiments migration guide.
Use
dry_run=True to test the loop locally without logging results. In dry-run mode, client.experiments.run() returns (None, experiment_df).Log an experiment
Choose this path when the task already runs in its own environment: another service, a TypeScript app, a sandbox, CI, or a notebook with its own orchestration. Compute the outputs yourself, key each row byexample_id, and upload the finished results.
Assemble your results
Put your task outputs in a table such as a DataFrame, CSV, JSON, JSONL, or Parquet file. Every row must carry:example_id, the dataset row ID the result corresponds tooutput(or another column you map to it), the task output for that example
correctness, or attach evaluators later if you do not have them yet.
If your results already contain example IDs, keep them. If they do not, fetch the dataset examples first and map id to example_id before you upload the run. When you resolve a dataset by name while fetching examples, pass space= to client.datasets.list_examples(). Your external job should produce one result row for each dataset example it ran; the example below shows the required shape and maps each row to an existing dataset example ID.
Python
Upload the run
- By Arize Skills
- By Code
Use the Arize skills plugin with the 
arize-experiment skill to upload a results file directly. The file must have example_id and output columns (CSV, JSON, JSONL, or Parquet). See ax experiments create for the full schema.Try asking your agent:- “Upload
runs.csvas a new experiment on datasetds_xxxand name itbaseline-v1.” - “Create an experiment from
nightly_runs.jsonlfor datasetqa-regression.”

Evaluate a remote experiment
You have two options for attaching scores to a remote run:- Score outputs yourself and upload alongside results. Add evaluator columns such as
labelandscoreto the results DataFrame and map them throughEvaluationResultFieldNames, as shown in the Python upload example above. Use this when the evaluator logic lives in the same environment as your task. - Attach evaluators in Arize after upload. Upload the experiment without eval columns, then open the experiment results page and click Add Evaluator, run the
arize-evaluatorskill, or run an existing evaluator from the experiment workflow. Use this when you want an LLM-as-a-judge scored from Arize itself, especially across remote runs from multiple languages.
- “Create a correctness LLM-as-a-judge evaluator using my OpenAI integration and run it on experiment
exp_xxx.” - “Score every run in experiment
exp_xxxwith a groundedness judge.”
arize-ai-provider-integration skill to set up your OpenAI or Anthropic keys, or your Bedrock role.
For the broader evaluator workflow:
- Create evaluators: create or manage reusable evaluators.
- Run offline evals on experiments: run evaluators against an existing experiment.
- Human review and Labeling queues: collect labels before you automate.
Manage your experiments
Compare experiments
Whether you logged the run or had Arize run it, compare the results in the same experiments UI. For the full walkthrough, see Compare experiments.Export or get results
If you want experiment metadata and runs programmatically, pull them in code:Tag a winner
Once a prompt variant clears the baseline on the evaluators you care about, tag that prompt version asproduction in Prompts, or use whatever label your application loads in production. For model or pipeline changes, promote the winning value in your app configuration and keep the experiment name or metadata tied to that promoted version.
Classification metrics
If each experiment returns a categorical label instead of free-form text, configure classification metrics from the dataset’s Experiments tab. The full setup for ground-truth mapping, positive-class selection, and metric definitions lives on Experiment in Playground.Additional code workflows
Once the main loop is in place, use these patterns to work faster or handle edge cases. The sections below are mostly about the Pythonrun() path.
Evaluator patterns
If the minimal evaluator above is enough, stop there. Function evaluators can return anEvaluationResult, a numeric score, or a string label; use EvaluationResult when you want to include score, label, and explanation together. Class-based evaluators can accept mapped inputs such as input, output, dataset_row, and metadata. Use the patterns below when you need more than one evaluator in the same run, shared state, or reusable evaluators.
For deeper evaluator references, see Run offline evals on experiments and Create evaluators.
Multiple evaluators
Pass a list toevaluators= and Arize runs each evaluator against each experiment result. Start with multiple function evaluators, or mix function and class-based evaluators in the same list. Each evaluator shows up as its own column in the comparison view.
Show multiple-evaluator example
Show multiple-evaluator example
Python
Class-based evaluators
The main API reference focuses on function evaluators. Use a subclass ofEvaluator when an evaluator holds shared state, runs async, or is reused across projects.
Class-based evaluator methods can request the inputs they need:
| Parameter | Description | Example |
|---|---|---|
input | Experiment run input | def evaluate(self, input, **kwargs): ... |
output | Experiment run output | def evaluate(self, output, **kwargs): ... |
dataset_row | The full dataset row, including every column | def evaluate(self, dataset_row, **kwargs): ... |
metadata | Experiment metadata | def evaluate(self, metadata, **kwargs): ... |
Show code evaluator class example
Show code evaluator class example
Python
Show LLM evaluator class example
Show LLM evaluator class example
This example uses Phoenix Evals and an OpenAI-backed judge. Install
phoenix-evals and set OPENAI_API_KEY before running it.Python
Async experiments
When throughput matters on therun() path, declare your task and evaluators with async def and raise concurrency. In Jupyter, install nest_asyncio with pip install nest_asyncio (it is not bundled with arize) and call nest_asyncio.apply() first so the runner can nest its event loop inside the kernel’s.

Show async experiment example
Show async experiment example
Python
Dataset sampling
For quick spot checks or balanced subsets, sample the dataset before running. The fastest path isdry_run=True with dry_run_count, which runs the task against the first N examples without logging:
Show dry-run sampling example
Show dry-run sampling example
Python
Show sampled dataset example
Show sampled dataset example
Python
Experiment tracing
When you want your own spans for retrieval, tool calls, or nested model activity, passset_global_tracer_provider=True so the experiment run registers a global tracer provider for that execution. Use it when you want manual or auto-instrumented tracing to participate in the same run-time tracing setup.
Install the OpenTelemetry and OpenInference packages required by your tracing setup before running these examples. For broader setup guidance, see Set up tracing.
Explicit spans. Create spans manually for the parts of the task you want visible:
Show explicit spans example
Show explicit spans example
Python
Show auto-instrumentor example
Show auto-instrumentor example
Python
LangGraph tracing example
Runnable Google Colab notebook showing experiment tracing with LangGraph and nested spans.
Handle row-level failures
If you leaveexit_on_error=False, inspect the returned DataFrame after the run and use the schema from your environment before deciding what to retry.
The exact columns in
experiment_df can vary by run. Check experiment_df.columns in your environment before you hard-code a retry filter.- Inspect
experiment_df.columnsand identify how your environment marks failed rows. - Filter
experiment_dfdown to just the rows you want to retry. - Keep only the original dataset columns when you build a temporary retry dataset.
- Rerun
client.experiments.run()against that temporary dataset.
Next step
CI/CD with experiments
Automate experiments as regression gates on every PR or deploy. Remote runs are a natural fit for CI, but the same loop also works with
client.experiments.run().Further reading
- Run offline evals on experiments: attach evaluators after the experiment exists, or use the full evaluator inputs and outputs guide.
- View and manage traces: find the next failure mode to turn into dataset rows.
- Python experiments API: full parameter list for
run()andcreate().