pass annotation. Because
the results are recorded through ordinary pytest, the exit code of the test run becomes
your CI gate without any additional configuration.
Installation
The plugin ships witharize-phoenix-client and is activated by the pytest extra.
Once it is installed, pytest discovers it automatically through its plugin entry point,
so no conftest.py configuration is required.
phoenix.evals, install the evals extra as well:
Marking Tests
Apply the@pytest.mark.phoenix marker to any test you want to record. Tests without
the marker run normally and are not sent to Phoenix. Combine the marker with
pytest.parametrize to turn each set of parameters into a dataset example, and use the
module-level helpers to record the output and any evaluations.
- Python
ids you supply to parametrize give each case a stable identity. Re-running the
suite maps each case back to the same dataset example, so runs accumulate as
experiments over a fixed set of examples rather than creating duplicates.
The Marker
@pytest.mark.phoenix accepts three optional keyword arguments:
- Python
datasetsets the name of the dataset and experiment. When omitted, it defaults to the test file’s path relative to your project root (for example,tests/evals/test_sql), so tests in different files become separate datasets and two files that share a basename never collide. The full precedence isPHOENIX_TEST_DATASET(environment) >phoenix_dataset(pytest.ini) > thisdataset=kwarg > the file-path default.evaluatorsis a list of evaluators that run automatically against every case. Each evaluator’s score is recorded as an annotation alongside thepassannotation, and a failing evaluator never fails the test itself.repetitionsruns each case multiple times, which helps reduce noise when LLM outputs vary between runs. Each repetition is expanded into a real pytest item (visible to-k,pytest-xdist, and your IDE) and recorded as a distinct run against the same example. A per-test value takes precedence over thePHOENIX_TEST_REPETITIONSenvironment variable.
Logging Outputs
Record the output of the system under test withlog_output. Output capture is
explicit because pytest emits a warning when a test returns a non-None value, so the
output is passed to the helper rather than returned from the test.
- Python
Logging Evaluations
An evaluator is any callable that returns a dictionary with aname and a score;
evaluators built on phoenix.evals are also accepted. There are three ways to attach
evaluations to a run.
Record a score directly with log_evaluation:
- Python
evaluate. The helper records the evaluator’s score on
the run and returns its result so the test can assert on it. Because a failed assertion
feeds the pass annotation, an inline evaluation can gate the individual test:
- Python
evaluators argument to run it across every case in
the suite without an inline call:
- Python
run_experiment, so an
evaluator written for one behaves identically under the other. Arguments are bound by
parameter name — declare any of the standard evaluator fields and each is supplied from the
case:
output— what you passed tolog_output.input— the test’s parametrized fields as a mapping (the dataset example’s input).expected/reference/metadata— sourced from a parametrized field of that name, if present.trace_id— the test run’s trace id, for correlating to its spans.example— not provided by the plugin; binds toNone.
None. A parametrized field
named after a standard field (e.g. input or trace_id) takes precedence over the plugin’s
default for it. Use **kwargs (as in def correctness(output, expected, **_)) to tolerate
fields you do not consume. The evaluator’s declared kind ("CODE"/"LLM") is recorded as the
annotation’s annotator kind.
If an evaluator raises, the plugin records an errored evaluation (the error is stored on
the annotation, with no score) instead of dropping it — matching run_experiment. A hoisted
evaluator failure degrades to a warning and does not fail the test; an inline evaluate()
failure re-raises after recording, so it still gates the test.
Annotations recorded by log_evaluation and evaluate do not fail a test on
their own. Only a failed assertion fails the pytest item.
Evaluations are keyed by name on a run, so calling log_evaluation or evaluate more than
once with the same name keeps only the last result. Give each evaluation you want to retain a
distinct name.
Configuring with Environment Variables
The plugin is configured entirely through environment variables, so the same suite can behave differently in local development and in CI without any code changes. Boolean variables accept1, true, yes, or on as truthy and 0, false, no, or off
as falsy. An empty or unset value uses the documented default; any other value raises an
error, so a typo such as PHOENIX_TEST_TRACKING=flase fails the run rather than silently
enabling recording.
| Variable | Default | Description |
|---|---|---|
PHOENIX_TEST_TRACKING | true | Master switch for recording. When set to a falsy value, the suite runs offline: the tests execute normally but nothing is sent to Phoenix. |
PHOENIX_TEST_REPETITIONS | 1 | Default number of repetitions for each marked test. The value must be an integer greater than or equal to 1; a malformed value raises an error so that CI never runs misconfigured. |
PHOENIX_TEST_DATASET | (file path) | Names the dataset for every collected test, taking precedence over both phoenix_dataset and the marker’s dataset=. When unset, each test falls back to its marker dataset= or, failing that, its file path. Combine it with pytest’s own selection (-m, -k, or a path) to turn a subset of the suite into a named dataset per run, for example PHOENIX_TEST_DATASET=smoke pytest -m smoke. |
PHOENIX_COLLECTOR_ENDPOINT, PHOENIX_API_KEY, and PHOENIX_CLIENT_HEADERS.
To iterate locally without recording anything to Phoenix, disable tracking:
How the dataset stays in sync
On a full run — you pass a directory, or no path at all — the plugin updates the dataset to match exactly the collected cases, pruning examples for tests that no longer exist. On a partial run — when you filter the collection with-k, -m, a specific
file, or a ::node id — it only appends, leaving examples for the unselected tests in
place. This keeps a filtered run such as pytest tests/evals/test_sql.py from deleting the
rest of the dataset, at the cost that renamed or removed cases are pruned only on a full run.
Because a full run updates the dataset to match its collected cases, two runs writing the same
dataset name at the same time — for example, parallel CI jobs that don’t set
PHOENIX_TEST_DATASET — can prune each other’s examples and pin their experiments to different
dataset versions. This is safe for a single runner (pytest -n is one run across many workers,
which the plugin coordinates). For genuinely concurrent runs, give each its own dataset name, for
example PHOENIX_TEST_DATASET=evals-${GIT_BRANCH}.
Running in parallel (pytest-xdist)
The plugin supportspytest -n. The controller process creates the dataset and experiment
once and hands their ids to the workers, which record runs in parallel — so exactly one
experiment is created regardless of the worker count. Enabling recording under xdist costs
one extra collection pass on the controller; set PHOENIX_TEST_TRACKING=false to skip it.
Gating CI
The exit code of the pytest run is your CI gate, and no additional configuration is required to use it. A failed assertion recordspass=False for that run and fails the
pytest item, exactly like a normal test, so the job fails. Uploads to Phoenix are
best-effort and never fail a test on their own; a network problem is reported as a
warning rather than failing the build.
The following GitHub Actions workflow installs the plugin, runs an eval suite, and gates
the job on the pytest exit code.

