@arizeai/phoenix-client/vitest ships a Vitest entrypoint plus an optional
reporter that prints a Phoenix-flavored summary at the end of the run.
Setup
Create a separate phoenix.vitest.config.ts so eval files don’t get
swept into your normal unit-test config:
import { defineConfig } from "vitest/config";
export default defineConfig({
test: {
include: ["**/*.eval.?(c|m)[jt]s"],
reporters: ["default", "@arizeai/phoenix-client/vitest/reporter"],
setupFiles: ["dotenv/config"],
testTimeout: 30000,
},
});
include keeps eval suites separate from unit tests by matching the
*.eval.ts convention.
reporters keeps Vitest’s default diagnostics and enables the Phoenix
summary block.
setupFiles: ["dotenv/config"] loads PHOENIX_HOST, PHOENIX_API_KEY,
and any other env vars from .env.
testTimeout is bumped because LLM calls can be slow.
The jsdom test environment is not supported. Either omit environment
or set it to "node".
Add a script to package.json:
{
"scripts": {
"eval": "vitest run --config phoenix.vitest.config.ts"
}
}
The script intentionally uses vitest run rather than watch mode —
many evaluators include longer-running LLM calls.
API
import * as px from "@arizeai/phoenix-client/vitest";
describe(name, fn, config?)
Declares a Phoenix test suite. The suite name is the dataset and
experiment name on the Phoenix server. describe.only and
describe.skip work like Vitest’s variants.
px.describe("my suite", () => { ... }, {
datasetName: "override-dataset-and-experiment-name",
description: "what this suite is for",
metadata: { model: "gpt-4o-mini" },
client: myCustomPhoenixClient, // overrides createClient()
repetitions: 3, // run each test in this suite 3x
acceptanceCriteria: [
{ annotationName: "token_f1", metric: "average", threshold: 0.8 },
{
annotationName: "token_f1",
metric: "passRate",
passFn: (a) => typeof a.score === "number" && a.score >= 0.5,
minPassRate: 0.9,
},
],
dryRun: true, // run this suite locally; sync nothing
});
| Field | Type | Description |
|---|
datasetName | string | Override the dataset / experiment name (defaults to the suite name). |
description | string | Description recorded on the dataset and experiment. |
metadata | Record<string, unknown> | Suite-level metadata applied to the experiment. |
client | PhoenixClient | Pre-configured @arizeai/phoenix-client instance. |
repetitions | number | Run each test in the suite this many times (default 1; PHOENIX_TEST_REPETITIONS overrides the default). Per-test repetitions wins. |
acceptanceCriteria | AcceptanceCriterion[] | Aggregate annotation thresholds that fail the suite after all tests run. |
dryRun | boolean | Run the whole suite locally — no dataset, experiment, runs, or annotations are created in Phoenix. Same effect as PHOENIX_TEST_TRACKING=false, scoped to this suite. |
test(name, params, fn, timeout?)
Declares a single Phoenix test case. The params object carries the
Phoenix Example fields. test.only, test.skip, and test.each
mirror Vitest semantics. it is a re-export of test.
px.test(
"a case",
{
input: { userQuery: "..." },
expected: { sql: "..." },
metadata: { hard: true },
id: "stable-example-id",
},
async ({ input, expected, metadata }) => {
const sql = await myApp(input.userQuery);
px.logOutput({ sql });
expect(sql).toEqual(expected?.sql);
},
);
| Param field | Maps to |
|---|
input | Example.input |
expected | Example.output (reference) |
metadata | Example.metadata |
id | Example.id (stable upsert id) |
repetitions | Number of runs against this example (overrides the suite value). Reported as "<name> [rep i/N]". |
dryRun | When true, this case runs as an ordinary local test — no dataset example, no run, nothing uploaded — even if the suite syncs. |
test.each(table)(name, fn, timeout?)
Run the same test body across many examples.
const DATASET = [
{ input: { userQuery: "whats up" }, expected: { sql: "n/a" } },
{ input: { userQuery: "how are you?" }, expected: { sql: "n/a" } },
];
px.describe("offtopic inputs", () => {
px.test.each(DATASET)("offtopic input", async ({ input, expected }) => {
const sql = await myApp(input.userQuery);
px.logOutput({ sql });
});
});
The name template supports %i, %s, and %j for parity with Vitest’s
test.each. Without a placeholder the row index is appended.
Logging
px.logOutput(value) records the actual output for the run.
px.logAnnotation({ name, score, ... }) records an annotation.
px.evaluate(evaluator, params?) runs an evaluator object and records its
result as an annotation linked to the evaluator trace. Evaluators can come
from @arizeai/phoenix-evals.createEvaluator(), asExperimentEvaluator(),
or any plain { name, evaluate } object.
See CI Eval Test Annotations for the full annotation shape.
Acceptance Criteria
Use acceptanceCriteria to gate the suite on aggregate annotation scores in
CI. Criteria run after the suite finishes, so all cases still execute and the
reporter shows the full scorecard before failing. Each criterion aggregates one
annotation (by annotationName) with one metric:
metric: "average" — gate on overall quality: the mean score across all runs
must clear threshold (compared in direction).
metric: "passRate" — gate on consistency: each run passes when its passFn
predicate returns true, and the suite passes when the fraction of passing
runs is at least minPassRate (e.g. minPassRate: 0.9 ⇒ 90% must pass).
passFn receives the run’s annotation and returns a boolean, so it can express
any pass rule — a score bar, a range, a label match, a metadata check, etc.
px.describe("text-to-sql scorecard", () => {
// tests log token_f1 (0–1), valid_sql (boolean), and latency_ms annotations
}, {
acceptanceCriteria: [
// overall quality: the mean token_f1 must be >= 0.8
{ annotationName: "token_f1", metric: "average", threshold: 0.8 },
// consistency: at least 90% of runs must score >= 0.7 on token_f1
{
annotationName: "token_f1",
metric: "passRate",
passFn: (a) => typeof a.score === "number" && a.score >= 0.7,
minPassRate: 0.9,
},
// hard floor: every run must produce valid SQL (boolean must be true)
{
annotationName: "valid_sql",
metric: "passRate",
passFn: (a) => a.score === true,
minPassRate: 1,
},
// budget: lower is better, so the mean latency must stay <= 800ms
{
annotationName: "latency_ms",
metric: "average",
threshold: 800,
direction: "minimize",
},
],
});
| Field | Description |
|---|
annotationName | Annotation to aggregate. If a run logs the same annotation more than once, the last one counts. |
metric | "average" checks the mean of all numeric/boolean scores against threshold (tolerates weak runs if the mean holds); "passRate" counts runs whose passFn returns true and requires that fraction to reach minPassRate. |
threshold | average only. The bar the mean must clear (in direction). |
direction | average only. "maximize" (default; higher mean is better, clears threshold when >=) or "minimize" (lower is better, clears when <=). Use "minimize" for cost, latency, or error rates. |
passFn | passRate only. (annotation) => boolean predicate deciding whether a single run passes, given its last annotation for annotationName (score, label, explanation, metadata, …). |
minPassRate | passRate only. Minimum fraction of runs (0–1) that must pass for the suite to pass (1 = all). The suite passes when passRate >= minPassRate. |
Edge cases. An average criterion with no numeric/boolean scores — or a
passRate criterion whose annotation was never logged — fails rather than
passing vacuously. Skipped tests are excluded from the aggregate; dry-run tests
are included because they still execute locally.
In the reporter’s Acceptance Criteria block the reported value is the mean
for average and the fraction of runs that passed for passRate (so a
fully-passing passRate criterion reads 1.000).
Reporter Output
When @arizeai/phoenix-client/vitest/reporter is loaded, the runner prints
a per-suite block at the end of the run with pass/fail counts,
annotation aggregates, acceptance criteria, and links to the Phoenix dataset and experiment.
The default Vitest reporter still runs alongside it.
Source Map
src/vitest/index.tssrc/vitest/reporter.tssrc/testing/runner.tssrc/testing/acceptance.ts