Code evaluators let you author a custom evaluation function in Python or TypeScript and attach it directly to a dataset. Phoenix stores the source, executes it server-side in a sandbox, and records labels and scores as annotations on each experiment run — the same way LLM evaluators do, but with deterministic code instead of a judge model. Reach for a code evaluator when you want full control over how the evaluation is built:Documentation Index
Fetch the complete documentation index at: https://arizeai-433a7140.mintlify.app/llms.txt
Use this file to discover all available pages before exploring further.
- Call third-party APIs or evaluation services, pull in your own libraries, or mix multiple LLMs inside a single judge.
- Compose custom logic — parsers, validators, scoring formulas, structural diffs — alongside any model calls you need.
- Craft the eval exactly the way you want it, without being constrained to a single built-in judge template.
- Run a deterministic, code-only path next to your LLM judges — repeatable scores with no per-call model cost.
This page covers code evaluators authored in the Phoenix UI and run by Phoenix’s sandbox backends. If you’d rather write evaluators locally and report scores with the
arize-phoenix-evals SDK, see the client-side code evaluators guide.How It Works
From authoring to results, you’ll work through five steps:- Create the evaluator — From a dataset’s Evaluators tab, choose Add evaluator → Create new code evaluator. Pick a language (Python or TypeScript) and a sandbox configuration.
- Write
evaluate(...)— The editor opens pre-populated with anevaluate(...)function. Its parameters become the evaluator’s inputs. - Configure the annotation — Pick an optimization direction (maximize vs. minimize) and, optionally, a threshold for pass/fail coloring.
- Map inputs — Bind each parameter to a path on the evaluation parameters (
input,output,reference,metadata) or to a literal value. - Test, save, run — Dry-run the evaluator against a dataset example in the test panel, save it, then run an experiment. Scores land on every new run automatically.
Authoring an Evaluator
Function signature
The function name must beevaluate. Each parameter becomes a row in the input-mapping panel — you bind it to a path on the evaluation parameters or to a literal value.
- Python
- TypeScript
output, reference, input, and metadata mirror the four evaluation parameters, but the names aren’t required. Rename them, drop the ones you don’t use, or add new ones — the signature is the source of truth for what shows up in the input-mapping panel. When a parameter shares a name with one of the four evaluation parameters, Phoenix auto-binds it; otherwise you bind it explicitly in the panel to a path on the evaluation parameters or to a literal value.
Return shape
The function returns an object with three optional fields:| Field | Type | Description |
|---|---|---|
label | string | The category (e.g. "correct", "fail"). Required for categorical evaluators. |
score | number | A numeric score. Required for continuous evaluators. Categorical evaluators can omit it — Phoenix fills it in from the configured label-to-score mapping. |
explanation | string | Free-form text shown alongside the score. Useful for debugging surprising results. |
Annotation configuration
An evaluator’s annotation config is descriptive — it tells Phoenix how to interpret whatever your function returns, not what shape it must produce.- Optimization direction —
maximizeorminimize, used to render trends correctly in experiment comparisons. - Lower / upper bound (optional) — The expected numeric range for scores. Used to normalize values for visualization.
- Threshold (optional) — A numeric cutoff that splits scores into pass/fail for threshold-pivoted coloring in result views. Leave it unset if pass/fail isn’t meaningful for your metric.
Sandbox Backends
Code evaluators always run inside a sandbox. When you create one, you pick from the sandbox configurations an administrator has provisioned under Settings → Sandboxes — Phoenix filters the list to configurations that match the language you chose. The available backends are:| Language | Backends |
|---|---|
| Python | WebAssembly (local), E2B, Daytona, Vercel Sandbox, Modal |
| TypeScript | Deno (local), Daytona, Vercel Sandbox |
Versioning
Every time you save an evaluator’s source, Phoenix creates a new evaluator version that executes for subsequent experiment runs. Older versions are retained so you can audit which code produced any historical score. The evaluator’s name, description, annotation configuration, input mapping, and sandbox binding live on the evaluator itself rather than on a version — editing those updates the evaluator in place without creating a new version.Testing Before You Save
The editor includes a Test panel that runs the current draft against a chosen dataset example. It shows the inputs Phoenix will pass toevaluate(...) after input mapping, the raw return value, and the parsed label/score/explanation. Use it to catch errors before saving — for example, to confirm that a path mapping resolves to a string rather than None, or that your function handles missing fields gracefully.
Examples
Copy-paste starting points for common evaluator patterns. Each page spells out the exact sandbox configuration — backend, dependencies, internet access, environment variables — to provision under Settings → Sandboxes before you save.JSON Distance
Count differing fields between output and a golden-dataset reference. Local sandbox.
Regex Match
Pass when output matches a regular expression. Local sandbox.
Embedding Distance
Cosine similarity over OpenAI embeddings. Needs
openai, network, and OPENAI_API_KEY.scikit-learn
Token-overlap similarity via
HashingVectorizer and cosine. Offline.Pairwise Evaluator
Blind LLM-judge
output vs reference head-to-head with randomized order.Composite Evaluator
Blend sub-scores (LLM + code rules) into one weighted average with per-axis breakdown.
LLM Jury
Poll multiple LLMs (OpenAI, Anthropic, Google) and combine verdicts with a weighted average.

