Built-in Evaluators

Built-in evaluators run server-side and are always available to add to any dataset. Configure them in the Phoenix UI and they label and score each output automatically when an experiment runs.

Available Evaluators

Evaluator	Output Type	What It Checks	Configuration Options
`contains`	Categorical (`true` / `false`)	Whether a text contains one or more words	Case sensitivity, require all words
`exact_match`	Categorical (`true` / `false`)	Whether two strings match exactly	Case sensitivity
`regex`	Categorical (`true` / `false`)	Whether a text matches a regular expression	Full match vs. partial match
`levenshtein_distance`	Continuous (≥ 0)	Edit distance between two strings	Case sensitivity
`json_distance`	Continuous (≥ 0)	Number of structural differences between two JSON values	Parse strings as JSON

Configuring Inputs

Each evaluator parameter can be set to either a path (a JSONPath expression that extracts a value from the evaluation parameters) or a literal (a fixed value typed directly). Use paths to pull from dataset inputs, task outputs, reference data, or metadata. Use literals for static configuration like regex patterns or word lists. See Input Mapping for full details on mapping modes, resolution order, and examples.

contains

Checks whether a text contains one or more specified words. Returns true if the match condition is met, false otherwise.

Parameters

Parameter	Type	Required	Default	Description
`words`	string	Yes	—	Comma-separated list of words to search for (e.g., `"yes, no, maybe"`)
`text`	string	Yes	—	The text to search
`case_sensitive`	boolean	No	`false`	Whether the search is case-sensitive
`require_all`	boolean	No	`false`	If `true`, all words must be present; if `false`, any single word is sufficient

Output

Property	Value	Description
`label`	`true` or `false`	Whether the match condition was satisfied
`score`	`1.0` or `0.0`	Numeric score (`1.0` = match, `0.0` = no match)
Optimization	Maximize	Higher scores are better

Usage Examples

Compliance disclaimer enforcement — A support chatbot must include a required legal phrase in every response. Text receives the model’s full output from the experiment — typically output or a nested path like output.response. Words is the required phrase, set as a literal value. If the required disclaimer varies per dataset example, map Words to a reference field instead. Required keyword presence — A pipeline that verifies responses include at least one of several acceptable phrases. Text receives the model’s response; Words is a comma-separated list of all acceptable phrases. Enable Require all if every phrase must appear rather than any one of them. Refusal detection — Testing whether an agent correctly declines out-of-scope requests. Text is the model’s response; Words is a set of terms associated with refusal (e.g., "sorry, cannot, decline"). Disable Case sensitive to catch varied capitalizations across responses.

Notes

If words is empty or contains only whitespace after splitting on commas, the evaluator returns false rather than true. This guards against the empty-list edge case where “all of nothing” would otherwise trivially pass.

exact_match

Checks whether two strings are identical. Returns true if they match, false otherwise.

Parameters

Parameter	Type	Required	Default	Description
`expected`	string	Yes	—	The reference value to match against
`actual`	string	Yes	—	The value to evaluate
`case_sensitive`	boolean	No	`true`	Whether the comparison is case-sensitive

Output

Property	Value	Description
`label`	`true` or `false`	Whether the strings are identical
`score`	`1.0` or `0.0`	Numeric score (`1.0` = match, `0.0` = no match)
Optimization	Maximize	Higher scores are better

Usage Examples

Classification label validation — A model that must output exactly one of a fixed set of labels (e.g., "positive", "negative", "neutral"), where any deviation indicates a problem. Actual is the model’s output — use output for a plain string response, or output.label if the response is a JSON object with a label field. Expected is the ground-truth label stored per-example in your dataset, typically a path like reference.label or reference.expected. Templated response checking — A pipeline that should return a fixed string for certain inputs (a canned reply, a status code, or a pass-through value). Actual is the model’s output; Expected can be typed as a literal value if every example uses the same target string, or mapped to a dataset field if the expected value varies per example.

Notes

The comparison is whitespace-sensitive. Leading/trailing spaces and different line endings will cause a mismatch. If your dataset fields may have inconsistent whitespace, consider using contains or regex instead.

regex

Checks whether a text matches a regular expression pattern. By default, the pattern is searched anywhere in the text (partial match). Returns true if the pattern matches, false otherwise.

Parameters

Parameter	Type	Required	Default	Description
`pattern`	string	Yes	—	The regular expression pattern to match
`text`	string	Yes	—	The text to search
`full_match`	boolean	No	`false`	If `true`, the pattern must match the entire string; if `false`, a match anywhere in the text is sufficient

Output

Property	Value	Description
`label`	`true` or `false`	Whether the pattern matched
`score`	`1.0` or `0.0`	Numeric score (`1.0` = match, `0.0` = no match)
Optimization	Maximize	Higher scores are better

Usage Examples

Format compliance — A model that must produce output in a specific structural format (a date, phone number, or identifier). Pattern is the regular expression defining the required format, set as a literal value. Text is the model’s output field — use a direct path for a plain string response, or a nested path like output.date if the target value is embedded in a JSON object. Citation or reference checking — A RAG pipeline that must include a URL, citation marker, or other structured element in every response. Pattern matches the expected element (e.g., a URL regex or citation format); Text is the model’s full response. Partial match mode (the default) passes as long as the pattern appears anywhere in the output. Output type gating — A code assistant whose output should contain a function definition rather than prose. Pattern is anchored to the expected code structure; Text is the response field. If your model returns structured JSON with a code key, map Text to output.code rather than the entire response.

Notes

Complex regex patterns can be slow on long inputs. Avoid patterns with nested quantifiers or excessive backtracking (e.g., (a+)+, .*.*). Prefer anchored patterns and specific character classes over broad wildcards. Test your pattern against representative inputs before deploying to a large dataset.

levenshtein_distance

Calculates the Levenshtein (edit) distance between two strings — the minimum number of single-character insertions, deletions, or substitutions needed to transform one string into the other. A score of 0 means the strings are identical; higher scores indicate more differences.

Parameters

Parameter	Type	Required	Default	Description
`expected`	string	Yes	—	The reference string
`actual`	string	Yes	—	The string to evaluate
`case_sensitive`	boolean	No	`true`	Whether the comparison is case-sensitive

Output

Property	Value	Description
`score`	Integer ≥ 0	Number of edits required; `0` = identical strings
Optimization	Minimize	Lower scores are better

Usage Examples

Answer closeness — A QA model where small paraphrasing is acceptable but significant divergence is not. Actual receives the model’s text response; Expected receives the reference answer from your dataset, typically a path like reference.answer. Comparing average edit distance across experiment runs shows whether prompt changes are moving outputs closer to reference. Entity extraction quality — A pipeline that extracts a specific named value (a product name, location, or identifier). Actual is the extracted value from the model’s output — often a nested path like output.entity if the response is structured JSON. Expected is the ground-truth value per example in your dataset. Edit distance reveals whether extraction is improving as you iterate on prompts or model configuration. Comparative prompt evaluation — Two prompt variants tested against the same dataset. Actual receives the response field from each run; Expected stays fixed, pointing to the same reference column. The variant with the lower average Levenshtein score is closer to the reference outputs.

Notes

The algorithm runs in O(n×m) time, where n and m are the lengths of the two strings. Performance degrades quadratically on very long inputs. Keep inputs under a few thousand characters for predictable evaluation times.

json_distance

Compares two JSON structures and returns the number of value differences between them. A score of 0 means the structures are identical; higher scores indicate more differing fields or elements. By default, inputs are assumed to be strings and are parsed as JSON before comparison, so you can pass raw JSON strings from your dataset fields directly.

Parameters

Parameter	Type	Required	Default	Description
`expected`	any	Yes	—	The reference JSON structure (object, array, or scalar)
`actual`	any	Yes	—	The JSON structure to evaluate
`parse_strings`	boolean	No	`true`	If `true`, string inputs are parsed as JSON before comparison; if `false`, inputs are compared as-is

Output

Property	Value	Description
`score`	Integer ≥ 0, or `null` on error	Number of differing values; `0` = identical structures
Optimization	Minimize	Lower scores are better

Input Mapping Examples

Structured output accuracy — A model that extracts or generates a JSON object (invoice fields, entity records, form data). Actual is the model’s JSON output — if your model returns a plain JSON string, map it to output and leave Parse strings as JSON enabled. Expected is the ground-truth JSON structure from your dataset, typically stored as a JSON string in a reference column. Each differing field or value counts as one point of distance. Tool call argument validation — An agent that produces structured tool call arguments. Actual contains the argument object — if it’s nested inside a larger output (e.g., at output.tool_calls[0].arguments), use a nested path to isolate it. Expected contains the correct argument values from your dataset. Each mismatched field is counted separately, giving you field-level precision on where the agent diverges. Prompt change regression tracking — Running the same dataset against two different prompt versions. Actual receives the JSON output from each run; Expected stays fixed, pointing to the reference JSON in your dataset. Comparing average distance across runs reveals whether a prompt change introduced new structural errors.

Notes

If either input cannot be parsed as JSON (when parse_strings is true), the evaluator returns a null score with an error explanation rather than a numeric result. Ensure your dataset fields contain valid JSON strings when using this evaluator with path mappings.

Type comparison behavior:

true and false are treated as booleans, not integers — {"flag": true} vs. {"flag": 1} counts as 1 difference.
Numeric types (int and float) with the same value are treated as equal — {"n": 1} vs. {"n": 1.0} counts as 0 differences.

Quick Start

Tracing

Evaluation

Datasets & Experiments

Prompts

Settings

Concepts

Resources

Built-in Evaluators

Available Evaluators

Configuring Inputs

contains

Parameters

Output

Usage Examples

Notes

exact_match

Parameters

Output

Usage Examples

Notes

regex

Parameters

Output

Usage Examples

Notes

levenshtein_distance

Parameters

Output

Usage Examples

Notes

json_distance

Parameters

Output

Input Mapping Examples

Notes

Quick Start

Tracing

Evaluation

Datasets & Experiments

Prompts

Settings

Concepts

Resources

​Available Evaluators

​Configuring Inputs

​contains

​Parameters

​Output

​Usage Examples

​Notes

​exact_match

​Parameters

​Output

​Usage Examples

​Notes

​regex

​Parameters

​Output

​Usage Examples

​Notes

​levenshtein_distance

​Parameters

​Output

​Usage Examples

​Notes

​json_distance

​Parameters

​Output

​Input Mapping Examples

​Notes

Available Evaluators

Configuring Inputs

contains

Parameters

Output

Usage Examples

Notes

exact_match

Parameters

Output

Usage Examples

Notes

regex

Parameters

Output

Usage Examples

Notes

levenshtein_distance

Parameters

Output

Usage Examples

Notes

json_distance

Parameters

Output

Input Mapping Examples

Notes