Skip to main content
Not every evaluation needs an LLM judge. When your criteria are deterministic and well-defined — checking whether a response contains a keyword, validating JSON structure, or matching a regex pattern — code evaluators are faster, cheaper, and more consistent than LLM-based alternatives. Code evaluators run Python functions directly against your trace data. They’re ideal for objective checks that don’t require interpretation or subjective judgment. You can use Arize AX’s pre-built code evaluators for common patterns or write your own custom logic. This tutorial walks through setting up code evaluators in the Arize AX UI to run on your travel agent traces.
Prerequisite: Before starting, run the companion notebook to generate traces from the travel agent. You’ll need traces in your Arize AX project to evaluate.

Step 1: Create an Evaluation Task

Evaluation tasks define what data to evaluate and which evaluators to run. To create one:
  1. Navigate to Eval Tasks in the upper right-hand corner and select Add Eval Task
  2. Choose Code Evaluator
  3. Give your task a name (ex: “Tool Input Validation”)
  4. Set the Cadence to Run on historical data so we can evaluate our existing traces
  5. In this tutorial, we are using a code eval to validate tool call inputs, so we only want to run this evaluator against tool spans. Add a Task Filter to scope the evaluation by setting attributes.openinference.span.kind = TOOL.

Step 2: Choose a Code Evaluator

To add an evaluator to your task, select Add Evaluator —> Create New and browse the available pre-built code evaluators. Arize AX offers several managed code evaluators for common checks:
EvaluatorWhat It ChecksParameters
Matches RegexWhether text matches a specified regex patternspan attribute, pattern
JSON ParseableWhether the output is valid JSONspan attribute
Contains Any KeywordWhether any specified keywords appearspan attribute, keywords
Contains All KeywordsWhether all specified keywords appearspan attribute, keywords
You can also choose CustomArizeEvaluator to define your own code evaluator with custom logic.

Configure the Evaluator

  1. For this tutorial, select JSON Parseable — this evaluator checks whether the input to each tool call is valid JSON, ensuring that the agent is passing properly formatted arguments to its tools.
  2. Give the evaluator an Eval Column Name (e.g. tool_input_json_valid).
  3. Set the scope to Span because we are evaluating specific tool spans.
  4. Set the span attribute to attributes.input.value (the input passed to the tool call).

Step 3: Run and View Results

With your code evaluator configured, save the evaluator and run the task. Code evals execute quickly — even on large datasets.
Code eval results
From the results view:
  • Filter by label to find tool spans that failed the JSON check — which tool calls received malformed inputs?
  • Combine with LLM eval results for a complete quality picture. Code evals catch structural issues while LLM evals assess content quality.
  • Use aggregate metrics to track compliance rates over time.

Combining Code and LLM Evaluators

The most effective evaluation setups use both code and LLM evaluators together. In a single project, you can attach multiple eval tasks of different types. For the travel agent, a practical setup might include:
  • Code eval: “Does the response mention budget-related terms?” (fast, deterministic)
  • Code eval: “Does the response cover all three required sections?” (structural check)
  • LLM eval: “Is the response actionable and helpful?” (subjective quality)
  • LLM eval: “Is the information factually correct?” (content accuracy)
This layered approach gives you both breadth and depth in your evaluation coverage.

What’s Next

You’ve now completed the evaluation tutorial series. You know how to run pre-built evals, create custom LLM-as-a-Judge evaluators, and set up code evals — all from the Arize AX UI. To continue building evaluations: