Skip to main content
Evaluators in Arize AX run in one of two places: on the platform (online) or off it (offline). The split is about who manages the orchestration — Arize, or you. Both paths produce the same shape of result and write to the same place. They differ in flexibility, in the work you have to do, and in the failure modes to plan for. This page is about the architectural split. The two pages that follow it cover online LLM-as-a-judge and online code evaluators; the offline evaluation with Phoenix page covers the off-platform path.

The split

OnlineOffline
Where it runsArize AX platformAnywhere you choose — your laptop, your CI, a batch job, a notebook
Who orchestratesArizeYou
Code requiredNone (LLM-as-a-judge) or minimal (code evaluator)All of it — download, classify, upload
Cadence optionsHistorical batch, continuous on new traces, or bothWhatever schedule you build
FlexibilityBounded by what the UI and platform exposeFull — multi-stage pipelines, parallel evaluators, custom dataframes, arbitrary code
What you write back to AXThe platform handles itYou write the upload step
Most teams start online and graduate to offline only when they hit the flexibility ceiling. Online evaluators get you most of the value with none of the orchestration work; offline evaluators are the escape hatch when you need something the platform doesn’t expose.

Online evaluators

Online evaluators live entirely in Arize. You build them in the UI, the platform runs them, and the results flow back to the originating spans automatically. The platform handles:
  • Triggering. As traces arrive (continuous cadence) or on demand (historical batch), Arize fires the evaluator.
  • Execution. The eval prompt is rendered against the trace data, the model is called, and the output is parsed.
  • Joining. The label, score, and explanation are written back as eval.<name>.* attributes on the originating span.
  • Sampling, rate-limiting, retries. Production-volume concerns are managed by the platform.
Two kinds of online evaluators exist, with the same shape and different runtimes:
  • Online LLM-as-a-judge: an LLM scores each span/trace/session. No code required — the evaluator is a prompt + a model + an output schema, all configured in the UI.
  • Online code evaluator: a Python function (or a built-in template like contains-any-keyword) scores each span/trace/session. Code is required, but minimal — the function shape is fixed and the platform runs it for you.
For both, the configuration is split into a reusable evaluator template (the prompt or function, the model, the output schema) and an evaluator task (which project to run against, what filters to apply, what cadence, what sampling rate). See Filters, scope, and cadence for the task-level controls.

Offline evaluators

Offline evaluators don’t run in Arize. You write the orchestration yourself: download the spans you want to evaluate, score them with whatever pipeline you choose, and upload the results back to Arize so they join their source spans. The shape of an offline evaluator is always the same three steps: 3 steps - Download spans from Arize, Run your custom eval pipeline, Upload eval results back to Arize The recommended library for step 2 is phoenix.evals — the open-source Phoenix evaluations library. It handles the LLM call, rate-limiting, structured-output parsing, and concurrency. Using it doesn’t require running a Phoenix server; it’s just a Python library that knows how to call an LLM for classification. When offline is the right choice:
  • Multi-stage pipelines. You want one evaluator’s output to feed another evaluator’s input.
  • Parallel evaluation. You want to run a dozen evaluators against the same span batch concurrently.
  • Custom data shaping. You need to join multiple span attributes, derive new fields, or compute aggregates before scoring.
  • Non-standard models. Your judge isn’t OpenAI, Anthropic, or one of the platform-supported providers.
  • CI integration. You want eval results to land in your build artifacts, not just the AX UI.
  • One-off analyses. A research question that doesn’t justify creating a long-running evaluator task.
When offline is the wrong choice:
  • You just want to score every new trace as it arrives. That’s exactly what online continuous evaluators do, with no orchestration cost.
  • You want the platform to handle retries and rate limits. That’s a feature of online.

Both can run on the same spans

Online and offline aren’t mutually exclusive. The same spans can be scored by:
  • An online evaluator running continuously on new traffic.
  • An offline evaluator running monthly across a historical window.
  • A human reviewer annotating a sample of edge cases.
All three write to the span as eval.<name>.* attributes with different <name> values. The platform merges everything into the same view. This is what makes evaluators a stack rather than a switch — you can layer cheap automated evaluators on top of expensive periodic human-validated ones without picking one strategy.

Next step

Starting with online, the most common path. The next page walks through the platform-managed LLM-as-a-judge model end-to-end:

Next: Online LLM-as-a-Judge