Evaluate - Arize AX Docs

Hands-on guides for the Evaluate stage of the AX workflow: building evaluators, aligning them with human judgment, and measuring quality.

Evaluations Quickstart

Get started running evaluations to measure how your model performs.

Build a custom LLM-as-a-Judge evaluator with a benchmark dataset tailored to your use case.

Run trace-level evaluations on individual requests to a recommendation agent.

Run multi-dimensional session-level evaluations on multi-turn AI tutor conversations.

Create and evaluate a RAG application to improve retrieval quality and correctness.

Debug RAG retrieval quality with embeddings and LLM-assisted metrics.

Build and evaluate an agentic RAG application on a Couchbase vector store.

Monitor and debug a LlamaIndex RAG-powered chatbot with traces and spans.

Create and evaluate a math problem-solving agent using Ragas and Arize AX.

Evaluate a question-answering task with Pydantic Evals and log results to Arize AX.

Trace OpenAI Realtime voice agents and run tone evaluation on captured audio.

Transcribe and evaluate audio with Gemini Flash, traced in Arize AX.

Span-level evaluator examples for hallucination, relevance, toxicity, SQL, tool calling, and more.

⌘I