Skip to main content
Hands-on guides for the Evaluate stage of the AX workflow: building evaluators, aligning them with human judgment, and measuring quality.

Evaluations Quickstart

Get started running evaluations to measure how your model performs.

Creating a Custom LLM Evaluator with a Benchmark Dataset

Build a custom LLM-as-a-Judge evaluator with a benchmark dataset tailored to your use case.

Trace-Level Evaluations for a Recommendation Agent

Run trace-level evaluations on individual requests to a recommendation agent.

Session-Level Evaluations for an AI Tutor

Run multi-dimensional session-level evaluations on multi-turn AI tutor conversations.

Evaluating RAG Retrieval Quality and Correctness

Create and evaluate a RAG application to improve retrieval quality and correctness.

Retrieval Evaluation

Debug RAG retrieval quality with embeddings and LLM-assisted metrics.

Evaluating Agentic RAG Using Arize AX and Couchbase

Build and evaluate an agentic RAG application on a Couchbase vector store.

Evaluating a RAG-Powered Chatbot

Monitor and debug a LlamaIndex RAG-powered chatbot with traces and spans.

Evaluate a Math Problem-Solving Agent Using Ragas

Create and evaluate a math problem-solving agent using Ragas and Arize AX.

Pydantic Evals

Evaluate a question-answering task with Pydantic Evals and log results to Arize AX.

Tracing and Evaluating Voice Applications

Trace OpenAI Realtime voice agents and run tone evaluation on captured audio.

Audio Transcription and Evaluation with Gemini Flash

Transcribe and evaluate audio with Gemini Flash, traced in Arize AX.

More Guides

Span-level evaluator examples for hallucination, relevance, toxicity, SQL, tool calling, and more.