Evaluations Quickstart
Get started running evaluations to measure how your model performs.
Creating a Custom LLM Evaluator with a Benchmark Dataset
Build a custom LLM-as-a-Judge evaluator with a benchmark dataset tailored to your use case.
Trace-Level Evaluations for a Recommendation Agent
Run trace-level evaluations on individual requests to a recommendation agent.
Session-Level Evaluations for an AI Tutor
Run multi-dimensional session-level evaluations on multi-turn AI tutor conversations.
Evaluating RAG Retrieval Quality and Correctness
Create and evaluate a RAG application to improve retrieval quality and correctness.
Retrieval Evaluation
Debug RAG retrieval quality with embeddings and LLM-assisted metrics.
Evaluating Agentic RAG Using Arize AX and Couchbase
Build and evaluate an agentic RAG application on a Couchbase vector store.
Evaluating a RAG-Powered Chatbot
Monitor and debug a LlamaIndex RAG-powered chatbot with traces and spans.
Evaluate a Math Problem-Solving Agent Using Ragas
Create and evaluate a math problem-solving agent using Ragas and Arize AX.
Pydantic Evals
Evaluate a question-answering task with Pydantic Evals and log results to Arize AX.
Tracing and Evaluating Voice Applications
Trace OpenAI Realtime voice agents and run tone evaluation on captured audio.
Audio Transcription and Evaluation with Gemini Flash
Transcribe and evaluate audio with Gemini Flash, traced in Arize AX.
More Guides
Span-level evaluator examples for hallucination, relevance, toxicity, SQL, tool calling, and more.