Skip to main content
Arize AX Docs home page
Search...
⌘K
Ask AI
Python SDK
Get started
Get started
Search...
Navigation
Evaluation
Documentation
Cookbooks
Integrations
API Reference
API Clients
Self-Hosting
Release Notes
Community Slack
Blog
Cookbooks
Agent Workflow Patterns
AI Engineering Workflows
Creating a Custom LLM Evaluator with a Benchmark Dataset
Optimizing Your Eval Prompts
Guardrails for Realtime Detection
Agents
A2A Financial Trading Agents - Google ADK / MCP / Llama
Multi-modal Autonomous Browser Agent with LLama Models
Tracing & Evaluating a Custom Support Agent
Trace Red Teaming Agent (Microsoft Foundry)
Trace LangChain Agent & Microsoft Risk+Safety Evaluators (Microsoft Foundry)
OpenAI Agents Cookbook
Product Recommendation Agent: Google Agent Engine & LangGraph
Evaluate a Math Problem-Solving Agent Using Ragas
Online Evals & Monitoring for Agents in Production (Mosaic AI)
Tracing a Customer Service Routing Agent
Evaluating Agentic RAG Using Arize And Couchbase
Human-in-the-Loop Workflows (Annotations)
Creating a Custom LLM Evaluator with a Benchmark Dataset
Experiments
Prompt Experimentation For Summarization Task
Text2SQL Application for Database Querying
Model Comparison For An Email Text Extraction Service
Prompt Learning
Optimizing Coding Agent Prompts for Execution
Improving Structured Output Generation with Prompt Learning
Optimizing Your Eval Prompts
Optimizing Coding Agent Prompts for Planning
Evaluation
Tracing and Evaluating Voice Applications
Trace-Level Evaluations for a Recommendation Agent
Session-Level Evaluations for an AI Tutor
Evaluating A RAG-Powered Chatbot
Audio Transcription And Evaluation With Gemini Flash
Evaluating RAG Retrieval Quality And Correctness
More Cookbooks
On this page
More Cookbooks
Span-Level Evaluation
Evaluation
Copy page
Execute code and evaluate LLM performance with precision
Copy page
More Cookbooks
Evaluations Quickstart
Run Online Evals in the Arize UI
Run Offline Evals in Code
Session-Level Evaluations
Agent Trajectory Evaluations
Evaluating RAG
Span-Level Evaluation
Evaluate code functionality
Evaluate hallucination
Evaluate human ground truth vs. AI
Evaluate Q&A correctness
Evaluate RAG
Evaluate reference links
Evaluate relevance
Evaluate SQL correctness
Evaluate tool calling
Evaluate toxicity
Evaluate user frustration
Was this page helpful?
Yes
No
⌘I