Phoenix Evals leverages executors that make the execution of evaluations many times faster.
When performing evaluations, speed is paramount so that you can focus on improving your system. Phoenix Evals executors run evaluations faster and more reliably by automatically handling rate limits, errors, and concurrency.
Handle Rate Limits: Automatically retry when LLM providers throttle requests
Manage Errors: Distinguish between temporary failures and permanent errors
Optimize Speed: Dynamically adjust concurrency based on provider performance
Running thousands of evaluations manually is slow and error-prone. Executors automatically handle the complexity so you can focus on your evaluation logic instead of infrastructure.
Phoenix Evals automatically traces all evaluation executions, providing complete transparency into how your evaluators make decisions. This visibility is essential for achieving human alignment and building trust in your evaluation results.
LLM evaluations are only as good as their alignment with human judgment. To achieve this alignment, you need to:
Inspect Evaluator Reasoning: See exactly how the evaluator LLM interpreted your prompt and reached its decision
Debug Evaluation Logic: Identify when evaluators misunderstand instructions or make inconsistent judgments
Validate Prompt Engineering: Verify that your evaluation prompts are working as intended across different examples
Build Confidence: Provide stakeholders with transparent evidence of evaluation quality
Every evaluation execution captures:
Input Data: The original content being evaluated
Evaluation Prompts: The exact prompts sent to evaluator LLMs
Model Responses: Full reasoning and decision-making process
Final Scores: Structured evaluation results and metadata
Execution Details: Timing, retries, and performance metrics
Phoenix Evals follows the Transparency pillar - nothing is abstracted away. You can inspect every aspect of the evaluation process, from the raw prompts to the model's step-by-step reasoning. This transparency enables you to:
Tune evaluation prompts for better human alignment
Identify systematic biases or errors in evaluation logic
Provide evidence-based justification for evaluation results
Continuously improve evaluator performance through data-driven insights
Use Phoenix's trace viewer to explore evaluation traces and ensure your evaluators are making decisions that align with human judgment.
The standard for evaluating text is human labeling. However, high-quality LLM outputs are becoming cheaper and faster to produce, and human evaluation cannot scale. In this context, evaluating the performance of LLM applications is best tackled by using a LLM. The Phoenix Evals library is designed for simple, fast, and accurate LLM-based evaluations.
Phoenix Evals provides lightweight, composable building blocks for writing and running evaluations on LLM applications. It can be installed completely independently of the arize-phoenix package and is available in both Python and TypeScript versions.
Works with your preferred model SDKs via SDK adapters (OpenAI, LiteLLM, LangChain, AI SDK) - Phoenix lets you configure which foundation model you'd like to use as a judge. This includes OpenAI, Anthropic, Gemini, and much more. See Configuring the LLM.
Powerful input mapping and binding for working with complex data structures - easily map nested data and complex inputs to evaluator requirements.
Several pre-built metrics for common evaluation tasks like hallucination detection - Phoenix provides pre-tested eval templates for common tasks such as RAG and function calling. Learn more about pretested templates here. Each eval is pre-tested on a variety of eval models. Find the most up-to-date benchmarks on GitHub.
Evaluators are natively instrumented via OpenTelemetry tracing for observability and dataset curation. See Evaluator Traces for an overview.
Blazing fast performance - achieve up to 20x speedup with built-in concurrency and batching. Evals run in batches and typically run much faster than calling the APIs directly. See Executors for details on how this works.
Tons of convenience features to improve the developer experience!
Run evals on your own data - comes with native dataframe and data transformation utilities, making it easy to run evaluations on your own data—whether that's logs, traces, or datasets downloaded for benchmarking.
Built-in Explanations - All Phoenix evaluations include an explanation capability that requires eval models to explain their judgment rationale. This boosts performance and helps you understand and improve your eval.

