Evaluation Fundamentals

Choose your evaluation method
There are three types of evaluators you can build — LLM, code, and annotations. Each has its uses depending on what you want to measure.| How it works | Ideal use cases | |
|---|---|---|
| LLM Judge | One LLM evaluates the outputs of another and provides explanations | Great for qualitative evaluation and direct labeling on mostly objective criteria Poor for quantitative scoring, subject matter expertise, and pairwise preference |
| Code | Code assesses the performance, accuracy, or behavior of LLMs | Great for reducing cost, latency, and evaluation that can be hard-coded (e.g. code-generation) Poor for qualitative measures such as summarization quality |
| Annotations | Humans provide custom labels to LLM traces | Great for evaluating the evaluator, labeling with subject matter expertise, and directional application feedback The most costly and time intensive option |
Choose your metric
Wherever you find the highest frequency of problematic traces is where you can start building evaluations. We have many pre-built templates for common evaluation cases for user frustration, hallucination, and agent planning, and function calling. You can setup these evaluations to create a set of metrics to measure your application performance. In the example below, we are judging the quality of a customer support chatbot on relevance, hallucination, and latency.
Components of an LLM evaluation

- The input data: Depending on what you are trying to measure or critique, the input data to your evaluation can consist of your applications input, output and prompt variables.
- The eval prompt template: this is where you specify your criteria, input data, and output labels to judge the quality of the LLM output.
- The output: the LLM evaluator generates eval labels and explanations to showcase why it gave it a certain label or score.
- The aggregate metric: when you run thousands of evaluations across a large dataset, you can use your aggregation metrics to summarize the quality of your responses over time across different prompts, retrievals, and LLMs.
Building good evaluation templates

- What is the input? In our example, it is the documents/context that was retrieved and the query from the user.
- What are we asking? In our example, we’re asking the LLM to tell us if the document was relevant to the query
- What are the possible output formats? In our example, it is binary relevant/irrelevant, but it can also be multi-class (e.g., fully relevant, partially relevant, not relevant).

Learn more
| Dive deeper into agent evaluation | Learn how to evaluate the performance of each component of an agent | |
| Use AI to build your evals | Leverage self-improving evals, which run on examples of your own data | |
| Run evaluations on your traces | Use online evals to run evals on your data without code | |
| Read our definitive guide on LLM App Evaluation | Read our best practices we’ve developed across hundreds of customers | https://arize.com/llm-evaluation |
| Watch our paper readings | We cover popular research papers every few weeks on our DeepPapers podcast | https://arize.com/ai-research-papers/ |