Build an Eval
This guide shows you how to build and improve an LLM as a Judge Eval from scratch.
Before you begin:
You'll need two things to build your own LLM Eval:
A dataset to evaluate
A template prompt to use as the evaluation prompt on each row of data.
The dataset can have any columns you like, and the template can be structured however you like. The only requirement is that the dataset has all the columns your template uses.
We have two examples of templates below: CATEGORICAL_TEMPLATE
and SCORE_TEMPLATE
. The first must be used alongside a dataset with columns query
and reference
. The second must be used with a dataset that includes a column called context
.
Feel free to set up your template however you'd like to match your dataset.
Preparing your data
You will need a dataset of results to evaluate. This dataset should be a pandas dataframe. If you are already collecting traces with Phoenix, you can export these traces and use them as the dataframe to evaluate:
If your eval should have categorical outputs, use llm_classify
.
If your eval should have numeric outputs, use llm_generate
.
Categorical - llm_classify
The llm_classify
function is designed for classification support both Binary and Multi-Class. The llm_classify function ensures that the output is clean and is either one of the "classes" or "UNPARSABLE"
A binary template looks like the following with only two values "irrelevant" and "relevant" that are expected from the LLM output:
The categorical template defines the expected output of the LLM, and the rails define the classes expected from the LLM:
irrelevant
relevant
Snap to Rails Function
The classify uses a snap_to_rails
function that searches the output string of the LLM for the classes in the classification list. It handles cases where no class is available, both classes are available or the string is a substring of the other class such as irrelevant and relevant.
A common use case is mapping the class to a 1 or 0 numeric value.
Numeric - llm_generate
The Phoenix library does support numeric score Evals if you would like to use them. A template for a score Eval looks like the following:
We use the more generic llm_generate
function that can be used for almost any complex eval that doesn't fit into the categorical type.
The above is an example of how to run a score based Evaluation.
Logging Evaluations to Phoenix
In order for the results to show in Phoenix, make sure your test_results
dataframe has a column context.span_id
with the corresponding span id. This value comes from Phoenix when you export traces from the platform. If you've brought in your own dataframe to evaluate, this section does not apply.
Improving your Custom Eval
At this point, you've constructed a custom Eval, but you have no understanding of how accurate that Eval is. To test your eval, you can use the same techniques that you use to iterate and improve on your application.
Start with a labeled ground truth set of data. Each input would be a row of your dataframe of examples, and each labeled output would be the correct judge label
Test your eval on that labeled set of examples, and compare to the ground truth to calculate F1, precision, and recall scores. For an example of this, see Hallucinations
Tweak your prompt and retest. See https://github.com/Arize-ai/phoenix/blob/docs/docs/evaluation/how-to-evals/broken-reference/README.md for an example of how to do this in an automated way.
Last updated
Was this helpful?