Customize Your Own Eval Templates
The LLM Evals library is designed to support the building of any custom Eval templates.
Steps to Building Your Own Eval
Follow the following steps to easily build your own Eval with Phoenix1. Choose a Metric
To do that, you must identify what is the metric best suited for your use case. Can you use a pre-existing template or do you need to evaluate something unique to your use case?2. Build a Golden Dataset
Then, you need the golden dataset. This should be representative of the type of data you expect the LLM eval to see. The golden dataset should have the βground truthβ label so that we can measure performance of the LLM eval template. Often such labels come from human feedback. Building such a dataset is laborious, but you can often find a standardized one for the most common use cases. Alternatively, you can use a dataset curated for your use case.3. Decide Which LLM to use For Evaluation
Then you need to decide which LLM you want to use for evaluation. This could be a different LLM from the one you are using for your application. For example, you may be using Llama for your application and GPT-4 for your eval. Often this choice is influenced by questions of cost and accuracy.4. Build the Eval Template
Now comes the core component that we are trying to benchmark and improve: the eval template. You can adjust an existing template or build your own from scratch. Be explicit about the following:- What is the input? In our example, it is the documents/context that was retrieved and the query from the user.
- What are we asking? In our example, weβre asking the LLM to tell us if the document was relevant to the query
- What are the possible output formats? In our example, it is binary relevant/irrelevant, but it can also be multi-class (e.g., fully relevant, partially relevant, not relevant).
5. Run Eval on your Golden Dataset and Benchmark Performance
This example shows a use of the custom created template on thedf dataframe.


