LLM as a Judge
Judge LLM outputs using LLMs
LLM-as-a-Judge is an evaluation approach that uses an LLM to assess the quality of another model’s outputs. LLM evaluation is extremely flexible, because you can specify the rules and criteria in mostly plain language, similar to how you would ask human evaluators to grade your responses.
You can run thousands of evaluations across curated datasets without the need for human intervention. This creates a scalable form of evaluation using only scoring or classification prompts to measure performance.
Arize uses the Phoenix Evals library which is designed for simple, fast, and accurate LLM-based evaluations.
Components of an LLM evaluation
Choose input data to evaluate
Depending on what you want to measure or critique, your evaluation input data may vary. It can include the application’s inputs, outputs, metadata, and/or prompt variables.
When selecting data, you can pull from existing traces or isolate specific spans within those traces that are most relevant to the behavior you want to evaluate. This flexibility lets you target evaluations at a narrow scope or apply them broadly across the entire workflow.
Define an eval prompt template
The prompt template is where you specify your evaluation criteria, input data, and desired output labels from the LLM Judge (ex: correct/incorrect) .
Arize provides many pre-built templates for common evaluation scenarios, including hallucination detection, agent planning quality, and function calling accuracy. You can also define a custom evaluation template or ask Alyx to generate one automatically based on your use case.
Last updated
Was this helpful?