Overview
The Correctness evaluator assesses whether an LLM’s response is factually accurate, complete, and logically consistent. It evaluates the quality of answers without requiring external context or reference responses.When to Use
Use the Correctness evaluator when you need to:- Validate factual accuracy - Ensure responses contain accurate information
- Check answer completeness - Verify responses address all parts of the question
- Detect logical inconsistencies - Identify contradictions within responses
- Evaluate general knowledge responses - Assess answers that don’t rely on retrieved context
- Get a quick gut-check - Capture a wide range of potential problems quickly
For evaluating responses against retrieved documents, use the Faithfulness evaluator instead. Correctness is best suited for evaluating general knowledge.
Supported Levels
The level of an evaluator determines the scope of the evaluation in OpenTelemetry terms. Some evaluations are applicable to individual spans, some to full traces or sessions, and some are applicable at multiple levels.| Level | Supported | Notes |
|---|---|---|
| Span | Yes | Apply to LLM spans where you want to evaluate the response quality. |
| Trace | Yes | Evaluate the final response of the entire trace. |
| Session | Yes | Evaluate responses across a conversation session. |
Input Requirements
The Correctness evaluator requires two inputs:| Field | Type | Description |
|---|---|---|
input | string | The user’s query or question |
output | string | The LLM’s response to evaluate |
Formatting Tips
For best results:- Use human-readable strings rather than raw JSON for all inputs
- For multi-turn conversations, format input as a readable conversation:
Output Interpretation
The evaluator returns aScore object with the following properties:
| Property | Value | Description |
|---|---|---|
label | "correct" or "incorrect" | Classification result |
score | 1.0 or 0.0 | Numeric score (1.0 = correct, 0.0 = incorrect) |
explanation | string | LLM-generated reasoning for the classification |
direction | "maximize" | Higher scores are better |
metadata | object | Additional information such as the model name. When tracing is enabled, includes the trace_id for the evaluation. |
- Correct (1.0): The response is factually accurate, complete, and logically consistent
- Incorrect (0.0): The response contains factual errors, is incomplete, or has logical inconsistencies
Usage Examples
- Python
- TypeScript
Using Input Mapping
When your data has different field names or requires transformation, use input mapping.- Python
- TypeScript
Configuration
For LLM client configuration options, see Configuring the LLM.Viewing and Modifying the Prompt
You can view the latest versions of our prompt templates on GitHub. The evaluators are designed to work well in a variety of contexts, but we highly recommend modifying the prompt to be more specific to your use case. Feel free to adapt them.- Python
- TypeScript
Using with Phoenix
Evaluating Traces
Run evaluations on traces collected in Phoenix and log results as annotations:Running Experiments
Use the Correctness evaluator in Phoenix experiments:API Reference
- Python: CorrectnessEvaluator
- TypeScript: createCorrectnessEvaluator
Related
- Faithfulness Evaluator - For evaluating responses against retrieved context
- Tool Selection Evaluator - For evaluating LLM tool selection accuracy

