


Why Tracing Matters for Human Alignment
LLM evaluations are only as good as their alignment with human judgment. To achieve this alignment, you need to:- Inspect Evaluator Reasoning: See exactly how the evaluator LLM interpreted your prompt and reached its decision
- Debug Evaluation Logic: Identify when evaluators misunderstand instructions or make inconsistent judgments
- Validate Prompt Engineering: Verify that your evaluation prompts are working as intended across different examples
- Build Confidence: Provide stakeholders with transparent evidence of evaluation quality
What Gets Traced
Every evaluation execution captures:- Input Data: The original content being evaluated
- Evaluation Prompts: The exact prompts sent to evaluator LLMs
- Model Responses: Full reasoning and decision-making process
- Final Scores: Structured evaluation results and metadata
- Execution Details: Timing, retries, and performance metrics
Transparency by Design
Phoenix Evals follows the Transparency pillar - nothing is abstracted away. You can inspect every aspect of the evaluation process, from the raw prompts to the model’s step-by-step reasoning. This transparency enables you to:- Tune evaluation prompts for better human alignment
- Identify systematic biases or errors in evaluation logic
- Provide evidence-based justification for evaluation results
- Continuously improve evaluator performance through data-driven insights

