Compare experiments
The Compare Experiments feature helps you identify meaningful improvements in performance across experiments so you can decide which experiment to move forward to. This enables:
Faster iteration : Quickly spot where performance diverges without manual guesswork.
Evidence-based decisions: Confirm if improvements are real, significant, or just noise.
Understand trade-offs: See if gains come with costs to balance accuracy, speed, & token usage.
1. Select Experiments to Compare
Select the experiments you want to analyze and click Compare Experiments to view runs side by side.
See outputs, evaluator results, and metadata together for direct comparison.
Choose only the columns that matter most and hide the rest.
Use Table View for detailed text results or Charting View to visualize evaluator outputs across runs.
2. Enable Diff Mode
Turn on Diff Mode to quickly see where runs improved or regressed against a baseline.
Select a baseline experiment to measure other runs against.
Differences in evaluator results are highlighted, making improvements and failures easy to spot.
Aggregated metrics at the top show how each evaluation measure changed relative to the baseline.
This makes it faster to identify meaningful changes without scanning through every result.
3. Enable Diff Output Mode
Enable Diff Output Mode to visually compare the textual outputs of experiments.
Highlights insertions, deletions, and changes compared to a baseline experiment.
Makes it easy to see how model responses evolve across runs.
Useful for spotting subtle but important output differences that metrics alone may miss.
This helps you understand not just whether results changed, but exactly how they changed.
Last updated
Was this helpful?