07.09.2025: Baseline for Experiment Comparisons πŸ”

Available in Phoenix 11.4+

You can now set a baseline run when comparing multiple experiments. This is especially useful when one run represents a known-good output (e.g. a previous model version or a CI-approved run), and you want to evaluate changes relative to it.

For example, in an evaluation like accuracy, you can easily see where the value flipped from correct β†’ incorrect or incorrect β†’ correct between your baseline and the current comparison - helping you quickly spot regressions or improvements.

This feature makes it easier to isolate the impact of changes like a new prompt, model, or dataset.

Last updated

Was this helpful?