Available in Phoenix 11.4+
You can now set a baseline run when comparing multiple experiments. This is especially useful when one run represents a known-good output (e.g. a previous model version or a CI-approved run), and you want to evaluate changes relative to it.
For example, in an evaluation like accuracy
, you can easily see where the value flipped from correct → incorrect
or incorrect → correct
between your baseline and the current comparison - helping you quickly spot regressions or improvements.
This feature makes it easier to isolate the impact of changes like a new prompt, model, or dataset.