This video introduces Arize Phoenix Datasets and Experiments 🚀, walking through a text-to-SQL use case.
The velocity AI application development is often bottlenecked by high quality evaluations because engineers are often faced with hard tradeoffs: which prompt or LLM best balances performance, latency, and cost. Quality Evaluations are critical as they help answer these types of questions with greater confidence.
🗄 Datasets are collections of Examples. An Example contains Inputs to an AI Task and optionally an expected or reference Output
👩🔬 Experiments are Run on Examples to Evaluate if a given Task produces better Outputs.
With arize-phoenix, Datasets are:
🔃 Integrated. Datasets are integrated with the platform, so you can add production spans to datasets, use datasets to run experiments, and use metadata to track different segments and use-cases.
🕰 Versioned. Every insert, update, and delete is versioned, so you can pin experiments and evaluations to a specific version of a dataset and track changes over time.
🧘♀️ Flexible. Support for KV, LLM, Chat, OpenAI Ft, OpenAI Evals
✏ Tracked. Dataset examples track their source spans so you always know the source of the data
Experiments build on Datasets. They are:
🕰 Versioned. Every experiment tracks a dataset version📊 Analyzed. Tracks latency, Error Rate, Cost, Scores
🧠 Evaluated. Built-in LLM and code evaluators.
⚡ Blazing Fast optimized for concurrency ⚡️
🕵♀️ Explainable. All evals are traced with explanations built-in
⚙ Custom. Custom evals are just functions. Built-in LLM evaluators
🔭 Traced. Traces the internal steps of your tasks and evaluations.
As per usual, Phoenix is fully OSS, 🔐 fully private, and can be self-hosted.
Don't forget to give us a ⭐ to support the project!
Learn more about datasets and experiments with Phoenix: https://docs.arize.com/phoenix/datasets-and-experiments/overview-datasets
Phoenix: https://phoenix.arize.com/