Introducing Phoenix Datasets and Experiments

This video introduces Arize Phoenix Datasets and Experiments πŸš€, walking through a text-to-SQL use case.

The velocity AI application development is often bottlenecked by high quality evaluations because engineers are often faced with hard tradeoffs: which prompt or LLM best balances performance, latency, and cost. Quality Evaluations are critical as they help answer these types of questions with greater confidence.

πŸ—„ Datasets are collections of Examples. An Example contains Inputs to an AI Task and optionally an expected or reference Output

πŸ‘©β€πŸ”¬ Experiments are Run on Examples to Evaluate if a given Task produces better Outputs.

With arize-phoenix, Datasets are:
πŸ”ƒ Integrated. Datasets are integrated with the platform, so you can add production spans to datasets, use datasets to run experiments, and use metadata to track different segments and use-cases.
πŸ•° Versioned. Every insert, update, and delete is versioned, so you can pin experiments and evaluations to a specific version of a dataset and track changes over time.
πŸ§˜β€β™€οΈ Flexible. Support for KV, LLM, Chat, OpenAI Ft, OpenAI Evals
✏ Tracked. Dataset examples track their source spans so you always know the source of the data

Experiments build on Datasets. They are:
πŸ•° Versioned. Every experiment tracks a dataset versionπŸ“Š Analyzed. Tracks latency, Error Rate, Cost, Scores
🧠 Evaluated. Built-in LLM and code evaluators.
⚑ Blazing Fast optimized for concurrency ⚑️
πŸ•΅β€β™€οΈ Explainable. All evals are traced with explanations built-in
βš™ Custom. Custom evals are just functions. Built-in LLM evaluators
πŸ”­ Traced. Traces the internal steps of your tasks and evaluations.

As per usual, Phoenix is fully OSS, πŸ” fully private, and can be self-hosted.

Don't forget to give us a ⭐ to support the project!

Learn more about datasets and experiments with Phoenix in docs.

Subscribe to our resources and blogs