This video introduces Arize Phoenix Datasets and Experiments π, walking through a text-to-SQL use case.
The velocity AI application development is often bottlenecked by high quality evaluations because engineers are often faced with hard tradeoffs: which prompt or LLM best balances performance, latency, and cost. Quality Evaluations are critical as they help answer these types of questions with greater confidence.
π Datasets are collections of Examples. An Example contains Inputs to an AI Task and optionally an expected or reference Output
π©βπ¬ Experiments are Run on Examples to Evaluate if a given Task produces better Outputs.
With arize-phoenix, Datasets are:
π Integrated. Datasets are integrated with the platform, so you can add production spans to datasets, use datasets to run experiments, and use metadata to track different segments and use-cases.
π° Versioned. Every insert, update, and delete is versioned, so you can pin experiments and evaluations to a specific version of a dataset and track changes over time.
π§ββοΈ Flexible. Support for KV, LLM, Chat, OpenAI Ft, OpenAI Evals
β Tracked. Dataset examples track their source spans so you always know the source of the data
Experiments build on Datasets. They are:
π° Versioned. Every experiment tracks a dataset versionπ Analyzed. Tracks latency, Error Rate, Cost, Scores
π§ Evaluated. Built-in LLM and code evaluators.
β‘ Blazing Fast optimized for concurrency β‘οΈ
π΅ββοΈ Explainable. All evals are traced with explanations built-in
β Custom. Custom evals are just functions. Built-in LLM evaluators
π Traced. Traces the internal steps of your tasks and evaluations.
As per usual, Phoenix is fully OSS, π fully private, and can be self-hosted.
Don't forget to give us a β to support the project!
Learn more about datasets and experiments with Phoenix: https://docs.arize.com/phoenix/datasets-and-experiments/overview-datasets
Phoenix: https://phoenix.arize.com/