Experiments

Test and validate your LLM applications

Experiments help developers systematically test changes in their LLM applications using a curated dataset. Each experiment run is stored independently to measure the impact of changes over time.

Quickstart: Experiments

Key features

Components of an Experiment:

Datasets

A dataset is a collection of examples for evaluating your application. It is commonly represented as a pandas Dataframe, which is a list of dictionaries. Those dictionaries can contain input messages, expected outputs, metadata, or any other tabular data you would like to observe and test.

Tasks

Run experiments

A task is any function that you want to test on a dataset. Usually, this task replicates LLM functionality.

Evaluators

Evaluate experiment with code

An evaluator is a function that takes the output of a task and provides an assessment.

It serves as the measure of success for your experiment. You can define multiple evaluators, ranging from LLM-based judges to code-based evaluations.

Learn More

Quickstart

Create your first experiment

Learn about Evals

Understand where to deploy different kinds of evals

Dive into a notebook

Look at end to end examples of Agents, RAG, and Voice.

Last updated 15 days ago

Was this helpful?