Experiments
Test and validate your LLM applications
Experiments help developers systematically test changes in their LLM applications using a curated dataset. Each experiment run is stored independently to measure the impact of changes over time.
Quickstart: ExperimentsKey features
Components of an Experiment:

Datasets
DatasetsA dataset is a collection of examples for evaluating your application. It is commonly represented as a pandas Dataframe, which is a list of dictionaries. Those dictionaries can contain input messages, expected outputs, metadata, or any other tabular data you would like to observe and test.
Tasks
Run experimentsA task is any function that you want to test on a dataset. Usually, this task replicates LLM functionality.
Evaluators
Evaluate experiment with codeAn evaluator is a function that takes the output of a task and provides an assessment.
It serves as the measure of success for your experiment. You can define multiple evaluators, ranging from LLM-based judges to code-based evaluations.
Learn More
Last updated
Was this helpful?