This tutorial sets up an experiment to determine which LlamaIndex query engine is preferred by an evaluation LLM. Using the PairwiseEvaluator
module, we compare responses from different engines and identify which one produces more helpful or relevant outputs.
We will go through key code snippets on this page. To follow the full tutorial, check out the Colab notebook above.
Here, we will grab 7 examples from a Hugging Face dataset.
sample_size = 7
category = "creative_writing"
url = "hf://datasets/databricks/databricks-dolly-15k/databricks-dolly-15k.jsonl"
df = pd.read_json(url, lines=True)
df = df.loc[df.category == category, ["instruction", "response"]]
df = df.sample(sample_size, random_state=42)
dataset = px.Client().upload_dataset(
dataset_name=f"{category}_{time_ns()}",
dataframe=df,
)
Task function can be either sync or async.
async def task(input):
return (await OpenAI(model="gpt-3.5-turbo").acomplete(input["instruction"])).text
Conduct a dry-run experiment on 3 randomly selected examples.
experiment = run_experiment(dataset, task, dry_run=3)
Evaluators can be sync or async. Function arguments output
and expected
refer to the attributes of the same name in the ExperimentRun
data structure shown above.
The PairwiseEvaluator
in LlamaIndex is used to compare two outputs side-by-side and determine which one is preferred.
This setup allows you to:
Run automated A/B tests on different LlamaIndex query engine configurations
Capture LLM-based preference data to guide iteration
Aggregate pairwise win rates and qualitative feedback
llm = OpenAI(temperature=0, model="gpt-4o")
async def pairwise(output, input, expected) -> Tuple[Score, Explanation]:
ans = await PairwiseComparisonEvaluator(llm=llm).aevaluate(
query=input["instruction"],
response=output,
second_response=expected["response"],
)
return ans.score, ans.feedback
evaluators = [pairwise]
experiment = evaluate_experiment(experiment, evaluators)