Comparing LlamaIndex Query Engines with a Pairwise Evaluator
This tutorial sets up an experiment to determine which LlamaIndex query engine is preferred by an evaluation LLM. Using the PairwiseEvaluator
module, we compare responses from different engines and identify which one produces more helpful or relevant outputs.
Notebook Walkthrough
We will go through key code snippets on this page. To follow the full tutorial, check out the Colab notebook above.
Upload Dataset to Phoenix
Here, we will grab 7 examples from a Hugging Face dataset.
sample_size = 7
category = "creative_writing"
url = "hf://datasets/databricks/databricks-dolly-15k/databricks-dolly-15k.jsonl"
df = pd.read_json(url, lines=True)
df = df.loc[df.category == category, ["instruction", "response"]]
df = df.sample(sample_size, random_state=42)
dataset = px.Client().upload_dataset(
dataset_name=f"{category}_{time_ns()}",
dataframe=df,
)
Define Task Function
Task function can be either sync or async.
async def task(input):
return (await OpenAI(model="gpt-3.5-turbo").acomplete(input["instruction"])).text
Dry-Run Experiment
Conduct a dry-run experiment on 3 randomly selected examples.
experiment = run_experiment(dataset, task, dry_run=3)
Define Evaluators For Each Experiment Run
Evaluators can be sync or async. Function arguments output
and expected
refer to the attributes of the same name in the ExperimentRun
data structure shown above.
The PairwiseEvaluator
in LlamaIndex is used to compare two outputs side-by-side and determine which one is preferred.
This setup allows you to:
Run automated A/B tests on different LlamaIndex query engine configurations
Capture LLM-based preference data to guide iteration
Aggregate pairwise win rates and qualitative feedback
llm = OpenAI(temperature=0, model="gpt-4o")
async def pairwise(output, input, expected) -> Tuple[Score, Explanation]:
ans = await PairwiseComparisonEvaluator(llm=llm).aevaluate(
query=input["instruction"],
response=output,
second_response=expected["response"],
)
return ans.score, ans.feedback
evaluators = [pairwise]
experiment = evaluate_experiment(experiment, evaluators)
View Results in Phoenix
Last updated
Was this helpful?