Comparing LlamaIndex Query Engines with a Pairwise Evaluator

This tutorial sets up an experiment to determine which LlamaIndex query engine is preferred by an evaluation LLM. Using the PairwiseEvaluator module, we compare responses from different engines and identify which one produces more helpful or relevant outputs.

See Llama-Index notebook for more info


Notebook Walkthrough

We will go through key code snippets on this page. To follow the full tutorial, check out the Colab notebook above.

Upload Dataset to Phoenix

Here, we will grab 7 examples from a Hugging Face dataset.

sample_size = 7
category = "creative_writing"
url = "hf://datasets/databricks/databricks-dolly-15k/databricks-dolly-15k.jsonl"
df = pd.read_json(url, lines=True)
df = df.loc[df.category == category, ["instruction", "response"]]
df = df.sample(sample_size, random_state=42)
dataset = px.Client().upload_dataset(
    dataset_name=f"{category}_{time_ns()}",
    dataframe=df,
)

Define Task Function

Task function can be either sync or async.

async def task(input):
    return (await OpenAI(model="gpt-3.5-turbo").acomplete(input["instruction"])).text

Dry-Run Experiment

Conduct a dry-run experiment on 3 randomly selected examples.

experiment = run_experiment(dataset, task, dry_run=3)

Define Evaluators For Each Experiment Run

Evaluators can be sync or async. Function arguments output and expected refer to the attributes of the same name in the ExperimentRun data structure shown above.

The PairwiseEvaluator in LlamaIndex is used to compare two outputs side-by-side and determine which one is preferred.

This setup allows you to:

  • Run automated A/B tests on different LlamaIndex query engine configurations

  • Capture LLM-based preference data to guide iteration

  • Aggregate pairwise win rates and qualitative feedback

llm = OpenAI(temperature=0, model="gpt-4o")


async def pairwise(output, input, expected) -> Tuple[Score, Explanation]:
    ans = await PairwiseComparisonEvaluator(llm=llm).aevaluate(
        query=input["instruction"],
        response=output,
        second_response=expected["response"],
    )
    return ans.score, ans.feedback


evaluators = [pairwise]
experiment = evaluate_experiment(experiment, evaluators)

View Results in Phoenix

Last updated

Was this helpful?