Comparing LlamaIndex Query Engines with a Pairwise Evaluator

Notebook Walkthrough
Upload Dataset to Phoenix
Define Task Function
Dry-Run Experiment
Define Evaluators For Each Experiment Run
View Results in Phoenix

See Llama-Index notebook for more info

Notebook Walkthrough

We will go through key code snippets on this page. To follow the full tutorial, check out the Colab notebook above.

Upload Dataset to Phoenix

Here, we will grab 7 examples from a Hugging Face dataset.

sample_size = 7
category = "creative_writing"
url = "hf://datasets/databricks/databricks-dolly-15k/databricks-dolly-15k.jsonl"
df = pd.read_json(url, lines=True)
df = df.loc[df.category == category, ["instruction", "response"]]
df = df.sample(sample_size, random_state=42)
px_client = Client()
dataset = px_client.datasets.create_dataset(
    name=f"{category}_{time_ns()}",
    dataframe=df,
)

Define Task Function

Task function can be either sync or async.

async def task(input):
    return (await OpenAI(model="gpt-3.5-turbo").acomplete(input["instruction"])).text

Dry-Run Experiment

Conduct a dry-run experiment on 3 randomly selected examples.

experiment = run_experiment(dataset, task, dry_run=3)

Define Evaluators For Each Experiment Run

Evaluators can be sync or async. Function arguments output and expected refer to the attributes of the same name in the ExperimentRun data structure shown above. The PairwiseEvaluator in LlamaIndex is used to compare two outputs side-by-side and determine which one is preferred. This setup allows you to:

Run automated A/B tests on different LlamaIndex query engine configurations
Capture LLM-based preference data to guide iteration
Aggregate pairwise win rates and qualitative feedback

llm = OpenAI(temperature=0, model="gpt-4o")


async def pairwise(output, input, expected) -> Tuple[Score, Explanation]:
    ans = await PairwiseComparisonEvaluator(llm=llm).aevaluate(
        query=input["instruction"],
        response=output,
        second_response=expected["response"],
    )
    return ans.score, ans.feedback


evaluators = [pairwise]

experiment = evaluate_experiment(experiment, evaluators)

View Results in Phoenix

Model Comparison for an Email Text Extraction Service Prompt Template Iteration for a Summarization Service

⌘I

AI Engineering Workflows

Tracing

Human-in-the-Loop Workflows (Annotations)

Prompt Engineering

Evaluation

Datasets & Experiments

Retrieval & Inferences

Comparing LlamaIndex Query Engines with a Pairwise Evaluator

Notebook Walkthrough

Upload Dataset to Phoenix

Define Task Function

Dry-Run Experiment

Define Evaluators For Each Experiment Run

View Results in Phoenix

AI Engineering Workflows

Tracing

Human-in-the-Loop Workflows (Annotations)

Prompt Engineering

Evaluation

Datasets & Experiments

Retrieval & Inferences

​Notebook Walkthrough

​Upload Dataset to Phoenix

​Define Task Function

​Dry-Run Experiment

​Define Evaluators For Each Experiment Run

​View Results in Phoenix

Notebook Walkthrough

Upload Dataset to Phoenix

Define Task Function

Dry-Run Experiment

Define Evaluators For Each Experiment Run

View Results in Phoenix