Summarize emails by testing prompts and models with Jaro-Winkler-based evaluation.
Imagine you're deploying a service that condenses emails into concise summaries. One challenge of using LLMs for summarization is that even the best models can miscategorize key details, or miss those details entirely.
In this tutorial, you will construct a dataset and run experiments to engineer a prompt template that produces accurately summarizes your emails. You will:
Upload a dataset of examples containing emails to Phoenix
Define an experiment task that extracts and formats the key details from those emails
Devise an evaluator measuring Jaro-Winkler Similarity
Run experiments to iterate on your prompt template and to compare the summaries produced by different LLMs
We will go through key code snippets on this page. To follow the full tutorial, check out the Colab notebook above.
First, we need to set up our instrumentors to capture traces from the agent. Since our task runs through LangChain but the evaluation function calls OpenAI directly, we’ll enable both the Langchain and OpenAI auto-instrumentors.
from openinference.instrumentation.langchain import LangChainInstrumentor
from openinference.instrumentation.openai import OpenAIInstrumentor
tracer_provider = register(endpoint="your-endpoint-here", auto_instrument = True)
Experiments in Phoenix are made up of 3 elements: a dataset, a task, and an evaluator. The dataset is a collection of the inputs and expected outputs that we'll use to evaluate. The task is an operation that should be performed on each input. Finally, the evaluator compares the result against an expected output.
For this example, here's what each looks like:
Dataset: a dataframe of emails to analyze, and the expected output for our agent
Task: a Langchain agent that extracts key info from our input emails. The result of this task will then be compared against the expected output
Evaluator: Jaro-Winkler distance calculation on the task's output and expected output
We've prepared some example emails and actual responses that we can use to evaluate our two models. Let's download those and save them to a temporary file. Then, we will upload the dataset to Phoenix.
dataset_name = "Email Extraction"
with tempfile.NamedTemporaryFile(suffix=".json") as f:
download_public_dataset(registry[dataset_name].dataset_id, path=f.name)
df = pd.read_json(f.name)[["inputs", "outputs"]]
df = df.sample(10, random_state=42)
dataset = px.Client().upload_dataset(
dataset_name=f"{dataset_name}{datetime.now(timezone.utc)}",
inputs=df.inputs,
outputs=df.outputs.map(lambda obj: obj["output"]),
)
The picture below shows how your dataset examples will appear in the Phoenix UI.
Now we'll set up our Langchain agent. This is a straightforward agent that makes a call to our specified model and formats the response as JSON.
model = "gpt-4o"
llm = ChatOpenAI(model=model).bind_functions(
functions=[registry[dataset_name].schema],
function_call=registry[dataset_name].schema.schema()["title"],
)
output_parser = JsonOutputFunctionsParser()
extraction_chain = registry[dataset_name].instructions | llm | output_parser
Next, we need to define a Task for our experiment to use.
def task(input) -> str:
return extraction_chain.invoke(input)
experiment = run_experiment(dataset, task)
Finally, we need to define our evaluation function. Here we'll use a Jaro-Winkler similarity function that generates a score for how similar the output and expected text are. Jaro-Winkler similarity is technique for measuring edit distance between two strings.
def jarowinkler_similarity(output, expected) -> float:
return jarowinkler.jarowinkler_similarity(
json.dumps(output, sort_keys=True),
json.dumps(expected, sort_keys=True),
)
evaluate_experiment(experiment, jarowinkler_similarity)
Now we have scores on how well GPT-4o does at extracting email facts. This is helpful, but doesn't mean much on its own. Let's compare it against another model.
To compare results with another model, we simply need to redefine our task. Our dataset and evaluator can stay the same.
model = "gpt-3.5-turbo"
llm = ChatOpenAI(model=model).bind_functions(
functions=[registry[dataset_name].schema],
function_call=registry[dataset_name].schema.schema()["title"],
)
extraction_chain = registry[dataset_name].instructions | llm | output_parser
def task(input) -> str:
return extraction_chain.invoke(input)
experiment = run_experiment(dataset, task)
evaluate_experiment(experiment, jarowinkler_similarity)
Now if you check your Phoenix experiment, you can compare Jaro-Winkler scores on a per query basis, and view aggregate model performance results. The experiment comparison screenshot below shows results from GPT-4o on the left and GPT-3.5-turbo on the far right. The higher the jarowinkler_similarity score, the closer the outputted value is to the actual value.
You should see that GPT-4o outperforms its older cousin.
From here you could try out different models or iterate on your prompt, then run the same experiment with a modified Task to compare results.