Skip to main content
Legacy Evaluator: This evaluator is from phoenix-evals 1.x and will be removed in a future version. For tool calling evaluation, consider using Tool Selection and Tool Invocation evaluators instead. You can migrate the template to a custom evaluator as shown below.

Function Calling Eval Template

You are an evaluation assistant evaluating questions and tool calls to
determine whether the tool called would answer the question. The tool
calls have been generated by a separate agent, and chosen from the list of
tools provided below. It is your job to decide whether that agent chose
the right tool to call.

<data>

<question>
{question}
</question>

<tool_called>
{tool_call}
</tool_called>

<tool_definitions>
{tool_definitions}
</tool_definitions>

</data>

Your response must be single word, either "correct" or "incorrect",
and should not contain any text or characters aside from that word.
"incorrect" means that the chosen tool would not answer the question,
the tool includes information that is not presented in the question,
or that the tool signature includes parameter values that don't match
the formats specified in the tool definitions above.

"correct" means the correct tool call was chosen, the correct parameters
were extracted from the question, the tool call generated is runnable and correct,
and that no outside information not present in the question was used
in the generated question.

Running an Agent Eval using the Function Calling Template

from phoenix.evals import ClassificationEvaluator
from phoenix.evals.llm import LLM

TOOL_CALLING_TEMPLATE = """You are an evaluation assistant evaluating questions and tool calls to
determine whether the tool called would answer the question. The tool
calls have been generated by a separate agent, and chosen from the list of
tools provided below. It is your job to decide whether that agent chose
the right tool to call.

<data>

<question>
{question}
</question>

<tool_called>
{tool_call}
</tool_called>

<tool_definitions>
{tool_definitions}
</tool_definitions>

</data>

"incorrect" means that the chosen tool would not answer the question,
the tool includes information that is not presented in the question,
or that the tool signature includes parameter values that don't match
the formats specified in the tool definitions above.

"correct" means the correct tool call was chosen, the correct parameters
were extracted from the question, the tool call generated is runnable and correct,
and that no outside information not present in the question was used
in the generated question."""

tool_calling_evaluator = ClassificationEvaluator(
    name="tool_calling",
    prompt_template=TOOL_CALLING_TEMPLATE,
    model=LLM(provider="openai", model="gpt-4o"),
    choices={"incorrect": 0, "correct": 1},
)

result = tool_calling_evaluator.evaluate({
    "question": "What's the weather in San Francisco?",
    "tool_call": "get_weather(location='San Francisco')",
    "tool_definitions": "get_weather(location: str) - Returns weather for a location"
})
Parameters:
  • df - a dataframe of cases to evaluate. The dataframe must have these columns to match the default template:
    • question - the query made to the model. If you’ve exported spans from Phoenix to evaluate, this will the llm.input_messages column in your exported data.
    • tool_call - information on the tool called and parameters included. If you’ve exported spans from Phoenix to evaluate, this will be the llm.function_call column in your exported data.

Parameter Extraction Only

This template instead evaluates only the parameter extraction step of a router:
You are comparing a function call response to a question and trying to determine if the
generated call has extracted the exact right parameters from the question.

<data>

<question>
{question}
</question>

<llm_response>
{response}
</llm_response>

<function_definitions>
{function_definitions}
</function_definitions>

</data>

Compare the parameters in the generated function against the function definitions provided above.
The parameters extracted from the question must match the expected schema exactly.
Your response must be single word, either "correct", "incorrect", or "not-applicable",
and should not contain any text or characters aside from that word.

"correct" means the function call parameters match the expected schema and provides only relevant information.
"incorrect" means that the parameters in the function do not match the expected schema exactly, or the generated function does not correctly answer the user's question. You should also respond with "incorrect" if the response makes up information that is not in the schema.
"not-applicable" means that response was not a function call.