Agent Tool Calling

Evaluate your agent’s tool selection and parameter extraction accuracy

This section covers evaluating how well your agent selects a tool to use, extracts the right parameters from the user query, and generates the tool call code.

Agents are often heavy users of tool calling, which can also be an approximation for workflow selection or dialog tree management. Given a set of tools and an input, which one should be chosen?

Depending on your use case you may want to expand your evaluation to include other cases, such as:

  • Missing context, short context, and long context

  • No functions should be called, one function should be called, or multiple functions should be called

  • Functions are available, but they are the wrong ones

  • Vague or opaque parameters in query, vs. very specific parameters in query

  • Single turn vs. multi-turn conversation pathways

Agent Tool Calling Prompt Template

In this prompt template, we are testing single-turn, no context, one function call for an agent router and evaluating the whole tool call.

You can find more specific, smaller evaluation tasks for agent tool selection and agent parameter extraction.

You are an evaluation assistant evaluating questions and tool calls to
determine whether the tool called would answer the question. The tool
calls have been generated by a separate agent, and chosen from the list of
tools provided below. It is your job to decide whether that agent chose
the right tool to call.

    [BEGIN DATA]
    ************
    [Question]: {question}
    ************
    [Tool Called]: {tool_call}
    [END DATA]

Your response must be single word, either "correct" or "incorrect",
and should not contain any text or characters aside from that word.

"incorrect" means that the chosen tool would not answer the question,
the tool includes information that is not presented in the question,
or that the tool signature includes parameter values that don't match
the formats specified in the tool signatures below.

"correct" means the correct tool call was chosen, the correct parameters
were extracted from the question, the tool call generated is runnable and correct,
and that no outside information not present in the question was used
in the generated question.

[TOOL DEFINITIONS START]
{tool_definitions}
[TOOL DEFINITIONS END]

How to Run:

from phoenix.evals import (
    TOOL_CALLING_PROMPT_RAILS_MAP,
    TOOL_CALLING_PROMPT_TEMPLATE,
    OpenAIModel,
    llm_classify,
)

df_in = pd.DataFrame(
    {"question": ["<INSERT QUESTION>"], "tool_call": ["<INSERT TOOL CALL>"]}
)

# the rails object will be used to snap responses to "correct" 
# or "incorrect"
rails = list(TOOL_CALLING_PROMPT_RAILS_MAP.values())
model = OpenAIModel(
    model_name="gpt-4",
    temperature=0.0,
)

# Loop through the specified dataframe and run each row 
# through the specified model and prompt. llm_classify
# will run requests concurrently to improve performance.
tool_call_evaluations = llm_classify(
    dataframe=df,
    template=TOOL_CALLING_PROMPT_TEMPLATE.template.replace("{tool_definitions}", json_tools),
    model=model,
    rails=rails,
    provide_explanation=True
)

Benchmark Results

This benchmark was obtained using the notebook below. It was run using the Berkeley Function Calling Leaderboard Dataset (BFCL) as a ground truth dataset. Each example in the dataset was evaluated using the TOOL_SELECTION_PROMPT_TEMPLATE above, then the resulting labels were compared against the ground truth label in BFCL Dataset.

Note: Some incorrect examples were added to the dataset to enhance scoring. Details on this methodology are included in the notebook.

Results from OpenAI and Anthropic Models

Last updated

Was this helpful?