Agent Tool Selection

Evaluating how well your agent selects a tool to use

Agents are often heavy users of tool calling, which can also be an approximation for workflow selection or dialog tree management. Given a set of tools and an input, which one should be chosen?

This is a more specific evaluation template compared to Agent Tool Calling, which tries to evaluate the whole tool call, instead of just whether the tool call selection was correct or not.

You can use this in tandem with Agent Parameter Extraction.

Agent Tool Selection Eval Prompt Template

You are an evaluation assistant assessing whether a tool call correctly matches a user's question. 
Your task is to decide if the tool selected is the best choice to answer the question,
using only the list of available tools provided below. You are not responsible for checking the 
parameters or arguments passed to the tool. You are evaluating **only** whether the correct tool 
was selected based on the content of the question. Think like a grading rubric. Be strict. If the
selected tool is not clearly correct based on the question alone, label it "incorrect". Do not 
make assumptions or infer information that is not explicitly stated in the question. 
Only use the information provided.
Your response must be a **single word**: either `"correct"` or `"incorrect"`. 
Do not include any explanation, punctuation, or other characters. The output will be parsed
programmatically.
---
Label the tool call as `"correct"` if **all** of the following are true:
- The selected tool is clearly the best fit to answer the user's question
- The tool is among those available in the tool list
- The question contains enough explicit information to justify selecting this tool
Label the tool call as `"incorrect"` if **any** of the following are true:
- A more appropriate tool exists to answer the question
- The tool is not clearly justified by the question content
- The tool would not produce a relevant or meaningful answer to the question
---
[BEGIN DATA]
************
[Question]: {question}
************
[Tool Called]: {tool_call}
[END DATA]
[Tool Definitions]: {tool_definitions}

How to Run the Tool Selection Eval

from phoenix.evals import (
    TOOL_SELECTION_PROMPT_RAILS_MAP,
    TOOL_SELECTION_PROMPT_TEMPLATE,
    OpenAIModel,
    llm_classify,
)

df_in = pd.DataFrame(
    {"question": ["<INSERT QUESTION>"], "tool_call": ["<INSERT TOOL CALL>"]}
)

# the rails object will be used to snap responses to "correct" 
# or "incorrect"
rails = list(TOOL_CALLING_SELECTION_RAILS_MAP.values())
model = OpenAIModel(
    model_name="gpt-4",
    temperature=0.0,
)

# Loop through the specified dataframe and run each row 
# through the specified model and prompt. llm_classify
# will run requests concurrently to improve performance.
tool_call_evaluations = llm_classify(
    dataframe=df,
    template=TOOL_SELECTION_PROMPT_TEMPLATE.template.replace("{tool_definitions}", json_tools),
    model=model,
    rails=rails,
    provide_explanation=True
)

Benchmark Results

This benchmark was obtained using the notebook below. It was run using the Berkeley Function Calling Leaderboard Dataset (BFCL) as a ground truth dataset. Each example in the dataset was evaluated using the TOOL_SELECTION_PROMPT_TEMPLATE above, then the resulting labels were compared against the ground truth label in BFCL Dataset.

Note: Some incorrect examples were added to the dataset to enhance scoring. Details on this methodology are included in the notebook.

Results from OpenAI and Anthropic Models

Last updated

Was this helpful?