Agent Parameter Extraction

Evaluate your agent’s parameter extraction for tool calls

This section covers evaluating how well a model extracts the right parameters from the user query for a tool call. Agents can go awry when the tool call is not using the right parameters and returns irrelevant results.

Use this template when you want to specifically grade the parameter extraction itself, and separate out the tool selection as its own evaluation. Smaller evaluation tasks are generally more accurate, but in turn, it increases the number of evals you have to run and account for.

Prompt Template

You are an evaluation assistant assessing whether the parameters provided in a tool call correctly
match the user's question. Your task is to decide if the parameters selected are correct and 
sufficient to answer the question, using only the list of available tools and their parameter
definitions provided below. You are not responsible for checking if the correct tool was selected
— assume the tool is correct. You are evaluating **only** whether the parameters are accurate and
justified based on the content of the question.
Think like a grading rubric. Be strict. If the parameters are not clearly correct based on the
question alone, label them "incorrect". Do not make assumptions or infer values that are not
explicitly stated or directly supported by the question. Only use the information provided.
Your response must be a **single word**: either `"correct"` or `"incorrect"`. 
Do not include any explanation, punctuation, or other characters. The output will be parsed
programmatically.
---
Label the parameter extraction as `"correct"` if **all** of the following are true:
- All required parameters are present and correctly filled based on the question
- The parameter values are explicitly justified by the question
- No extra, irrelevant, or hallucinated parameters are included
Label the parameter extraction as `"incorrect"` if **any** of the following are true:
- Any required parameter is missing, malformed, or incorrectly populated
- Any parameter value is inferred or not clearly supported by the question
- Any extra or irrelevant parameter is included
---
[BEGIN DATA]
************
[Question]: {question}
************
[Tool Called With Parameters]: {tool_call}
************
[END DATA]
[Tool Definitions]: {tool_definitions}

How to Run the Tool Parameter Extraction Eval

from phoenix.evals import (
    TOOL_PARAMETER_EXTRACTION_PROMPT_RAILS_MAP,
    TOOL_PARAMETER_EXTRACTION_PROMPT_TEMPLATE,
    OpenAIModel,
    llm_classify,
)

df_in = pd.DataFrame(
    {"question": ["<INSERT QUESTION>"], "tool_call": ["<INSERT TOOL CALL>"]}
)

# the rails object will be used to snap responses to "correct" 
# or "incorrect"
rails = list(TOOL_PARAMETER_EXTRACTION_RAILS_MAP.values())
model = OpenAIModel(
    model_name="gpt-4",
    temperature=0.0,
)

# Loop through the specified dataframe and run each row 
# through the specified model and prompt. llm_classify
# will run requests concurrently to improve performance.
tool_call_evaluations = llm_classify(
    dataframe=df,
    template=TOOL_PARAMETER_EXTRACTION_PROMPT_TEMPLATE.template.replace("{tool_definitions}", json_tools),
    model=model,
    rails=rails,
    provide_explanation=True
)

Benchmark Results

This benchmark was obtained using the notebook below. It was run using the Berkeley Function Calling Leaderboard Dataset (BFCL) as a ground truth dataset. Each example in the dataset was evaluated using the TOOL_PARAMETER_EXTRACTION_PROMPT_TEMPLATE above, then the resulting labels were compared against the ground truth label in BFCL Dataset.

Note: Some incorrect examples were added to the dataset to enhance scoring. Details on this methodology are included in the notebook.

Results for OpenAI and Anthropic Models

Last updated

Was this helpful?