- Missing context, short context, and long context
- No functions should be called, one function should be called, or multiple functions should be called
- Functions are available, but they are the wrong ones
- Vague or opaque parameters in query, vs. very specific parameters in query
- Single turn vs. multi-turn conversation pathways
Agent Tool Calling Prompt Template
In this prompt template, we are testing single-turn, no context, one function call for an agent router and evaluating the whole tool call. You can find more specific, smaller evaluation tasks for agent tool selection and agent parameter extraction.Benchmark Results
This benchmark was obtained using the notebook below. It was run using the Berkeley Function Calling Leaderboard Dataset (BFCL) as a ground truth dataset. Each example in the dataset was evaluated using theTOOL_SELECTION_PROMPT_TEMPLATE above, then the resulting labels were compared against the ground truth label in BFCL Dataset.
Note: Some incorrect examples were added to the dataset to enhance scoring. Details on this methodology are included in the notebook.
Try it out!
Results from OpenAI and Anthropic Models


