Tool Invocation

Overview

The Tool Invocation evaluator determines whether an LLM invoked a tool correctly with proper arguments, formatting, and safe content. This evaluator focuses on the how of tool calling — validating that the invocation itself is well-formed — rather than whether the right tool was selected. This is an LLM evaluator: Phoenix runs a judge model against a managed prompt template on your behalf. When to Use Use the Tool Invocation evaluator when you need to:

Validate tool call arguments — Ensure all required parameters are present with correct values
Check JSON formatting — Verify tool calls are properly structured
Detect hallucinated fields — Identify when the LLM invents parameters not in the schema
Audit for unsafe content — Check that arguments don’t contain PII or sensitive data
Evaluate multi-tool invocations — Validate when the LLM calls multiple tools at once

This evaluator validates tool invocation correctness, not tool selection. For evaluating whether the right tool was chosen, use the Tool Selection evaluator instead. The two evaluators are complementary — Tool Selection catches wrong-tool errors while Tool Invocation catches malformed-call errors — and are best run together for complete tool-calling coverage.

Input Mapping The template handles output formatting automatically — it pulls from your experiment’s output and formats the tool calls and available tools into a human-readable structure for the judge. You don’t need to configure anything for the output side. The only field you may need to map is input, which should point to the user query from your dataset. For example, if your dataset has input.query:

Template field	Dataset column
`input`	`input.query`

Output Labels

Property	Value	Description
`label`	`"correct"` or `"incorrect"`	Classification result
`score`	`1.0` or `0.0`	Numeric score (1.0 = correct, 0.0 = incorrect)
`explanation`	`string`	LLM-generated reasoning for the classification
Optimization	Maximize	Higher scores are better

Criteria for Correct (1.0):

All required parameters are present with correct values
Tool call is properly structured and formatted
No hallucinated fields or parameters invented by the LLM
Arguments contain no unsafe content (PII, sensitive data)

Criteria for Incorrect (0.0):

Required parameters are missing or have incorrect values
Tool call is malformed or improperly structured
The LLM invented parameters not in the schema
Arguments contain unsafe or sensitive content

Using in Phoenix

Navigate to your dataset and open the Evaluators tab.
Click Add Evaluator and select LLM Evaluator Template, then choose tool_invocation.
In the evaluator slide-over, you’ll see the prompt template and choices are pre-configured. You can use the defaults or edit the prompt to fit your use case.
Set an input mapping for the input field so the template pulls from the correct column in your dataset. Output formatting is already handled by the template — no output mapping needed.
Optionally, configure which LLM to use as the judge model.
Click Create. The evaluator will automatically run on any future experiments for that dataset.

​Overview

​Output Labels

​Using in Phoenix

​See Also

Overview

Output Labels

Using in Phoenix

See Also