Skip to main content

Overview

The Tool Invocation evaluator determines whether an LLM invoked a tool correctly with proper arguments, formatting, and safe content. This evaluator focuses on the how of tool calling — validating that the invocation itself is well-formed — rather than whether the right tool was selected. This is an LLM evaluator: Phoenix runs a judge model against a managed prompt template on your behalf. When to Use Use the Tool Invocation evaluator when you need to:
  • Validate tool call arguments — Ensure all required parameters are present with correct values
  • Check JSON formatting — Verify tool calls are properly structured
  • Detect hallucinated fields — Identify when the LLM invents parameters not in the schema
  • Audit for unsafe content — Check that arguments don’t contain PII or sensitive data
  • Evaluate multi-tool invocations — Validate when the LLM calls multiple tools at once
This evaluator validates tool invocation correctness, not tool selection. For evaluating whether the right tool was chosen, use the Tool Selection evaluator instead. The two evaluators are complementary — Tool Selection catches wrong-tool errors while Tool Invocation catches malformed-call errors — and are best run together for complete tool-calling coverage.
Input Mapping The template handles output formatting automatically — it pulls from your experiment’s output and formats the tool calls and available tools into a human-readable structure for the judge. You don’t need to configure anything for the output side. The only field you may need to map is input, which should point to the user query from your dataset. For example, if your dataset has input.query:
Template fieldDataset column
inputinput.query

Output Labels

PropertyValueDescription
label"correct" or "incorrect"Classification result
score1.0 or 0.0Numeric score (1.0 = correct, 0.0 = incorrect)
explanationstringLLM-generated reasoning for the classification
OptimizationMaximizeHigher scores are better
Criteria for Correct (1.0):
  • All required parameters are present with correct values
  • Tool call is properly structured and formatted
  • No hallucinated fields or parameters invented by the LLM
  • Arguments contain no unsafe content (PII, sensitive data)
Criteria for Incorrect (0.0):
  • Required parameters are missing or have incorrect values
  • Tool call is malformed or improperly structured
  • The LLM invented parameters not in the schema
  • Arguments contain unsafe or sensitive content

Using in Phoenix

  1. Navigate to your dataset and open the Evaluators tab.
  2. Click Add Evaluator and select LLM Evaluator Template, then choose tool_invocation.
  3. In the evaluator slide-over, you’ll see the prompt template and choices are pre-configured. You can use the defaults or edit the prompt to fit your use case.
  4. Set an input mapping for the input field so the template pulls from the correct column in your dataset. Output formatting is already handled by the template — no output mapping needed.
  5. Optionally, configure which LLM to use as the judge model.
  6. Click Create. The evaluator will automatically run on any future experiments for that dataset.

See Also