Evaluation metrics

Evaluation metrics are the numerical or categorical measures used to judge AI system behavior. Examples include task success rate, correctness, faithfulness, relevance, toxicity, latency, cost, tool-call accuracy, Recall@K, and human preference.

The best metric depends on the job the system is supposed to do. A retrieval metric will not tell you whether an answer is safe. A correctness score will not tell you whether the agent took an expensive or risky path. Production eval suites usually need multiple metrics.

Docs

Learn

Insights

Company

Docs

Learn

Insights

Company

What Are Evaluation Metrics?