Evaluation metrics are the numerical or categorical measures used to judge AI system behavior. Examples include task success rate, correctness, faithfulness, relevance, toxicity, latency, cost, tool-call accuracy, Recall@K, and human preference.
The best metric depends on the job the system is supposed to do. A retrieval metric will not tell you whether an answer is safe. A correctness score will not tell you whether the agent took an expensive or risky path. Production eval suites usually need multiple metrics.