AI evaluation, or model evaluation, measures the quality, safety, reliability, and performance of an AI system or model. In classic ML, this often means measuring a model against a labeled test set. In LLM and agent systems, evaluation often includes semantic judges, human review, trace analysis, and production monitoring.
The term should be scoped carefully. Evaluating a foundation model benchmark is different from evaluating a customer support agent in production. The first measures capability. The second measures system behavior.