Human evaluation uses people to judge AI outputs, traces, or sessions. Reviewers may label correctness, safety, preference, policy adherence, or task success.
Human evaluation is slower and more expensive than automated evaluation, but it is critical for calibration and high-risk judgment calls. Many strong eval systems use human labels to align LLM judges and resolve ambiguous cases.