LLM evaluation is the practice of measuring whether a large language model or LLM-powered application behaves as intended. It can score correctness, relevance, faithfulness, safety, tone, tool use, latency, cost, and task success.
For production systems, LLM evaluation should measure the application, not just the base model. Users experience prompts, retrieval, tools, memory, orchestration, and policies together. The eval should match that system boundary.