Coverage measures how much of the expected behavior space an evaluation suite exercises. Coverage can include user intents, edge cases, languages, tools, policies, failure modes, personas, and production traffic patterns.
Low coverage creates false confidence. An eval suite can be green because it never tested the cases that break. Dataset curation and error analysis are the usual ways to improve coverage.