Evaluation gating is the use of eval results to allow, block, or require review before a change moves forward. A gate might block a prompt update if task success rate drops, if policy adherence fails, or if retrieval relevance regresses on a golden dataset.
The gate should be tied to a meaningful threshold and a clear action. "Average score below 0.8" is less useful than "fail deployment if any P0 safety test fails or if task success drops more than 3 percent against the baseline."