Training data is used to teach or adapt a model. Evaluation data is used to measure behavior after training, prompting, retrieval, or orchestration changes. The two should be separated to avoid measuring memorization instead of generalization.
For LLM applications, evaluation data may never touch model training. It may be used to compare prompts, retrievers, tools, policies, or agent harness versions. Keeping it clean preserves trust in the score.