A test set is a collection of test cases used to evaluate behavior. A test case is one specific example with an input and expected evaluation criteria. For agents, a test case may include initial state, available tools, expected trajectory constraints, and success criteria.
Test cases should be small enough to debug and realistic enough to matter. Toy examples are useful for smoke tests, but production reliability depends on cases that resemble actual user workflows.