Safety evaluation measures whether an AI system avoids harmful, unsafe, unauthorized, or policy-violating behavior. It can include toxicity, self-harm, illegal advice, data leakage, unsafe tool calls, prompt injection resistance, and compliance with domain-specific policies.
For agents, safety evaluation must include actions, not just text. The dangerous behavior may be a tool call, permission escalation, file change, transaction, or data access event.