A/B testing for LLMs compares two or more AI system variants on live traffic. Variants might differ by model, prompt, retrieval strategy, tool policy, or agent workflow. The goal is to measure which version performs better on real user outcomes.
LLM A/B tests need more than click or conversion metrics. Teams often need task success, correctness, relevance, safety, latency, cost, user satisfaction, and escalation rate. For agents, trace-level evals help explain why one variant wins or loses.