Shadow testing runs a new model, prompt, retriever, or agent version alongside production without showing its output to users. The shadow system receives the same or similar inputs, generates outputs, and gets evaluated against the production baseline.
Shadow testing is especially useful for agents because it exposes differences in tool calls, latency, cost, retrieval behavior, and trajectory quality before the new version is user-facing. The hard part is making sure the shadow run is realistic without triggering side effects in external systems.