Cascading failures occur when one failure triggers additional failures across an agent workflow or multi-agent system. A bad retrieval result can lead to a bad plan, which leads to a wrong tool call, which leads to an unsafe or incorrect final answer.
Agents are vulnerable to cascading failures because later steps often trust earlier steps. Checkpoints, validation, policy gates, retries, and trajectory evals help catch errors before they propagate.