Study maps how agent-like AIs fail and why those failures can cross system boundaries
This paper looks at why “agentic” AI systems fail. Agentic AI means systems that use a large language model (LLM) to plan actions, call exte
This paper looks at why “agentic” AI systems fail. Agentic AI means systems that use a large language model (LLM) to plan actions, call external tools, and carry out multi-step tasks. The authors show that these systems break in ways that are different from normal software or from chatty AIs that only answer questions.
The team collected 13,602 closed issues and merged pull requests from 40 open-source agentic AI repositories. They selected 385 representative faults for deep manual study. To make sense of those faults they used grounded theory, a method that builds categories from real examples. They also ran Apriori-based association rule mining, a statistical method that finds strong co‑occurrence patterns. Finally, they checked their results with a survey of 145 developers who work on agentic systems.
The result is a detailed taxonomy. The authors identified 37 specific fault types organized into 13 higher-level categories, 13 classes of observable symptoms, and 12 categories of root causes. Two common root causes were dependency and integration failures (19.5%) and data and type handling failures (17.6%). Many faults arise from mismatches between probabilistic outputs from LLMs and deterministic interfaces expected by other software. The study also found recurring propagation paths across components. For example, token-management bugs frequently led to authentication failures (a very strong link, lift = 181.5), and incorrect time values often came from bad datetime conversions (lift = 121.0).
This work matters because it moves fault analysis from anecdotes to patterns. The authors show that failures in agentic systems are often structured and cross component boundaries. That means teams can build targeted debugging, monitoring, and reliability steps that focus on common propagation paths — for example, checking token refresh logic or datetime conversions. Practitioners rated the taxonomy as representative (mean = 3.97 out of 5) and 83.8% said it covered faults they had seen.