SpecOps: a fully automated system that tests real-world AI agents through their user interfaces
This paper presents SpecOps, a new testing system for product-level AI agents that interact with real software interfaces. These agents are
This paper presents SpecOps, a new testing system for product-level AI agents that interact with real software interfaces. These agents are powered by large language models (LLMs), which are AI systems that produce text and actions from natural-language prompts. SpecOps aims to find bugs and failures in agents that run in real environments, such as command-line tools, web apps, and browser extensions, instead of relying on simulators or heavy human work.
SpecOps breaks testing into four steps. Specialized LLM-based “specialist” agents handle each step: generating test cases, setting up the environment, running the test, and validating the results. The system uses prompt engineering (carefully designed instructions to the LLMs) and human-like visual monitoring via screen captures to watch what the subject agent does. The paper gives a concrete motivating example: a popular open-source agent called OpenInterpreter mistakenly used the wrong file path ("~/" instead of "./") and triggered a file-not-found error; detecting such errors requires creating files, prompting the agent, observing its actions, and checking the environment afterward.
In experiments on five real-world agents across three domains (Email, File System, and HR question-and-answer), SpecOps reported strong results. It found 164 true bugs with an F1 score of 0.89 (F1 is a single number that balances how many real bugs were found against how many reported problems were false alarms). The authors report a prompting success rate of 100% for SpecOps versus 11–49.5% for baseline approaches such as AutoGPT and LLM-crafted automation scripts. They also report perfect execution of planned steps, an average runtime under eight minutes per test, and a cost below $0.73 per test.