SpecOps: a fully automated system that tests real-world AI agents through their user interfaces
This paper presents SpecOps, a new testing system for product-level AI agents that interact with real software interfaces. These agents are powered by large language models (LLMs), which are AI systems that produce text and actions from natural-language prompts. SpecOps aims to find bugs and failures in agents that run in real environments, such as command-line tools, web apps, and browser extensions, instead of relying on simulators or heavy human work.
SpecOps breaks testing into four steps. Specialized LLM-based “specialist” agents handle each step: generating test cases, setting up the environment, running the test, and validating the results. The system uses prompt engineering (carefully designed instructions to the LLMs) and human-like visual monitoring via screen captures to watch what the subject agent does. The paper gives a concrete motivating example: a popular open-source agent called OpenInterpreter mistakenly used the wrong file path ("~/" instead of "./") and triggered a file-not-found error; detecting such errors requires creating files, prompting the agent, observing its actions, and checking the environment afterward.
In experiments on five real-world agents across three domains (Email, File System, and HR question-and-answer), SpecOps reported strong results. It found 164 true bugs with an F1 score of 0.89 (F1 is a single number that balances how many real bugs were found against how many reported problems were false alarms). The authors report a prompting success rate of 100% for SpecOps versus 11–49.5% for baseline approaches such as AutoGPT and LLM-crafted automation scripts. They also report perfect execution of planned steps, an average runtime under eight minutes per test, and a cost below $0.73 per test.
Why this matters: many current benchmarks and testbeds work in text-only or simulated settings, or they require lots of human effort to make realistic test scenarios. That can miss bugs that only show up when an agent interacts with real software, or it can introduce errors from the simulator itself. SpecOps aims to close that realism gap by operating end to end in real environments and by automating the whole pipeline with LLM-based specialists.