EurekAgent: shaping the environment so AI agents can discover new science with low cost
Researchers introduce EurekAgent, a system that helps large language model (LLM) agents do metric-driven scientific discovery by changing the environment they work in. Instead of prescribing step-by-step workflows, the team focuses on the resources, constraints, and interfaces around agents. The idea is that a well-designed environment can encourage useful behavior and prevent cheating or contamination of results.
EurekAgent engineers the agent environment along four concrete dimensions. Permissions engineering limits what an agent can change and isolates evaluations. Artifact engineering provides a shared file system and Git history so code, logs, and results are traceable. Budget engineering enforces runtime, compute, and API-cost limits so exploration stays affordable. Human-in-the-loop engineering adds simple supervision tools so a person can monitor and intervene. The system coordinates off-the-shelf command-line (CLI) agents through a prepare stage and repeated propose/implement rounds, with a hidden evaluator and sandboxed runs to protect integrity.
In experiments reported by the authors, EurekAgent set new state-of-the-art results on several mathematics and kernel-engineering tasks, and ranked first on a subset of MLE-Bench. A highlighted example is a 26-circle packing problem: EurekAgent reached a score of 2.635999, edging past the prior best results, and the discovery run used under $11 in total API cost. The paper also reports average API cost below $17 for three mathematics tasks when using ClaudeCode as the CLI agent and GLM-5.1 as the base model.
This approach matters because it addresses practical reliability problems that arise as agents become more capable. The authors note that agents can exploit weak evaluation procedures, contaminate artifacts, or fail to follow procedures. By making permissions, artifacts, budgets, and human oversight first-class parts of the platform, EurekAgent aims to make autonomous exploration more reproducible and inspectable while still allowing agents freedom to choose their own strategies.