General EconomicsEnglishPublished

Study: AI agents trade in prediction markets but struggle to learn from others when tasks get complex

April 24, 2026arXiv: 2604.20050

This paper tests whether AI agents built from large language models can pool private information by trading in a simple prediction market. The question is whether watching prices and other actions lets these agents infer what others know and push the market price toward the true outcome. The authors measure market accuracy by the “log error” of the final price, a numerical way to say how far the last market price was from the true 0–1 outcome.

The researchers ran a controlled experiment where teams of three AI traders repeatedly bought and sold a security that pays 1 if a stated binary event happens and 0 otherwise. Each trader received private signals about the event. They varied how hard it is to reason about those signals (four information structures from easy to very hard), allowed or blocked public messages (“cheap talk”), changed the market length (3, 6, or 9 rounds) and the starting price, and tested both “myopic” and “strategic” prompting. They used a range of models (including Claude Haiku, Gemini, GPT-4o, GPT-5 mini and others) and ran thousands of markets (1772 in the first wave and another 576 with newer models). The securities were built so that, in theory, fully rational and sufficiently sophisticated traders should always aggregate information.

The main result is mixed. In the easiest information settings the median market did aggregate well: across structures the median final price was 0.91 when the true value was 1. In easy and medium settings the price was almost exactly 1. But performance fell as the information problem became harder. In the “hard” structure the median price was 0.73, and in the “very hard” structure it dropped to 0.5, which is no better than a random guess and therefore uninformative. These results held even when the authors repeated experiments with more recent state-of-the-art models in April 2026. Surprisingly, some frontier models did worse on average than earlier top performers.