HorizonMath tests AI on over 100 unsolved math problems with automatic checks
HorizonMath is a new benchmark that asks whether artificial intelligence can make genuine mathematical discoveries. The team collected more
HorizonMath is a new benchmark that asks whether artificial intelligence can make genuine mathematical discoveries. The team collected more than 100 mostly unsolved problems across eight areas of computational and applied mathematics. Each problem is chosen so that finding a solution is hard, but checking a proposed solution can be done quickly by a computer.
The benchmark focuses on three kinds of problems. One kind asks for a closed‑form expression — a neat formula — that can be checked by comparing it to a high‑precision numerical value. A second asks for constructions or optimized objects that improve on a published baseline. A third asks for an example whose existence is unknown but can be validated by confirming it meets all required properties. These choices exploit a “generator–verifier” gap: it is difficult to generate new candidates, but easy to verify them.
The authors released the full evaluation code and problem files as open source. Each problem comes with a deterministic verifier: either a numeric comparison at high precision or a constraint checker that tests the candidate object. Because the true solutions are not known in advance, any correct answer found would not already be present in model training data. The repository and instructions are available at https://github.com/ewang26/HorizonMath.
When the authors ran current systems, most state‑of‑the‑art models scored near zero. Using the platform, they found two problems where GPT‑5.4 Pro proposed solutions that improve on the best‑known published results. These proposals are described in the paper, but the authors note that such model outputs are “potential” novel contributions and still require expert human review before being accepted by the mathematical community.