Machine LearningEnglishPublished

HorizonMath tests AI on over 100 unsolved math problems with automatic checks

March 17, 2026arXiv: 2603.15617v1

HorizonMath is a new benchmark that asks whether artificial intelligence can make genuine mathematical discoveries. The team collected more than 100 mostly unsolved problems across eight areas of computational and applied mathematics. Each problem is chosen so that finding a solution is hard, but checking a proposed solution can be done quickly by a computer.

The benchmark focuses on three kinds of problems. One kind asks for a closed‑form expression — a neat formula — that can be checked by comparing it to a high‑precision numerical value. A second asks for constructions or optimized objects that improve on a published baseline. A third asks for an example whose existence is unknown but can be validated by confirming it meets all required properties. These choices exploit a “generator–verifier” gap: it is difficult to generate new candidates, but easy to verify them.

The authors released the full evaluation code and problem files as open source. Each problem comes with a deterministic verifier: either a numeric comparison at high precision or a constraint checker that tests the candidate object. Because the true solutions are not known in advance, any correct answer found would not already be present in model training data. The repository and instructions are available at https://github.com/ewang26/HorizonMath.

When the authors ran current systems, most state‑of‑the‑art models scored near zero. Using the platform, they found two problems where GPT‑5.4 Pro proposed solutions that improve on the best‑known published results. These proposals are described in the paper, but the authors note that such model outputs are “potential” novel contributions and still require expert human review before being accepted by the mathematical community.

There are important caveats. Matching a high‑precision numeric target or passing an automatic checker is strong evidence, but it is not the same as a formal proof. Verifying a closed‑form expression numerically does not prove it is exactly correct in all cases. The benchmark also deliberately excludes problems that only admit natural‑language proofs or that cannot be checked automatically. The authors present HorizonMath as an open challenge and a growing resource. They argue it provides a scalable, contamination‑resistant way to measure progress, while acknowledging that expert validation remains necessary for any claimed mathematical discovery.