MathNet: a 30K+ multilingual Olympiad dataset and benchmark for math reasoning and math-aware search
Researchers introduce MathNet, a large collection of competition-level math problems and a set of tests designed to push how well AI can both solve and find mathematically related problems. The collection contains 30,676 Olympiad-level problems with expert-written solutions. The problems cover many topics, include LaTeX and natural-language statements, and span contests from 47 countries and 17 languages over multiple decades.
The team built MathNet to support three tasks. The first is Problem Solving: models must generate solutions to hard math problems. The second is Math-Aware Retrieval: systems must find problems that are mathematically equivalent or closely related, not just textually similar. The third is Retrieval-Augmented Problem Solving (RAG): models fetch related problems and use them to produce better answers. To support retrieval work, the authors provide MathNet-Retrieve (40K synthetic problems made from 10K anchor problems, each paired with one equivalent positive and three hard negatives) and MathNet-RAG (70 expert-curated, structurally similar IMO-level problems).
A key point of MathNet is its focus on mathematical structure. Two statements can look different but mean the same thing — for example, x^2 + y^2 = 1 and √a^2 + b^2 = 1 describe the same unit-circle condition even though they use different symbols. The authors emphasize that common search systems and embedding-based retrievers (models that turn text into numerical vectors for search) often fail to detect such equivalences, because they rely on surface wording or variable names.
The paper evaluates 27 state-of-the-art models on these tasks. Even leading reasoning models struggle: reported scores include 78.4% for Gemini-3.1-Pro and 69.3% for GPT-5 on the problem-solving task, indicating room for improvement. Embedding models also lag at retrieving truly equivalent problems. The authors show that retrieval-augmented generation can help, but its benefit depends strongly on retrieval quality. One retriever, DeepSeek-V3.2-Speciale, produced up to 12% gains and obtained the highest benchmark scores when it surfaced structure-aligned neighbors.