New benchmark shows AI still struggles to write PhD‑level 3D computer‑vision code
Researchers introduce GeoCodeBench, a new benchmark designed to test whether large language models can write the kind of precise code used in modern 3D computer vision. The tasks are real “fill‑in‑the‑function” problems taken from recent research papers. Each problem gives a function skeleton and the related paper text. Models must complete the implementation and then must pass strict unit tests run in a sandbox.
To build the benchmark the team first used an automated tool to propose candidate functions from official code repositories. They then had domain experts screen those candidates by hand to pick core 3D geometric components. For every target function the authors created varied and edge‑case unit tests. These tests check tricky geometry, degenerate cases, and invariances so scoring is fully automatic and reproducible.
GeoCodeBench groups tasks into two levels. General 3D capability covers fundamentals like coordinate transforms, projections, and basic physics or optics formulas. Research capability covers harder items: implementing new algorithms from papers and routing geometric logic—how researchers combine building blocks into new pipelines. The authors ran eight representative open‑ and closed‑source models across the suite.
The results show a large gap between current models and dependable research‑grade code. The best model, GPT‑5, reached only a 36.6% overall pass rate across tasks. Some models did solve individual problems perfectly. In those cases a model sometimes used a different but still mathematically valid approach and still passed all unit tests. That points to real problem‑solving potential, even when models do not match exact reference code.
The paper also reports two notable patterns. First, research‑level tasks are markedly harder than general‑purpose geometric tasks, though performance on the two axes is positively correlated. Second, giving a model the entire paper is not always better. Truncating input to the Methods section sometimes led to better performance than feeding the full paper, suggesting current models still struggle with long scientific context.