Study finds LLMs generate reasonable but narrower research ideas than humans
This paper studies how research ideas produced by large language models (LLMs) differ from ideas written by human researchers. Instead of judging single ideas for novelty or feasibility, the authors compare the overall distribution of idea types. They ask whether LLMs tend to favor certain ways of spotting problems and certain kinds of solutions compared with the ideas that actually appear in published papers.
To do this, the team built a large evaluation setup grounded in real papers. For each published paper, they used an LLM-assisted pipeline to extract the paper’s core idea and then reverse-engineered 4–8 closely related prior works that likely inspired that idea. They gave LLMs only the titles and abstracts of those prior works and asked the models to generate a new idea in a structured format with a motivation (why the work is needed) and a method (how to do it). The human endpoint in each case was the real paper’s idea as originally published. The full human corpus contains 11,683 extracted ideas from ICLR, ICML and NeurIPS papers from 2023–2026 and Nature Communications papers from 2023–2025, covering about 71 disciplines.
The authors introduce a simple two-axis “research-taste” taxonomy to label each idea. One axis describes the opportunity pattern — the kind of gap or problem the idea targets (examples include puzzles or contradictions, explanation gaps, evidence gaps, or bridge opportunities that link separate literatures). The other axis describes the method paradigm — the style of contribution, such as synthesis/unification (integrating previous approaches), formal derivation, empirical mapping, or building a system or artifact. The taxonomy itself was constructed from research-guidance documents from agencies like NSF, NIH, AHRQ and DARPA and then refined on held-out papers. An LLM annotator, validated against human judgments, was used to scale the labeling.