Study shows simple human preferences can hide problems when judging long scientific answers
This paper looks at how we check the checkers for long-form question-answering systems. The authors focus on ScholarQA-CS2, a benchmark for
This paper looks at how we check the checkers for long-form question-answering systems. The authors focus on ScholarQA-CS2, a benchmark for systems that pull papers and web documents to write long, research-style answers. They validate the benchmark using human judgments and then probe whether common meta-evaluation methods really measure the things experts care about.
The team first collected human pairwise preference judgments. Pairwise preference means an annotator sees two system answers for the same question and picks which they prefer. They also ran and analyzed automatic evaluations that use large language models (LLMs) as judges. To dig deeper, the researchers added metric-by-metric human annotations and experiments that changed how expert-like the annotators were. This let them study how the way judgments are gathered affects agreement between humans and automatic metrics.
At a high level, the paper compares two ways of judging answers. One is overall preference: which whole answer is better. The other is metric-wise scoring: separate checks for things like whether claims are supported by cited sources (claim verification), how many cited claims are actually backed by the source (citation precision and citation recall), and how much of the needed content the answer covers (rubric coverage and answer relevance). The authors explain that LLM judges compute these metric scores, while humans can give either overall preferences or explicit metric-level judgments.
Their main findings are practical. Overall human pairwise preference works well for comparing systems at the system level — for example, saying system A is better than system B on average. But it is often too coarse to judge specific metrics or individual answers. For fine-grained evaluation, explicit metric-wise human annotations are necessary. They also find that annotator expertise matters: more expert annotators assess metrics differently than less expert ones. Across different LLM judges, these patterns held up, but human subjectivity remained a major source of variation.