LiveMedBench: a weekly, contamination‑free medical test set that scores LLMs with automated rubrics
This paper introduces LiveMedBench, a new benchmark designed to test Large Language Models (LLMs) on real clinical problems while avoiding common evaluation flaws. The authors build a system that harvests recent, real-world medical cases from online physician communities every week. They also pair each case with a set of objective, case‑specific scoring rules (a rubric) so model answers can be checked against clinical criteria instead of simple word overlap or another LLM’s judgment.
To create the benchmark, the team gathers threads from four verified medical communities (iCliniq, Student Doctor Network, DXY, and Medlive). They keep only text posts that were published on or after January 1, 2023, are in English or Chinese, include at least one verified physician response, and match clinical code keywords (from ICD, ICF, or ICHI). Posts that need images, video, or audio are excluded. A Multi‑Agent Clinical Curation Framework then turns messy discussion threads into structured clinical cases. One agent converts posts into the SOAP format (Subjective, Objective, Assessment, Plan), and the pipeline uses Retrieval‑Augmented Generation (RAG) — a method that looks up authoritative medical evidence — to cross‑check case details.
For evaluation, the authors develop an Automated Rubric‑based Evaluation Framework. This system breaks physician responses into many small, verifiable criteria. In total LiveMedBench pairs 2,756 real cases with 16,702 unique evaluation criteria. The paper reports that these automated rubrics agree with human physician judgments substantially better than the common alternative of using an LLM to score other LLM outputs (the “LLM‑as‑a‑Judge” approach). The process also includes human quality assurance steps to keep the cases clinically sound.
What they found matters for anyone who hopes to use LLMs in medicine. The benchmark covers 38 medical specialties and multiple languages. When the authors tested 38 LLMs, even the best model (reported as GPT‑5.2) reached only 39.2% by their rubric scoring. They also show that 84% of models drop in performance on cases that post‑date the models’ training cutoffs, highlighting the risk that published benchmarks can be contaminated by material the models already saw. An error analysis points to a key failure mode: 35–48% of mistakes came from poor contextual application — models could state facts but struggled to tailor them to a specific patient. The authors also note that injecting retrieved, up‑to‑date knowledge recovers much of this lost performance, suggesting the shortfall is often missing or stale knowledge rather than pure reasoning failure.