Top AI chatbots outperform students on a new relativity concept test, but fail on a few image-based items
Researchers tested three leading AI chatbots on a new 21-question concept inventory about classical (Galilean) relativity and found mixed results. The inventory, called the Classical Relativity Concept Inventory (CRCI), was not publicly available when the models were tested, so high scores are unlikely to come from memorizing the exact questions. On average, Gemini 3 Flash scored 97% correct, Gemini 3 Pro 89%, and GPT-5.2 73%, compared with 62% for 267 first-year physics students who took the same test.
To probe how the chatbots reason, the team presented each CRCI item as a high-resolution screenshot and submitted each question 30 times to each model. That produced 1,890 responses in total. The CRCI covers familiar textbook topics: reference frames, Galilean velocity addition, and a basic version of the weak equivalence principle (how motion in free fall looks). The researchers read the chatbots’ answers and coded mistakes into three kinds: errors reading the pictures (visual interpretation), errors in the physics steps (reasoning), and mismatches between the written explanation and the final choice (coordination).
The headline numbers hide an important pattern. All three models did extremely well on most items, but each failed completely on a small number of questions. The qualitative coding shows that most of those failures came from misreading the drawings or diagrams in the questions, not from a lack of physics knowledge. The paper also notes a structural difference between AI and student mistakes: when a model erred it tended to pick the same wrong choice repeatedly, while student errors were spread across several distractors.
The study highlights what this means for classroom use. Because failures cluster on particular items—especially those that need careful visual interpretation—chatbot reliability is item-dependent and can be unpredictable. That matters for instructors who might supervise or grade conceptual assessments, and for researchers who use concept inventories to measure understanding. The authors suggest taking care when administering tests that include images or require spatial interpretation.