Study finds large language models show systematic human-like biases in preferences but more rational beliefs
What happens when large language models (LLMs) face economic choices? This paper tests whether LLMs behave like humans in decisions about preferences and beliefs. The authors find a clear pattern: on preference questions, more advanced or larger models give answers that look more human-like and often violate classical rational rules. On belief questions, the most advanced large models tend to give more rational, statistically correct answers.
The team adapted experimental questions from cognitive psychology and experimental economics. These are the same kinds of tests researchers use to document human biases. They sent those prompts through application programming interfaces (APIs) to four LLM families: OpenAI’s ChatGPT (including GPT-4 and variants), Anthropic’s Claude, Google’s Gemini, and Meta’s Llama. For each family they compared older and newer model versions, and for the newer versions they compared large-scale and smaller-scale variants.
The results are concrete. In preference-based tasks, Claude3Opus (an advanced large-scale Claude model) matched human-style answers on four of six questions. Claude3Haiku (a smaller advanced Claude) matched on three of six, and the older Claude2 matched on one of six. For belief-based tasks, Gemini1.5Pro (a large advanced Gemini) answered all ten belief questions correctly, while Gemini1.5Flash (smaller) answered five and Gemini1.0Pro (older) answered two. The authors also report variation across model families: for example, Gemini’s preference answers were more human-like than ChatGPT’s, and Meta Llama’s belief answers were less rational than GPT’s.
The paper also re-runs two economics experiments. In a forecasting task that follows Afrouzi et al. (2023), advanced small-scale models such as GPT-4o, Claude3Haiku, and Gemini1.5Flash acted like humans and overestimated how persistent a process was. Their larger-scale counterparts produced forecasts closer to the true persistence. In a stock-investment task following Bose et al. (2022), large-scale models (GPT-4, Claude3Opus, Gemini1.5Pro) made investment choices that depended more on the visual salience of price paths, a human-like bias. Note that six of the twelve models tested could not process graphical inputs, so those tasks were not run on every model.