First-hand information for everyone
This paper tests whether AI agents built from large language models can pool private information by trading in a simple prediction market. T
Researchers introduce AVISE (AI Vulnerability Identification and Security Evaluation), a modular open-source framework to find security prob
This paper introduces AVISE (AI Vulnerability Identification and Security Evaluation), a modular, open‑source framework meant to help resear
Researchers simulated a simple exchange populated entirely by autonomous large language model (LLM) agents to study how AI forms price expec
Researchers introduce MathNet, a large collection of competition-level math problems and a set of tests designed to push how well AI can bot
AI safety problems sometimes hide across many short interactions. A single conversation or log file can look harmless, but when a small set
This paper tackles a common failure in agentic multimodal models: they call external tools too often, even when an answer could be found in
This paper checks how large, pre-trained medical “foundation” models behave when asked to find traumatic bowel injury on CT scans. Foundatio
What happens when large language models (LLMs) face economic choices? This paper tests whether LLMs behave like humans in decisions about pr
This paper introduces LiveMedBench, a new benchmark designed to test Large Language Models (LLMs) on real clinical problems while avoiding c