First-hand information for everyone
This paper looks at how we check the checkers for long-form question-answering systems. The authors focus on ScholarQA-CS2, a benchmark for
Large language models can reason in impressive ways. But they also make systematic reasoning mistakes that are hard to fix with broad retrai
This paper introduces LieCraft, a new evaluation framework and sandbox for measuring deception in large language models (LLMs). In plain ter
This paper introduces SAHOO, a practical framework to watch and control subtle shifts in behavior when machine learning systems update thems
Large language model (LLM) agents often claim they called a tool or read a webpage when they did not. This paper introduces NabaOS, a practi