FinTradeBench: a new test that asks language models to combine company filings and market price signals
This paper introduces FinTradeBench, a benchmark that tests whether Large Language Models (LLMs) can reason about both company fundamentals and how a stock actually trades. The authors built 1,400 questions grounded in NASDAQ‑100 companies over a ten‑year window (2015–2025). Questions fall into three types: fundamentals‑focused, trading‑signal‑focused, and hybrid problems that require combining both kinds of information.
To make the benchmark, the team pulled two main data sources for each company and quarter: regulatory filings (10‑K and 10‑Q reports filed with the Securities and Exchange Commission, or SEC) to extract accounting indicators, and daily trading data (Open, High, Low, Close, Volume — OHLCV) to compute price‑based signals. They selected a compact, interpretable set of fundamentals (for example, return on assets, return on equity, earnings‑to‑price, book‑to‑price, debt‑to‑equity) and trading signals (for example, moving averages, momentum, realized volatility, drawdowns, volume measures). To scale reliably, they used a “calibration‑then‑scaling” pipeline: 150 expert‑written seed questions (50 per category), multi‑model response generation, self‑filtering and numerical audits, and alignment between human and LLM judges.
The authors evaluated 14 LLMs in two conditions: zero‑shot prompting (no extra documents) and retrieval‑augmented prompting (models can access retrieved documents). They report a clear performance gap. Retrieval helped a lot on fundamentals questions (about +37% accuracy) and on hybrid questions (about +55%), but it gave little or no benefit on trading‑signal questions that rely on time‑series price data. The paper gives concrete failure examples: for a trading question about a July 2025 pullback in NVIDIA stock, most models missed the pullback component and all failed the “buying opportunity” judgment; only one model (Claude, a proprietary LLM) correctly identified the pullback.