Simple real‑time monitor flags unsafe LLM outputs by thresholding a verifier signal
Large language models (LLMs) can still produce incorrect or harmful text after training. This paper studies a simple online monitor that watches a stream of safety signals and raises an alarm as soon as the signal indicates danger. The monitor turns a verifier model’s score into a yes/no alarm by comparing the score to a single threshold that is set by a calibration procedure called risk control.
The monitor works step by step. At each generation step it reads a safety signal s_t — for example, the predictive probability from an external verifier that the output so far is safe. The monitor raises an alarm at the first step k where s_k is below a threshold λ. The threshold is chosen using a held‑out calibration set so that a chosen error rate is controlled. The paper studies two calibration styles: conformal risk control, which controls the average false alarm rate, and a high‑probability method using an upper confidence bound (UCB) based on the Hoeffding‑Bentkus inequality, which gives stronger guarantees at the cost of a more conservative (higher) threshold.
The authors test the approach on two safety tasks: factual correctness in mathematical reasoning and harmful or malicious content detected in red‑teaming conversations. For math they used the MATH dataset and two generator LLMs: Claude Haiku 4.5, which solved about 90% of problems, and Mistral‑7B‑Instruct, which solved about 26%. They used OpenAI’s o3‑mini to obtain labels for final answers, and the safety signal was the stepwise probability from a Qwen2.5‑Math process reward model (PRM). The paper measures false alarm rate (flagging safe outputs), power (detecting unsafe outputs), and detection delay (how early the alarm fires). The authors release code on GitHub.
In the experiments the simple threshold monitor, with thresholds chosen by risk control, performed competitively with a more complex baseline that uses sequential hypothesis testing (the “e‑valuator”). The thresholded monitor also tended to detect failures earlier in the generation process. This matters because a lightweight, provably calibrated monitor can stop or escalate risky generations in real time, which could reduce harm or misinformation before an output is completed.