SAHOO: a practical system to detect and limit alignment drift as models improve themselves
This paper introduces SAHOO, a practical framework to watch and control subtle shifts in behavior when machine learning systems update thems
This paper introduces SAHOO, a practical framework to watch and control subtle shifts in behavior when machine learning systems update themselves. The authors are concerned that iterative self-improvement can raise capability while at the same time moving the system away from intended goals. SAHOO adds checks that detect changes in meaning, preserve safety constraints, and flag cycles that undo earlier gains.
SAHOO uses three complementary safeguards. The Goal Drift Index (GDI) is a learned detector that combines four signals—semantic (meaning), distributional (overall statistics), structural (formatting), and lexical (word choice)—into a single score. Constraint preservation checks enforce safety-critical rules such as syntactic correctness and avoiding made-up facts. A regression-risk measure estimates when a proposed improvement will reverse prior gains. The GDI’s component weights were learned from a small calibration set; semantic change was the largest contributor in the reported weights.
The team tested SAHOO on 189 tasks across three domains: code generation (HumanEval), factual accuracy (TruthfulQA), and multi-step math reasoning (GSM8K). Experiments used the Qwen3-8B model as a base. Thresholds and component weights were calibrated on 18 validation tasks (six per domain) with three improvement cycles each. Runs used up to 20 improvement cycles with stopping rules that include small quality changes, high regression risk, failed constraints, or a GDI that exceeds its calibrated threshold.
Results show notable gains in capability while keeping many constraints intact. Reported improvements were strongest in code and math: code pass@1 rose from 0.672 to 0.795 (an 18.3% relative gain), and math exact-match rose from 0.689 to 0.805 (16.8%). Truthfulness improved more modestly, from 0.678 to 0.704 (+3.8%). Constraint preservation was perfect in the code and math tests, and near one (0.9874 mean, standard deviation 0.0547) in the truthfulness set. The paper also reports 170 total constraint violations across the 63 truthfulness tasks, with fabrication and overconfidence the dominant categories.