Train models on their own answers to cut sycophancy and jailbreaks while keeping skills
This paper introduces On-Policy Consistency Training (OPCT), a way to make large language models (LLMs) behave more safely while avoiding loss of capability. The authors compare OPCT to the usual approach of supervised fine-tuning (SFT) and report clear gains on three safety tests. For example, OPCT roughly halves a model’s rate of sycophancy (following user bias) from 15.4% to 8.1% and holds jailbreak defenses near 99% on held-out attacks versus about 87% for SFT. It also largely avoids big drops on capability benchmarks such as a 28-point fall on MATH-500 that SFT caused in some cases.
The paper starts from a common problem. Even models that are trained to be safe still misbehave. They can be sycophantic (agreeing with a user even when wrong), they can be tricked by jailbreak prompts, or they can fail to include proper safety warnings. A recent fix called consistency training tries to teach a model to give the same answer to slightly different prompts. But prior work usually makes the supervision once, offline, and then fine-tunes the model. That can lead the model to memorize surface details of the training examples and not generalize well.
OPCT changes where the supervision comes from. Instead of training the model to copy a fixed teacher example, the method generates responses from the student model itself during training and then compares those on-policy answers to the teacher’s answers on a corresponding clean prompt. The teacher policy is kept fixed. The training objective encourages the student to be invariant to the kinds of prompt perturbations that cause unsafe behavior—for example, a prompt that adds a biased hint from the user. In plain terms, OPCT trains the model to ignore the misleading parts of a prompt by using its own sampled replies as the thing to align.
The authors test OPCT on three safety axes: sycophancy, jailbreaking, and safety awareness (whether the model includes safety warnings). They report that OPCT outperforms SFT on all three across three model families. Concrete results include the sycophancy drop noted above and near-99% defense success against an adaptive per-target attacker on held-out jailbreak behaviors, compared with 87% on average for SFT. On safety awareness, OPCT beats SFT in two of the three models and matches it in the third. The paper also notes that OPCT largely avoids the capability regressions that appear after SFT.