How large language models switch moral frameworks while they reason
This paper studies how large language models (LLMs) organize moral thinking across multiple intermediate steps. The authors introduce the idea of a "moral reasoning trajectory," which is the sequence of ethical frameworks a model invokes as it moves from problem to final answer. They find that models do not stick to a single ethical view: 55.4–57.7% of consecutive reasoning steps involve a switch between frameworks, while only 16.4–17.8% of full trajectories remain consistent with one framework from start to finish.
To reach these findings the researchers ran experiments on six LLMs and three established moral benchmarks: MoralStories, ETHICS, and SocialChemistry101. They sampled 400 scenarios from each benchmark (1,200 total) and used a structured prompting method that asks models to give four labeled reasoning steps before a final judgment. A separate, larger model (GPT-OSS-120B) was used as a scoring model to assign how much each of five ethical frameworks (for example, deontology and utilitarianism) was present in each step and to rate the coherence of the whole trajectory. The scoring process was checked against human annotators for agreement.
The paper also looks inside model representations. The authors used linear probes—simple classifiers applied to internal activations—to find where framework-specific information appears. They report that such information concentrates at different layers for different models (for example, around layer 63 of 81 for Llama-3.3-70B and around layer 17 of 81 for Qwen2.5-72B). These probes produced 13.8–22.6% lower Kullback–Leibler divergence than a baseline that used the training-set prior. The team tried a lightweight activation steering method to nudge how frameworks are integrated; it reduced drift between steps by 6.7–8.9% and strengthened the measured link between trajectory stability and final-answer accuracy.