How crowd preferences can teach agents to be safer without explicit safety rules
Researchers show that human preference data collected from many people can hide common safety principles, and that those shared principles can be learned and used to make agents act more safely even when no explicit safety signals are given. The work starts from the idea behind Reinforcement Learning from Human Feedback (RLHF) — using preference judgments to shape agent behavior — and asks whether safety rules can be recovered from crowd preferences where different users have different goals but follow similar safety norms.
The team first examined a simple option: learn a reward model from crowd preferences and then combine that learned reward with the task reward during training. They found that this direct reward combination has inherent limitations, meaning it does not reliably produce safe behavior in all cases. To address that, they propose a different approach called Safe Crowd Preference-based RL, a hierarchical method that separates safety from task goals.
In this hierarchical framework the system extracts safety-aligned skills from the preference data. “Skills” here means reusable behaviors that tend to satisfy the shared safety principles. A higher-level policy then composes those skills with task-focused actions to solve downstream problems while keeping the agent within the learned safety bounds. In plain terms, the system first learns how to be safe and then learns how to complete tasks using those safe building blocks.
The authors tested the idea in a set of safe reinforcement learning environments and in an early-stage large language model (LLM)-style task where users had diverse goals but shared safety constraints. In these experiments their approach substantially reduced safety costs — a measure of unsafe behavior — without access to explicit safety rewards. At the same time, task performance was comparable to “oracle” methods that were trained with ground-truth safety signals (that is, methods that did have explicit safety rewards).