A safety 'kernel' that the AI cannot turn off enforces actions at runtime
This paper describes the Unfireable Safety Kernel, a separate safety module that runs outside an AI agent so the agent cannot bypass it. The authors argue that putting controls inside an agent’s own process—system prompts, output filters, and guardrail libraries—leaves those controls reachable and changeable by the agent. Their kernel is meant to be an architectural control: a last line of defense that checks and authorizes consequential actions as they happen.
The researchers first set out four properties an authorization layer must have to be truly out of reach of a controlled system. These are process separation (the safety code runs in a different process), pre-action enforcement on a structurally-only path (all actions must go through the kernel before they happen), fail-closed at both the request and system levels (if something goes wrong, the default is to refuse the action), and externalized signed evidence verifiable outside the controlled system’s trust boundary (audit records are cryptographically signed so others can check them). They call this layer execution-time AI alignment, distinct from training-time alignment methods such as Reinforcement Learning from Human Feedback (RLHF) or Constitutional AI, and from inference-time tools like prompts and filters.
They implemented a reference kernel in Rust. The design includes a four-seam defense-in-depth architecture and a “no-kernel, no-agent” lifecycle rule so the agent will not start without a reachable kernel. The system records decisions in an append-only transparency log signed with Ed25519 (a cryptographic signature scheme) under an operator key the kernel does not hold. The authors also applied formal checks: a mathematical check called an SMT theorem using the Z3 solver, and an exhaustive bounded-model check of the production decision function using the Kani tool (four verification harnesses, all passed). A disciplined migration from Python to Rust was validated by exact byte-equivalence on 1,000 deterministic fixtures and identical reject behavior for 17 adversarial test classes. The reference code is available under an Apache-2.0 license on GitHub.