Arbiter finds internal contradictions in coding agent “system prompts” using multi‑model checks
System prompts are the long instruction files that tell large language model (LLM) coding agents how to behave. The paper presents Arbiter,
System prompts are the long instruction files that tell large language model (LLM) coding agents how to behave. The paper presents Arbiter, a framework for testing those prompts. Arbiter combines a rule‑based, exhaustive checks phase with an open exploration phase that asks many different LLMs to ‘‘scour’’ a prompt and report anything that looks problematic. Applied to three public agent prompts, the method found many real issues and a confirmed design bug.
In the directed phase Arbiter breaks a system prompt into contiguous blocks and classifies each block by role (for example security or tool usage), modality (a rule that mandates or prohibits something, or just guidance), and scope (what tools or topics it governs). The framework then runs formal rules that look for interference between block pairs. Built‑in rules include mandate‑prohibition conflict, scope overlap, priority ambiguity, implicit dependency, and verbatim duplication. Many structural checks run as simple predicates with no LLM calls. Pre‑filters reduce the number of block pairs that need expensive checks.
The undirected phase complements the rules. Arbiter sends a deliberately vague ‘‘read this and note what you find interesting’’ prompt to many different LLMs. Each pass receives the prior findings and is encouraged to explore new territory. Passes use different models to gain complementary viewpoints. Each finding is self‑rated on a four‑level epistemic scale (curious, notable, concerning, alarming). The campaign stops when three models in a row decline to continue.
The authors ran Arbiter on three public coding‑agent prompts: ClaudeCode (Anthropic, 1,490 lines), Codex CLI (OpenAI, 298 lines), and Gemini CLI (Google, 245 lines). Across the undirected scouring passes they report 152 findings, and the directed analysis of ClaudeCode produced 21 labeled interference patterns. For ClaudeCode those patterns included four critical direct contradictions (for example a ‘‘use TodoWrite ALWAYS’’ mandate that conflicts with a separate ‘‘NEVER use TodoWrite’’ prohibition), thirteen scope overlaps, two priority ambiguities, and two implicit dependencies. About 95% of those directed findings were statically detectable without relying on an LLM. One scourer finding pointed to structural data loss in Gemini CLI’s memory system; the vendor filed and patched the symptom, though Arbiter’s analysis suggested a deeper schema‑level cause remained.