New method predicts ordered, executable skill plans to help LLM agents solve complex tasks
This paper introduces SkillComposer, a method that helps large language model (LLM) agents pick not just which skills to use but also how many and in what order. The authors call this structured skill composition. Instead of returning an unordered list of candidate skills or exposing the agent to every option, SkillComposer generates a single, ordered plan of skill identifiers that can be loaded directly into an agent.
To build SkillComposer the team formalized the problem as task-conditioned skill sequence prediction. They train a model that decodes a sequence of skill identifiers one after another, using each choice to influence the next. The decoder is constrained so every output corresponds to a real, executable skill from the library, and the sequence ends with a stop symbol. The authors built a training set from a real, human-curated skill library and used a skill dependency graph and layered synthesis and filtering to create supervision for single- and multi-skill compositions. The paper gives a concrete example from the library of K = 196 skills: to find stations that flooded, the model might predict (nws-flood-thresholds, usgs-data-download, flood-detection).
At a high level, SkillComposer works by conditioning each prediction on the task description, the available skill library, and the skills already selected. Because the model generates skills autoregressively (one after another), the subset of skills, the number of skills, and their order emerge together in a single pass. The resulting plan is inspectable and executable: each generated token is a skill identifier that an agent can load and run.
The authors evaluate SkillComposer in two ways. First, they measure composition quality on held-out and synthetic evaluation sets to see whether the predicted plan matches the target subset, count, and order. Second, they test downstream task success on a benchmark called SkillsBench using two production-grade coding agents (GPT-5.2-Codex and Gemini-3-Pro-Preview). On those agents, SkillComposer increased the pass rate by +23.1 and +18.2 percentage points over a no-skill baseline. The method also outperformed a top-3 retrieval baseline and matched a gold-skill retrieval upper bound while using fewer prompt tokens.