Artificial IntelligenceEnglishPublished

Researchers teach AI to learn by doing with automatically made machine‑learning tasks

March 19, 2026arXiv: 2603.17216v1

This paper describes a way to train AI systems by giving them many automatically created machine‑learning problems to solve. The authors build a pipeline that writes task descriptions, picks datasets, makes starter code, and then checks that each task actually runs. They use the resulting problem-solving examples to fine‑tune smaller models so the models get practice at the step‑by‑step work of real research instead of only reading final papers or code.

The pipeline works in stages. First it samples machine‑learning topics and asks a strong model to propose a task and a dataset. It checks whether the proposed dataset exists by searching the Hugging Face API. Next it generates configuration files and starter code for an execution environment called ML-Gym. If running the task hits errors, the system feeds the errors back into the generator and tries to debug automatically for a limited number of rounds. If the task still fails it is discarded. The authors run the validated tasks at scale to collect many agent trajectories—sequences of reasoning, code edits, and runs—that record the full iterative process.

To make training data, the team used a powerful model (referred to as GPT‑5) as a teacher to produce trajectories on the synthetic tasks. They then used these trajectories for supervised fine‑tuning of two student models (Qwen3‑4B and Qwen3‑8B). On the ML‑Gym benchmark, which contains 13 machine‑learning challenges, the fine‑tuned student models improved aggregate performance. The paper reports increases in the main aggregate metric (AUP) of about 9% for Qwen3‑4B and 12% for Qwen3‑8B compared with their baselines.

This approach matters because it supplies AI agents with experience in the full research loop: proposing datasets, writing and debugging code, running experiments, and improving results. That kind of iterative practice is hard to get from static text or final code alone. The pipeline is designed to scale without human labeling, so it can generate a wide range of problem types and produce many training examples for agents.