Artificial IntelligenceEnglishPublished

MEMO: using a memory bank to make multi-turn LLM games more stable and stronger

March 13, 2026arXiv: 2603.09022v1

Researchers introduced MEMO, a method that makes long, multi-turn games played by large language model (LLM) agents both stronger and more consistent. These games suffer from big run-to-run swings: a small early mistake can change the whole interaction and make win rates unreliable. MEMO reduces that instability by changing what the models are given to think about at inference time, rather than changing the model weights.

At a high level, MEMO is a self-play framework that couples two ideas: retention and exploration. Retention means keeping a persistent memory bank that stores structured lessons extracted from past self-play games. The memory holds short summaries and actionable insights that can be injected back into future games as priors. Exploration means evolving a pool of prompts and contexts in tournament-style self-play. The system tests many candidate contexts, scores them by performance and uncertainty, and keeps the most reliable ones.

How it works in plain terms: MEMO proposes a set of candidate contexts (prompts plus memory priors), runs them against a baseline agent in many self-play games, and rates each candidate using TrueSkill. TrueSkill is a Bayesian rating system that gives both a skill estimate and an uncertainty; MEMO prefers contexts that are strong and have low uncertainty. The framework also uses “prioritized replay” to revisit rare or decisive game states, and basic create/read/update/delete operations to manage memory entries. The authors tested MEMO on five text-based games drawn from TextArena and SPIN-Bench, using GPT-4o-mini and Qwen-2.5-7B-Instruct models with 2,000 self-play games per task.

The results are concrete. Across those five games, MEMO raised mean win rates from 25.1% to 49.5% for GPT-4o-mini and from 20.9% to 44.3% for Qwen-2.5-7B-Instruct. It also cut run-to-run variance substantially: the reported relative standard error dropped to about 6.4% from roughly 43.3%, a roughly sevenfold reduction. MEMO is also more sample-efficient than a reinforcement-learning self-play baseline in some settings: on Kuhn Poker it reached a 60% win rate with 2,000 games, versus an RL baseline that needed 38,000 games.