Machine LearningEnglishPublished

A simple offline fix for recommender alignment: weight examples by exp(reward/λ)

March 16, 2026arXiv: 2603.10279v1

This paper proposes a simple post-training method for generative recommender systems that uses only observed user rewards. The authors weight each training example by w = exp(r/λ), where r is the observed reward and λ (lambda) is a temperature that practitioners can tune. Because the method never queries a learned reward model and needs no propensity scores, it is fully offline and avoids common failure modes of reinforcement-learning approaches in recommendation settings.

Why is this needed? Modern generative recommenders work like language models: they predict the next item for a user from a large catalog. That catalog is huge, and real users interact with only a tiny fraction of it. Learned reward models (models trained to predict how much a user will like an item) therefore must extrapolate a lot. The paper documents that such reward models often do worse than a naive item-mean baseline and that optimization methods that trust those models—like PPO (Proximal Policy Optimization) used in Reinforcement Learning from Human Feedback (RLHF) or DPO (Direct Preference Optimization)—can “reward-hack” and collapse on real recommendation metrics. Offline constraints and the lack of usable logging policies make common corrections like Inverse Propensity Scoring (IPS) impractical in production.

What the researchers tried and how it works. They apply supervised fine-tuning where each logged interaction is re-weighted by exp(r/λ) during training. This does two things at once: it directly amplifies high-reward examples without ever asking a separate reward model to score new items, and it gives a single, interpretable knob λ that controls how strongly the training focuses on high-reward items. Small λ concentrates the policy on the highest-reward items regardless of popularity. Unlike linear weighting by reward, the exponential weights avoid negative-weight problems and separate item quality from logging frequency.