Tuna-2 drops pretrained vision encoders and learns directly from pixels for image understanding and generation
This paper presents Tuna-2, a multimodal AI model that works directly from raw pixels instead of relying on a separate, pretrained vision encoder. In place of complicated encoder modules, Tuna-2 uses simple patch embedding layers that turn parts of an image into tokens. A single transformer then handles both visual and text tokens, so the same model can read images, answer questions about them, and generate or edit pictures.
The authors arrived at Tuna-2 by simplifying earlier designs in steps. First they made Tuna‑R, which keeps a learned representation encoder but removes a Variational Autoencoder (VAE) used for image reconstruction. Tuna-2 goes further and removes the encoder entirely. Because raw pixels are high-dimensional and harder to learn from, the team adds a masking-based feature learning trick. During training they replace some image patches with a learnable mask token. This helps the model avoid simple shortcuts and forces it to learn visual cues useful for both understanding and generation.
For image generation, Tuna-2 works in pixel space rather than in a compact latent space. The model is trained to predict a clean image from a noisy version. The training objective follows a flow-matching recipe (called x-prediction with a v-loss), which measures error in a velocity-like quantity derived from the clean and noisy images. At inference time the authors use a numerical solver (Euler solver) to turn the model’s predictions into a final denoised image. In short, Tuna-2 directly maps noisy pixel inputs to clean images while also sharing the same backbone for language and vision tasks.
Why this matters: the authors report that Tuna-2 achieves state-of-the-art results on a range of multimodal benchmarks. Their controlled comparisons show that, after enough pretraining, the encoder-free Tuna-2 becomes competitive with encoder-based models for image generation and outperforms them on tasks that need fine-grained visual perception. The result suggests that pretrained vision encoders are not strictly necessary and that end-to-end pixel-space learning can produce strong unified representations for both perception and synthesis.