First-hand information for everyone
This paper presents Tuna-2, a multimodal AI model that works directly from raw pixels instead of relying on a separate, pretrained vision en
This paper describes Lyra 2.0, a system that starts from a single image and generates long, camera-controlled videos that can be lifted into
Lyra 2.0 is a method that starts from a single photo and lets a user explore a large, synthetic 3D world. The system first generates a camer
This paper tackles a common failure in agentic multimodal models: they call external tools too often, even when an answer could be found in
Large vision-language models can name objects that are not actually in an image. This paper studies that problem, which is called object hal
Researchers introduce GeoCodeBench, a new benchmark designed to test whether large language models can write the kind of precise code used i
This paper checks how large, pre-trained medical “foundation” models behave when asked to find traumatic bowel injury on CT scans. Foundatio
This paper introduces MuRF, short for Multi-Resolution Fusion. The idea is simple. Instead of feeding a single resized image to a pre-traine
This paper presents ThinkJEPA, a method that combines two ways of understanding video to predict future states for tasks like hand-manipulat
Ultrasound pictures look different from ordinary photos. They are made from echoes of sound and show characteristic gray-scale textures. A t