Computer VisionEnglishPublished

T2Mo: Controlling 3D object motion with simple paths and a text prompt

June 4, 2026arXiv: 2606.05162v1

This paper presents T2Mo, a system that makes it easier to generate moving 3D shapes by combining short text descriptions with explicit 3D paths. Text alone can be vague about exact motion. The authors add 3D trajectories — small sequences of points in space that show where parts of a shape should move — so users can specify precise local motion while a text prompt supplies the overall meaning.

T2Mo is a feed‑forward framework. It takes a static 3D mesh as input, a set of user‑provided 3D trajectories, and a text prompt. The model produces a sequence of deformed meshes by predicting per‑vertex displacements. To handle trajectories that can be sparse or unevenly placed, the team designed a shape‑grounded trajectory embedding. That embedding maps any set of input trajectories into a fixed set of shape‑aware tokens that together cover the whole object.

At a high level, each input trajectory is attached to its source point on the mesh. The system also samples extra anchor points on the shape so the conditioning covers the full geometry. Those trajectory and shape tokens are combined with the text embedding and fed into a diffusion‑based generative backbone (DiT) that outputs the per‑vertex motion over time. In other words, the trajectories tell the model where selected points should move in 3D, and the text tells it what kind of motion to produce overall.

This combination gives interactive and fine‑grained control. The paper shows uses such as editing a source motion, transferring motion to a new mesh, and making targeted local movements like a leg kicking or a blade lowering. The authors compare T2Mo to recent text‑only and cascaded video‑based baselines. Quantitative metrics (including VBench, trajectory alignment, and motion magnitude), qualitative examples, and user studies reported in the paper indicate that T2Mo follows user guidance more faithfully and produces more expressive motions while preserving motion quality.