LARY: A Latent Action Representation Yielding Benchmark for Generalizable Vision-to-Action Alignment
Abstract
General visual foundation models trained without action supervision outperform specialized embodied models and demonstrate superior alignment between visual and physical action spaces compared to pixel-based approaches.
While the shortage of explicit action data limits Vision-Language-Action (VLA) models, human action videos offer a scalable yet unlabeled data source. A critical challenge in utilizing large-scale human video datasets lies in transforming visual signals into ontology-independent representations, known as latent actions. However, the capacity of latent action representation to derive robust control from visual observations has yet to be rigorously evaluated. We introduce the Latent Action Representation Yielding (LARY) Benchmark, a unified framework for evaluating latent action representations on both high-level semantic actions (what to do) and low-level robotic control (how to do). The comprehensively curated dataset encompasses over one million videos (1,000 hours) spanning 151 action categories, alongside 620K image pairs and 595K motion trajectories across diverse embodiments and environments. Our experiments reveal two crucial insights: (i) General visual foundation models, trained without any action supervision, consistently outperform specialized embodied latent action models. (ii) Latent-based visual space is fundamentally better aligned to physical action space than pixel-based space. These results suggest that general visual representations inherently encode action-relevant knowledge for physical control, and that semantic-level abstraction serves as a fundamentally more effective pathway from vision to action than pixel-level reconstruction.
Community
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Universal Pose Pretraining for Generalizable Vision-Language-Action Policies (2026)
- Learning Additively Compositional Latent Actions for Embodied AI (2026)
- UniLACT: Depth-Aware RGB Latent Action Learning for Vision-Language-Action Models (2026)
- DiT4DiT: Jointly Modeling Video Dynamics and Actions for Generalizable Robot Control (2026)
- Joint-Aligned Latent Action: Towards Scalable VLA Pretraining in the Wild (2026)
- DIAL: Decoupling Intent and Action via Latent World Modeling for End-to-End VLA (2026)
- World Guidance: World Modeling in Condition Space for Action Generation (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Get this paper in your agent:
hf papers read 2604.11689 Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash Models citing this paper 0
No model linking this paper
Datasets citing this paper 1
Spaces citing this paper 0
No Space linking this paper