arxiv:2605.02881

MolmoAct2: Action Reasoning Models for Real-world Deployment

Published on May 4

· Submitted by

Duan on May 5

Authors:

Haoquan Fang ,

Sam Wang ,

Abstract

MolmoAct2 presents an open-action reasoning model for robotics that improves upon previous systems through specialized vision-language-model backbones, new datasets, open-weight action tokenizers, architectural redesign for continuous-action prediction, and adaptive reasoning for reduced latency.

AI-generated summary

Vision-Language-Action (VLA) models aim to provide a single generalist controller for robots, but today's systems fall short on the criteria that matter for real-world deployment. Frontier models are closed, open-weight alternatives are tied to expensive hardware, reasoning-augmented policies pay prohibitive latency for their grounding, and fine-tuned success rates remain below the threshold for dependable use. We present MolmoAct2, a fully open action reasoning model built for practical deployment, advancing its predecessor along five axes. We introduce MolmoER, a VLM backbone specialized for spatial and embodied reasoning, trained on a 3.3M-sample corpus with a specialize-then-rehearse recipe. We release three new datasets spanning low-to-medium cost platforms, including MolmoAct2-BimanualYAM, 720 hours of teleoperated bimanual trajectories that constitute the largest open bimanual dataset to date, together with quality-filtered Franka (DROID) and SO100/101 subsets. We provide OpenFAST, an open-weight, open-data action tokenizer trained on millions of trajectories across five embodiments. We redesign the architecture to graft a flow-matching continuous-action expert onto a discrete-token VLM via per-layer KV-cache conditioning. Finally, we propose MolmoThink, an adaptive-depth reasoning variant that re-predicts depth tokens only for scene regions that change between timesteps, retaining geometric grounding at a fraction of prior latency. In the most extensive empirical study of any open VLA to date, spanning 7 simulation and real-world benchmarks, MolmoAct2 outperforms strong baselines including Pi-05, while MolmoER surpasses GPT-5 and Gemini Robotics ER-1.5 across 13 embodied-reasoning benchmarks. We release model weights, training code, and complete training data. Project page: https://allenai.org/blog/molmoact2

View arXiv page View PDF Project page GitHub 12 Add to collection

Community

Jiafei1224

Paper submitter about 19 hours ago

SKRobotics

about 12 hours ago

Any hope for proper project page and github project? Now both are empty

Jiafei1224

about 11 hours ago

We are going to release everything by 8am PST. Stay tune.

avahal

about 2 hours ago

the per-layer kv-cache conditioning that grafts a continuous-action flow-matching expert onto a discrete-token vlm is a clever bit of engineering that keeps latency in check while preserving the backbone's perception and grounding. by letting the continuous controller see the backbone tokens through per-layer caches, they decouple discrete planning from continuous actuation in a clean way that also seems to help interpretability. the depth-adaptive molmoThink is neat too, only re-predicting depth where things actually change, which feels like a practical redundancy prune for real robots. the arxivlens breakdown helped me parse the architecture flow and the way specialization then rehearse lines up, see details here: https://arxivlens.com/PaperView/Details/molmoact2-action-reasoning-models-for-real-world-deployment-1212-6d5e6054. would love to see ablations on how much the per-layer kv conditioning contributes versus just having a separate flow model trained with the same data.