ReconVLA: Reconstructive Vision-Language-Action Model as Effective Robot Perceiver

Abstract

Recent advances in Vision-Language-Action (VLA) models have enabled robotic agents to integrate multimodal understanding with action execution. However, our empirical analysis reveals that current VLAs struggle to allocate visual attention to target regions. Instead, visual attention is always dispersed. To guide the visual attention grounding on the correct target, we propose \textbf{\method}, a reconstructive VLA model with an implicit grounding paradigm. Conditioned on the model's visual outputs, a diffusion transformer aims to reconstruct the gaze region of the image, which corresponds to the target manipulated objects. This process prompts the VLA model to learn fine-grained representations and accurately allocate visual attention, thus effectively leveraging task-specific visual information and conducting precise manipulation. Moreover, we curate a large-scale pretraining dataset comprising over 100k trajectories and 2 million data samples from open-source robotic datasets, further boosting the model’s generalization in visual reconstruction. Extensive experiments in simulation and the real world demonstrate the superiority of our implicit grounding method, showcasing its capabilities of precise manipulation and generalization.

Introduce

We propose ReconVLA, a reconstructive VLA model with an implicit grounding paradigm. The reconstruction of gaze regions prompts the model toward precise visual attention allocation and fine-grained representation learning, thereby enhancing visual grounding capabilities and executing precise manipulation.
We constructed a large-scale robot pretraining dataset, containing more than 100k trajectories, 2 million data samples. Pretraining on this dataset enhances the model's generalization of visual reconstruction capabilities.
Extensive experiments in simulation and the real world show the superiority of our implicit grounding methods and the capabilities of precise manipulation and generalization for unseen targets.

Method

₀

_t

Visual Pretraining

BridgeData V2

LIBERO

CALVIN

Experiments

Main experiment in simulation.

Evaluation on CALVIN ABC->D Benchmark

Comparison with other methods

Our implicit grounding method gets the highest success rates, which demonstrates the superiority of our method over other paradigms.

Ablation Study

We observe that pretraining leads to a significant improvement in success rates. This is because, in unseen test environments, grounding the target object and performing reconstruction is inherently challenging and poses a generalization challenge to the model’s generative capability. Pretraining on large-scale datasets substantially enhances the model’s generalization ability during visual reconstruction.

Evaluation on Real World

We evaluate the model’s generalization ability on real-world tasks.

The real world results compare with other methods.

Reconstruction Visualization

BibTeX


      @article{song2025reconvla,
        title={Reconvla: Reconstructive vision-language-action model as effective robot perceiver},
        author={Song, Wenxuan and Zhou, Ziyang and Zhao, Han and Chen, Jiayi and Ding, Pengxiang and Yan, Haodong and Huang, Yuxin and Tang, Feilong and Wang, Donglin and Li, Haoang},
        journal={arXiv preprint arXiv:2508.10333},
        year={2025}
      }