iconReconVLA: Reconstructive Vision-Language-Action Model as Effective Robot Perceiver

1The Hong Kong University of Science and Technology (Guangzhou)
2Westlake University 3Zhejiang University 4Monash University

Abstract

Recent advances in Vision-Language-Action (VLA) models have enabled robotic agents to integrate multimodal understanding with action execution. However, our empirical analysis reveals that current VLAs struggle to allocate visual attention to target regions. Instead, visual attention is always dispersed. To guide the visual attention grounding on the correct target, we propose \textbf{\method}, a reconstructive VLA model with an implicit grounding paradigm. Conditioned on the model's visual outputs, a diffusion transformer aims to reconstruct the gaze region of the image, which corresponds to the target manipulated objects. This process prompts the VLA model to learn fine-grained representations and accurately allocate visual attention, thus effectively leveraging task-specific visual information and conducting precise manipulation. Moreover, we curate a large-scale pretraining dataset comprising over 100k trajectories and 2 million data samples from open-source robotic datasets, further boosting the model’s generalization in visual reconstruction. Extensive experiments in simulation and the real world demonstrate the superiority of our implicit grounding method, showcasing its capabilities of precise manipulation and generalization.

Introduce

ReconVLA architecture
  1. We propose ReconVLA, a reconstructive VLA model with an implicit grounding paradigm. The reconstruction of gaze regions prompts the model toward precise visual attention allocation and fine-grained representation learning, thereby enhancing visual grounding capabilities and executing precise manipulation.
  2. We constructed a large-scale robot pretraining dataset, containing more than 100k trajectories, 2 million data samples. Pretraining on this dataset enhances the model's generalization of visual reconstruction capabilities.
  3. Extensive experiments in simulation and the real world show the superiority of our implicit grounding methods and the capabilities of precise manipulation and generalization for unseen targets.

Method

ReconVLA architecture
    Our model consists of a reconstructive part and an action part. The input includes multi-view images and a text instruction. For the action part, the model outputs discrete action tokens. For the reconstruction part, ReconVLA is guided to output reconstructive tokens, which are conditions of the denoising process to reconstruct the scene tokens z0 from noisy zt. The scene tokens are tokenized images of gaze regions. This supervision enables ReconVLA to enhance visual grounding and fine-grained comprehension capabilities, which contribute to precise manipulation.

Visual Pretraining

ReconVLA architecture
    To enhance its ability to ground and reconstruct specific regions, we design a pretraining process for reconstruction tasks on a large-scale robot dataset. We constructed the pre-training dataset based on large-scale open-source robotic datasets BridgeData V2, along with high-quality simulation datasets LIBERO and CALVIN.

Experiments

ReconVLA architecture
    Main experiment in simulation.

Evaluation on CALVIN ABC->D Benchmark

Comparison with other methods

ReconVLA architecture
    Our implicit grounding method gets the highest success rates, which demonstrates the superiority of our method over other paradigms.

Ablation Study

ReconVLA architecture
    We observe that pretraining leads to a significant improvement in success rates. This is because, in unseen test environments, grounding the target object and performing reconstruction is inherently challenging and poses a generalization challenge to the model’s generative capability. Pretraining on large-scale datasets substantially enhances the model’s generalization ability during visual reconstruction.

Evaluation on Real World

ReconVLA architecture
    We evaluate the model’s generalization ability on real-world tasks.
ReconVLA architecture
    The real world results compare with other methods.

Reconstruction Visualization

ReconVLA architecture
ReconVLA architecture

BibTeX


        @article{reconvla2025,
          title={ReconVLA: Reconstructive Vision-Language-Action Model as Effective Robot Perceiver},
          author={Wenxuan Song, Ziyang Zhou, Han Zhao, Jiayi Chen, Pengxiang Ding, Haodong Yan, Yuxin Huang, Feilong Tang, Donglin Wang, Haoang Li},
          journal={https://doi.org/10.48550/arXiv.2508.10333},
          year={2025}
        }