Skip to content

Sprinter1999/LightVLA

 
 

Repository files navigation

LightVLA: The Better You Learn, The Smarter You Prune

🚀 Towards Efficient Vision-language-action Models via Differentiable Token Pruning

License

📝 Abstract

LightVLA is an innovative, simple yet effective differentiable token pruning framework designed for Vision-Language-Action (VLA) models. While VLA models have demonstrated impressive capabilities in executing real-world robotic tasks, their deployment on resource-constrained platforms is often bottlenecked by the heavy attention-based computation over large sets of visual tokens. LightVLA addresses this challenge through adaptive, performance-driven visual token pruning: it generates dynamic queries to evaluate the importance of visual tokens and employs Gumbel softmax for differentiable token selection. Through fine-tuning, LightVLA learns to retain the most informative visual tokens while pruning those that do not contribute to task execution, thereby simultaneously improving efficiency and performance. Notably, LightVLA requires no heuristic magic numbers and introduces no additional trainable parameters, making it compatible with modern inference frameworks. Experimental results show that LightVLA outperforms various VLA models and existing token pruning methods across diverse tasks on the LIBERO benchmark, achieving higher success rates with significantly reduced computational overhead. Specifically, LightVLA reduces FLOPs and latency by 59.1% and 38.2% respectively, with a 2.6% improvement in task success rate.

framework_lightVLA.jpg

🔗 Project Links

TL;DR

The core implementation of LightVLA is here.

💡 Key Features & Approach

LightVLA's core lies in its unique adaptive pruning mechanism, aimed at optimizing the efficiency and performance of VLA models:

  • Adaptive, Performance-Driven Pruning: LightVLA generates dynamic queries to assess the importance of visual tokens and utilizes Gumbel softmax for differentiable token selection. This approach enables the model to intelligently identify and retain crucial visual information for task execution while discarding redundant tokens.
  • Dual Benefits of Efficiency and Performance: By learning pruning strategies during fine-tuning, LightVLA not only significantly reduces computational overhead (FLOPs and latency) but also improves task success rates on the LIBERO benchmark, achieving a win-win in both efficiency and performance.
  • No Additional Parameters: A major advantage of LightVLA is its design, which does not rely on any heuristic magic numbers and introduces no additional trainable parameters. This makes it seamlessly integrable with existing modern inference frameworks, facilitating easy deployment and application.

📊 Experimental Results

LightVLA demonstrates exceptional performance improvements and efficiency optimizations on the LIBERO benchmark. Here's a comparison of key metrics:

Metric Improvement/Reduction
FLOPs (Floating Point Operations) ↓ 59.1%
Latency ↓ 38.2%
Task Success Rate ↑ 2.6%

These results highlight LightVLA's powerful ability to enhance the efficiency of VLA models while maintaining or even improving task execution performance.

🛠️ System Requirements

Inference

  • 1 GPU with ~16 GB VRAM for LIBERO sim benchmark tasks.

Training

  • Between 1-8 GPUs with 27-80 GB, depending on the desired training setup (with default bfloat16 data type). See the OpenVLA-OFT FAQ for details.

⬇️ Installation

Please refer to the SETUP.md file for detailed instructions on setting up the conda environment.

🚀 Training and Evaluation

Please refer to the LIBERO.md file for detailed instructions on fine-tuning/evaluating on LIBERO simulation benchmark task suites.

🤝 Support

If you encounter any issues, please feel free to open a new GitHub Issue.

📝 Citation

If you use our code or methods in your research or work, please cite our paper:

@misc{jiang2025betterlearnsmarterprune,
      title={The Better You Learn, The Smarter You Prune: Towards Efficient Vision-language-action Models via Differentiable Token Pruning}, 
      author={Titong Jiang and Xuefeng Jiang and Yuan Ma and Xin Wen and Bailin Li and Kun Zhan and Peng Jia and Yahui Liu and Sheng Sun and Xianpeng Lang},
      year={2025},
      eprint={2509.12594},
      archivePrefix={arXiv},
      primaryClass={cs.RO},
      url={https://arxiv.org/abs/2509.12594}, 
}

📜 License

This project is licensed under the MIT License. Please see the LICENSE file in the project root for details.

Acknowledgements

This work is built upon the wonderful OpenVLA-OFT project. Special thanks to Moo Jin Kim, Chelsea Finn, and Percy Liang for their contributions.

About

LightVLA: Differentiable Token Pruning for VLA models

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 99.5%
  • Shell 0.5%