Skip to content

yzoaim/NativeRes-LLaVA

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

48 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

🌈NativeRes-LLaVA

Official code repo for our work Native Visual Understanding: Resolving Resolution Dilemmas in Vision-Language Models

📰 News

  • [2025/5/22] 🔥🔥🔥 We released NativeRes-LLaVA(with Qwen2-ViT) 1B && 2B && 7B Checkpoints on Hugging Face
  • [2025/6/13] 🔥🔥🔥 We released the paper on arXiv!

📌 ToDo Lists

  • Release Inference Code
  • Release ✨ NativeRes-LLaVA 1B && 2B && 7B Checkpoints (with Qwen2-ViT)
  • Release NativeRes-ViT (Niujunbo2002/qwen2vit-665m-patch14-native)
  • Support SGLang 🚀🚀🚀
  • Release NativeRes-ViT (Niujunbo2002/qwen2_5_vit-668m-patch14-native)(Window Attention🌟)
  • Release ✨ NativeRes-LLaVA 1B && 2B && 7B Checkpoints (with Qwen2.5-ViT) More Faster!
  • Release Training Code (The code is being organized.)
  • Release RC-Bench (The code is being organized.)
  • Support LLaMA Factory && VeRL
  • Release SOTA NativeRes-LLaVA Checkpoints and Training Recipe (Data is being collected⛽️)

Comparisons of openness and capabilities across different VLMs.

Models Nativeres-Training Codebase Sequence Packing Scripts Pre-Training Codebase Base Model Checkpoint SFT-Training Codebase Instruct Model Checkpoint Flexibly Changing Modules Resolution Strategy
LLaVA ⬜️ None ⬜️ None 🟩 Open 🟩 Open 🟩 Open 🟩 Open 🟩 Open Fixed
Cambrian-1 ⬜️ None ⬜️ None 🟥 Closed 🟩 Open 🟥 Closed 🟩 Open 🟩 Open Hybrid
LLaVA-OneVision ⬜️ None ⬜️ None 🟩 Open 🟩 Open 🟩 Open 🟩 Open 🟩 Open Crop
Seed1.5-VL 🟥 Closed 🟥 Closed 🟥 Closed 🟥 Closed 🟥 Closed 🟥 Closed 🟥 Closed Native
Kimi-VL 🟥 Closed 🟥 Closed 🟥 Closed 🟥 Closed 🟩 Open 🟩 Open 🟥 Closed Native
Qwen2-VL 🟥 Closed 🟥 Closed 🟥 Closed 🟩 Open 🟩 Open 🟩 Open 🟥 Closed Native
NativeRes-LLaVA 🟩 Open 🟩 Open 🟩 Open 🟩 Open 🟩 Open 🟩 Open 🟩 Open Native
  • Emoji: 🟩 = Open-Source, 🟥 = Closed-Source, ⬜️ = None

Architecture of NativeRes-LLaVA.

Install

This is a repo enabling you train a LLaVA using images with native resolution.

  1. Clone this repository and navigate to LLaVA folder
git clone https://github.com/Niujunbo2002/NativeRes-LLaVA.git
cd NativeRes-LLaVA
  1. Install Package
conda create -n nativeres python=3.10 -y
conda activate nativeres
pip install --upgrade pip  # enable PEP 660 support
pip install torch==2.6.0
pip install torchaudio==2.6.0
pip install torchvision==0.21.0
pip install -r requirements.txt
pip install transformers==4.50.3

❗️If you get stuck during the installation of Flash-attn, we recommend manually downloading the appropriate version from the official source.

Install the required environment in requirements.txt. The Transforms version should be able to support at least Qwen2-VL model.

Quick Start

First, download the checkpoints from the following folder: NativeRes-LLaVA

We have released NativeRes-ViT (qwen2-vl-665m-patch14-nativeres), a ViT model capable of handling native-resolution inputs.

We have also released the model NativeRes-LLaVA-qwen2-7b-qwen2vl, which integrates NativeRes-ViT and uses Qwen2-7b-Instruct as the language model. You are free to configure the min_image_tokens and max_image_tokens parameters (default: min_image_tokens=4, max_image_tokens=4096).

You need to first download NativeRes-ViT (Niujunbo2002/qwen2-vl-665m-patch14-nativeres) to your local machine. Then, update the "mm_vision_tower" path in Niujunbo2002/NativeRes-LLaVA-qwen2-7b-qwen2vl/config.json to point to the local path of NativeRes-ViT.

{
  "add_faster_video": false,
  "add_time_instruction": false,
  "architectures": [
    "LlavaQwenForCausalLM"
  ],
  ...
  "mm_vision_select_feature": "patch",
  "mm_vision_select_layer": -1,
  "mm_vision_tower": "/Local_Path/NativeRes/qwen2vl-665m-patch14-native",
  "mm_vision_tower_lr": 2e-06,
  ...
}

Inference

For Inference, we have a simple example, just run:

python ./infer_demo.py

Train

Please note that the following is merely our reference to the official LLaVA training strategy. You are free to choose any training strategy you believe to be correct and efficient based on our codebase.

Stage1: Pretrain

If you want to run using siglip ViT, which not support NativeRes, you can run:

bash scripts/train/pretrain_siglip.sh

Otherwise you can run in NativeRes mode which utilize Qwen2-VL ViT to support native resolution:

bash scripts/train/pretrain_qwenvit.sh

Stage2: Finetune

For finetuning using siglip, just run

bash scripts/train/direct_finetune_siglip_a4_v1.5.sh

Otherwise you can run in NativeRes mode by:(using the LLaVA1.5 Fintuning Dataset now, you can change it anyway.)

bash scripts/train/direct_finetune_qwen_a4_v1.5_4_2048.sh

Notes

  1. Still not support zero3 in NativeRes mode now.
  2. Update sys.path.append("/mnt/petrelfs/niujunbo/zhengyuanhong/NativeResLLaVA") to your personal path.
  3. Still not support video now.

Contact

Junbo Niu: [email protected], Yuanhong Zheng: [email protected]

Acknowledgements

This codebase is built upon LLaVA and leverages open-source model Qwen2-VL-2B-Instruct . We extend our gratitude to the contributors and maintainers of these projects.

Citation

If you find our work helpful for your research, please consider giving a star ⭐ and citation 📝.

@misc{niu2025nativevisualunderstandingresolving,
      title={Native Visual Understanding: Resolving Resolution Dilemmas in Vision-Language Models}, 
      author={Junbo Niu and Yuanhong Zheng and Ziyang Miao and Hejun Dong and Chunjiang Ge and Hao Liang and Ma Lu and Bohan Zeng and Qiahao Zheng and Conghui He and Wentao Zhang},
      year={2025},
      eprint={2506.12776},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2506.12776}, 
}

About

Official code repo for our work "Native Visual Understanding: Resolving Resolution Dilemmas in Vision-Language Models"

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 100.0%