🌈NativeRes-LLaVA

Official code repo for our work Native Visual Understanding: Resolving Resolution Dilemmas in Vision-Language Models

📰 News

[2025/5/22] 🔥🔥🔥 We released NativeRes-LLaVA(with Qwen2-ViT) 1B && 2B && 7B Checkpoints on Hugging Face
[2025/6/13] 🔥🔥🔥 We released the paper on arXiv!

📌 ToDo Lists

Comparisons of openness and capabilities across different VLMs.

Models	Nativeres-Training Codebase	Sequence Packing Scripts	Pre-Training Codebase	Base Model Checkpoint	SFT-Training Codebase	Instruct Model Checkpoint	Flexibly Changing Modules	Resolution Strategy
LLaVA	⬜️ None	⬜️ None	🟩 Open	🟩 Open	🟩 Open	🟩 Open	🟩 Open	Fixed
Cambrian-1	⬜️ None	⬜️ None	🟥 Closed	🟩 Open	🟥 Closed	🟩 Open	🟩 Open	Hybrid
LLaVA-OneVision	⬜️ None	⬜️ None	🟩 Open	🟩 Open	🟩 Open	🟩 Open	🟩 Open	Crop
Seed1.5-VL	🟥 Closed	🟥 Closed	🟥 Closed	🟥 Closed	🟥 Closed	🟥 Closed	🟥 Closed	Native
Kimi-VL	🟥 Closed	🟥 Closed	🟥 Closed	🟥 Closed	🟩 Open	🟩 Open	🟥 Closed	Native
Qwen2-VL	🟥 Closed	🟥 Closed	🟥 Closed	🟩 Open	🟩 Open	🟩 Open	🟥 Closed	Native
NativeRes-LLaVA	🟩 Open	🟩 Open	🟩 Open	🟩 Open	🟩 Open	🟩 Open	🟩 Open	Native

Emoji: 🟩 = Open-Source, 🟥 = Closed-Source, ⬜️ = None

Architecture of NativeRes-LLaVA.

Install

This is a repo enabling you train a LLaVA using images with native resolution.

Clone this repository and navigate to LLaVA folder

git clone https://github.com/Niujunbo2002/NativeRes-LLaVA.git
cd NativeRes-LLaVA

Install Package

conda create -n nativeres python=3.10 -y
conda activate nativeres
pip install --upgrade pip  # enable PEP 660 support
pip install torch==2.6.0
pip install torchaudio==2.6.0
pip install torchvision==0.21.0
pip install -r requirements.txt
pip install transformers==4.50.3

❗️If you get stuck during the installation of Flash-attn, we recommend manually downloading the appropriate version from the official source.

Install the required environment in requirements.txt. The Transforms version should be able to support at least Qwen2-VL model.

Quick Start

First, download the checkpoints from the following folder: NativeRes-LLaVA

We have released NativeRes-ViT (qwen2-vl-665m-patch14-nativeres), a ViT model capable of handling native-resolution inputs.

We have also released the model NativeRes-LLaVA-qwen2-7b-qwen2vl, which integrates NativeRes-ViT and uses Qwen2-7b-Instruct as the language model. You are free to configure the min_image_tokens and max_image_tokens parameters (default: min_image_tokens=4, max_image_tokens=4096).

You need to first download NativeRes-ViT (Niujunbo2002/qwen2-vl-665m-patch14-nativeres) to your local machine. Then, update the "mm_vision_tower" path in Niujunbo2002/NativeRes-LLaVA-qwen2-7b-qwen2vl/config.json to point to the local path of NativeRes-ViT.

{
  "add_faster_video": false,
  "add_time_instruction": false,
  "architectures": [
    "LlavaQwenForCausalLM"
  ],
  ...
  "mm_vision_select_feature": "patch",
  "mm_vision_select_layer": -1,
  "mm_vision_tower": "/Local_Path/NativeRes/qwen2vl-665m-patch14-native",
  "mm_vision_tower_lr": 2e-06,
  ...
}

Inference

For Inference, we have a simple example, just run:

python ./infer_demo.py

Train

Please note that the following is merely our reference to the official LLaVA training strategy. You are free to choose any training strategy you believe to be correct and efficient based on our codebase.

Stage1: Pretrain

If you want to run using siglip ViT, which not support NativeRes, you can run:

bash scripts/train/pretrain_siglip.sh

Otherwise you can run in NativeRes mode which utilize Qwen2-VL ViT to support native resolution:

bash scripts/train/pretrain_qwenvit.sh

Stage2: Finetune

For finetuning using siglip, just run

bash scripts/train/direct_finetune_siglip_a4_v1.5.sh

Otherwise you can run in NativeRes mode by:(using the LLaVA1.5 Fintuning Dataset now, you can change it anyway.)

bash scripts/train/direct_finetune_qwen_a4_v1.5_4_2048.sh

Notes

Still not support zero3 in NativeRes mode now.
Update sys.path.append("/mnt/petrelfs/niujunbo/zhengyuanhong/NativeResLLaVA") to your personal path.
Still not support video now.

Contact

Junbo Niu: [email protected], Yuanhong Zheng: [email protected]

Acknowledgements

This codebase is built upon LLaVA and leverages open-source model Qwen2-VL-2B-Instruct . We extend our gratitude to the contributors and maintainers of these projects.

Citation

If you find our work helpful for your research, please consider giving a star ⭐ and citation 📝.

@misc{niu2025nativevisualunderstandingresolving,
      title={Native Visual Understanding: Resolving Resolution Dilemmas in Vision-Language Models}, 
      author={Junbo Niu and Yuanhong Zheng and Ziyang Miao and Hejun Dong and Chunjiang Ge and Hao Liang and Ma Lu and Bohan Zeng and Qiahao Zheng and Conghui He and Wentao Zhang},
      year={2025},
      eprint={2506.12776},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2506.12776}, 
}

Name		Name	Last commit message	Last commit date
Latest commit History 48 Commits
assets		assets
demo		demo
docs		docs
llava		llava
LICENSE		LICENSE
README.md		README.md
cog.yaml		cog.yaml
infer_demo.py		infer_demo.py
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

🌈NativeRes-LLaVA

📰 News

📌 ToDo Lists

Comparisons of openness and capabilities across different VLMs.

Architecture of NativeRes-LLaVA.

Install

Quick Start

Inference

Train

Stage1: Pretrain

Stage2: Finetune

Notes

Contact

Acknowledgements

Citation

About

Uh oh!

Releases

Packages

Languages

License

yzoaim/NativeRes-LLaVA

Folders and files

Latest commit

History

Repository files navigation

🌈NativeRes-LLaVA

📰 News

📌 ToDo Lists

Comparisons of openness and capabilities across different VLMs.

Architecture of NativeRes-LLaVA.

Install

Quick Start

Inference

Train

Stage1: Pretrain

Stage2: Finetune

Notes

Contact

Acknowledgements

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages