Official code repo for our work Native Visual Understanding: Resolving Resolution Dilemmas in Vision-Language Models
- [2025/5/22] 🔥🔥🔥 We released NativeRes-LLaVA(with Qwen2-ViT) 1B && 2B && 7B Checkpoints on Hugging Face
- [2025/6/13] 🔥🔥🔥 We released the paper on arXiv!
- Release Inference Code
- Release ✨ NativeRes-LLaVA 1B && 2B && 7B Checkpoints (with Qwen2-ViT)
- Release NativeRes-ViT (Niujunbo2002/qwen2vit-665m-patch14-native)
- Support SGLang 🚀🚀🚀
- Release NativeRes-ViT (Niujunbo2002/qwen2_5_vit-668m-patch14-native)(Window Attention🌟)
- Release ✨ NativeRes-LLaVA 1B && 2B && 7B Checkpoints (with Qwen2.5-ViT) More Faster!
- Release Training Code (The code is being organized.)
- Release RC-Bench (The code is being organized.)
- Support LLaMA Factory && VeRL
- Release SOTA NativeRes-LLaVA Checkpoints and Training Recipe (Data is being collected⛽️)
Models | Nativeres-Training Codebase | Sequence Packing Scripts | Pre-Training Codebase | Base Model Checkpoint | SFT-Training Codebase | Instruct Model Checkpoint | Flexibly Changing Modules | Resolution Strategy |
---|---|---|---|---|---|---|---|---|
LLaVA | ⬜️ None | ⬜️ None | 🟩 Open | 🟩 Open | 🟩 Open | 🟩 Open | 🟩 Open | Fixed |
Cambrian-1 | ⬜️ None | ⬜️ None | 🟥 Closed | 🟩 Open | 🟥 Closed | 🟩 Open | 🟩 Open | Hybrid |
LLaVA-OneVision | ⬜️ None | ⬜️ None | 🟩 Open | 🟩 Open | 🟩 Open | 🟩 Open | 🟩 Open | Crop |
Seed1.5-VL | 🟥 Closed | 🟥 Closed | 🟥 Closed | 🟥 Closed | 🟥 Closed | 🟥 Closed | 🟥 Closed | Native |
Kimi-VL | 🟥 Closed | 🟥 Closed | 🟥 Closed | 🟥 Closed | 🟩 Open | 🟩 Open | 🟥 Closed | Native |
Qwen2-VL | 🟥 Closed | 🟥 Closed | 🟥 Closed | 🟩 Open | 🟩 Open | 🟩 Open | 🟥 Closed | Native |
NativeRes-LLaVA | 🟩 Open | 🟩 Open | 🟩 Open | 🟩 Open | 🟩 Open | 🟩 Open | 🟩 Open | Native |
- Emoji: 🟩 = Open-Source, 🟥 = Closed-Source, ⬜️ = None
This is a repo enabling you train a LLaVA using images with native resolution.
- Clone this repository and navigate to LLaVA folder
git clone https://github.com/Niujunbo2002/NativeRes-LLaVA.git
cd NativeRes-LLaVA
- Install Package
conda create -n nativeres python=3.10 -y
conda activate nativeres
pip install --upgrade pip # enable PEP 660 support
pip install torch==2.6.0
pip install torchaudio==2.6.0
pip install torchvision==0.21.0
pip install -r requirements.txt
pip install transformers==4.50.3
❗️If you get stuck during the installation of Flash-attn
, we recommend manually downloading the appropriate version from the official source.
Install the required environment in requirements.txt
. The Transforms version should be able to support at least Qwen2-VL model
.
First, download the checkpoints from the following folder: NativeRes-LLaVA
We have released NativeRes-ViT (qwen2-vl-665m-patch14-nativeres), a ViT model capable of handling native-resolution inputs.
We have also released the model NativeRes-LLaVA-qwen2-7b-qwen2vl, which integrates NativeRes-ViT and uses Qwen2-7b-Instruct as the language model. You are free to configure the min_image_tokens
and max_image_tokens
parameters (default: min_image_tokens=4
, max_image_tokens=4096
).
You need to first download NativeRes-ViT (Niujunbo2002/qwen2-vl-665m-patch14-nativeres
) to your local machine.
Then, update the "mm_vision_tower"
path in Niujunbo2002/NativeRes-LLaVA-qwen2-7b-qwen2vl/config.json
to point to the local path of NativeRes-ViT.
{
"add_faster_video": false,
"add_time_instruction": false,
"architectures": [
"LlavaQwenForCausalLM"
],
...
"mm_vision_select_feature": "patch",
"mm_vision_select_layer": -1,
"mm_vision_tower": "/Local_Path/NativeRes/qwen2vl-665m-patch14-native",
"mm_vision_tower_lr": 2e-06,
...
}
For Inference, we have a simple example, just run:
python ./infer_demo.py
Please note that the following is merely our reference to the official LLaVA training strategy. You are free to choose any training strategy you believe to be correct and efficient based on our codebase.
If you want to run using siglip ViT, which not support NativeRes, you can run:
bash scripts/train/pretrain_siglip.sh
Otherwise you can run in NativeRes mode which utilize Qwen2-VL ViT to support native resolution:
bash scripts/train/pretrain_qwenvit.sh
For finetuning using siglip, just run
bash scripts/train/direct_finetune_siglip_a4_v1.5.sh
Otherwise you can run in NativeRes mode by:(using the LLaVA1.5 Fintuning Dataset now, you can change it anyway.)
bash scripts/train/direct_finetune_qwen_a4_v1.5_4_2048.sh
- Still not support zero3 in NativeRes mode now.
- Update
sys.path.append("/mnt/petrelfs/niujunbo/zhengyuanhong/NativeResLLaVA")
to your personal path. - Still not support
video
now.
Junbo Niu: [email protected], Yuanhong Zheng: [email protected]
This codebase is built upon LLaVA and leverages open-source model Qwen2-VL-2B-Instruct . We extend our gratitude to the contributors and maintainers of these projects.
If you find our work helpful for your research, please consider giving a star ⭐ and citation 📝.
@misc{niu2025nativevisualunderstandingresolving,
title={Native Visual Understanding: Resolving Resolution Dilemmas in Vision-Language Models},
author={Junbo Niu and Yuanhong Zheng and Ziyang Miao and Hejun Dong and Chunjiang Ge and Hao Liang and Ma Lu and Bohan Zeng and Qiahao Zheng and Conghui He and Wentao Zhang},
year={2025},
eprint={2506.12776},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2506.12776},
}