Skip to content

🚀ReVisual-R1 is a 7B open-source multimodal language model that follows a three-stage curriculum—cold-start pre-training, multimodal reinforcement learning, and text-only reinforcement learning—to achieve faithful, concise, and self-reflective state-of-the-art performance in visual and textual reasoning.

License

Notifications You must be signed in to change notification settings

CSfufu/Revisual-R1

Repository files navigation

Revisual Icon Advancing Multimodal Reasoning: From Optimized Cold Start to Staged Reinforcement Learning

Paper alphaXiv Github Hugging Face Collection Twitter HF Demo

Awesome License: MIT

📚 Overview

⚡ News

  • [2025/06/06] 🔥 Revisual-R1 model (Coldstart & final) are opensource on huggingface.
  • [2025/06/05] 🎉 Ranked #2 of the day on Huggingface Daily Papers.
  • [2025/06/05] 🔥 Revisual-R1 paper available on arxiv.

📖 Introduction

This paper introduces ReVisual-R1, a 7B open-source MLLM designed to address prevalent challenges in cultivating sophisticated multimodal reasoning. By systematically integrating a strategic, high-difficulty text-only cold-start phase for foundational reasoning, a Multimodal RL stage employing GRPO stabilized by our novel Prioritized Advantage Distillation (PAD) mechanism and guided by rule-based rewards including an Efficient-Length Reward, and a final TextRL refinement phase, our structured three-stage curriculum demonstrates that thoughtful data strategy and targeted algorithmic optimizations are pivotal. ReVisual-R1 achieves SOTA performance among open-source 7B models on a suite of challenging visuo-mathematical and reasoning benchmarks. This work underscores that careful curriculum design and algorithmic enhancements, rather than sheer model scale, can unlock robust, self-reflective multimodal reasoning.

🔑 Key Features

  1. Cold-Start Insights: We reveal that existing multimodal cold-start corpora lack sufficient difficulty and show that a high-complexity, text-centric warm-up is critical for fostering advanced visual reasoning.

  2. Stable RL Optimisation: We introduce Prioritised Advantage Distillation (PAD) to overcome gradient stagnation, enabling stable and sample-efficient reinforcement learning for MLLMs.

  3. Staged Curriculum & Model: We design a three-phase training pipeline—text warm-up, multimodal RL with PAD, and text RL—culminating in ReVisual-R1, the first open-source 7 B model with self-critical, multi-hop reasoning that rivals proprietary systems.

🍭 Results

Revisual results

ReVisual-R1 presents strong performance in challenging visual-mathematical reasoning tasks, while simultaneously preserving strong general-purpose text skills.

🎯 Models

Model Huggingface Base Model
Revisual-R1-Coldstart https://huggingface.co/csfufu/Revisual-R1-Coldstart Qwen2.5-VL-7B-Instruct
Revisual-R1-final https://huggingface.co/csfufu/Revisual-R1-final Qwen2.5-VL-7B-Instruct

🧮 Datasets

Revisual results

We will open source the GRAMMAR dataset within the next two weeks, which includes high-quality datasets for coldstart. Stay tuned!

Datasets Huggingface Size of the data volume
MMRL https://huggingface.co/datasets/csfufu/mmrl 30.9K
TextRL https://huggingface.co/datasets/csfufu/textrl 32.5K
Coldstart https://huggingface.co/datasets/csfufu/Grammer_dataset 47.3k

✨ Getting Started

🔧 Installing

You can install Revisual-R1 dependencies by running the following commands:

conda create -n revisual python=3.10 -y && conda activate revisual

cd Revisual-R1
pip3 install -e .

If you encounter issues when installing flash-attn, we recommend you to install it here flash-attn. For example, we use this version.

$ wget https://github.com/Dao-AILab/flash-attention/releases/download/v2.7.3/flash_attn-2.7.3+cu12torch2.4cxx11abiFALSE-cp310-cp310-linux_x86_64.whl
$ pip install flash_attn-2.7.3+cu12torch2.4cxx11abiFALSE-cp310-cp310-linux_x86_64.whl

🔧 Training

Cold Start Training

bash ./cold_start/run_cold_start.sh

Staged Reinforcement Optimization

bash ./examples/main.sh

If you encounter issues with connecting to Hugging Face, consider using export HF_ENDPOINT=https://hf-mirror.com.

🔧 Merge Checkpoint in Hugging Face Format

python3 scripts/model_merger.py --local_dir checkpoints/${ProjectName}$/exp_name/global_step_100/actor

🎁 Evaluation

🤖 Usage

usage: main.py [-h] --model-name MODEL_NAME --openai-api-key OPENAI_API_KEY [--openai-base-url OPENAI_BASE_URL] [--cache-dir CACHE_DIR] [--output-dir OUTPUT_DIR] [--max-tokens MAX_TOKENS] [--min-pixels MIN_PIXELS]
               [--max-pixels MAX_PIXELS] [--temperature TEMPERATURE] [--top-p TOP_P] [--system-prompt SYSTEM_PROMPT] [--datasets DATASETS] [--dataset-dir DATASET_DIR] [--eval-threads EVAL_THREADS] [--max-retries MAX_RETRIES]

Unified evaluation for multimodal math datasets

options:
  -h, --help            show this help message and exit
  --model-name MODEL_NAME
                        The name of the model to use
  --openai-api-key OPENAI_API_KEY
                        The API key for the OpenAI API
  --openai-base-url OPENAI_BASE_URL
                        The base URL for the OpenAI API
  --cache-dir CACHE_DIR
                        Directory to cache predictions
  --output-dir OUTPUT_DIR
                        Directory to save results
  --max-tokens MAX_TOKENS
                        Maximum number of tokens to generate
  --min-pixels MIN_PIXELS
  --max-pixels MAX_PIXELS
  --temperature TEMPERATURE
                        Sampling temperature
  --top-p TOP_P         Top-p sampling
  --system-prompt SYSTEM_PROMPT
                        System prompt for the model
  --datasets DATASETS   Comma-separated list of datasets to evaluate: geo3k,wemath,mathvista,mathverse,mathvision or 'all'
  --dataset-dir DATASET_DIR
  --eval-threads EVAL_THREADS
                        Number of threads for evaluation
  --max-retries MAX_RETRIES
                        Maximum number of retries for evaluation

🔓Examples

(1) Evaluate a model directly via OpenAI API

python ./src/main.py --model-name="gpt-4.1" \
	--openai-api-key="YOUR_API_KEY" \
	--cache-dir="./cache"

(2) Deploy and evaluate a local model using lmdeploy

lmdeploy serve api_server \
	/path/to/local/lmm \
	--model-name lmm_name \
	--server-port 23333 \
	--chat-template qwen2d5-vl

python ./src/main.py --model-name="lmm_name" \
	--openai-base-url="http://localhost:23333/v1" \
	--openai-api-key="YOUR_API_KEY" \
	--cache-dir="./cache"

🖥️ Inference

Run the command below.

MODEL_PATH="Reviusal-R1"
MAX_TOKENS=16384
DO_SAMPLE=True
TEMPERATURE=1.0
TOP_P=0.95
TOP_K=50
NUM_RETURN_SEQUENCES=1


prompt = "You FIRST think about the reasoning process as an internal monologue and then provide the final answer. The reasoning process MUST BE enclosed within <think> </think> tags. The final answer MUST BE put in \\boxed{}."
question="xxx"


python infer.py \
 --model_path ${MODEL_PATH} \
 --image_path ${IMAGE_PATH} \
 --question ${question} \
 --prompt ${prompt} \
 --max_tokens ${MAX_TOKENS} \
 --do_sample ${DO_SAMPLE} \
 --temperature ${TEMPERATURE} \
 --top_p ${TOP_P} \
 --top_k ${TOP_K} \
 --num_return_sequences ${NUM_RETURN_SEQUENCES} 

You can also modify the arguments in inference/inference.sh

bash inference/inference.sh

🏝️ Reasoning Example

Our Revisual-R1 model reasoning case, showcasing its exceptional reasoning ability. The model generates long responses, continuously hypothesizing, reflecting, verifying, and correcting to arrive at the final answer, while also providing a summary answer. case2

🚧 TODO

We are preparing to complete these tasks over the next few weeks, please stay tuned!

  • 🚧 We are going to release the training datasets(Coldstart).
  • 🚧 We are in the process of training for 32B & 3B Revisual-R1 and will release them when we finish.

📮 Contact

For questions, feedback, or collaboration opportunities, feel free to reach out: [email protected]

📄Citation

If you find our works useful for your research, please consider citing:

@article{chen2025advancing,
  title={Advancing Multimodal Reasoning: From Optimized Cold Start to Staged Reinforcement Learning},
  author={Chen, Shuang and Guo, Yue and Su, Zhaochen and Li, Yafu and Wu, Yulun and Chen, Jiacheng and Chen, Jiayu and Wang, Weijie and Qu, Xiaoye and Cheng, Yu},
  journal={arXiv preprint arXiv:2506.04207},
  year={2025}
}

⭐️ Star HistoryMore actions

Star History Chart

About

🚀ReVisual-R1 is a 7B open-source multimodal language model that follows a three-stage curriculum—cold-start pre-training, multimodal reinforcement learning, and text-only reinforcement learning—to achieve faithful, concise, and self-reflective state-of-the-art performance in visual and textual reasoning.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •  

Languages