Skip to content

[arXiv:2508.00410] "Co-Reward: Self-supervised Reinforcement Learning for Large Language Model Reasoning via Contrastive Agreement"

Notifications You must be signed in to change notification settings

tmlr-group/Co-rewarding

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

26 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Co-rewarding: Stable Self-supervised RL for Eliciting Reasoning in Large Language Models

Paper    GitHub Stars

Co-rewarding Framework

Co-rewarding is a novel self-supervised RL framework that improves training stability by seeking complementary supervision from another views. Specifically, we instantiate Co-rewarding in two ways: (1) Co-rewarding-I is a data-side instantiation that derives reward signals from contrastive agreement across semantically analogous questions; and (2) Co-rewarding-II is a model-side instantiation that maintains a slowly-updated reference teacher with pseudo labels to realize self-distillation. Intuitively, such instantiations introduce different levels of discrepancy to increase the difficulty of training collapse on trivial reasoning solutions.

Install Environment

# 1. create a new conda environment
conda create -y -n co-rewarding python=3.10
conda activate co-rewarding

# 2. clone the reponsitory
git clone https://github.com/tmlr-group/Co-rewarding.git
cd Co-rewarding

# 3. install necessary package
bash scripts/install_env.sh

# 4. add verl to PYTHONPATH in editable mode
cd Co-rewarding-II
pip install -e . --no-deps

Training Co-rewarding

Set training set, training_files and training_aug_files for Co-rewarding-I, training_files for Co-rewarding-II:

  • MATH: data/math
  • DAPO-14k: data/dapo
  • OpenRS: data/open-rs

Modify the WANDB_KEY and LLM_PATH in the Co-rewarding-I/run_corewarding-I.sh and Co-rewarding-I/run_corewarding-I.sh to your own WANDB key and your downloaded LLMs. then run the following command:

# Co-rewarding-I
cd Co-rewarding-I
bash run_corewarding-I.sh

# Co-rewarding-II
cd Co-rewarding-II
bash run_corewarding-II.sh

Preprocess the training set

The training datasets, e.g., MATH, DAPO-14k, OpenRS, are provided in the path Co-rewarding-II/data, if you want to preprocess from the scatch:

cd Co-rewarding-II

# MATH
python example/data_preprocess/math_dataset.py

# DAPO-14k
python examples/data_preprocess/dapo17ken_dataset.py

# OpenRS
python example/data_preprocess/open-rs.py

If you want to obtain the rephrased data for training Co-rewarding-I:

# 1. Copy preprocessed dataset from Co-rewarding-II to Co-rewarding-I
cp -r Co-rewarding-II/data/* Co-rewarding-I/data/

# 2. Rephrase Data
python rewrite_questions.py \
  --input_path data/math/train.parquet \
  --output_jsonl data/math/train_rewrite_Qwen3-32B.jsonl \
  --output_parquet data/math/train_rewrite_Qwen3-32B.parquet \
  --output_original_parquet data/math/train_original.parquet \
  --model_path $YOUR_Qwen3-32B_MODEL_PATH \
  --tokenizer_path $YOUR_Qwen3-32B_TOKENIZER_PATH \
  --question_column prompt \
  --batch_size 128

python rewrite_questions.py \
  --input_path data/dapo/train.parquet \
  --output_jsonl data/dapo/train_rewrite_Qwen3-32B.jsonl \
  --output_parquet data/dapo/train_rewrite_Qwen3-32B.parquet \
  --output_original_parquet data/dapo/train_original.parquet \
  --model_path $YOUR_Qwen3-32B_MODEL_PATH \
  --tokenizer_path $YOUR_Qwen3-32B_TOKENIZER_PATH \
  --question_column prompt \
  --batch_size 128

python rewrite_questions.py \
  --input_path data/open-rs/train.parquet \
  --output_jsonl data/open-rs/train_rewrite_Qwen3-32B.jsonl \
  --output_parquet data/open-rs/train_rewrite_Qwen3-32B.parquet \
  --output_original_parquet data/open-rs/train_original.parquet \
  --model_path $YOUR_Qwen3-32B_MODEL_PATH \
  --tokenizer_path $YOUR_Qwen3-32B_TOKENIZER_PATH \
  --question_column prompt \
  --batch_size 128

Dataset

We release our rephrased dataset:

Checkpoints

We release all checkpoints trained by us, including our Co-rewarding-I, Co-rewarding-II and all Baselines.

Trained on MATH

Model Name Model Size Method Hugging Face Link
TMLR-Group-HF/Co-rewarding-I-Qwen2.5-3B-MATH 3B Co-rewarding-I View Model
TMLR-Group-HF/Co-rewarding-I-Qwen2.5-7B-MATH 7B Co-rewarding-I View Model
TMLR-Group-HF/Co-rewarding-I-Qwen3-1.7B-Base-MATH 1.7B Co-rewarding-I View Model
TMLR-Group-HF/Co-rewarding-I-Qwen3-4B-Base-MATH 4B Co-rewarding-I View Model
TMLR-Group-HF/Co-rewarding-I-Qwen3-8B-Base-MATH 8B Co-rewarding-I View Model
TMLR-Group-HF/Co-rewarding-I-Llama-3.2-3B-Instruct-MATH 3B Co-rewarding-I View Model
--- --- --- ---
TMLR-Group-HF/Co-rewarding-II-Qwen2.5-3B-MATH 3B Co-rewarding-II View Model
TMLR-Group-HF/Co-rewarding-II-Qwen2.5-7B-MATH 7B Co-rewarding-II View Model
TMLR-Group-HF/Co-rewarding-II-Qwen3-1.7B-Base-MATH 1.7B Co-rewarding-II View Model
TMLR-Group-HF/Co-rewarding-II-Qwen3-4B-Base-MATH 4B Co-rewarding-II View Model
TMLR-Group-HF/Co-rewarding-II-Qwen3-8B-Base-MATH 8B Co-rewarding-II View Model
TMLR-Group-HF/Co-rewarding-II-Llama-3.2-3B-Instruct-MATH 3B Co-rewarding-II View Model
--- --- --- ---
TMLR-Group-HF/GT-Qwen2.5-3B-MATH 3B GT-GRPO View Model
TMLR-Group-HF/GT-Qwen2.5-7B-MATH 7B GT-GRPO View Model
TMLR-Group-HF/GT-Qwen3-1.7B-Base-MATH 1.7B GT-GRPO View Model
TMLR-Group-HF/GT-Qwen3-4B-Base-MATH 4B GT-GRPO View Model
TMLR-Group-HF/GT-Qwen3-8B-Base-MATH 8B GT-GRPO View Model
TMLR-Group-HF/GT-Llama-3.2-3B-Instruct-MATH 3B GT-GRPO View Model
--- --- --- ---
TMLR-Group-HF/Self-Certainty-Qwen2.5-3B-MATH 3B Self-Certainty Maximization View Model
TMLR-Group-HF/Self-Certainty-Qwen2.5-7B-MATH 7B Self-Certainty Maximization View Model
TMLR-Group-HF/Self-Certainty-Qwen3-1.7B-Base-MATH 1.7B Self-Certainty Maximization View Model
TMLR-Group-HF/Self-Certainty-Qwen3-4B-Base-MATH 4B Self-Certainty Maximization View Model
TMLR-Group-HF/Self-Certainty-Qwen3-8B-Base-MATH 8B Self-Certainty Maximization View Model
TMLR-Group-HF/Self-Certainty-Llama-3.2-3B-Instruct-MATH 3B Self-Certainty Maximization View Model
--- --- --- ---
TMLR-Group-HF/Entropy-Qwen2.5-3B-MATH 3B Entropy Minimization View Model
TMLR-Group-HF/Entropy-Qwen2.5-7B-MATH 7B Entropy Minimization View Model
TMLR-Group-HF/Entropy-Qwen3-1.7B-Base-MATH 1.7B Entropy Minimization View Model
TMLR-Group-HF/Entropy-Qwen3-4B-Base-MATH 4B Entropy Minimization View Model
TMLR-Group-HF/Entropy-Qwen3-8B-Base-MATH 8B Entropy Minimization View Model
TMLR-Group-HF/Entropy-Llama-3.2-3B-Instruct-MATH 3B Entropy Minimization View Model
--- --- --- ---
TMLR-Group-HF/Majority-Voting-Qwen2.5-3B-MATH 3B Majority-Voting View Model
TMLR-Group-HF/Majority-Voting-Qwen2.5-7B-MATH 7B Majority-Voting View Model
TMLR-Group-HF/Majority-Voting-Qwen3-1.7B-Base-MATH 1.7B Majority-Voting View Model
TMLR-Group-HF/Majority-Voting-Qwen3-4B-Base-MATH 4B Majority-Voting View Model
TMLR-Group-HF/Majority-Voting-Qwen3-8B-Base-MATH 8B Majority-Voting View Model
TMLR-Group-HF/Majority-Voting-Llama-3.2-3B-Instruct-MATH 3B Majority-Voting View Model

Trained on DAPO-14k

Model Name Model Size Method Hugging Face Link
TMLR-Group-HF/Co-rewarding-I-Qwen3-4B-Base-DAPO14k 4B Co-rewarding-I View Model
TMLR-Group-HF/Co-rewarding-I-Qwen3-8B-Base-DAPO14k 8B Co-rewarding-I View Model
TMLR-Group-HF/Co-rewarding-I-Llama-3.2-3B-Instruct-DAPO14k 3B Co-rewarding-I View Model
--- --- --- ---
TMLR-Group-HF/Co-rewarding-II-Qwen3-4B-Base-DAPO14k 4B Co-rewarding-II View Model
TMLR-Group-HF/Co-rewarding-II-Qwen3-8B-Base-DAPO14k 8B Co-rewarding-II View Model
TMLR-Group-HF/Co-rewarding-II-Llama-3.2-3B-Instruct-DAPO14k 3B Co-rewarding-II View Model
--- --- --- ---
TMLR-Group-HF/GT-Qwen3-4B-Base-DAPO14k 4B GT-GRPO View Model
TMLR-Group-HF/GT-Qwen3-8B-Base-DAPO14k 8B GT-GRPO View Model
TMLR-Group-HF/GT-Llama-3.2-3B-Instruct-DAPO14k 3B GT-GRPO View Model
--- --- --- ---
TMLR-Group-HF/Self-Certainty-Qwen3-4B-Base-DAPO14k 4B Self-Certainty Maximization View Model
TMLR-Group-HF/Self-Certainty-Qwen3-8B-Base-DAPO14k 8B Self-Certainty Maximization View Model
TMLR-Group-HF/Self-Certainty-Llama-3.2-3B-Instruct-DAPO14k 3B Self-Certainty Maximization View Model
--- --- --- ---
TMLR-Group-HF/Entropy-Qwen3-4B-Base-DAPO14k 4B Entropy Minimization View Model
TMLR-Group-HF/Entropy-Qwen3-8B-Base-DAPO14k 8B Entropy Minimization View Model
TMLR-Group-HF/Entropy-Llama-3.2-3B-Instruct-DAPO14k 3B Entropy Minimization View Model
--- --- --- ---
TMLR-Group-HF/Majority-Voting-Qwen3-4B-Base-DAPO14k 4B Majority-Voting View Model
TMLR-Group-HF/Majority-Voting-Qwen3-8B-Base-DAPO14k 8B Majority-Voting View Model
TMLR-Group-HF/Majority-Voting-Llama-3.2-3B-Instruct-DAPO14k 3B Majority-Voting View Model

Trained on OpenRS

Model Name Model Size Method Hugging Face Link
TMLR-Group-HF/Co-rewarding-I-Qwen3-4B-Base-OpenRS 4B Co-rewarding-I View Model
TMLR-Group-HF/Co-rewarding-I-Qwen3-8B-Base-OpenRS 8B Co-rewarding-I View Model
--- --- --- ---
TMLR-Group-HF/Co-rewarding-II-Qwen3-4B-Base-OpenRS 4B Co-rewarding-II View Model
TMLR-Group-HF/Co-rewarding-II-Qwen3-8B-Base-OpenRS 8B Co-rewarding-II View Model
--- --- --- ---
TMLR-Group-HF/GT-Qwen3-4B-Base-OpenRS 4B GT-GRPO View Model
TMLR-Group-HF/GT-Qwen3-8B-Base-OpenRS 8B GT-GRPO View Model
--- --- --- ---
TMLR-Group-HF/Self-Certainty-Qwen3-4B-Base-OpenRS 4B Self-Certainty Maximization View Model
TMLR-Group-HF/Self-Certainty-Qwen3-8B-Base-OpenRS 8B Self-Certainty Maximization View Model
--- --- --- ---
TMLR-Group-HF/Entropy-Qwen3-4B-Base-OpenRS 4B Entropy Minimization View Model
TMLR-Group-HF/Entropy-Qwen3-8B-Base-OpenRS 8B Entropy Minimization View Model
--- --- --- ---
TMLR-Group-HF/Majority-Voting-Qwen3-4B-Base-OpenRS 4B Majority-Voting View Model
TMLR-Group-HF/Majority-Voting-Qwen3-8B-Base-OpenRS 8B Majority-Voting View Model

Citation

If you use our datasets or models, please cite our paper!

About

[arXiv:2508.00410] "Co-Reward: Self-supervised Reinforcement Learning for Large Language Model Reasoning via Contrastive Agreement"

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 93.7%
  • Shell 5.6%
  • Jupyter Notebook 0.7%