Skip to content

torotoki/reasoning-minimal

Repository files navigation

🧠 reasoning-minimal

Minimal code for making a reasoning model from the base model using Guided Reward Policy Optimization (GRPO).

The model is trained to output

  • a private chain-of-thought wrapped in <think>...</think>
  • a final answer wrapped in <answer>...</answer>

Two types of reward functions enforce format correctness and answer and chain-of-thought accuracy simultaneously.


✨ Highlights

  • Lightweight base: Qwen/Qwen2.5-0.5B-Instruct
  • Training on GSM8K grade-school math dataset
  • Optional LoRA adaptation (uncomment to enable) for low-VRAM training
  • Precise math checking with [math_verify]
  • Plug-and-play multi-reward RL via TRL’s GRPOTrainer
  • Integrated Weights & Biases logging and automatic inference demo

🚀 Setup & Usage

pip install torch transformers datasets peft trl math_verify wandb
python train.py --model-path-or-dir Qwen/Qwen2.5-0.5B-Instruct
python evaluation.py --model-path-or-dir outputs/Qwen/Qwen2.5-0.5B-Instruct

References & Acknowledgements

About

Minimal code to train reasoning model with reinforcement learning.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages