-
Couldn't load subscription status.
- Fork 2.3k
feat(trainer): add PAPOTrainer for preference-based optimization #4334
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
feat(trainer): add PAPOTrainer for preference-based optimization #4334
Conversation
|
Thanks for your contribution. Can you mode this new trainer to trl.experimental instead? Ideally, we would also have a small mention in the paper index section of the documentation |
renamed: trl/trainer/papo_config.py -> trl/experimental/papo/papo_config.py renamed: trl/trainer/papo_trainer.py -> trl/experimental/papo/papo_trainer.py modified: trl/trainer/__init__.py
Thank you for your advice. I have moved this new trainer to trl.experimental and also added PAPO info in paper index. |
docs/source/paper_index.md
Outdated
|
|
||
| ### PAPO (Preference Alignment via Policy Optimization) | ||
|
|
||
| * **Paper:** [Link to Paper.](https://arxiv.org/abs/2507.06448) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you please try to align with other subsections
### Paper title
**📜 Paper**: https://huggingface.co/papers/XXXX.XXXXX
Some brief intro.
```python
from trl import ...
training_args = ...
```There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
| def create_model_card( | ||
| self, | ||
| model_name: Optional[str] = None, | ||
| dataset_name: Optional[str] = None, | ||
| tags: Union[str, list[str], None] = None, | ||
| ): | ||
| """ | ||
| Creates a model card for PAPO trainer. | ||
| """ | ||
| if not self.is_world_process_zero(): | ||
| return | ||
|
|
||
| # Normalize tags | ||
| if tags is None: | ||
| tags = set() | ||
| elif isinstance(tags, str): | ||
| tags = {tags} | ||
| else: | ||
| tags = set(tags) | ||
|
|
||
| tags.update(self._tag_names) | ||
|
|
||
| # PAPO doesn't have a published paper yet, so we reference the GRPO paper | ||
| # and note that PAPO extends it for multimodal reasoning | ||
| citation = """\ | ||
| @article{wang2025perception, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you please remove this method and use a class method for the citation instead. see
trl/trl/trainer/dpo_trainer.py
Lines 253 to 268 in 5e691d1
| _tag_names = ["trl", "dpo"] | |
| _name = "DPO" | |
| _paper = { | |
| "title": "Direct Preference Optimization: Your Language Model is Secretly a Reward Model", | |
| "id": "2305.18290", | |
| # docstyle-ignore | |
| "citation": textwrap.dedent("""\ | |
| @inproceedings{rafailov2023direct, | |
| title = {{Direct Preference Optimization: Your Language Model is Secretly a Reward Model}}, | |
| author = {Rafael Rafailov and Archit Sharma and Eric Mitchell and Christopher D. Manning and Stefano Ermon and Chelsea Finn}, | |
| year = 2023, | |
| booktitle = {Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 16, 2023}, | |
| url = {http://papers.nips.cc/paper_files/paper/2023/hash/a85b405ed65c6477a4fe8302b5e06ce7-Abstract-Conference.html}, | |
| editor = {Alice Oh and Tristan Naumann and Amir Globerson and Kate Saenko and Moritz Hardt and Sergey Levine}, | |
| }"""), | |
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
…/trl-papo into feat/trainer-papo
What does this PR do?
This PR introduces a new trainer named
PAPOTrainer, which extendsGRPOTrainerto support the PAPO (Preference Alignment via Policy Optimization) algorithm.Motivation
PAPO is a variant of GRPO that incorporates a contrastive preference optimization mechanism to improve stability when positive samples are sparse. But the official code use verl. To make it convenient for everyone to use, I implemented the TRL version of the code based on the PAPO formula, and it runs successfully.
Implementation Details
trl/trainer/papo_trainer.pytrl/trainer/papo_config.py__init__.pyto includePAPOTrainerpytest tests/trainer/test_papo_trainer.py -v🧪 Example Usage
https://github.com/SolarWindRider/avr/blob/main/train_papo.py
I have tested my trainer[with PEFT and FSDP] on Ascend910C and H20[single node with 8 cards].
Checklist
pytest