Skip to content

Conversation

@SolarWindRider
Copy link

@SolarWindRider SolarWindRider commented Oct 24, 2025

What does this PR do?

This PR introduces a new trainer named PAPOTrainer, which extends GRPOTrainer to support the PAPO (Preference Alignment via Policy Optimization) algorithm.

Motivation

PAPO is a variant of GRPO that incorporates a contrastive preference optimization mechanism to improve stability when positive samples are sparse. But the official code use verl. To make it convenient for everyone to use, I implemented the TRL version of the code based on the PAPO formula, and it runs successfully.

Implementation Details

  • Added trl/trainer/papo_trainer.py
  • Added trl/trainer/papo_config.py
  • Updated __init__.py to include PAPOTrainer
  • All tests pass locally with pytest tests/trainer/test_papo_trainer.py -v

🧪 Example Usage

https://github.com/SolarWindRider/avr/blob/main/train_papo.py

I have tested my trainer[with PEFT and FSDP] on Ascend910C and H20[single node with 8 cards].

Checklist

  • I have tested this code locally
  • I have run all tests with pytest
  • I have followed the code style guidelines
  • I have added docstrings and comments

@qgallouedec
Copy link
Member

Thanks for your contribution. Can you mode this new trainer to trl.experimental instead? Ideally, we would also have a small mention in the paper index section of the documentation

SolarWindRider and others added 4 commits October 26, 2025 00:51
	renamed:    trl/trainer/papo_config.py -> trl/experimental/papo/papo_config.py
	renamed:    trl/trainer/papo_trainer.py -> trl/experimental/papo/papo_trainer.py
	modified:   trl/trainer/__init__.py
@SolarWindRider
Copy link
Author

Thanks for your contribution. Can you mode this new trainer to trl.experimental instead? Ideally, we would also have a small mention in the paper index section of the documentation

Thank you for your advice. I have moved this new trainer to trl.experimental and also added PAPO info in paper index.


### PAPO (Preference Alignment via Policy Optimization)

* **Paper:** [Link to Paper.](https://arxiv.org/abs/2507.06448)
Copy link
Member

@qgallouedec qgallouedec Oct 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you please try to align with other subsections

### Paper title

**📜 Paper**: https://huggingface.co/papers/XXXX.XXXXX

Some brief intro.

```python
from trl import ...

training_args = ...
```

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Comment on lines 328 to 353
def create_model_card(
self,
model_name: Optional[str] = None,
dataset_name: Optional[str] = None,
tags: Union[str, list[str], None] = None,
):
"""
Creates a model card for PAPO trainer.
"""
if not self.is_world_process_zero():
return

# Normalize tags
if tags is None:
tags = set()
elif isinstance(tags, str):
tags = {tags}
else:
tags = set(tags)

tags.update(self._tag_names)

# PAPO doesn't have a published paper yet, so we reference the GRPO paper
# and note that PAPO extends it for multimodal reasoning
citation = """\
@article{wang2025perception,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you please remove this method and use a class method for the citation instead. see

_tag_names = ["trl", "dpo"]
_name = "DPO"
_paper = {
"title": "Direct Preference Optimization: Your Language Model is Secretly a Reward Model",
"id": "2305.18290",
# docstyle-ignore
"citation": textwrap.dedent("""\
@inproceedings{rafailov2023direct,
title = {{Direct Preference Optimization: Your Language Model is Secretly a Reward Model}},
author = {Rafael Rafailov and Archit Sharma and Eric Mitchell and Christopher D. Manning and Stefano Ermon and Chelsea Finn},
year = 2023,
booktitle = {Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 16, 2023},
url = {http://papers.nips.cc/paper_files/paper/2023/hash/a85b405ed65c6477a4fe8302b5e06ce7-Abstract-Conference.html},
editor = {Alice Oh and Tristan Naumann and Amir Globerson and Kate Saenko and Moritz Hardt and Sergey Levine},
}"""),
}
for an example

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants