feat(trainer): add PAPOTrainer for preference-based optimization #4334

SolarWindRider · 2025-10-24T08:38:20Z

What does this PR do?

This PR introduces a new trainer named PAPOTrainer, which extends GRPOTrainer to support the PAPO (Preference Alignment via Policy Optimization) algorithm.

Motivation

PAPO is a variant of GRPO that incorporates a contrastive preference optimization mechanism to improve stability when positive samples are sparse. But the official code use verl. To make it convenient for everyone to use, I implemented the TRL version of the code based on the PAPO formula, and it runs successfully.

Implementation Details

Added trl/trainer/papo_trainer.py
Added trl/trainer/papo_config.py
Updated __init__.py to include PAPOTrainer
All tests pass locally with pytest tests/trainer/test_papo_trainer.py -v

🧪 Example Usage

https://github.com/SolarWindRider/avr/blob/main/train_papo.py

I have tested my trainer[with PEFT and FSDP] on Ascend910C and H20[single node with 8 cards].

Checklist

I have tested this code locally
I have run all tests with pytest
I have followed the code style guidelines
I have added docstrings and comments

qgallouedec · 2025-10-24T23:57:52Z

Thanks for your contribution. Can you mode this new trainer to trl.experimental instead? Ideally, we would also have a small mention in the paper index section of the documentation

renamed: trl/trainer/papo_config.py -> trl/experimental/papo/papo_config.py renamed: trl/trainer/papo_trainer.py -> trl/experimental/papo/papo_trainer.py modified: trl/trainer/__init__.py

SolarWindRider · 2025-10-25T17:20:26Z

Thanks for your contribution. Can you mode this new trainer to trl.experimental instead? Ideally, we would also have a small mention in the paper index section of the documentation

Thank you for your advice. I have moved this new trainer to trl.experimental and also added PAPO info in paper index.

qgallouedec · 2025-10-28T05:08:41Z

docs/source/paper_index.md

+
+### PAPO (Preference Alignment via Policy Optimization)
+
+* **Paper:** [Link to Paper.](https://arxiv.org/abs/2507.06448)


can you please try to align with other subsections

### Paper title **📜 Paper**: https://huggingface.co/papers/XXXX.XXXXX Some brief intro. ```python from trl import ... training_args = ... ```

qgallouedec · 2025-10-28T05:13:47Z

trl/experimental/papo/papo_trainer.py

+    def create_model_card(
+        self,
+        model_name: Optional[str] = None,
+        dataset_name: Optional[str] = None,
+        tags: Union[str, list[str], None] = None,
+    ):
+        """
+        Creates a model card for PAPO trainer.
+        """
+        if not self.is_world_process_zero():
+            return
+
+        # Normalize tags
+        if tags is None:
+            tags = set()
+        elif isinstance(tags, str):
+            tags = {tags}
+        else:
+            tags = set(tags)
+
+        tags.update(self._tag_names)
+
+        # PAPO doesn't have a published paper yet, so we reference the GRPO paper
+        # and note that PAPO extends it for multimodal reasoning
+        citation = """\
+@article{wang2025perception,


can you please remove this method and use a class method for the citation instead. see

trl/trl/trainer/dpo_trainer.py

Lines 253 to 268 in 5e691d1

_tag_names = ["trl", "dpo"]

_name = "DPO"

_paper = {

"title": "Direct Preference Optimization: Your Language Model is Secretly a Reward Model",

"id": "2305.18290",

# docstyle-ignore

"citation": textwrap.dedent("""\

@inproceedings{rafailov2023direct,

title = {{Direct Preference Optimization: Your Language Model is Secretly a Reward Model}},

author = {Rafael Rafailov and Archit Sharma and Eric Mitchell and Christopher D. Manning and Stefano Ermon and Chelsea Finn},

year = 2023,

booktitle = {Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 16, 2023},

url = {http://papers.nips.cc/paper_files/paper/2023/hash/a85b405ed65c6477a4fe8302b5e06ce7-Abstract-Conference.html},

editor = {Alice Oh and Tristan Naumann and Amir Globerson and Kate Saenko and Moritz Hardt and Sergey Levine},

}"""),

}

for an example

…/trl-papo into feat/trainer-papo

solarwindrider and others added 3 commits October 24, 2025 16:19

feat(trainer): add PAPOTrainer for preference-based optimization

727a783

Merge branch 'main' into feat/trainer-papo

c2c4c56

Merge branch 'main' into feat/trainer-papo

c32a23c

SolarWindRider and others added 4 commits October 26, 2025 00:51

Merge branch 'main' into feat/trainer-papo

e4af368

new file: trl/experimental/papo/__init__.py

1b1fb4c

renamed: trl/trainer/papo_config.py -> trl/experimental/papo/papo_config.py renamed: trl/trainer/papo_trainer.py -> trl/experimental/papo/papo_trainer.py modified: trl/trainer/__init__.py

move papo to exp and add paper index

a3751e2

clean trainer __init__.py

4f2b250

qgallouedec reviewed Oct 28, 2025

View reviewed changes

precommit

75b231d

qgallouedec reviewed Oct 28, 2025

View reviewed changes

SolarWindRider and others added 7 commits October 28, 2025 14:20

Merge branch 'huggingface:main' into feat/trainer-papo

59e7222

paper index info And a class method for the citation

04b986b

Merge branch 'feat/trainer-papo' of https://github.com/SolarWindRider…

25c3595

…/trl-papo into feat/trainer-papo

Merge branch 'main' into feat/trainer-papo

6d25b0a

fix conversational inputs bugs

6422795

Merge branch 'main' into feat/trainer-papo

4ccfd28

Merge branch 'main' into feat/trainer-papo

bfaee35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

feat(trainer): add PAPOTrainer for preference-based optimization #4334

feat(trainer): add PAPOTrainer for preference-based optimization #4334

Uh oh!

SolarWindRider commented Oct 24, 2025 •

edited

Loading

Uh oh!

qgallouedec commented Oct 24, 2025

Uh oh!

SolarWindRider commented Oct 25, 2025

Uh oh!

qgallouedec Oct 28, 2025 •

edited

Loading

Uh oh!

SolarWindRider Oct 28, 2025

Uh oh!

qgallouedec Oct 28, 2025

Uh oh!

SolarWindRider Oct 28, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants


		### PAPO (Preference Alignment via Policy Optimization)

		* Paper: [Link to Paper.](https://arxiv.org/abs/2507.06448)

	_tag_names = ["trl", "dpo"]
	_name = "DPO"
	_paper = {
	"title": "Direct Preference Optimization: Your Language Model is Secretly a Reward Model",
	"id": "2305.18290",
	# docstyle-ignore
	"citation": textwrap.dedent("""\
	@inproceedings{rafailov2023direct,
	title = {{Direct Preference Optimization: Your Language Model is Secretly a Reward Model}},
	author = {Rafael Rafailov and Archit Sharma and Eric Mitchell and Christopher D. Manning and Stefano Ermon and Chelsea Finn},
	year = 2023,
	booktitle = {Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 16, 2023},
	url = {http://papers.nips.cc/paper_files/paper/2023/hash/a85b405ed65c6477a4fe8302b5e06ce7-Abstract-Conference.html},
	editor = {Alice Oh and Tristan Naumann and Amir Globerson and Kate Saenko and Moritz Hardt and Sergey Levine},
	}"""),
	}

Uh oh!

feat(trainer): add PAPOTrainer for preference-based optimization #4334

Are you sure you want to change the base?

feat(trainer): add PAPOTrainer for preference-based optimization #4334

Uh oh!

Conversation

SolarWindRider commented Oct 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Motivation

Implementation Details

🧪 Example Usage

Checklist

Uh oh!

qgallouedec commented Oct 24, 2025

Uh oh!

SolarWindRider commented Oct 25, 2025

Uh oh!

qgallouedec Oct 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SolarWindRider Oct 28, 2025

Choose a reason for hiding this comment

Uh oh!

qgallouedec Oct 28, 2025

Choose a reason for hiding this comment

Uh oh!

SolarWindRider Oct 28, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

SolarWindRider commented Oct 24, 2025 •

edited

Loading

qgallouedec Oct 28, 2025 •

edited

Loading