Skip to content

Conversation

@kaixuanliu
Copy link
Contributor

When we run test case pytest -rA tests/test_sft_trainer.py::TestSFTTrainer::test_train_vlm_gemma_3n, it will fail both on CUDA and Intel XPU. Further investigation shows there are 2 reasons:

  1. audio tower part will not update related weight during finetune
  2. when using bf16 datatype, small changes for model weight will be rounded off
    This PR fixes this bug.

@kaixuanliu kaixuanliu marked this pull request as draft October 15, 2025 06:56
@kaixuanliu kaixuanliu marked this pull request as ready for review October 15, 2025 07:23
@yao-matrix
Copy link
Contributor

@kashif , pls help review, thx very much.

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Copy link
Member

@albertvillanova albertvillanova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the catch and the fix.

I confirm that this PR fixes the test:

PASSED tests/test_sft_trainer.py::TestSFTTrainer::test_train_vlm_gemma_3n

Maybe we could add the reason why vision/audio towers do not update, something like they are frozen during training?

Copy link
Member

@qgallouedec qgallouedec left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer keeping bf16, as it's usually the precision used. Instead we can simply increase precision

per_device_train_batch_size=1,
gradient_checkpointing=True,
model_init_kwargs={"dtype": "bfloat16"},
model_init_kwargs={"dtype": "float16"},
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
model_init_kwargs={"dtype": "float16"},
model_init_kwargs={"dtype": "bfloat16"},

# Initialize the trainer
training_args = SFTConfig(
output_dir=self.tmp_dir,
max_length=None,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
max_length=None,
learning_rate=0.1, # increase lr to ensure updates are not lost due to bf16 rounding
max_length=None,

@qgallouedec qgallouedec changed the title fix CI issue for vlm_gemma_3n model Fix CI issue for vlm_gemma_3n model Oct 28, 2025
@qgallouedec qgallouedec merged commit a9d33d0 into huggingface:main Oct 28, 2025
8 of 10 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants