Skip to content

Why not 1-sqrt for WSD decay? #382

@slashedstar

Description

@slashedstar

In the section 4 of the paper referenced for Warmup-Stable-Decay learning rate scheduler (https://arxiv.org/abs/2405.18392) they mention 1-sqrt being the best cooldown function, on page 20 they further demonstrate how it outperforms other functions so why the choice of cosine for WSD cooldown? https://github.com/kozistr/pytorch_optimizer/blob/main/pytorch_optimizer/lr_scheduler/wsd.py#L33

Metadata

Metadata

Assignees

Labels

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions