-
Notifications
You must be signed in to change notification settings - Fork 33
Closed
Labels
feature requestRequest featuresRequest features
Description
In the section 4 of the paper referenced for Warmup-Stable-Decay learning rate scheduler (https://arxiv.org/abs/2405.18392) they mention 1-sqrt being the best cooldown function, on page 20 they further demonstrate how it outperforms other functions so why the choice of cosine for WSD cooldown? https://github.com/kozistr/pytorch_optimizer/blob/main/pytorch_optimizer/lr_scheduler/wsd.py#L33
kozistr
Metadata
Metadata
Assignees
Labels
feature requestRequest featuresRequest features