Optimization Nuggets: Stochastic Polyak Step-size, Part 2
Published:
Fabian Pedregosa invited me to write a joint blog post on a convergence proof for the stochastic Polyak step size (SPS).
Published:
Fabian Pedregosa invited me to write a joint blog post on a convergence proof for the stochastic Polyak step size (SPS).
Published:
I wrote a blog post which got published at the ICLR blog post track 2023. The post is titled Decay No More and explains the details of AdamW and its weight decay mechanism. Check it out here.
Published:
TL;DR: AdamW is often considered a method that decouples weight decay and learning rate. In this blog post, we show that this is not true for the specific way AdamW is implemented in Pytorch. We also show how to adapt the tuning strategy in order to fix this: when doubling the learning rate, the weight decay should be halved.
Published:
I wrote a blog post which got published at the ICLR blog post track 2023. The post is titled Decay No More and explains the details of AdamW and its weight decay mechanism. Check it out here.
Published:
Making your research code open-source, tested and documented is quite simple nowadays. This post gives an overview of the most important steps and collects useful ressources, e.g. tutorials for Readthedocs, Sphinx (Gallery) and unit testing in Python.
Published:
Fabian Pedregosa invited me to write a joint blog post on a convergence proof for the stochastic Polyak step size (SPS).
Published:
I wrote a blog post which got published at the ICLR blog post track 2023. The post is titled Decay No More and explains the details of AdamW and its weight decay mechanism. Check it out here.
Published:
When implementing optimization algorithms, we typically have to balance the following goals:
Published:
When implementing optimization algorithms, we typically have to balance the following goals:
Published:
When implementing optimization algorithms, we typically have to balance the following goals:
Published:
Making your research code open-source, tested and documented is quite simple nowadays. This post gives an overview of the most important steps and collects useful ressources, e.g. tutorials for Readthedocs, Sphinx (Gallery) and unit testing in Python.
Published:
TL;DR: AdamW is often considered a method that decouples weight decay and learning rate. In this blog post, we show that this is not true for the specific way AdamW is implemented in Pytorch. We also show how to adapt the tuning strategy in order to fix this: when doubling the learning rate, the weight decay should be halved.