solidot新版网站常见问题,请点击这里查看。
消息
本文已被查看5164次
Fixing Weight Decay Regularization in Adam. (arXiv:1711.05101v2 [cs.LG] UPDATED)
来源于:arXiv
L$_2$ regularization and weight decay regularization are equivalent for
standard stochastic gradient descent (when rescaled by the learning rate), but
as we demonstrate this is \emph{not} the case for adaptive gradient algorithms,
such as Adam. While common deep learning frameworks of these algorithms
implement L$_2$ regularization (often calling it "weight decay" in what may be
misleading due to the inequivalence we expose), we propose a simple
modification to recover the original formulation of weight decay regularization
by decoupling the weight decay from the optimization steps taken w.r.t. the
loss function. We provide empirical evidence that our proposed modification (i)
decouples the optimal choice of weight decay factor from the setting of the
learning rate for both standard SGD and Adam, and (ii) substantially improves
Adam's generalization performance, allowing it to compete with SGD with
momentum on image classification datasets (on which it was previously typically
outperfo 查看全文>>