Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

Commit

Permalink
Doc fix for a few optimizers (#12034)
Browse files Browse the repository at this point in the history
* Update optimizer.py

* Update optimizer.py
  • Loading branch information
eric-haibin-lin authored Aug 6, 2018
1 parent 7261d8c commit 70efd32
Showing 1 changed file with 32 additions and 2 deletions.
34 changes: 32 additions & 2 deletions python/mxnet/optimizer.py
Original file line number Diff line number Diff line change
Expand Up @@ -449,7 +449,7 @@ class SGD(Optimizer):
**lazy updates** are applied by::
for row in grad.indices:
rescaled_grad[row] = lr * rescale_grad * clip(grad[row], clip_gradient) + wd * weight[row]
rescaled_grad[row] = lr * (rescale_grad * clip(grad[row], clip_gradient) + wd * weight[row])
state[row] = momentum[row] * state[row] + rescaled_grad[row]
weight[row] = weight[row] - state[row]
Expand All @@ -462,7 +462,7 @@ class SGD(Optimizer):
Otherwise, **standard updates** are applied by::
rescaled_grad = lr * rescale_grad * clip(grad, clip_gradient) + wd * weight
rescaled_grad = lr * (rescale_grad * clip(grad, clip_gradient) + wd * weight)
state = momentum * state + rescaled_grad
weight = weight - state
Expand Down Expand Up @@ -616,6 +616,14 @@ class FTML(Optimizer):
*FTML - Follow the Moving Leader in Deep Learning*,
available at http://proceedings.mlr.press/v70/zheng17a/zheng17a.pdf.
Denote time step by t. The optimizer updates the weight by::
rescaled_grad = clip(grad * rescale_grad + wd * weight, clip_gradient)
v = beta2 * v + (1 - beta2) * square(rescaled_grad)
d_t = (1 - power(beta1, t)) / lr * square_root(v / (1 - power(beta2, t))) + epsilon)
z = beta1 * z + (1 - beta1) * rescaled_grad - (d_t - beta1 * d_(t-1)) * weight
weight = - z / d_t
This optimizer accepts the following parameters in addition to those accepted
by :class:`.Optimizer`.
Expand Down Expand Up @@ -1080,6 +1088,13 @@ class AdaGrad(Optimizer):
Methods for Online Learning and Stochastic Optimization*, and available at
http://www.jmlr.org/papers/volume12/duchi11a/duchi11a.pdf.
This optimizer updates each weight by::
grad = clip(grad * rescale_grad, clip_gradient)
history += square(grad)
div = grad / sqrt(history + float_stable_eps)
weight += (div + weight * wd) * -lr
This optimizer accepts the following parameters in addition to those accepted
by :class:`.Optimizer`.
Expand Down Expand Up @@ -1207,6 +1222,14 @@ class AdaDelta(Optimizer):
This class implements AdaDelta, an optimizer described in *ADADELTA: An adaptive
learning rate method*, available at https://arxiv.org/abs/1212.5701.
This optimizer updates each weight by::
grad = clip(grad * rescale_grad + wd * weight, clip_gradient)
acc_grad = rho * acc_grad + (1. - rho) * grad * grad
delta = sqrt(acc_delta + epsilon) / sqrt(acc_grad + epsilon) * grad
acc_delta = rho * acc_delta + (1. - rho) * delta * delta
weight -= (delta + wd * weight)
This optimizer accepts the following parameters in addition to those accepted
by :class:`.Optimizer`.
Expand Down Expand Up @@ -1333,6 +1356,13 @@ class Adamax(Optimizer):
It is a variant of Adam based on the infinity norm
available at http://arxiv.org/abs/1412.6980 Section 7.
The optimizer updates the weight by::
grad = clip(grad * rescale_grad + wd * weight, clip_gradient)
m = beta1 * m_t + (1 - beta1) * grad
u = maximum(beta2 * u, abs(grad))
weight -= lr / (1 - beta1**t) * m / u
This optimizer accepts the following parameters in addition to those accepted
by :class:`.Optimizer`.
Expand Down

0 comments on commit 70efd32

Please sign in to comment.