AI::MXNet::Optimizer - Common Optimization algorithms with regularizations.


Create an optimizer with specified name.

name: str
    Name of required optimizer. Should be the name
    of a subclass of Optimizer. Case insensitive.

rescale_grad : float
    Rescaling factor on gradient. Normally should be 1/batch_size.

kwargs: dict
    Parameters for optimizer

opt : Optimizer
    The result optimizer.


Set individual learning rate multipler for parameters

args_lr_mult : dict of string/int to float
    set the lr multipler for name/index to float.
    setting multipler by index is supported for backward compatibility,
    but we recommend using name and symbol.


Set individual weight decay multipler for parameters.
By default wd multipler is 0 for all params whose name doesn't
end with _weight, if param_idx2name is provided.

args_wd_mult : dict of string/int to float
    set the wd multipler for name/index to float.
    setting multipler by index is supported for backward compatibility,
    but we recommend using name and symbol.


AI::MXNet::SGD - A very simple SGD optimizer with momentum and weight regularization.


learning_rate : float, optional
    learning_rate of SGD

momentum : float, optional
   momentum value

wd : float, optional
    L2 regularization coefficient add to all the weights

rescale_grad : float, optional
    rescaling factor of gradient. Normally should be 1/batch_size.

clip_gradient : float, optional
    clip gradient in range [-clip_gradient, clip_gradient]

param_idx2name : dict of string/int to float, optional
    special treat weight decay in parameter ends with bias, gamma, and beta


AI::MXNet::DCASGD - DCASGD optimizer with momentum and weight regularization.


Implements paper "Asynchronous Stochastic Gradient Descent with
                Delay Compensation for Distributed Deep Learning"

learning_rate : float, optional
    learning_rate of SGD

momentum : float, optional
   momentum value

lamda : float, optional
   scale DC value

wd : float, optional
    L2 regularization coefficient add to all the weights

rescale_grad : float, optional
    rescaling factor of gradient. Normally should be 1/batch_size.

clip_gradient : float, optional
    clip gradient in range [-clip_gradient, clip_gradient]

param_idx2name : hash ref of string/int to float, optional
    special treat weight decay in parameter ends with bias, gamma, and beta


AI::MXNet::NAG - SGD with Nesterov weight handling.


It is implemented according to


AI::MXNet::SLGD - Stochastic Langevin Dynamics Updater to sample from a distribution.


learning_rate : float, optional
    learning_rate of SGD

wd : float, optional
    L2 regularization coefficient add to all the weights

rescale_grad : float, optional
    rescaling factor of gradient. Normally should be 1/batch_size.

clip_gradient : float, optional
    clip gradient in range [-clip_gradient, clip_gradient]

param_idx2name : dict of string/int to float, optional
    special treat weight decay in parameter ends with bias, gamma, and beta


AI::MXNet::Adam - Adam optimizer as described in [King2014]_.


.. [King2014] Diederik Kingma, Jimmy Ba,
   *Adam: A Method for Stochastic Optimization*,

the code in this class was adapted from

learning_rate : float, optional
    Step size.
    Default value is set to 0.001.
beta1 : float, optional
    Exponential decay rate for the first moment estimates.
    Default value is set to 0.9.
beta2 : float, optional
    Exponential decay rate for the second moment estimates.
    Default value is set to 0.999.
epsilon : float, optional
    Default value is set to 1e-8.
decay_factor : float, optional
    Default value is set to 1 - 1e-8.

wd : float, optional
    L2 regularization coefficient add to all the weights
rescale_grad : float, optional
    rescaling factor of gradient. Normally should be 1/batch_size.

clip_gradient : float, optional
    clip gradient in range [-clip_gradient, clip_gradient]


AI::MXNet::AdaGrad - AdaGrad optimizer of Duchi et al., 2011


This code follows the version in  Eq(5)
by Matthew D. Zeiler, 2012. AdaGrad will help the network to converge faster
in some cases.

learning_rate : float, optional
    Step size.
    Default value is set to 0.05.

wd : float, optional
    L2 regularization coefficient add to all the weights

rescale_grad : float, optional
    rescaling factor of gradient. Normally should be 1/batch_size.

eps: float, optional
    A small float number to make the updating processing stable
    Default value is set to 1e-7.

clip_gradient : float, optional
    clip gradient in range [-clip_gradient, clip_gradient]


AI::MXNet::RMSProp - RMSProp optimizer of Tieleman & Hinton, 2012.


For centered=False, the code follows the version in by
Tieleman & Hinton, 2012

For centered=True, the code follows the version in Eq(38) - Eq(45) by Alex Graves, 2013.

learning_rate : float, optional
    Step size.
    Default value is set to 0.001.
gamma1: float, optional
    decay factor of moving average for gradient^2.
    Default value is set to 0.9.
gamma2: float, optional
    "momentum" factor.
    Default value if set to 0.9.
    Only used if centered=True
epsilon : float, optional
    Default value is set to 1e-8.
centered : bool, optional
    Use Graves or Tielemans & Hintons version of RMSProp
wd : float, optional
    L2 regularization coefficient add to all the weights
rescale_grad : float, optional
    rescaling factor of gradient.
clip_gradient : float, optional
    clip gradient in range [-clip_gradient, clip_gradient]
clip_weights : float, optional
    clip weights in range [-clip_weights, clip_weights]


AI::MXNet::AdaDelta - AdaDelta optimizer.


rho: float
    Decay rate for both squared gradients and delta x
epsilon : float
    The constant as described in the thesis
wd : float
    L2 regularization coefficient add to all the weights
rescale_grad : float, optional
    rescaling factor of gradient. Normally should be 1/batch_size.
clip_gradient : float, optional
    clip gradient in range [-clip_gradient, clip_gradient]




Reference:Ad Click Prediction: a View from the Trenches

lamda1 : float, optional
    L1 regularization coefficient.

learning_rate : float, optional
    The initial learning rate.

beta : float, optional
    Per-coordinate learning rate correlation parameter.