NAME

AI::MXNet::Optimizer - Common Optimization algorithms with regularizations.

DESCRIPTION

Common Optimization algorithms with regularizations.

create_optimizer

Create an optimizer with specified name.

Parameters
----------
name: str
    Name of required optimizer. Should be the name
    of a subclass of Optimizer. Case insensitive.

rescale_grad : float
    Rescaling factor on gradient. Normally should be 1/batch_size.

kwargs: dict
    Parameters for optimizer

Returns
-------
opt : Optimizer
    The result optimizer.

set_lr_mult

Set individual learning rate multipler for parameters

Parameters
----------
args_lr_mult : dict of string/int to float
    set the lr multipler for name/index to float.
    setting multipler by index is supported for backward compatibility,
    but we recommend using name and symbol.

set_wd_mult

Set individual weight decay multipler for parameters.
By default wd multipler is 0 for all params whose name doesn't
end with _weight, if param_idx2name is provided.

Parameters
----------
args_wd_mult : dict of string/int to float
    set the wd multipler for name/index to float.
    setting multipler by index is supported for backward compatibility,
    but we recommend using name and symbol.

NAME

AI::MXNet::SGD - A very simple SGD optimizer with momentum and weight regularization.

DESCRIPTION

A very simple SGD optimizer with momentum and weight regularization.

Parameters
----------
learning_rate : float, optional
    learning_rate of SGD

momentum : float, optional
   momentum value

wd : float, optional
    L2 regularization coefficient add to all the weights

rescale_grad : float, optional
    rescaling factor of gradient. Normally should be 1/batch_size.

clip_gradient : float, optional
    clip gradient in range [-clip_gradient, clip_gradient]

param_idx2name : hash of string/int to float, optional
    special treat weight decay in parameter ends with bias, gamma, and beta

multi_precision: bool, optional
    Flag to control the internal precision of the optimizer.
    False results in using the same precision as the weights (default),
    True makes internal 32-bit copy of the weights and applies gradients
    in 32-bit precision even if actual weights used in the model have lower precision.
    Turning this on can improve convergence and accuracy when training with float16.

NAME

AI::MXNet::DCASGD - DCASGD optimizer with momentum and weight regularization.

DESCRIPTION

DCASGD optimizer with momentum and weight regularization.

Implements paper "Asynchronous Stochastic Gradient Descent with
                Delay Compensation for Distributed Deep Learning"

Parameters
----------
learning_rate : float, optional
    learning_rate of SGD

momentum : float, optional
   momentum value

lamda : float, optional
   scale DC value

wd : float, optional
    L2 regularization coefficient add to all the weights

rescale_grad : float, optional
    rescaling factor of gradient. Normally should be 1/batch_size.

clip_gradient : float, optional
    clip gradient in range [-clip_gradient, clip_gradient]

param_idx2name : hash ref of string/int to float, optional
    special treat weight decay in parameter ends with bias, gamma, and beta

NAME

AI::MXNet::NAG - SGD with Nesterov weight handling.

DESCRIPTION

It is implemented according to
https://github.com/torch/optim/blob/master/sgd.lua

NAME

AI::MXNet::SLGD - Stochastic Langevin Dynamics Updater to sample from a distribution.

DESCRIPTION

Stochastic Langevin Dynamics Updater to sample from a distribution.

Parameters
----------
learning_rate : float, optional
    learning_rate of SGD

wd : float, optional
    L2 regularization coefficient add to all the weights

rescale_grad : float, optional
    rescaling factor of gradient. Normally should be 1/batch_size.

clip_gradient : float, optional
    clip gradient in range [-clip_gradient, clip_gradient]

param_idx2name : dict of string/int to float, optional
    special treat weight decay in parameter ends with bias, gamma, and beta

NAME

AI::MXNet::Adam - Adam optimizer as described in [King2014]_.

DESCRIPTION

Adam optimizer as described in [King2014]_.

.. [King2014] Diederik Kingma, Jimmy Ba,
   *Adam: A Method for Stochastic Optimization*,
   http://arxiv.org/abs/1412.6980

the code in this class was adapted from
https://github.com/mila-udem/blocks/blob/master/blocks/algorithms/__init__.py#L765

Parameters
----------
learning_rate : float, optional
    Step size.
    Default value is set to 0.001.
beta1 : float, optional
    Exponential decay rate for the first moment estimates.
    Default value is set to 0.9.
beta2 : float, optional
    Exponential decay rate for the second moment estimates.
    Default value is set to 0.999.
epsilon : float, optional
    Default value is set to 1e-8.
decay_factor : float, optional
    Default value is set to 1 - 1e-8.

wd : float, optional
    L2 regularization coefficient add to all the weights
rescale_grad : float, optional
    rescaling factor of gradient. Normally should be 1/batch_size.

clip_gradient : float, optional
    clip gradient in range [-clip_gradient, clip_gradient]

NAME

AI::MXNet::AdaGrad - AdaGrad optimizer of Duchi et al., 2011

DESCRIPTION

AdaGrad optimizer of Duchi et al., 2011,

This code follows the version in http://arxiv.org/pdf/1212.5701v1.pdf  Eq(5)
by Matthew D. Zeiler, 2012. AdaGrad will help the network to converge faster
in some cases.

Parameters
----------
learning_rate : float, optional
    Step size.
    Default value is set to 0.05.

wd : float, optional
    L2 regularization coefficient add to all the weights

rescale_grad : float, optional
    rescaling factor of gradient. Normally should be 1/batch_size.

eps: float, optional
    A small float number to make the updating processing stable
    Default value is set to 1e-7.

clip_gradient : float, optional
    clip gradient in range [-clip_gradient, clip_gradient]

NAME

AI::MXNet::RMSProp - RMSProp optimizer of Tieleman & Hinton, 2012.

DESCRIPTION

RMSProp optimizer of Tieleman & Hinton, 2012,

For centered=False, the code follows the version in
http://www.cs.toronto.edu/~tijmen/csc321/slides/lecture_slides_lec6.pdf by
Tieleman & Hinton, 2012

For centered=True, the code follows the version in
http://arxiv.org/pdf/1308.0850v5.pdf Eq(38) - Eq(45) by Alex Graves, 2013.

Parameters
----------
learning_rate : float, optional
    Step size.
    Default value is set to 0.001.
gamma1: float, optional
    decay factor of moving average for gradient^2.
    Default value is set to 0.9.
gamma2: float, optional
    "momentum" factor.
    Default value if set to 0.9.
    Only used if centered=True
epsilon : float, optional
    Default value is set to 1e-8.
centered : bool, optional
    Use Graves or Tielemans & Hintons version of RMSProp
wd : float, optional
    L2 regularization coefficient add to all the weights
rescale_grad : float, optional
    rescaling factor of gradient.
clip_gradient : float, optional
    clip gradient in range [-clip_gradient, clip_gradient]
clip_weights : float, optional
    clip weights in range [-clip_weights, clip_weights]

NAME

AI::MXNet::AdaDelta - AdaDelta optimizer.

DESCRIPTION

AdaDelta optimizer as described in
Zeiler, M. D. (2012).
*ADADELTA: An adaptive learning rate method.*

http://arxiv.org/abs/1212.5701

Parameters
----------
rho: float
    Decay rate for both squared gradients and delta x
epsilon : float
    The constant as described in the thesis
wd : float
    L2 regularization coefficient add to all the weights
rescale_grad : float, optional
    rescaling factor of gradient. Normally should be 1/batch_size.
clip_gradient : float, optional
    clip gradient in range [-clip_gradient, clip_gradient]

NAME

AI::MXNet::Ftrl

DESCRIPTION

Reference:Ad Click Prediction: a View from the Trenches

Parameters
----------
lamda1 : float, optional
    L1 regularization coefficient.

learning_rate : float, optional
    The initial learning rate.

beta : float, optional
    Per-coordinate learning rate correlation parameter.
eta_{t,i}=frac{learning_rate}{beta+sqrt{sum_{s=1^}tg_{s,i}^t}

NAME

AI::MXNet::Adamax

DESCRIPTION

It is a variant of Adam based on the infinity norm
available at http://arxiv.org/abs/1412.6980 Section 7.

This optimizer accepts the following parameters in addition to those accepted
AI::MXNet::Optimizer.

Parameters
----------
beta1 : float, optional
    Exponential decay rate for the first moment estimates.
beta2 : float, optional
    Exponential decay rate for the second moment estimates.

NAME

AI::MXNet::Nadam

DESCRIPTION

The Nesterov Adam optimizer.

Much like Adam is essentially RMSprop with momentum,
Nadam is Adam RMSprop with Nesterov momentum available
at http://cs229.stanford.edu/proj2015/054_report.pdf.

This optimizer accepts the following parameters in addition to those accepted
AI::MXNet::Optimizer.

Parameters
----------
beta1 : float, optional
    Exponential decay rate for the first moment estimates.
beta2 : float, optional
    Exponential decay rate for the second moment estimates.
epsilon : float, optional
    Small value to avoid division by 0.
schedule_decay : float, optional
    Exponential decay rate for the momentum schedule

set_states

Sets updater states.

get_states

Gets updater states.

Parameters
----------
dump_optimizer : bool, default False
    Whether to also save the optimizer itself. This would also save optimizer
    information such as learning rate and weight decay schedules.