NAME

Lugh::Optimizer::AdamW - Adam optimizer with decoupled weight decay

SYNOPSIS

use Lugh;
use Lugh::Autograd;

# Create context and parameter tensors
my $ctx = Lugh::Context->new(mem_size => 64 * 1024 * 1024);
my $weights = Lugh::Autograd::Tensor->new($ctx, 'f32', 768, 768, {
    requires_grad => 1,
});

# Create AdamW optimizer (recommended for transformers)
my $optimizer = Lugh::Optimizer::AdamW->new(
    lr           => 1e-4,
    weight_decay => 0.01,
);

# Register parameters
$optimizer->add_param($weights);

# Training loop
for my $step (1..10000) {
    $optimizer->zero_grad();

    my $loss = compute_loss($weights, $batch);
    $loss->backward();

    $optimizer->step();
}

DESCRIPTION

Lugh::Optimizer::AdamW implements the Adam optimizer with decoupled weight decay regularization, as described in "Decoupled Weight Decay Regularization" (Loshchilov & Hutter, 2017).

AdamW is the recommended optimizer for training transformer models including LLaMA, GPT, and BERT variants. It maintains per-parameter adaptive learning rates using first and second moment estimates.

The update rule is:

m_t = beta1 * m_{t-1} + (1 - beta1) * gradient
v_t = beta2 * v_{t-1} + (1 - beta2) * gradient^2

m_hat = m_t / (1 - beta1^t)  # Bias correction
v_hat = v_t / (1 - beta2^t)

param = param - lr * (m_hat / (sqrt(v_hat) + eps) + weight_decay * param)

CONSTRUCTOR

new

my $optimizer = Lugh::Optimizer::AdamW->new(%options);

Creates a new AdamW optimizer.

Options:

lr (default: 0.001)

Learning rate. For transformers, typical values are 1e-4 to 5e-5.

beta1 (default: 0.9)

Exponential decay rate for the first moment estimates.

beta2 (default: 0.999)

Exponential decay rate for the second moment estimates.

eps (default: 1e-8)

Small constant for numerical stability. Prevents division by zero.

weight_decay (default: 0.01)

Decoupled weight decay coefficient. Applied directly to parameters, not to gradients.

Examples:

# Default AdamW for transformers
my $adamw = Lugh::Optimizer::AdamW->new(
    lr           => 1e-4,
    weight_decay => 0.01,
);

# Fine-tuning with lower learning rate
my $adamw = Lugh::Optimizer::AdamW->new(
    lr           => 2e-5,
    weight_decay => 0.01,
    beta1        => 0.9,
    beta2        => 0.999,
);

# For stable training with larger batches
my $adamw = Lugh::Optimizer::AdamW->new(
    lr           => 5e-4,
    weight_decay => 0.1,
    beta2        => 0.95,  # Lower for larger batches
);

METHODS

add_param

$optimizer->add_param($tensor);

Registers a tensor as a parameter to be optimized. The optimizer will maintain first and second moment estimates for this parameter.

Parameters:

$tensor

A Lugh::Autograd::Tensor object with requires_grad enabled.

Example:

# Register all model parameters
for my $layer (@layers) {
    $optimizer->add_param($layer->{weights});
    $optimizer->add_param($layer->{bias}) if $layer->{bias};
}

zero_grad

$optimizer->zero_grad();

Zeros the gradients of all registered parameters. Must be called at the start of each training iteration.

step

$optimizer->step();

Performs a single optimization step, updating all registered parameters based on their gradients and the optimizer state (moment estimates).

get_lr

my $current_lr = $optimizer->get_lr();

Returns the current learning rate.

set_lr

$optimizer->set_lr($new_lr);

Sets a new learning rate. Useful when implementing learning rate schedules.

get_step_count

my $steps = $optimizer->get_step_count();

Returns the number of optimization steps taken. Used internally for bias correction.

TRAINING TIPS

Learning Rate

  • Start with 1e-4 for pre-training

  • Use 2e-5 to 5e-5 for fine-tuning

  • Scale with batch size: lr_scaled = lr_base * sqrt(batch_size / 32)

Weight Decay

  • Use 0.01 for most cases

  • Increase to 0.1 for regularization

  • Set to 0 for bias terms and layer norms (not currently supported per-parameter)

Warmup

AdamW benefits from learning rate warmup, especially for larger models:

my $scheduler = Lugh::Optimizer::LRScheduler->new(
    $optimizer,
    schedule     => 'linear',
    warmup_steps => 1000,
    total_steps  => 100000,
);

ADAMW VS ADAM

AdamW differs from the original Adam in how weight decay is applied:

  • Adam: Weight decay is added to the gradient before computing moment estimates, which couples it with the adaptive learning rate.

  • AdamW: Weight decay is applied directly to parameters after the gradient update, decoupling it from the adaptive learning rate.

AdamW generally provides better generalization, especially for transformer models.

MEMORY USAGE

AdamW maintains two additional tensors per parameter (first and second moment estimates), tripling the memory required compared to SGD. For large models, this can be significant:

Model Params    Parameter Memory    AdamW Memory
1B              4 GB                12 GB
7B              28 GB               84 GB

SEE ALSO

Lugh::Optimizer::SGD - Basic SGD optimizer

Lugh::Optimizer::LRScheduler - Learning rate scheduling

Lugh::Optimizer - Gradient clipping utilities

Lugh::Autograd - Automatic differentiation

REFERENCES

Loshchilov, I., & Hutter, F. (2017). "Decoupled Weight Decay Regularization" https://arxiv.org/abs/1711.05101

AUTHOR

LNATION <email@lnation.org>

LICENSE

This is free software; you can redistribute it and/or modify it under the same terms as Perl itself.