NAME

Lugh::Optimizer::LRScheduler - Learning rate scheduling for optimizers

SYNOPSIS

use Lugh;
use Lugh::Autograd;

# Create optimizer
my $optimizer = Lugh::Optimizer::AdamW->new(lr => 1e-4);

# Create cosine annealing scheduler with warmup
my $scheduler = Lugh::Optimizer::LRScheduler->new(
    $optimizer,
    schedule     => 'cosine',
    warmup_steps => 1000,
    total_steps  => 100000,
    min_lr       => 1e-6,
);

# Training loop
for my $step (1..100000) {
    $optimizer->zero_grad();

    my $loss = compute_loss();
    $loss->backward();

    $optimizer->step();
    $scheduler->step();  # Update learning rate

    if ($step % 1000 == 0) {
        printf "Step %d, LR: %.2e\n", $step, $scheduler->get_lr();
    }
}

DESCRIPTION

Lugh::Optimizer::LRScheduler provides learning rate scheduling for optimizer objects. Learning rate schedules adjust the learning rate during training, which is crucial for achieving good convergence, especially in transformer models.

The scheduler wraps an optimizer and adjusts its learning rate based on the current step count and the chosen schedule.

CONSTRUCTOR

new

my $scheduler = Lugh::Optimizer::LRScheduler->new($optimizer, %options);

Creates a new learning rate scheduler.

Parameters:

$optimizer

The optimizer to schedule. Must have get_lr() and set_lr() methods. Typically a Lugh::Optimizer::SGD or Lugh::Optimizer::AdamW.

Options:

schedule or type (default: 'constant')

The schedule type. Available schedules:

  • 'constant' - No change to learning rate

  • 'linear' - Linear decay from initial to min_lr

  • 'cosine' - Cosine annealing

  • 'exponential' - Exponential decay

  • 'step' - Step decay at milestones

  • 'warmup' - Linear warmup only, then constant

warmup_steps (default: 0)

Number of warmup steps. During warmup, the learning rate increases linearly from 0 to the initial learning rate.

total_steps (default: 1000)

Total number of training steps. Used for calculating decay rates.

min_lr (default: 0)

Minimum learning rate. The scheduler will not reduce the learning rate below this value.

decay_rate or gamma (default: 0.1)

Decay factor for exponential and step schedules.

milestones (for step schedule)

Array reference of step numbers at which to apply decay.

SCHEDULE TYPES

constant

my $scheduler = Lugh::Optimizer::LRScheduler->new(
    $optimizer,
    schedule => 'constant',
);

Learning rate remains unchanged throughout training.

linear

my $scheduler = Lugh::Optimizer::LRScheduler->new(
    $optimizer,
    schedule     => 'linear',
    warmup_steps => 1000,
    total_steps  => 100000,
    min_lr       => 1e-6,
);

Linear decay from initial learning rate to min_lr:

if step <= warmup_steps:
    lr = initial_lr * (step / warmup_steps)
else:
    progress = (step - warmup_steps) / (total_steps - warmup_steps)
    lr = initial_lr + (min_lr - initial_lr) * progress

cosine

my $scheduler = Lugh::Optimizer::LRScheduler->new(
    $optimizer,
    schedule     => 'cosine',
    warmup_steps => 1000,
    total_steps  => 100000,
    min_lr       => 1e-6,
);

Cosine annealing, which provides smooth decay:

if step <= warmup_steps:
    lr = initial_lr * (step / warmup_steps)
else:
    progress = (step - warmup_steps) / (total_steps - warmup_steps)
    lr = min_lr + 0.5 * (initial_lr - min_lr) * (1 + cos(pi * progress))

Cosine annealing is the most commonly used schedule for transformer training.

exponential

my $scheduler = Lugh::Optimizer::LRScheduler->new(
    $optimizer,
    schedule   => 'exponential',
    decay_rate => 0.99,
    min_lr     => 1e-6,
);

Exponential decay:

lr = initial_lr * decay_rate^step
lr = max(lr, min_lr)

step

my $scheduler = Lugh::Optimizer::LRScheduler->new(
    $optimizer,
    schedule   => 'step',
    milestones => [30000, 60000, 90000],
    decay_rate => 0.1,
);

Step decay at specified milestones:

lr = initial_lr * decay_rate^(number of milestones passed)

Common in computer vision (e.g., ImageNet training).

warmup

my $scheduler = Lugh::Optimizer::LRScheduler->new(
    $optimizer,
    schedule     => 'warmup',
    warmup_steps => 1000,
);

Linear warmup followed by constant learning rate:

if step <= warmup_steps:
    lr = initial_lr * (step / warmup_steps)
else:
    lr = initial_lr

METHODS

step

$scheduler->step();

Advances the scheduler by one step and updates the optimizer's learning rate. Should be called once per training iteration, typically after $optimizer->step().

get_lr

my $current_lr = $scheduler->get_lr();

Returns the current learning rate from the underlying optimizer.

get_step

my $current_step = $scheduler->get_step();

Returns the current step count of the scheduler.

USAGE PATTERNS

Pre-training

For pre-training large language models:

my $scheduler = Lugh::Optimizer::LRScheduler->new(
    $optimizer,
    schedule     => 'cosine',
    warmup_steps => 2000,
    total_steps  => 300000,
    min_lr       => 1e-5,
);

Fine-tuning

For fine-tuning on downstream tasks:

my $scheduler = Lugh::Optimizer::LRScheduler->new(
    $optimizer,
    schedule     => 'linear',
    warmup_steps => 100,
    total_steps  => 3000,
    min_lr       => 0,
);

Few-shot Learning

For quick adaptation:

my $scheduler = Lugh::Optimizer::LRScheduler->new(
    $optimizer,
    schedule     => 'warmup',
    warmup_steps => 10,
);

EXAMPLE: COMPLETE TRAINING LOOP

use Lugh;
use Lugh::Autograd;

# Setup
my $ctx = Lugh::Context->new(mem_size => 64 * 1024 * 1024);
my $params = create_model_params($ctx);

my $optimizer = Lugh::Optimizer::AdamW->new(
    lr           => 1e-4,
    weight_decay => 0.01,
);

for my $p (@$params) {
    $optimizer->add_param($p);
}

my $scheduler = Lugh::Optimizer::LRScheduler->new(
    $optimizer,
    schedule     => 'cosine',
    warmup_steps => 1000,
    total_steps  => 50000,
    min_lr       => 1e-6,
);

# Training
for my $step (1..50000) {
    my ($inputs, $targets) = get_batch();

    $optimizer->zero_grad();

    my $loss = forward_pass($params, $inputs, $targets);
    $loss->backward();

    # Gradient clipping (optional)
    Lugh::Optimizer->clip_grad_norm(1.0, @$params);

    $optimizer->step();
    $scheduler->step();

    if ($step % 100 == 0) {
        printf "Step %d: loss=%.4f, lr=%.2e\n",
               $step, $loss->get_data()->[0], $scheduler->get_lr();
    }
}

SEE ALSO

Lugh::Optimizer::AdamW - Recommended optimizer for transformers

Lugh::Optimizer::SGD - Basic SGD optimizer

Lugh::Optimizer - Gradient clipping utilities

AUTHOR

LNATION <email@lnation.org>

LICENSE

This is free software; you can redistribute it and/or modify it under the same terms as Perl itself.