NAME

Lugh::Optimizer - Optimization algorithms for Lugh training

SYNOPSIS

use Lugh;
use Lugh::Optimizer;
use Lugh::Autograd;

# Create a context and tensor
my $ctx = Lugh::Context->new(mem_size => 64 * 1024 * 1024);
my $weights = Lugh::Autograd::Tensor->new($ctx, 'f32', 2, { requires_grad => 1 });
$weights->set_data(0.5, 0.5);

# SGD optimizer
my $sgd = Lugh::Optimizer::SGD->new(
    lr       => 0.01,
    momentum => 0.9,
);
$sgd->add_param($weights);

# Training loop
for my $i (1..100) {
    $sgd->zero_grad();
    
    # ... compute loss and backward ...
    
    $sgd->step();
}

# AdamW optimizer (recommended for transformers)
my $adamw = Lugh::Optimizer::AdamW->new(
    lr           => 1e-4,
    weight_decay => 0.01,
);

# Learning rate scheduler
my $scheduler = Lugh::Optimizer::LRScheduler->new(
    $adamw,
    schedule     => 'cosine',
    warmup_steps => 100,
    total_steps  => 1000,
);

DESCRIPTION

Lugh::Optimizer provides optimization algorithms for training neural networks. All implementations are in XS for maximum performance.

OPTIMIZERS

Lugh::Optimizer::SGD

Stochastic Gradient Descent with momentum and optional Nesterov acceleration.

my $sgd = Lugh::Optimizer::SGD->new(
    lr           => 0.01,     # Learning rate (default: 0.001)
    momentum     => 0.9,      # Momentum factor (default: 0)
    weight_decay => 0.0001,   # L2 regularization (default: 0)
    nesterov     => 1,        # Use Nesterov momentum (default: 0)
);

Methods

add_param($tensor)

Add a tensor to be optimized.

step()

Perform a single optimization step, updating all parameters.

zero_grad()

Zero out gradients for all parameters.

get_lr()

Get the current learning rate.

set_lr($new_lr)

Set a new learning rate.

Lugh::Optimizer::AdamW

Adam optimizer with decoupled weight decay (AdamW). Recommended for transformer models.

my $adamw = Lugh::Optimizer::AdamW->new(
    lr           => 1e-4,    # Learning rate (default: 0.001)
    beta1        => 0.9,     # First moment decay (default: 0.9)
    beta2        => 0.999,   # Second moment decay (default: 0.999)
    eps          => 1e-8,    # Numerical stability (default: 1e-8)
    weight_decay => 0.01,    # Decoupled weight decay (default: 0.01)
);

Methods

Same as SGD, plus:

get_step_count()

Get the current optimization step count (for bias correction).

AdamW vs Adam

AdamW applies weight decay directly to the weights (decoupled), rather than adding L2 regularization to the gradient. This leads to better generalization for large language models.

LEARNING RATE SCHEDULERS

Lugh::Optimizer::LRScheduler

Adjusts learning rate during training according to a schedule.

my $scheduler = Lugh::Optimizer::LRScheduler->new(
    $optimizer,
    schedule     => 'cosine',   # Schedule type
    warmup_steps => 100,        # Warmup period
    total_steps  => 1000,       # Total training steps
    min_lr       => 0,          # Minimum learning rate
);

Schedule Types

constant

Learning rate stays constant throughout training.

linear

Linear warmup then linear decay to min_lr.

cosine

Linear warmup then cosine annealing to min_lr. Popular for LLM training.

exponential

Exponential decay: lr = initial_lr * decay_rate^step

step

Decay at specific milestones:

my $scheduler = Lugh::Optimizer::LRScheduler->new(
    $optimizer,
    schedule   => 'step',
    milestones => [30, 60, 90],  # Decay at these steps
    decay_rate => 0.1,           # Multiply by this at each milestone
);
warmup

Linear warmup then constant.

Methods

step()

Update the learning rate. Call once per training step.

get_lr()

Get the current learning rate.

get_step()

Get the current step count.

GRADIENT CLIPPING

clip_grad_norm

Clip gradients by total norm.

my $total_norm = Lugh::Optimizer->clip_grad_norm(1.0, @tensors);

Scales all gradients so the total L2 norm does not exceed max_norm. Returns the original total norm before clipping.

clip_grad_value

Clip gradients by value.

Lugh::Optimizer->clip_grad_value(1.0, @tensors);

Clamps all gradient values to [-max_value, max_value].

EXAMPLE: COMPLETE TRAINING LOOP

use Lugh;
use Lugh::Train;
use Lugh::Optimizer;

# Load model and create inference engine
my $model = Lugh::Model->new(model => 'model.gguf');
my $inference = Lugh::Inference->new(model => $model);

# Create trainable LoRA adapter
my $lora = Lugh::LoRA->create(
    model   => $model,
    rank    => 8,
    alpha   => 16.0,
    targets => [qw(attn_q attn_v)],
);

# Get LoRA weight tensors
my @params;
for my $name ($lora->weight_names) {
    push @params, $lora->get_weight_tensor($name, 'a');
    push @params, $lora->get_weight_tensor($name, 'b');
}

# Create optimizer
my $optimizer = Lugh::Optimizer::AdamW->new(
    lr           => 1e-4,
    weight_decay => 0.01,
);
$optimizer->add_param($_) for @params;

# Create scheduler
my $scheduler = Lugh::Optimizer::LRScheduler->new(
    $optimizer,
    schedule     => 'cosine',
    warmup_steps => 100,
    total_steps  => 1000,
    min_lr       => 1e-6,
);

# Training loop
for my $step (1..1000) {
    my $ctx = Lugh::Context->new(mem_size => 64 * 1024 * 1024);
    $optimizer->zero_grad();

    # Forward pass with LoRA
    my $logits = Lugh::Train->forward(
        inference  => $inference,
        context    => $ctx,
        tokens     => \@input_tokens,
        lora       => $lora,
        train_lora => 1,
    );

    my $loss = Lugh::Train->cross_entropy_loss($ctx, $logits, \@targets);

    # Backward pass
    $loss->backward();

    # Gradient clipping (prevent exploding gradients)
    my $grad_norm = Lugh::Optimizer->clip_grad_norm(1.0, @params);

    # Optimizer step
    $optimizer->step();

    # Learning rate schedule
    $scheduler->step();

    # Log progress
    if ($step % 100 == 0) {
        my $lr = $scheduler->get_lr();
        my ($loss_val) = $loss->get_data();
        printf "Step %d: loss=%.4f, lr=%.2e, grad_norm=%.2f\n",
               $step, $loss_val, $lr, $grad_norm;
    }
}

# Save trained adapter
$lora->save('trained_adapter.gguf');

SEE ALSO

AUTHOR

LNATION <email@lnation.org>

LICENSE

This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself.