NAME

Lugh::Inference - Transformer Forward Pass and Token Generation

VERSION

Version 0.01

SYNOPSIS

use Lugh;

# Load model and create inference engine
my $model = Lugh::Model->new(model => '/path/to/model.gguf');
my $tokenizer = Lugh::Tokenizer->new(model => $model);
my $inference = Lugh::Inference->new(model => $model);

# Encode a prompt
my @tokens = $tokenizer->encode("The capital of France is");

# Run forward pass to get logits
my @logits = $inference->forward(\@tokens);

# Find the most likely next token (greedy)
my $max_idx = 0;
my $max_val = $logits[0];
for my $i (1..$#logits) {
    if ($logits[$i] > $max_val) {
        $max_val = $logits[$i];
        $max_idx = $i;
    }
}

# Decode the predicted token
my $next_token = $tokenizer->decode([$max_idx]);
print "Next token: $next_token\n";  # " Paris"

# Or use top-p sampling
my $sampled = $inference->sample_top_p(
    \@logits,
    temperature => 0.8,
    top_p => 0.95
);

DESCRIPTION

Lugh::Inference implements the transformer forward pass for autoregressive language model inference. Given a sequence of input tokens, it computes the probability distribution (logits) over the vocabulary for the next token.

Transformer Architecture

The forward pass implements the standard transformer decoder architecture used by LLaMA, Mistral, and similar models:

Input Tokens
     │
     ▼
┌─────────────┐
│  Token      │
│  Embeddings │
└─────────────┘
     │
     ▼
┌─────────────────────────────────┐
│  Transformer Layer (× N)        │
│  ┌─────────────────────────────┐│
│  │ RMSNorm                     ││
│  │      ↓                      ││
│  │ Multi-Head Attention (GQA)  ││
│  │      ↓                      ││
│  │ Residual Add                ││
│  │      ↓                      ││
│  │ RMSNorm                     ││
│  │      ↓                      ││
│  │ FFN (SwiGLU)                ││
│  │      ↓                      ││
│  │ Residual Add                ││
│  └─────────────────────────────┘│
└─────────────────────────────────┘
     │
     ▼
┌─────────────┐
│  Final      │
│  RMSNorm    │
└─────────────┘
     │
     ▼
┌─────────────┐
│  Output     │
│  Projection │ → Logits [vocab_size]
└─────────────┘

Key Components

  • RMSNorm - Root Mean Square Layer Normalization

  • RoPE - Rotary Position Embeddings for position encoding

  • GQA - Grouped Query Attention (multiple Q heads per KV head)

  • SwiGLU - Gated activation in feed-forward network

CONSTRUCTOR

new

my $inference = Lugh::Inference->new(
    model      => $model,
    backend    => 'auto',  # optional, compute backend
    n_threads  => 4,       # optional, number of CPU threads
    flash_attn => 0,       # optional, use flash attention
);

Creates a new Inference engine from a loaded model.

Parameters:

  • model (required) - A Lugh::Model object

  • backend (optional) - Compute backend to use. Defaults to 'auto'.

    Available backends depend on your system:

    • 'auto' - Automatically select the best available (GPU preferred)

    • 'Metal' - Apple Metal GPU (macOS only)

    • 'CUDA' - NVIDIA GPU (if ggml built with CUDA)

    • 'Vulkan' - Cross-platform GPU

    • 'CPU' - CPU with SIMD (always available)

    Use Lugh::available_backends() to see what's available on your system.

  • n_threads (optional) - Number of CPU threads for computation. Defaults to 4. Only affects CPU backend.

  • flash_attn (optional) - Use flash attention if set to 1 (default: 0)

Returns: A Lugh::Inference object.

Throws: Dies if no model is provided or if requested backend is unavailable.

Example:

my $model = Lugh::Model->new(model => 'model.gguf');

# Auto-select best backend (recommended)
my $inference = Lugh::Inference->new(model => $model);

# Force Metal GPU on macOS
my $gpu_inference = Lugh::Inference->new(
    model   => $model,
    backend => 'Metal',
);

# Force CPU with 8 threads
my $cpu_inference = Lugh::Inference->new(
    model     => $model,
    backend   => 'CPU',
    n_threads => 8,
);

METHODS

forward

my @logits = $inference->forward(\@tokens);

Runs the transformer forward pass on input tokens.

Parameters:

  • \@tokens - Array reference of token IDs (integers)

Returns: A list of logits (one per vocabulary token).

Details:

The forward pass:

1. Looks up token embeddings
2. Applies N transformer layers with attention and FFN
3. Applies final normalization
4. Projects to vocabulary size
5. Returns logits for the last token position

Performance Notes:

  • Each call creates a new computation graph

  • Memory is allocated and freed for each call

  • For multi-token generation, consider batching

Example:

my @tokens = (1, 450, 7483, 310, 3444, 338);  # "The capital of France is"
my @logits = $inference->forward(\@tokens);

# logits has 32000 elements (vocab size)
print "Vocab size: ", scalar(@logits), "\n";

sample_top_p

my $token_id = $inference->sample_top_p(
    \@logits,
    temperature => 0.8,
    top_p => 0.95
);

Samples a token from logits using nucleus (top-p) sampling.

Parameters:

  • \@logits - Array reference of logits from forward()

  • temperature - Sampling temperature (default: 0.8)

  • top_p - Cumulative probability threshold (default: 0.95)

Returns: A single token ID.

Algorithm:

1. Apply temperature scaling: logit / temperature
2. Convert to probabilities via softmax
3. Sort tokens by probability (descending)
4. Keep tokens until cumulative probability >= top_p
5. Randomly sample from this "nucleus"

Temperature Effects:

  • temperature < 1 - More deterministic (sharper distribution)

  • temperature = 1 - Use raw probabilities

  • temperature > 1 - More random (flatter distribution)

  • temperature → 0 - Approaches greedy (argmax)

Top-p Effects:

  • top_p = 0.1 - Very focused, only most likely tokens

  • top_p = 0.9 - Typical setting, good balance

  • top_p = 1.0 - Consider all tokens (like top-k with k=all)

Example:

my @logits = $inference->forward(\@tokens);

# Creative generation
my $token = $inference->sample_top_p(\@logits,
    temperature => 1.0,
    top_p => 0.95
);

# More focused generation
my $token = $inference->sample_top_p(\@logits,
    temperature => 0.3,
    top_p => 0.5
);

sample_top_k

my $token_id = $inference->sample_top_k(
    \@logits,
    temperature => 0.8,
    top_k => 40
);

Samples a token from logits using top-k sampling.

Parameters:

  • \@logits - Array reference of logits from forward()

  • temperature - Sampling temperature (default: 0.8)

  • top_k - Number of top tokens to consider (default: 40)

Returns: A single token ID.

Algorithm:

1. Apply temperature scaling: logit / temperature
2. Convert to probabilities via softmax
3. Select top-k tokens by probability
4. Renormalize probabilities
5. Randomly sample from top-k tokens

Example:

my @logits = $inference->forward(\@tokens);

# Sample from top 50 tokens
my $token = $inference->sample_top_k(\@logits,
    temperature => 0.9,
    top_k => 50
);

generate

my @tokens = $inference->generate(
    \@prompt_tokens,
    max_tokens  => 100,
    temperature => 0.8,
    top_p       => 0.95,
    top_k       => 40,
    greedy      => 0,
    eos_token   => 2,
    callback    => sub { ... },
);

Generates multiple tokens autoregressively from a prompt.

Parameters:

  • \@prompt_tokens (required) - Array reference of prompt token IDs

  • max_tokens - Maximum tokens to generate (default: 128)

  • temperature - Sampling temperature (default: 0.8)

  • top_p - Top-p (nucleus) sampling threshold (default: 0.95)

  • top_k - Top-k sampling limit (default: 40). If < 1000, uses top_k; otherwise uses top_p

  • greedy - If true, use greedy decoding (argmax) (default: 0)

  • eos_token - Token ID to stop generation (default: from model, typically 2)

  • callback - Optional subroutine called for each generated token

Returns: A list of generated token IDs (not including the prompt).

Callback:

The callback receives (token_id, count) and should return true to stop generation:

callback => sub {
    my ($token, $count) = @_;
    print $tokenizer->decode([$token]);
    return 0;  # Continue (return 1 to stop)
}

Stopping Conditions:

Generation stops when:

  • max_tokens is reached

  • EOS token is generated

  • Callback returns true

Example:

use Lugh;

my $model = Lugh::Model->new(model => 'model.gguf');
my $tokenizer = Lugh::Tokenizer->new(model => $model);
my $inference = Lugh::Inference->new(model => $model);

my @prompt = $tokenizer->encode("Once upon a time");

# Greedy generation
my @tokens = $inference->generate(\@prompt,
    max_tokens => 50,
    greedy     => 1,
);
print $tokenizer->decode(\@tokens);

# Creative generation with streaming
@tokens = $inference->generate(\@prompt,
    max_tokens  => 100,
    temperature => 1.0,
    top_p       => 0.95,
    callback    => sub {
        my ($tok, $n) = @_;
        print $tokenizer->decode([$tok]);
        STDOUT->flush();
        return 0;
    },
);

ATTENTION MECHANISM

Scaled Dot-Product Attention

Attention(Q, K, V) = softmax(QK^T / √d_k) × V

Where:

  • Q - Query vectors [head_dim, n_tokens, n_heads]

  • K - Key vectors [head_dim, n_tokens, n_kv_heads]

  • V - Value vectors [head_dim, n_tokens, n_kv_heads]

  • d_k - Head dimension (typically 64-128)

Grouped Query Attention (GQA)

GQA uses fewer KV heads than query heads to reduce memory:

Model       n_head  n_kv_head  Ratio
LLaMA 7B    32      32         1:1 (MHA)
LLaMA 2 70B 64      8          8:1 (GQA)
TinyLlama   32      4          8:1 (GQA)
Mistral 7B  32      8          4:1 (GQA)

The implementation broadcasts KV heads to match query heads using ggml's native broadcasting.

Causal Masking

The attention uses causal (autoregressive) masking so each position can only attend to itself and previous positions:

Position:  0  1  2  3
0          ✓  ✗  ✗  ✗
1          ✓  ✓  ✗  ✗
2          ✓  ✓  ✓  ✗
3          ✓  ✓  ✓  ✓

This is implemented using ggml_diag_mask_inf which sets the upper triangle to -infinity before softmax.

RoPE (Rotary Position Embeddings)

Position information is encoded by rotating Q and K vectors:

RoPE(x, pos) = x × cos(pos × θ) + rotate(x) × sin(pos × θ)

Where θ depends on the dimension and base frequency (typically 10000).

Parameters are read from model metadata:

  • llama.rope.dimension_count - Dimensions to rotate

  • llama.rope.freq_base - Base frequency

  • llama.context_length - Original context length

FEED-FORWARD NETWORK

The FFN uses SwiGLU activation:

FFN(x) = down(gate(x) × SiLU(up(x)))

Where:

  • gate, up - Linear projections to intermediate dimension

  • SiLU - Sigmoid Linear Unit: x × sigmoid(x)

  • down - Linear projection back to model dimension

Typical dimensions:

Model       n_embd  FFN_dim   Ratio
TinyLlama   2048    5632      2.75×
LLaMA 7B    4096    11008     2.69×
LLaMA 13B   5120    13824     2.70×

GENERATION LOOP

The generate() method handles the complete generation loop internally. For simple use cases:

use Lugh;

my $model = Lugh::Model->new(model => 'model.gguf');
my $tokenizer = Lugh::Tokenizer->new(model => $model);
my $inference = Lugh::Inference->new(model => $model);

my @prompt = $tokenizer->encode("Once upon a time");
my @generated = $inference->generate(\@prompt,
    max_tokens  => 100,
    temperature => 0.8,
    top_p       => 0.95,
);
print $tokenizer->decode(\@generated);

For streaming output:

my @generated = $inference->generate(\@prompt,
    max_tokens  => 100,
    temperature => 0.8,
    callback    => sub {
        my ($token, $count) = @_;
        print $tokenizer->decode([$token]);
        STDOUT->flush();
        return 0;  # Continue
    },
);

For manual control (building your own loop):

my @tokens = $tokenizer->encode($prompt);
my @generated;

for (1..$max_tokens) {
    my @logits = $inference->forward(\@tokens);
    my $next = $inference->sample_top_p(\@logits,
        temperature => 0.8,
        top_p => 0.9
    );
    
    last if $next == $tokenizer->eos_id;
    
    push @tokens, $next;
    push @generated, $next;
    
    print $tokenizer->decode([$next]);
    STDOUT->flush();
}

PERFORMANCE

Computation

A single forward pass performs approximately:

FLOPs ≈ 2 × n_params × n_tokens

For TinyLlama (1.1B params) with 6 tokens:

2 × 1.1e9 × 6 ≈ 13 GFLOPs

Memory

During inference, memory is needed for:

  • Model weights (quantized) - Depends on model size and quantization

  • Activations - O(n_tokens × n_embd × n_layers)

  • Attention scores - O(n_tokens² × n_heads × n_layers)

Optimizations

Current implementation:

  • Uses ggml's Metal GPU backend on macOS

  • Uses Accelerate BLAS for matrix operations

  • Quantized weights stay quantized during computation

  • KV cache for incremental decoding (see create_kv_cache)

  • Memory pools for efficient repeated inference

  • Batch processing for multiple sequences

ADVANCED METHODS

create_memory_pool

my $pool = $inference->create_memory_pool();

Creates a reusable memory pool for efficient repeated forward passes. The pool caches backend and allocator resources, avoiding per-call allocation overhead.

Returns: A Lugh::MemoryPool object.

Example:

my $pool = $inference->create_memory_pool();

# Efficient repeated forward passes
for my $text (@texts) {
    my @tokens = $tokenizer->encode($text);
    my @logits = $inference->forward_with_pool($pool, \@tokens);
    # Process logits...
}

# Pool automatically cleaned up on destruction
# Or manually reset for next batch:
$pool->reset();

forward_with_pool

my @logits = $inference->forward_with_pool($pool, \@tokens);

Runs forward pass using a pre-allocated memory pool. More efficient than forward() for repeated inference on different inputs.

Parameters:

  • $pool - A Lugh::MemoryPool from create_memory_pool()

  • \@tokens - Array reference of token IDs

Returns: A list of logits (one per vocabulary token).

Example:

my $pool = $inference->create_memory_pool();

# Much more efficient than calling forward() repeatedly
for my $prompt (@prompts) {
    my @tokens = $tokenizer->encode($prompt);
    my @logits = $inference->forward_with_pool($pool, \@tokens);
    my $next_token = $inference->sample_top_p(\@logits, top_p => 0.9);
    print $tokenizer->decode([$next_token]), "\n";
}

forward_batch

my $results = $inference->forward_batch(\@sequences);

Processes multiple token sequences, returning logits for each. Each sequence is processed independently with shared backend resources.

Parameters:

  • \@sequences - Array reference of array references of token IDs

Returns: Array reference of array references of logits.

Example:

my @seq1 = $tokenizer->encode("Hello");
my @seq2 = $tokenizer->encode("World");
my @seq3 = $tokenizer->encode("Test");

my $results = $inference->forward_batch([\@seq1, \@seq2, \@seq3]);

# $results->[0] is logits for seq1
# $results->[1] is logits for seq2
# $results->[2] is logits for seq3

for my $i (0 .. $#$results) {
    my @logits = @{$results->[$i]};
    my $next = $inference->sample_top_p(\@logits, top_p => 0.9);
    print "Sequence $i next token: ", $tokenizer->decode([$next]), "\n";
}

MEMORY POOL

The Lugh::MemoryPool class provides reusable compute resources:

reset

$pool->reset();

Resets the memory pool for reuse. Frees and reallocates the compute context. Called automatically by forward_with_pool().

Returns: True on success.

backend

my $backend_name = $pool->backend();

Returns the name of the backend used by this pool (e.g., "Metal", "CPU").

THREAD SAFETY

Lugh::Inference objects are NOT thread-safe. Each Perl thread must create its own Inference object.

SEE ALSO

Lugh, Lugh::Model, Lugh::Tokenizer

https://arxiv.org/abs/1706.03762 - "Attention Is All You Need"

https://arxiv.org/abs/2104.09864 - RoPE paper

https://arxiv.org/abs/2002.05202 - SwiGLU activation

https://arxiv.org/abs/2305.13245 - GQA paper

AUTHOR

lnation <email@lnation.org>

LICENSE

This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself.

1 POD Error

The following errors were encountered while parsing the POD:

Around line 66:

Non-ASCII character seen before =encoding in '│'. Assuming UTF-8