NAME

Lugh::Inference - Transformer Forward Pass and Token Generation

VERSION

Version 0.12

SYNOPSIS

use Lugh;

# Load model and create inference engine
my $model = Lugh::Model->new(model => '/path/to/model.gguf');
my $tokenizer = Lugh::Tokenizer->new(model => $model);
my $inference = Lugh::Inference->new(model => $model);

# Encode a prompt
my @tokens = $tokenizer->encode("The capital of France is");

# Run forward pass to get logits
my @logits = $inference->forward(tokens => \@tokens);

# Find the most likely next token (greedy)
my $max_idx = 0;
my $max_val = $logits[0];
for my $i (1..$#logits) {
    if ($logits[$i] > $max_val) {
        $max_val = $logits[$i];
        $max_idx = $i;
    }
}

# Decode the predicted token
my $next_token = $tokenizer->decode([$max_idx]);
print "Next token: $next_token\n";  # " Paris"

# Or use top-p sampling
my $sampled = $inference->sample_top_p(
    \@logits,
    temperature => 0.8,
    top_p => 0.95
);

DESCRIPTION

Lugh::Inference implements the transformer forward pass for autoregressive language model inference. Given a sequence of input tokens, it computes the probability distribution (logits) over the vocabulary for the next token.

Transformer Architecture

The forward pass implements the standard transformer decoder architecture used by LLaMA, Mistral, and similar models:

Input Tokens
     │
     ▼
┌─────────────┐
│  Token      │
│  Embeddings │
└─────────────┘
     │
     ▼
┌─────────────────────────────────┐
│  Transformer Layer (× N)        │
│  ┌─────────────────────────────┐│
│  │ RMSNorm                     ││
│  │      ↓                      ││
│  │ Multi-Head Attention (GQA)  ││
│  │      ↓                      ││
│  │ Residual Add                ││
│  │      ↓                      ││
│  │ RMSNorm                     ││
│  │      ↓                      ││
│  │ FFN (SwiGLU)                ││
│  │      ↓                      ││
│  │ Residual Add                ││
│  └─────────────────────────────┘│
└─────────────────────────────────┘
     │
     ▼
┌─────────────┐
│  Final      │
│  RMSNorm    │
└─────────────┘
     │
     ▼
┌─────────────┐
│  Output     │
│  Projection │ → Logits [vocab_size]
└─────────────┘

Key Components

RMSNorm - Root Mean Square Layer Normalization
RoPE - Rotary Position Embeddings for position encoding
GQA - Grouped Query Attention (multiple Q heads per KV head)
SwiGLU - Gated activation in feed-forward network

CONSTRUCTOR

new

my $inference = Lugh::Inference->new(
    model      => $model,
    backend    => 'auto',  # optional, compute backend
    n_threads  => 4,       # optional, number of CPU threads
    flash_attn => 0,       # optional, use flash attention
);

Creates a new Inference engine from a loaded model.

Parameters:

model (required) - A Lugh::Model object
backend (optional) - Compute backend to use. Defaults to 'auto'.

Available backends depend on your system:
- 'auto' - Automatically select the best available (GPU preferred)
- 'Metal' - Apple Metal GPU (macOS only)
- 'CUDA' - NVIDIA GPU (if ggml built with CUDA)
- 'Vulkan' - Cross-platform GPU
- 'CPU' - CPU with SIMD (always available)
Use Lugh::available_backends() to see what's available on your system.
n_threads (optional) - Number of CPU threads for computation. Defaults to 4. Only affects CPU backend.
flash_attn (optional) - Use flash attention if set to 1 (default: 0)

Returns: A Lugh::Inference object.

Throws: Dies if no model is provided or if requested backend is unavailable.

Example:

my $model = Lugh::Model->new(model => 'model.gguf');

# Auto-select best backend (recommended)
my $inference = Lugh::Inference->new(model => $model);

# Force Metal GPU on macOS
my $gpu_inference = Lugh::Inference->new(
    model   => $model,
    backend => 'Metal',
);

# Force CPU with 8 threads
my $cpu_inference = Lugh::Inference->new(
    model     => $model,
    backend   => 'CPU',
    n_threads => 8,
);

METHODS

forward

my @logits = $inference->forward(tokens => \@tokens);

# With optional LoRA adapter
my @logits = $inference->forward(
    tokens => \@tokens,
    lora   => $lora,
);

Runs the transformer forward pass on input tokens.

Parameters:

tokens or \@tokens - Array reference of token IDs (integers)
lora (optional) - A Lugh::LoRA adapter to apply during inference

Returns: A list of logits (one per vocabulary token).

Details:

The forward pass:

1. Looks up token embeddings
2. Applies N transformer layers with attention and FFN
3. Applies final normalization
4. Projects to vocabulary size
5. Returns logits for the last token position

Performance Notes:

Each call creates a new computation graph
Memory is allocated and freed for each call
For multi-token generation, consider batching

Example:

my @tokens = (1, 450, 7483, 310, 3444, 338);  # "The capital of France is"
my @logits = $inference->forward(tokens => \@tokens);

# logits has 32000 elements (vocab size)
print "Vocab size: ", scalar(@logits), "\n";

sample_top_p

my $token_id = $inference->sample_top_p(
    \@logits,
    temperature => 0.8,
    top_p => 0.95
);

Samples a token from logits using nucleus (top-p) sampling.

Parameters:

\@logits - Array reference of logits from forward()
temperature - Sampling temperature (default: 0.8)
top_p - Cumulative probability threshold (default: 0.95)

Returns: A single token ID.

Algorithm:

1. Apply temperature scaling: logit / temperature
2. Convert to probabilities via softmax
3. Sort tokens by probability (descending)
4. Keep tokens until cumulative probability >= top_p
5. Randomly sample from this "nucleus"

Temperature Effects:

temperature < 1 - More deterministic (sharper distribution)
temperature = 1 - Use raw probabilities
temperature > 1 - More random (flatter distribution)
temperature → 0 - Approaches greedy (argmax)

Top-p Effects:

top_p = 0.1 - Very focused, only most likely tokens
top_p = 0.9 - Typical setting, good balance
top_p = 1.0 - Consider all tokens (like top-k with k=all)

Example:

my @logits = $inference->forward(tokens => \@tokens);

# Creative generation
my $token = $inference->sample_top_p(\@logits,
    temperature => 1.0,
    top_p => 0.95
);

# More focused generation
my $token = $inference->sample_top_p(\@logits,
    temperature => 0.3,
    top_p => 0.5
);

sample_top_k

my $token_id = $inference->sample_top_k(
    \@logits,
    temperature => 0.8,
    top_k => 40
);

Samples a token from logits using top-k sampling.

Parameters:

\@logits - Array reference of logits from forward()
temperature - Sampling temperature (default: 0.8)
top_k - Number of top tokens to consider (default: 40)

Returns: A single token ID.

Algorithm:

1. Apply temperature scaling: logit / temperature
2. Convert to probabilities via softmax
3. Select top-k tokens by probability
4. Renormalize probabilities
5. Randomly sample from top-k tokens

Example:

my @logits = $inference->forward(tokens => \@tokens);

# Sample from top 50 tokens
my $token = $inference->sample_top_k(\@logits,
    temperature => 0.9,
    top_k => 50
);

generate

my @tokens = $inference->generate(
    \@prompt_tokens,
    max_tokens  => 100,
    temperature => 0.8,
    top_p       => 0.95,
    top_k       => 40,
    greedy      => 0,
    eos_token   => 2,
    callback    => sub { ... },
);

Generates multiple tokens autoregressively from a prompt.

Parameters:

\@prompt_tokens (required) - Array reference of prompt token IDs
max_tokens - Maximum tokens to generate (default: 128)
temperature - Sampling temperature (default: 0.8)
top_p - Top-p (nucleus) sampling threshold (default: 0.95)
top_k - Top-k sampling limit (default: 40). If < 1000, uses top_k; otherwise uses top_p
greedy - If true, use greedy decoding (argmax) (default: 0)
eos_token - Token ID to stop generation (default: from model, typically 2)
callback - Optional subroutine called for each generated token

Returns: A list of generated token IDs (not including the prompt).

Callback:

The callback receives (token_id, count) and should return true to stop generation:

callback => sub {
    my ($token, $count) = @_;
    print $tokenizer->decode([$token]);
    return 0;  # Continue (return 1 to stop)
}

Stopping Conditions:

Generation stops when:

max_tokens is reached
EOS token is generated
Callback returns true

Example:

use Lugh;

my $model = Lugh::Model->new(model => 'model.gguf');
my $tokenizer = Lugh::Tokenizer->new(model => $model);
my $inference = Lugh::Inference->new(model => $model);

my @prompt = $tokenizer->encode("Once upon a time");

# Greedy generation
my @tokens = $inference->generate(\@prompt,
    max_tokens => 50,
    greedy     => 1,
);
print $tokenizer->decode(\@tokens);

# Creative generation with streaming
@tokens = $inference->generate(\@prompt,
    max_tokens  => 100,
    temperature => 1.0,
    top_p       => 0.95,
    callback    => sub {
        my ($tok, $n) = @_;
        print $tokenizer->decode([$tok]);
        STDOUT->flush();
        return 0;
    },
);

ATTENTION MECHANISM

Scaled Dot-Product Attention

Attention(Q, K, V) = softmax(QK^T / √d_k) × V

Where:

Q - Query vectors [head_dim, n_tokens, n_heads]
K - Key vectors [head_dim, n_tokens, n_kv_heads]
V - Value vectors [head_dim, n_tokens, n_kv_heads]
d_k - Head dimension (typically 64-128)

Grouped Query Attention (GQA)

GQA uses fewer KV heads than query heads to reduce memory:

Model       n_head  n_kv_head  Ratio
LLaMA 7B    32      32         1:1 (MHA)
LLaMA 2 70B 64      8          8:1 (GQA)
TinyLlama   32      4          8:1 (GQA)
Mistral 7B  32      8          4:1 (GQA)

The implementation broadcasts KV heads to match query heads using ggml's native broadcasting.

Causal Masking

The attention uses causal (autoregressive) masking so each position can only attend to itself and previous positions:

Position:  0  1  2  3
0          ✓  ✗  ✗  ✗
1          ✓  ✓  ✗  ✗
2          ✓  ✓  ✓  ✗
3          ✓  ✓  ✓  ✓

This is implemented using ggml_diag_mask_inf which sets the upper triangle to -infinity before softmax.

RoPE (Rotary Position Embeddings)

Position information is encoded by rotating Q and K vectors:

RoPE(x, pos) = x × cos(pos × θ) + rotate(x) × sin(pos × θ)

Where θ depends on the dimension and base frequency (typically 10000).

Parameters are read from model metadata:

llama.rope.dimension_count - Dimensions to rotate
llama.rope.freq_base - Base frequency
llama.context_length - Original context length

FEED-FORWARD NETWORK

The FFN uses SwiGLU activation:

FFN(x) = down(gate(x) × SiLU(up(x)))

Where:

gate, up - Linear projections to intermediate dimension
SiLU - Sigmoid Linear Unit: x × sigmoid(x)
down - Linear projection back to model dimension

Typical dimensions:

Model       n_embd  FFN_dim   Ratio
TinyLlama   2048    5632      2.75×
LLaMA 7B    4096    11008     2.69×
LLaMA 13B   5120    13824     2.70×

GENERATION LOOP

The generate() method handles the complete generation loop internally. For simple use cases:

use Lugh;

my $model = Lugh::Model->new(model => 'model.gguf');
my $tokenizer = Lugh::Tokenizer->new(model => $model);
my $inference = Lugh::Inference->new(model => $model);

my @prompt = $tokenizer->encode("Once upon a time");
my @generated = $inference->generate(\@prompt,
    max_tokens  => 100,
    temperature => 0.8,
    top_p       => 0.95,
);
print $tokenizer->decode(\@generated);

For streaming output:

my @generated = $inference->generate(\@prompt,
    max_tokens  => 100,
    temperature => 0.8,
    callback    => sub {
        my ($token, $count) = @_;
        print $tokenizer->decode([$token]);
        STDOUT->flush();
        return 0;  # Continue
    },
);

For manual control (building your own loop):

my @tokens = $tokenizer->encode($prompt);
my @generated;

for (1..$max_tokens) {
    my @logits = $inference->forward(tokens => \@tokens);
    my $next = $inference->sample_top_p(\@logits,
        temperature => 0.8,
        top_p => 0.9
    );
    
    last if $next == $tokenizer->eos_id;
    
    push @tokens, $next;
    push @generated, $next;
    
    print $tokenizer->decode([$next]);
    STDOUT->flush();
}

PERFORMANCE

Computation

A single forward pass performs approximately:

FLOPs ≈ 2 × n_params × n_tokens

For TinyLlama (1.1B params) with 6 tokens:

2 × 1.1e9 × 6 ≈ 13 GFLOPs

Memory

During inference, memory is needed for:

Model weights (quantized) - Depends on model size and quantization
Activations - O(n_tokens × n_embd × n_layers)
Attention scores - O(n_tokens² × n_heads × n_layers)

Optimizations

Current implementation:

Uses ggml's Metal GPU backend on macOS
Uses Accelerate BLAS for matrix operations
Quantized weights stay quantized during computation
KV cache for incremental decoding (see create_kv_cache)
Memory pools for efficient repeated inference
Batch processing for multiple sequences

ADVANCED METHODS

create_memory_pool

my $pool = $inference->create_memory_pool();

Creates a reusable memory pool for efficient repeated forward passes. The pool caches backend and allocator resources, avoiding per-call allocation overhead.

Returns: A Lugh::MemoryPool object.

Example:

my $pool = $inference->create_memory_pool();

# Efficient repeated forward passes
for my $text (@texts) {
    my @tokens = $tokenizer->encode($text);
    my @logits = $inference->forward_pool($pool, \@tokens);
    # Process logits...
}

# Pool automatically cleaned up on destruction
# Or manually reset for next batch:
$pool->reset();

forward_pool

# Positional form
my @logits = $inference->forward_pool($pool, \@tokens);

# Named parameter form (required for LoRA)
my @logits = $inference->forward_pool(
    pool   => $pool,
    tokens => \@tokens,
    lora   => $lora,        # optional: Lugh::LoRA adapter
);

Runs forward pass using a pre-allocated memory pool. More efficient than forward() for repeated inference on different inputs.

Parameters:

pool or $pool - A Lugh::MemoryPool from create_memory_pool()
tokens or \@tokens - Array reference of token IDs
lora (optional) - A Lugh::LoRA adapter to apply during inference

Returns: A list of logits (one per vocabulary token).

Example:

my $pool = $inference->create_memory_pool();

# Much more efficient than calling forward() repeatedly
for my $prompt (@prompts) {
    my @tokens = $tokenizer->encode($prompt);
    my @logits = $inference->forward_pool($pool, \@tokens);
    my $next_token = $inference->sample_top_p(\@logits, top_p => 0.9);
    print $tokenizer->decode([$next_token]), "\n";
}

# With LoRA adapter
my $lora = Lugh::LoRA->new(adapter => 'adapter.gguf', model => $model);
for my $prompt (@prompts) {
    my @tokens = $tokenizer->encode($prompt);
    my @logits = $inference->forward_pool(
        pool   => $pool,
        tokens => \@tokens,
        lora   => $lora,
    );
    # ...
}

forward_batch

# Positional form
my $results = $inference->forward_batch(\@sequences);

# Named parameter form (required for LoRA or per-sequence caches)
my $results = $inference->forward_batch(
    sequences => \@sequences,
    lora      => $lora,      # optional: Lugh::LoRA adapter
    caches    => \@caches,   # optional: per-sequence KV caches
);

Processes multiple token sequences, returning logits for each. Each sequence is processed independently with shared backend resources.

Parameters:

sequences or \@sequences - Array reference of array references of token IDs
lora (optional) - A Lugh::LoRA adapter to apply during inference
caches (optional) - Array reference of KV caches, one per sequence. Must have the same count as sequences. Each sequence will use its corresponding cache for incremental decoding, allowing parallel continuation of multiple conversations.

Returns: Array reference of array references of logits.

Example:

my @seq1 = $tokenizer->encode("Hello");
my @seq2 = $tokenizer->encode("World");
my @seq3 = $tokenizer->encode("Test");

my $results = $inference->forward_batch([\@seq1, \@seq2, \@seq3]);

# $results->[0] is logits for seq1
# $results->[1] is logits for seq2
# $results->[2] is logits for seq3

for my $i (0 .. $#$results) {
    my @logits = @{$results->[$i]};
    my $next = $inference->sample_top_p(\@logits, top_p => 0.9);
    print "Sequence $i next token: ", $tokenizer->decode([$next]), "\n";
}

# With LoRA adapter
my $lora = Lugh::LoRA->new(adapter => 'adapter.gguf', model => $model);
my $results = $inference->forward_batch(
    sequences => [\@seq1, \@seq2, \@seq3],
    lora      => $lora,
);

# With per-sequence caches for incremental decoding
my $cache1 = $inference->create_kv_cache();
my $cache2 = $inference->create_kv_cache();

# First pass - encode prompts
my $results = $inference->forward_batch(
    sequences => [\@seq1, \@seq2],
    caches    => [$cache1, $cache2],
);

# Each cache now contains the KV state for its sequence
# Continue decoding with new tokens
my @next1 = ($inference->sample_top_p($results->[0], top_p => 0.9));
my @next2 = ($inference->sample_top_p($results->[1], top_p => 0.9));

my $results2 = $inference->forward_batch(
    sequences => [\@next1, \@next2],
    caches    => [$cache1, $cache2],
);

Convenience Methods

The following convenience methods are thin wrappers around the unified _forward_unified() function, providing simpler APIs for common use cases:

forward_simple

my @logits = $inference->forward_simple(\@tokens);

Simplest forward pass - just tokens, no cache, pool, or adapters.

forward_pool

my @logits = $inference->forward_pool($pool, \@tokens);
my @logits = $inference->forward_pool($pool, \@tokens, lora => $lora);

Forward pass using a memory pool for efficient compute resource reuse.

forward_cache

my @logits = $inference->forward_cache($cache, \@tokens);
my @logits = $inference->forward_cache($cache, \@tokens, rope => $rope);

Forward pass with KV cache for incremental decoding.

forward_cache_pool

my @logits = $inference->forward_cache_pool($cache, $pool, \@tokens);

Forward pass combining KV cache and memory pool for maximum efficiency.

forward_batch_pool

my $results = $inference->forward_batch_pool($pool, \@sequences);

Batch processing with memory pool for high-throughput inference.

Unified Forward API

For full control, use the unified _forward_unified() XS function directly:

my @logits = $inference->_forward_unified(
    tokens    => \@tokens,      # OR sequences => \@seqs
    cache     => $cache,        # optional
    pool      => $pool,         # optional
    rope      => $rope,         # optional
    lora      => $lora,         # optional
);

MEMORY POOL

The Lugh::MemoryPool class provides reusable compute resources:

reset

$pool->reset();

Resets the memory pool for reuse. Frees and reallocates the compute context. Called automatically by forward_pool().

Returns: True on success.

backend

my $backend_name = $pool->backend();

Returns the name of the backend used by this pool (e.g., "Metal", "CPU").

THREAD SAFETY

Lugh::Inference objects are NOT thread-safe. Each Perl thread must create its own Inference object.

AUTHOR

lnation <email@lnation.org>

LICENSE

This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself.

To install Lugh, copy and paste the appropriate command in to your terminal.

cpanm

cpanm Lugh

CPAN shell

perl -MCPAN -e shell
install Lugh

For more information on module installation, please visit the detailed CPAN module installation guide.

	Global
`s`	Focus search bar
`?`	Bring up this help dialog

	GitHub
`g` `p`	Go to pull requests
`g` `i`	go to github issues (only if github is preferred repository)

	POD
`g` `a`	Go to author
`g` `c`	Go to changes
`g` `i`	Go to issues
`g` `d`	Go to dist
`g` `r`	Go to repository/SCM
`g` `s`	Go to source
`g` `b`	Go to file browse

	Search terms
module: (e.g. module:Plugin)
distribution: (e.g. distribution:Dancer auth)
author: (e.g. author:SONGMU Redis)
version: (e.g. version:1.00)

NAME

VERSION

SYNOPSIS

DESCRIPTION

Transformer Architecture

Key Components

CONSTRUCTOR

new

METHODS

forward

sample_top_p

sample_top_k

generate

ATTENTION MECHANISM

Scaled Dot-Product Attention

Grouped Query Attention (GQA)

Causal Masking

RoPE (Rotary Position Embeddings)

FEED-FORWARD NETWORK

GENERATION LOOP

PERFORMANCE

Computation

Memory

Optimizations

ADVANCED METHODS

create_memory_pool

forward_pool

forward_batch

Convenience Methods

forward_simple

forward_pool

forward_cache

forward_cache_pool

forward_batch_pool

Unified Forward API

MEMORY POOL

reset

backend

THREAD SAFETY

SEE ALSO

AUTHOR

LICENSE

Module Install Instructions