NAME
Lugh::Inference - Transformer Forward Pass and Token Generation
VERSION
Version 0.01
SYNOPSIS
use Lugh;
# Load model and create inference engine
my $model = Lugh::Model->new(model => '/path/to/model.gguf');
my $tokenizer = Lugh::Tokenizer->new(model => $model);
my $inference = Lugh::Inference->new(model => $model);
# Encode a prompt
my @tokens = $tokenizer->encode("The capital of France is");
# Run forward pass to get logits
my @logits = $inference->forward(\@tokens);
# Find the most likely next token (greedy)
my $max_idx = 0;
my $max_val = $logits[0];
for my $i (1..$#logits) {
if ($logits[$i] > $max_val) {
$max_val = $logits[$i];
$max_idx = $i;
}
}
# Decode the predicted token
my $next_token = $tokenizer->decode([$max_idx]);
print "Next token: $next_token\n"; # " Paris"
# Or use top-p sampling
my $sampled = $inference->sample_top_p(
\@logits,
temperature => 0.8,
top_p => 0.95
);
DESCRIPTION
Lugh::Inference implements the transformer forward pass for autoregressive language model inference. Given a sequence of input tokens, it computes the probability distribution (logits) over the vocabulary for the next token.
Transformer Architecture
The forward pass implements the standard transformer decoder architecture used by LLaMA, Mistral, and similar models:
Input Tokens
│
▼
┌─────────────┐
│ Token │
│ Embeddings │
└─────────────┘
│
▼
┌─────────────────────────────────┐
│ Transformer Layer (× N) │
│ ┌─────────────────────────────┐│
│ │ RMSNorm ││
│ │ ↓ ││
│ │ Multi-Head Attention (GQA) ││
│ │ ↓ ││
│ │ Residual Add ││
│ │ ↓ ││
│ │ RMSNorm ││
│ │ ↓ ││
│ │ FFN (SwiGLU) ││
│ │ ↓ ││
│ │ Residual Add ││
│ └─────────────────────────────┘│
└─────────────────────────────────┘
│
▼
┌─────────────┐
│ Final │
│ RMSNorm │
└─────────────┘
│
▼
┌─────────────┐
│ Output │
│ Projection │ → Logits [vocab_size]
└─────────────┘
Key Components
RMSNorm - Root Mean Square Layer Normalization
RoPE - Rotary Position Embeddings for position encoding
GQA - Grouped Query Attention (multiple Q heads per KV head)
SwiGLU - Gated activation in feed-forward network
CONSTRUCTOR
new
my $inference = Lugh::Inference->new(
model => $model,
backend => 'auto', # optional, compute backend
n_threads => 4, # optional, number of CPU threads
flash_attn => 0, # optional, use flash attention
);
Creates a new Inference engine from a loaded model.
Parameters:
model(required) - A Lugh::Model objectbackend(optional) - Compute backend to use. Defaults to 'auto'.Available backends depend on your system:
'auto'- Automatically select the best available (GPU preferred)'Metal'- Apple Metal GPU (macOS only)'CUDA'- NVIDIA GPU (if ggml built with CUDA)'Vulkan'- Cross-platform GPU'CPU'- CPU with SIMD (always available)
Use
Lugh::available_backends()to see what's available on your system.n_threads(optional) - Number of CPU threads for computation. Defaults to 4. Only affects CPU backend.flash_attn(optional) - Use flash attention if set to 1 (default: 0)
Returns: A Lugh::Inference object.
Throws: Dies if no model is provided or if requested backend is unavailable.
Example:
my $model = Lugh::Model->new(model => 'model.gguf');
# Auto-select best backend (recommended)
my $inference = Lugh::Inference->new(model => $model);
# Force Metal GPU on macOS
my $gpu_inference = Lugh::Inference->new(
model => $model,
backend => 'Metal',
);
# Force CPU with 8 threads
my $cpu_inference = Lugh::Inference->new(
model => $model,
backend => 'CPU',
n_threads => 8,
);
METHODS
forward
my @logits = $inference->forward(\@tokens);
Runs the transformer forward pass on input tokens.
Parameters:
\@tokens- Array reference of token IDs (integers)
Returns: A list of logits (one per vocabulary token).
Details:
The forward pass:
- 1. Looks up token embeddings
- 2. Applies N transformer layers with attention and FFN
- 3. Applies final normalization
- 4. Projects to vocabulary size
- 5. Returns logits for the last token position
Performance Notes:
Each call creates a new computation graph
Memory is allocated and freed for each call
For multi-token generation, consider batching
Example:
my @tokens = (1, 450, 7483, 310, 3444, 338); # "The capital of France is"
my @logits = $inference->forward(\@tokens);
# logits has 32000 elements (vocab size)
print "Vocab size: ", scalar(@logits), "\n";
sample_top_p
my $token_id = $inference->sample_top_p(
\@logits,
temperature => 0.8,
top_p => 0.95
);
Samples a token from logits using nucleus (top-p) sampling.
Parameters:
\@logits- Array reference of logits from forward()temperature- Sampling temperature (default: 0.8)top_p- Cumulative probability threshold (default: 0.95)
Returns: A single token ID.
Algorithm:
- 1. Apply temperature scaling: logit / temperature
- 2. Convert to probabilities via softmax
- 3. Sort tokens by probability (descending)
- 4. Keep tokens until cumulative probability >= top_p
- 5. Randomly sample from this "nucleus"
Temperature Effects:
temperature < 1 - More deterministic (sharper distribution)
temperature = 1 - Use raw probabilities
temperature > 1 - More random (flatter distribution)
temperature → 0 - Approaches greedy (argmax)
Top-p Effects:
top_p = 0.1 - Very focused, only most likely tokens
top_p = 0.9 - Typical setting, good balance
top_p = 1.0 - Consider all tokens (like top-k with k=all)
Example:
my @logits = $inference->forward(\@tokens);
# Creative generation
my $token = $inference->sample_top_p(\@logits,
temperature => 1.0,
top_p => 0.95
);
# More focused generation
my $token = $inference->sample_top_p(\@logits,
temperature => 0.3,
top_p => 0.5
);
sample_top_k
my $token_id = $inference->sample_top_k(
\@logits,
temperature => 0.8,
top_k => 40
);
Samples a token from logits using top-k sampling.
Parameters:
\@logits- Array reference of logits from forward()temperature- Sampling temperature (default: 0.8)top_k- Number of top tokens to consider (default: 40)
Returns: A single token ID.
Algorithm:
- 1. Apply temperature scaling: logit / temperature
- 2. Convert to probabilities via softmax
- 3. Select top-k tokens by probability
- 4. Renormalize probabilities
- 5. Randomly sample from top-k tokens
Example:
my @logits = $inference->forward(\@tokens);
# Sample from top 50 tokens
my $token = $inference->sample_top_k(\@logits,
temperature => 0.9,
top_k => 50
);
generate
my @tokens = $inference->generate(
\@prompt_tokens,
max_tokens => 100,
temperature => 0.8,
top_p => 0.95,
top_k => 40,
greedy => 0,
eos_token => 2,
callback => sub { ... },
);
Generates multiple tokens autoregressively from a prompt.
Parameters:
\@prompt_tokens(required) - Array reference of prompt token IDsmax_tokens- Maximum tokens to generate (default: 128)temperature- Sampling temperature (default: 0.8)top_p- Top-p (nucleus) sampling threshold (default: 0.95)top_k- Top-k sampling limit (default: 40). If < 1000, uses top_k; otherwise uses top_pgreedy- If true, use greedy decoding (argmax) (default: 0)eos_token- Token ID to stop generation (default: from model, typically 2)callback- Optional subroutine called for each generated token
Returns: A list of generated token IDs (not including the prompt).
Callback:
The callback receives (token_id, count) and should return true to stop generation:
callback => sub {
my ($token, $count) = @_;
print $tokenizer->decode([$token]);
return 0; # Continue (return 1 to stop)
}
Stopping Conditions:
Generation stops when:
max_tokens is reached
EOS token is generated
Callback returns true
Example:
use Lugh;
my $model = Lugh::Model->new(model => 'model.gguf');
my $tokenizer = Lugh::Tokenizer->new(model => $model);
my $inference = Lugh::Inference->new(model => $model);
my @prompt = $tokenizer->encode("Once upon a time");
# Greedy generation
my @tokens = $inference->generate(\@prompt,
max_tokens => 50,
greedy => 1,
);
print $tokenizer->decode(\@tokens);
# Creative generation with streaming
@tokens = $inference->generate(\@prompt,
max_tokens => 100,
temperature => 1.0,
top_p => 0.95,
callback => sub {
my ($tok, $n) = @_;
print $tokenizer->decode([$tok]);
STDOUT->flush();
return 0;
},
);
ATTENTION MECHANISM
Scaled Dot-Product Attention
Attention(Q, K, V) = softmax(QK^T / √d_k) × V
Where:
Q - Query vectors [head_dim, n_tokens, n_heads]
K - Key vectors [head_dim, n_tokens, n_kv_heads]
V - Value vectors [head_dim, n_tokens, n_kv_heads]
d_k - Head dimension (typically 64-128)
Grouped Query Attention (GQA)
GQA uses fewer KV heads than query heads to reduce memory:
Model n_head n_kv_head Ratio
LLaMA 7B 32 32 1:1 (MHA)
LLaMA 2 70B 64 8 8:1 (GQA)
TinyLlama 32 4 8:1 (GQA)
Mistral 7B 32 8 4:1 (GQA)
The implementation broadcasts KV heads to match query heads using ggml's native broadcasting.
Causal Masking
The attention uses causal (autoregressive) masking so each position can only attend to itself and previous positions:
Position: 0 1 2 3
0 ✓ ✗ ✗ ✗
1 ✓ ✓ ✗ ✗
2 ✓ ✓ ✓ ✗
3 ✓ ✓ ✓ ✓
This is implemented using ggml_diag_mask_inf which sets the upper triangle to -infinity before softmax.
RoPE (Rotary Position Embeddings)
Position information is encoded by rotating Q and K vectors:
RoPE(x, pos) = x × cos(pos × θ) + rotate(x) × sin(pos × θ)
Where θ depends on the dimension and base frequency (typically 10000).
Parameters are read from model metadata:
llama.rope.dimension_count- Dimensions to rotatellama.rope.freq_base- Base frequencyllama.context_length- Original context length
FEED-FORWARD NETWORK
The FFN uses SwiGLU activation:
FFN(x) = down(gate(x) × SiLU(up(x)))
Where:
gate, up - Linear projections to intermediate dimension
SiLU - Sigmoid Linear Unit: x × sigmoid(x)
down - Linear projection back to model dimension
Typical dimensions:
Model n_embd FFN_dim Ratio
TinyLlama 2048 5632 2.75×
LLaMA 7B 4096 11008 2.69×
LLaMA 13B 5120 13824 2.70×
GENERATION LOOP
The generate() method handles the complete generation loop internally. For simple use cases:
use Lugh;
my $model = Lugh::Model->new(model => 'model.gguf');
my $tokenizer = Lugh::Tokenizer->new(model => $model);
my $inference = Lugh::Inference->new(model => $model);
my @prompt = $tokenizer->encode("Once upon a time");
my @generated = $inference->generate(\@prompt,
max_tokens => 100,
temperature => 0.8,
top_p => 0.95,
);
print $tokenizer->decode(\@generated);
For streaming output:
my @generated = $inference->generate(\@prompt,
max_tokens => 100,
temperature => 0.8,
callback => sub {
my ($token, $count) = @_;
print $tokenizer->decode([$token]);
STDOUT->flush();
return 0; # Continue
},
);
For manual control (building your own loop):
my @tokens = $tokenizer->encode($prompt);
my @generated;
for (1..$max_tokens) {
my @logits = $inference->forward(\@tokens);
my $next = $inference->sample_top_p(\@logits,
temperature => 0.8,
top_p => 0.9
);
last if $next == $tokenizer->eos_id;
push @tokens, $next;
push @generated, $next;
print $tokenizer->decode([$next]);
STDOUT->flush();
}
PERFORMANCE
Computation
A single forward pass performs approximately:
FLOPs ≈ 2 × n_params × n_tokens
For TinyLlama (1.1B params) with 6 tokens:
2 × 1.1e9 × 6 ≈ 13 GFLOPs
Memory
During inference, memory is needed for:
Model weights (quantized) - Depends on model size and quantization
Activations - O(n_tokens × n_embd × n_layers)
Attention scores - O(n_tokens² × n_heads × n_layers)
Optimizations
Current implementation:
Uses ggml's Metal GPU backend on macOS
Uses Accelerate BLAS for matrix operations
Quantized weights stay quantized during computation
KV cache for incremental decoding (see
create_kv_cache)Memory pools for efficient repeated inference
Batch processing for multiple sequences
ADVANCED METHODS
create_memory_pool
my $pool = $inference->create_memory_pool();
Creates a reusable memory pool for efficient repeated forward passes. The pool caches backend and allocator resources, avoiding per-call allocation overhead.
Returns: A Lugh::MemoryPool object.
Example:
my $pool = $inference->create_memory_pool();
# Efficient repeated forward passes
for my $text (@texts) {
my @tokens = $tokenizer->encode($text);
my @logits = $inference->forward_with_pool($pool, \@tokens);
# Process logits...
}
# Pool automatically cleaned up on destruction
# Or manually reset for next batch:
$pool->reset();
forward_with_pool
my @logits = $inference->forward_with_pool($pool, \@tokens);
Runs forward pass using a pre-allocated memory pool. More efficient than forward() for repeated inference on different inputs.
Parameters:
$pool- A Lugh::MemoryPool fromcreate_memory_pool()\@tokens- Array reference of token IDs
Returns: A list of logits (one per vocabulary token).
Example:
my $pool = $inference->create_memory_pool();
# Much more efficient than calling forward() repeatedly
for my $prompt (@prompts) {
my @tokens = $tokenizer->encode($prompt);
my @logits = $inference->forward_with_pool($pool, \@tokens);
my $next_token = $inference->sample_top_p(\@logits, top_p => 0.9);
print $tokenizer->decode([$next_token]), "\n";
}
forward_batch
my $results = $inference->forward_batch(\@sequences);
Processes multiple token sequences, returning logits for each. Each sequence is processed independently with shared backend resources.
Parameters:
\@sequences- Array reference of array references of token IDs
Returns: Array reference of array references of logits.
Example:
my @seq1 = $tokenizer->encode("Hello");
my @seq2 = $tokenizer->encode("World");
my @seq3 = $tokenizer->encode("Test");
my $results = $inference->forward_batch([\@seq1, \@seq2, \@seq3]);
# $results->[0] is logits for seq1
# $results->[1] is logits for seq2
# $results->[2] is logits for seq3
for my $i (0 .. $#$results) {
my @logits = @{$results->[$i]};
my $next = $inference->sample_top_p(\@logits, top_p => 0.9);
print "Sequence $i next token: ", $tokenizer->decode([$next]), "\n";
}
MEMORY POOL
The Lugh::MemoryPool class provides reusable compute resources:
reset
$pool->reset();
Resets the memory pool for reuse. Frees and reallocates the compute context. Called automatically by forward_with_pool().
Returns: True on success.
backend
my $backend_name = $pool->backend();
Returns the name of the backend used by this pool (e.g., "Metal", "CPU").
THREAD SAFETY
Lugh::Inference objects are NOT thread-safe. Each Perl thread must create its own Inference object.
SEE ALSO
Lugh, Lugh::Model, Lugh::Tokenizer
https://arxiv.org/abs/1706.03762 - "Attention Is All You Need"
https://arxiv.org/abs/2104.09864 - RoPE paper
https://arxiv.org/abs/2002.05202 - SwiGLU activation
https://arxiv.org/abs/2305.13245 - GQA paper
AUTHOR
lnation <email@lnation.org>
LICENSE
This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself.
1 POD Error
The following errors were encountered while parsing the POD:
- Around line 66:
Non-ASCII character seen before =encoding in '│'. Assuming UTF-8