NAME
Lugh - Pure C LLM Inference Engine for Perl (built on ggml)
VERSION
Version 0.06
SYNOPSIS
use Lugh;
# === High-Level API: LLM Inference ===
# Load a GGUF model
my $model = Lugh::Model->new(
model => '/path/to/tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf'
);
# Create tokenizer and inference engine
my $tokenizer = Lugh::Tokenizer->new(model => $model);
my $inference = Lugh::Inference->new(model => $model);
# Encode a prompt
my @tokens = $tokenizer->encode("The capital of France is");
# Run forward pass to get logits
my @logits = $inference->forward(\@tokens);
# Find most likely next token (greedy decoding)
my $max_idx = 0;
my $max_val = $logits[0];
for my $i (1..$#logits) {
if ($logits[$i] > $max_val) {
$max_val = $logits[$i];
$max_idx = $i;
}
}
# Decode the predicted token
my $next_token = $tokenizer->decode([$max_idx]);
print "Next token: $next_token\n"; # " Paris"
# === Low-Level API: Tensor Operations ===
# Create a ggml context
my $ctx = Lugh::Context->new(mem_size => 16 * 1024 * 1024);
# Create tensors
my $a = Lugh::Tensor->new_f32($ctx, 4);
my $b = Lugh::Tensor->new_f32($ctx, 4);
# Set values
$a->set_f32(1.0, 2.0, 3.0, 4.0);
$b->set_f32(5.0, 6.0, 7.0, 8.0);
# Create computation graph
my $c = Lugh::Ops::add($ctx, $a, $b);
my $graph = Lugh::Graph->new($ctx);
$graph->build_forward($c);
$graph->compute($ctx, 4); # 4 threads
# Get results
my @result = $c->get_f32();
print "Result: @result\n"; # 6.0 8.0 10.0 12.0
DESCRIPTION
Lugh is a pure C LLM inference engine for Perl, built on the ggml tensor library. It provides both high-level APIs for running large language model inference and low-level tensor operations for building custom neural network computations.
Named after the Celtic god of skill and craftsmanship, Lugh aims to provide complete understanding and control over LLM inference - letting you see exactly how transformers work under the hood.
Features
GGUF Model Loading - Load quantized models in the standard GGUF format
BPE Tokenization - Encode text to tokens and decode tokens to text
Transformer Inference - Full forward pass with attention, RoPE, FFN
Grouped Query Attention - Support for GQA models (LLaMA 2, Mistral, etc.)
Quantization Support - Run Q4, Q5, Q8 quantized models efficiently
Metal GPU Acceleration - Uses Apple Metal on macOS
BLAS Acceleration - Uses Accelerate/OpenBLAS for matrix operations
Supported Model Architectures
Lugh automatically detects model architecture from GGUF metadata and adapts inference accordingly. Supported architectures include:
LLaMA Family - LLaMA, LLaMA 2, LLaMA 3, TinyLlama, OpenLLaMA, Mistral, Mixtral
Qwen Family - Qwen, Qwen2 (combined QKV projections)
Phi Family - Phi-2, Phi-3 (combined QKV, GELU activation)
Gemma Family - Gemma, Gemma2 (post-normalization support)
GPT Family - GPT-2, GPT-J, GPT-NeoX
Other Architectures - Falcon, BLOOM, MPT, StarCoder, StableLM, InternLM, DeepSeek, Command-R, BERT, T5
The architecture is detected from the general.architecture key in the GGUF file, and appropriate handling is applied for:
Combined QKV - Phi, Qwen, BLOOM, GPT-2, GPT-J use a single QKV tensor
FFN Activation - SiLU/GELU based on architecture (no gate for GPT-2, Phi)
Post-Normalization - Gemma2 applies additional layer norms
Recurrent Models - MAMBA, RWKV identified (inference WIP)
Query architecture information at runtime:
my $model = Lugh::Model->new(model => 'model.gguf');
print "Architecture: ", $model->architecture, "\n";
print "Arch type: ", $model->arch_type, "\n";
print "Has combined QKV: ", $model->arch_has_combined_qkv, "\n";
print "Has FFN gate: ", $model->arch_has_ffn_gate, "\n";
print "Has post-norm: ", $model->arch_has_post_norm, "\n";
PACKAGES
Lugh provides two levels of API:
High-Level: LLM Inference
Lugh::Model - Load GGUF model files and access tensors/metadata
Lugh::Tokenizer - BPE tokenization (encode text, decode tokens)
Lugh::Inference - Transformer forward pass and sampling
Low-Level: Tensor Operations
Lugh::Context - Memory context for tensor allocation
Lugh::Tensor - N-dimensional tensors with various data types
Lugh::Ops - Tensor operations (add, mul, matmul, softmax, etc.)
Lugh::Graph - Computation graph for lazy evaluation
QUICK START
Running Inference
use Lugh;
# Load model
my $model = Lugh::Model->new(model => 'model.gguf');
my $tokenizer = Lugh::Tokenizer->new(model => $model);
my $inference = Lugh::Inference->new(model => $model);
# Generate text
my @tokens = $tokenizer->encode("Once upon a time");
for (1..50) { # Generate 50 tokens
my @logits = $inference->forward(\@tokens);
# Sample next token (greedy)
my $next = 0;
my $max = $logits[0];
for my $i (1..$#logits) {
if ($logits[$i] > $max) {
$max = $logits[$i];
$next = $i;
}
}
last if $next == $tokenizer->eos_id;
push @tokens, $next;
print $tokenizer->decode([$next]);
}
Inspecting a Model
use Lugh;
my $model = Lugh::Model->new(model => 'model.gguf');
print "Architecture: ", $model->architecture, "\n";
print "Layers: ", $model->get_kv('llama.block_count'), "\n";
print "Hidden dim: ", $model->get_kv('llama.embedding_length'), "\n";
print "Heads: ", $model->get_kv('llama.attention.head_count'), "\n";
print "Vocab: ", $model->get_kv('llama.vocab_size'), "\n";
# List tensors
for my $name ($model->tensor_names) {
my ($type, $dims, @shape) = $model->tensor_info($name);
print "$name: [@shape]\n";
}
ARCHITECTURE
┌─────────────────────────────────────────────────────┐
│ Perl Layer │
│ │
│ Lugh::Model Lugh::Tokenizer Lugh::Inference │
│ │ │ │ │
└───────┼───────────────┼──────────────────┼──────────┘
│ │ │
┌───────┼───────────────┼──────────────────┼──────────┐
│ │ XS Bindings │ │
│ ▼ ▼ ▼ │
│ GGUF Parser BPE Tokenizer Forward Pass │
│ Tensor Access Vocab Lookup Attention+FFN │
│ │
└─────────────────────────────────────────────────────┘
│ │ │
┌───────┼───────────────┼──────────────────┼──────────┐
│ ▼ ▼ ▼ │
│ ggml Library │
│ │
│ Tensor Ops RoPE Attention Quantization │
│ │
│ Metal GPU Accelerate BLAS │
└─────────────────────────────────────────────────────┘
TRANSFORMER COMPONENTS
Lugh implements the standard transformer decoder:
Token Embeddings
Converts token IDs to dense vectors by looking up rows in the embedding matrix.
RMSNorm
Root Mean Square Layer Normalization - normalizes without centering:
RMSNorm(x) = x / sqrt(mean(x²) + ε)
Rotary Position Embeddings (RoPE)
Encodes position by rotating Q and K vectors. Allows the model to understand relative positions of tokens.
Grouped Query Attention (GQA)
Multi-head attention where multiple query heads share fewer key/value heads, reducing memory and computation:
# TinyLlama: 32 query heads, 4 KV heads (8:1 ratio)
Attention(Q, K, V) = softmax(QK^T / √d) × V
SwiGLU FFN
Feed-forward network with gated activation:
FFN(x) = down(gate(x) × SiLU(up(x)))
PERFORMANCE
Backend Selection
Lugh supports multiple compute backends through the ggml library:
# List available backends on your system
my @backends = Lugh::available_backends();
# Returns: ('Metal', 'BLAS', 'CPU', 'auto')
# Check which backend is best for GPU acceleration
my $best = Lugh::best_backend();
# Returns: 'Metal' on macOS with GPU, 'CPU' otherwise
# Get detailed info about a backend
my $info = Lugh::backend_info('Metal');
# Returns: { name => 'Metal', type => 'GPU', is_gpu => 1,
# description => 'Apple M1', device_count => 1 }
# Check if a specific backend is available
if (Lugh::backend_available('Metal')) {
print "Metal GPU acceleration available!\n";
}
# Select backend for inference
my $inference = Lugh::Inference->new(
model => $model,
backend => 'auto', # auto-select best (default)
# backend => 'Metal', # force Metal GPU
# backend => 'CPU', # force CPU
);
Available backends depend on your system and ggml installation:
Metal - Apple GPU (macOS only)
CUDA - NVIDIA GPU (requires ggml with CUDA)
Vulkan - Cross-platform GPU
BLAS - Accelerate/OpenBLAS for fast matrix ops
CPU - Always available, uses SIMD
auto - Automatically select the best available
Backend API Functions
available_backends()- List all available backend namesbackend_count()- Number of registered backendsbackend_device_count()- Total number of available devicesbackend_info($name)- Get detailed info about a backendbackend_available($name)- Check if a backend is availablebest_backend()- Get the name of the best available backendhas_metal()- Check if Metal is compiled in (macOS)metal_available()- Check if Metal GPU is available at runtime
Memory Usage
Memory depends on model size and quantization:
TinyLlama 1.1B Q4_K_M: ~650 MB
LLaMA 7B Q4_K_M: ~4 GB
LLaMA 13B Q4_K_M: ~7.5 GB
Speed
On Apple Silicon (M1/M2/M3), expect:
TinyLlama: ~20-50 tokens/second
LLaMA 7B: ~10-20 tokens/second
Speed depends on context length and batch size.
SEE ALSO
Lugh::Model - Model loading
Lugh::Tokenizer - Text tokenization
Lugh::Inference - Forward pass
Lugh::Context - Memory management
Lugh::Tensor - Tensor operations
Lugh::Ops - Math operations
Lugh::Graph - Computation graphs
External resources:
https://github.com/ggerganov/ggml - The ggml tensor library
AUTHOR
lnation <email@lnation.org>
BUGS
Please report any bugs or feature requests to bug-lugh at rt.cpan.org, or through the web interface at https://rt.cpan.org/NoAuth/ReportBug.html?Queue=Lugh.
SUPPORT
You can find documentation for this module with the perldoc command:
perldoc Lugh
perldoc Lugh::Model
perldoc Lugh::Tokenizer
perldoc Lugh::Inference
LICENSE AND COPYRIGHT
This software is Copyright (c) 2026 by lnation <email@lnation.org>.
This is free software, licensed under:
The Artistic License 2.0 (GPL Compatible)
1 POD Error
The following errors were encountered while parsing the POD:
- Around line 242:
Non-ASCII character seen before =encoding in '┌─────────────────────────────────────────────────────┐'. Assuming UTF-8