NAME
Lugh - Pure C LLM Inference Engine for Perl (built on ggml)
VERSION
Version 0.01
SYNOPSIS
use Lugh;
# === High-Level API: LLM Inference ===
# Load a GGUF model
my $model = Lugh::Model->new(
model => '/path/to/tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf'
);
# Create tokenizer and inference engine
my $tokenizer = Lugh::Tokenizer->new(model => $model);
my $inference = Lugh::Inference->new(model => $model);
# Encode a prompt
my @tokens = $tokenizer->encode("The capital of France is");
# Run forward pass to get logits
my @logits = $inference->forward(\@tokens);
# Find most likely next token (greedy decoding)
my $max_idx = 0;
my $max_val = $logits[0];
for my $i (1..$#logits) {
if ($logits[$i] > $max_val) {
$max_val = $logits[$i];
$max_idx = $i;
}
}
# Decode the predicted token
my $next_token = $tokenizer->decode([$max_idx]);
print "Next token: $next_token\n"; # " Paris"
# === Low-Level API: Tensor Operations ===
# Create a ggml context
my $ctx = Lugh::Context->new(mem_size => 16 * 1024 * 1024);
# Create tensors
my $a = Lugh::Tensor->new_f32($ctx, 4);
my $b = Lugh::Tensor->new_f32($ctx, 4);
# Set values
$a->set_f32(1.0, 2.0, 3.0, 4.0);
$b->set_f32(5.0, 6.0, 7.0, 8.0);
# Create computation graph
my $c = Lugh::Ops::add($ctx, $a, $b);
my $graph = Lugh::Graph->new($ctx);
$graph->build_forward($c);
$graph->compute($ctx, 4); # 4 threads
# Get results
my @result = $c->get_f32();
print "Result: @result\n"; # 6.0 8.0 10.0 12.0
DESCRIPTION
Lugh is a pure C LLM inference engine for Perl, built on the ggml tensor library. It provides both high-level APIs for running large language model inference and low-level tensor operations for building custom neural network computations.
Named after the Celtic god of skill and craftsmanship, Lugh aims to provide complete understanding and control over LLM inference - letting you see exactly how transformers work under the hood.
Features
GGUF Model Loading - Load quantized models in the standard GGUF format
BPE Tokenization - Encode text to tokens and decode tokens to text
Transformer Inference - Full forward pass with attention, RoPE, FFN
Grouped Query Attention - Support for GQA models (LLaMA 2, Mistral, etc.)
Quantization Support - Run Q4, Q5, Q8 quantized models efficiently
Metal GPU Acceleration - Uses Apple Metal on macOS
BLAS Acceleration - Uses Accelerate/OpenBLAS for matrix operations
Supported Models
Lugh can run any GGUF model with the LLaMA architecture:
TinyLlama (1.1B)
LLaMA / LLaMA 2 (7B, 13B, 70B)
Mistral / Mixtral
OpenLLaMA
Phi-2 (with minor adjustments)
PACKAGES
Lugh provides two levels of API:
High-Level: LLM Inference
Lugh::Model - Load GGUF model files and access tensors/metadata
Lugh::Tokenizer - BPE tokenization (encode text, decode tokens)
Lugh::Inference - Transformer forward pass and sampling
Low-Level: Tensor Operations
Lugh::Context - Memory context for tensor allocation
Lugh::Tensor - N-dimensional tensors with various data types
Lugh::Ops - Tensor operations (add, mul, matmul, softmax, etc.)
Lugh::Graph - Computation graph for lazy evaluation
QUICK START
Running Inference
use Lugh;
# Load model
my $model = Lugh::Model->new(model => 'model.gguf');
my $tokenizer = Lugh::Tokenizer->new(model => $model);
my $inference = Lugh::Inference->new(model => $model);
# Generate text
my @tokens = $tokenizer->encode("Once upon a time");
for (1..50) { # Generate 50 tokens
my @logits = $inference->forward(\@tokens);
# Sample next token (greedy)
my $next = 0;
my $max = $logits[0];
for my $i (1..$#logits) {
if ($logits[$i] > $max) {
$max = $logits[$i];
$next = $i;
}
}
last if $next == $tokenizer->eos_id;
push @tokens, $next;
print $tokenizer->decode([$next]);
}
Inspecting a Model
use Lugh;
my $model = Lugh::Model->new(model => 'model.gguf');
print "Architecture: ", $model->architecture, "\n";
print "Layers: ", $model->get_kv('llama.block_count'), "\n";
print "Hidden dim: ", $model->get_kv('llama.embedding_length'), "\n";
print "Heads: ", $model->get_kv('llama.attention.head_count'), "\n";
print "Vocab: ", $model->get_kv('llama.vocab_size'), "\n";
# List tensors
for my $name ($model->tensor_names) {
my ($type, $dims, @shape) = $model->tensor_info($name);
print "$name: [@shape]\n";
}
ARCHITECTURE
┌─────────────────────────────────────────────────────┐
│ Perl Layer │
│ │
│ Lugh::Model Lugh::Tokenizer Lugh::Inference │
│ │ │ │ │
└───────┼───────────────┼──────────────────┼──────────┘
│ │ │
┌───────┼───────────────┼──────────────────┼──────────┐
│ │ XS Bindings │ │
│ ▼ ▼ ▼ │
│ GGUF Parser BPE Tokenizer Forward Pass │
│ Tensor Access Vocab Lookup Attention+FFN │
│ │
└─────────────────────────────────────────────────────┘
│ │ │
┌───────┼───────────────┼──────────────────┼──────────┐
│ ▼ ▼ ▼ │
│ ggml Library │
│ │
│ Tensor Ops RoPE Attention Quantization │
│ │
│ Metal GPU Accelerate BLAS │
└─────────────────────────────────────────────────────┘
TRANSFORMER COMPONENTS
Lugh implements the standard transformer decoder:
Token Embeddings
Converts token IDs to dense vectors by looking up rows in the embedding matrix.
RMSNorm
Root Mean Square Layer Normalization - normalizes without centering:
RMSNorm(x) = x / sqrt(mean(x²) + ε)
Rotary Position Embeddings (RoPE)
Encodes position by rotating Q and K vectors. Allows the model to understand relative positions of tokens.
Grouped Query Attention (GQA)
Multi-head attention where multiple query heads share fewer key/value heads, reducing memory and computation:
# TinyLlama: 32 query heads, 4 KV heads (8:1 ratio)
Attention(Q, K, V) = softmax(QK^T / √d) × V
SwiGLU FFN
Feed-forward network with gated activation:
FFN(x) = down(gate(x) × SiLU(up(x)))
PERFORMANCE
Memory Usage
Memory depends on model size and quantization:
TinyLlama 1.1B Q4_K_M: ~650 MB
LLaMA 7B Q4_K_M: ~4 GB
LLaMA 13B Q4_K_M: ~7.5 GB
Speed
On Apple Silicon (M1/M2/M3), expect:
TinyLlama: ~20-50 tokens/second
LLaMA 7B: ~10-20 tokens/second
Speed depends on context length and batch size.
SEE ALSO
Lugh::Model - Model loading
Lugh::Tokenizer - Text tokenization
Lugh::Inference - Forward pass
Lugh::Context - Memory management
Lugh::Tensor - Tensor operations
Lugh::Ops - Math operations
Lugh::Graph - Computation graphs
External resources:
https://github.com/ggerganov/ggml - The ggml tensor library
https://github.com/ggerganov/llama.cpp - Reference implementation
https://arxiv.org/abs/1706.03762 - "Attention Is All You Need"
AUTHOR
lnation <email@lnation.org>
BUGS
Please report any bugs or feature requests to bug-lugh at rt.cpan.org, or through the web interface at https://rt.cpan.org/NoAuth/ReportBug.html?Queue=Lugh.
SUPPORT
You can find documentation for this module with the perldoc command:
perldoc Lugh
perldoc Lugh::Model
perldoc Lugh::Tokenizer
perldoc Lugh::Inference
LICENSE AND COPYRIGHT
This software is Copyright (c) 2026 by lnation <email@lnation.org>.
This is free software, licensed under:
The Artistic License 2.0 (GPL Compatible)
1 POD Error
The following errors were encountered while parsing the POD:
- Around line 213:
Non-ASCII character seen before =encoding in '┌─────────────────────────────────────────────────────┐'. Assuming UTF-8