NAME

Lugh - Pure C LLM Inference Engine for Perl (built on ggml)

VERSION

Version 0.02

SYNOPSIS

use Lugh;

# === High-Level API: LLM Inference ===

# Load a GGUF model
my $model = Lugh::Model->new(
    model => '/path/to/tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf'
);

# Create tokenizer and inference engine
my $tokenizer = Lugh::Tokenizer->new(model => $model);
my $inference = Lugh::Inference->new(model => $model);

# Encode a prompt
my @tokens = $tokenizer->encode("The capital of France is");

# Run forward pass to get logits
my @logits = $inference->forward(\@tokens);

# Find most likely next token (greedy decoding)
my $max_idx = 0;
my $max_val = $logits[0];
for my $i (1..$#logits) {
    if ($logits[$i] > $max_val) {
        $max_val = $logits[$i];
        $max_idx = $i;
    }
}

# Decode the predicted token
my $next_token = $tokenizer->decode([$max_idx]);
print "Next token: $next_token\n";  # " Paris"

# === Low-Level API: Tensor Operations ===

# Create a ggml context
my $ctx = Lugh::Context->new(mem_size => 16 * 1024 * 1024);

# Create tensors
my $a = Lugh::Tensor->new_f32($ctx, 4);
my $b = Lugh::Tensor->new_f32($ctx, 4);

# Set values
$a->set_f32(1.0, 2.0, 3.0, 4.0);
$b->set_f32(5.0, 6.0, 7.0, 8.0);

# Create computation graph
my $c = Lugh::Ops::add($ctx, $a, $b);

my $graph = Lugh::Graph->new($ctx);
$graph->build_forward($c);
$graph->compute($ctx, 4);  # 4 threads

# Get results
my @result = $c->get_f32();
print "Result: @result\n";  # 6.0 8.0 10.0 12.0

DESCRIPTION

Lugh is a pure C LLM inference engine for Perl, built on the ggml tensor library. It provides both high-level APIs for running large language model inference and low-level tensor operations for building custom neural network computations.

Named after the Celtic god of skill and craftsmanship, Lugh aims to provide complete understanding and control over LLM inference - letting you see exactly how transformers work under the hood.

Features

  • GGUF Model Loading - Load quantized models in the standard GGUF format

  • BPE Tokenization - Encode text to tokens and decode tokens to text

  • Transformer Inference - Full forward pass with attention, RoPE, FFN

  • Grouped Query Attention - Support for GQA models (LLaMA 2, Mistral, etc.)

  • Quantization Support - Run Q4, Q5, Q8 quantized models efficiently

  • Metal GPU Acceleration - Uses Apple Metal on macOS

  • BLAS Acceleration - Uses Accelerate/OpenBLAS for matrix operations

Supported Models

Lugh can run any GGUF model with the LLaMA architecture:

  • TinyLlama (1.1B)

  • LLaMA / LLaMA 2 (7B, 13B, 70B)

  • Mistral / Mixtral

  • OpenLLaMA

  • Phi-2 (with minor adjustments)

PACKAGES

Lugh provides two levels of API:

High-Level: LLM Inference

Low-Level: Tensor Operations

  • Lugh::Context - Memory context for tensor allocation

  • Lugh::Tensor - N-dimensional tensors with various data types

  • Lugh::Ops - Tensor operations (add, mul, matmul, softmax, etc.)

  • Lugh::Graph - Computation graph for lazy evaluation

QUICK START

Running Inference

use Lugh;

# Load model
my $model = Lugh::Model->new(model => 'model.gguf');
my $tokenizer = Lugh::Tokenizer->new(model => $model);
my $inference = Lugh::Inference->new(model => $model);

# Generate text
my @tokens = $tokenizer->encode("Once upon a time");

for (1..50) {  # Generate 50 tokens
    my @logits = $inference->forward(\@tokens);
    
    # Sample next token (greedy)
    my $next = 0;
    my $max = $logits[0];
    for my $i (1..$#logits) {
        if ($logits[$i] > $max) {
            $max = $logits[$i];
            $next = $i;
        }
    }
    
    last if $next == $tokenizer->eos_id;
    
    push @tokens, $next;
    print $tokenizer->decode([$next]);
}

Inspecting a Model

use Lugh;

my $model = Lugh::Model->new(model => 'model.gguf');

print "Architecture: ", $model->architecture, "\n";
print "Layers: ", $model->get_kv('llama.block_count'), "\n";
print "Hidden dim: ", $model->get_kv('llama.embedding_length'), "\n";
print "Heads: ", $model->get_kv('llama.attention.head_count'), "\n";
print "Vocab: ", $model->get_kv('llama.vocab_size'), "\n";

# List tensors
for my $name ($model->tensor_names) {
    my ($type, $dims, @shape) = $model->tensor_info($name);
    print "$name: [@shape]\n";
}

ARCHITECTURE

┌─────────────────────────────────────────────────────┐
│                    Perl Layer                        │
│                                                      │
│  Lugh::Model    Lugh::Tokenizer    Lugh::Inference  │
│       │               │                  │          │
└───────┼───────────────┼──────────────────┼──────────┘
        │               │                  │
┌───────┼───────────────┼──────────────────┼──────────┐
│       │         XS Bindings              │          │
│       ▼               ▼                  ▼          │
│  GGUF Parser    BPE Tokenizer      Forward Pass     │
│  Tensor Access  Vocab Lookup       Attention+FFN    │
│                                                      │
└─────────────────────────────────────────────────────┘
        │               │                  │
┌───────┼───────────────┼──────────────────┼──────────┐
│       ▼               ▼                  ▼          │
│                    ggml Library                      │
│                                                      │
│  Tensor Ops    RoPE    Attention    Quantization    │
│                                                      │
│              Metal GPU    Accelerate BLAS           │
└─────────────────────────────────────────────────────┘

TRANSFORMER COMPONENTS

Lugh implements the standard transformer decoder:

Token Embeddings

Converts token IDs to dense vectors by looking up rows in the embedding matrix.

RMSNorm

Root Mean Square Layer Normalization - normalizes without centering:

RMSNorm(x) = x / sqrt(mean(x²) + ε)

Rotary Position Embeddings (RoPE)

Encodes position by rotating Q and K vectors. Allows the model to understand relative positions of tokens.

Grouped Query Attention (GQA)

Multi-head attention where multiple query heads share fewer key/value heads, reducing memory and computation:

# TinyLlama: 32 query heads, 4 KV heads (8:1 ratio)
Attention(Q, K, V) = softmax(QK^T / √d) × V

SwiGLU FFN

Feed-forward network with gated activation:

FFN(x) = down(gate(x) × SiLU(up(x)))

PERFORMANCE

Memory Usage

Memory depends on model size and quantization:

TinyLlama 1.1B Q4_K_M:  ~650 MB
LLaMA 7B Q4_K_M:        ~4 GB
LLaMA 13B Q4_K_M:       ~7.5 GB

Speed

On Apple Silicon (M1/M2/M3), expect:

TinyLlama:  ~20-50 tokens/second
LLaMA 7B:   ~10-20 tokens/second

Speed depends on context length and batch size.

SEE ALSO

External resources:

AUTHOR

lnation <email@lnation.org>

BUGS

Please report any bugs or feature requests to bug-lugh at rt.cpan.org, or through the web interface at https://rt.cpan.org/NoAuth/ReportBug.html?Queue=Lugh.

SUPPORT

You can find documentation for this module with the perldoc command:

perldoc Lugh
perldoc Lugh::Model
perldoc Lugh::Tokenizer
perldoc Lugh::Inference

LICENSE AND COPYRIGHT

This software is Copyright (c) 2026 by lnation <email@lnation.org>.

This is free software, licensed under:

The Artistic License 2.0 (GPL Compatible)

1 POD Error

The following errors were encountered while parsing the POD:

Around line 213:

Non-ASCII character seen before =encoding in '┌─────────────────────────────────────────────────────┐'. Assuming UTF-8