NAME

Lugh::Model - GGUF Model Loading and Tensor Access

VERSION

Version 0.12

SYNOPSIS

use Lugh;

# Load a GGUF model file
my $model = Lugh::Model->new(
    model => '/path/to/model.gguf'
);

# Get model information
print "Architecture: ", $model->architecture, "\n";
print "Tensors: ", $model->n_tensors, "\n";
print "Metadata keys: ", $model->n_kv, "\n";

# Access model metadata
my $n_layers = $model->get_kv('llama.block_count');
my $n_embd = $model->get_kv('llama.embedding_length');
my $vocab_size = $model->get_kv('llama.vocab_size');

# List all tensors
my @names = $model->tensor_names;

# Get tensor information
my ($type, $n_dims, @shape) = $model->tensor_info('token_embd.weight');

# List all metadata keys
my @keys = $model->kv_keys;

DESCRIPTION

Lugh::Model provides an interface for loading and inspecting GGUF model files. GGUF (GPT-Generated Unified Format) is the standard format for storing large language models, used by llama.cpp and related projects.

The model object loads the entire model into memory, including all tensors with their weights. This allows direct access to model parameters for inference.

GGUF Format

GGUF files contain:

Header - Magic number, version, tensor count, metadata count
Metadata - Key-value pairs describing the model architecture, hyperparameters, tokenizer vocabulary, and other configuration
Tensor Info - Name, dimensions, type, and offset for each tensor
Tensor Data - The actual weight data, potentially quantized

Supported Quantization Types

The model loader supports all ggml quantization types, including:

F32, F16, BF16 - Full/half precision floats
Q4_0, Q4_1, Q4_K_S, Q4_K_M - 4-bit quantization
Q5_0, Q5_1, Q5_K_S, Q5_K_M - 5-bit quantization
Q8_0, Q8_1, Q8_K - 8-bit quantization
Q2_K, Q3_K_S, Q3_K_M, Q3_K_L - 2-3 bit quantization
Q6_K - 6-bit quantization
IQ1_S, IQ2_XXS, IQ2_XS, IQ2_S, IQ3_XXS, IQ3_XS, IQ3_S, IQ4_NL, IQ4_XS - i-quants

CONSTRUCTOR

new

my $model = Lugh::Model->new(
    model => '/path/to/model.gguf'
);

Creates a new Model object by loading a GGUF file.

Parameters:

model (required) - Path to the GGUF model file. Also accepts file or path as aliases.
use_mmap (optional) - If true, use memory-mapped I/O to load the model file. This allows the OS to share read-only pages across processes, significantly reducing memory usage for multi-process deployments. Defaults to false (0) for backward compatibility. Also accepts mmap as an alias.
prefetch (optional) - If true and use_mmap is enabled, advise the kernel to prefetch the entire file into memory during loading. This can improve inference speed at the cost of longer initial load time. Defaults to true (1).

Returns: A Lugh::Model object.

Throws: Dies if the file cannot be loaded or is not a valid GGUF file.

Examples:

# Standard loading (copies file into memory)
my $model = Lugh::Model->new(
    model => '/models/tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf'
);

# Memory-mapped loading (shares pages across processes)
my $model = Lugh::Model->new(
    model    => '/models/llama-7b.Q4_K_M.gguf',
    use_mmap => 1,
);

# Memory-mapped without prefetch (lazy loading)
my $model = Lugh::Model->new(
    model    => '/models/llama-7b.Q4_K_M.gguf',
    use_mmap => 1,
    prefetch => 0,  # Pages loaded on demand
);

METHODS

filename

my $path = $model->filename;

Returns the path to the loaded GGUF file.

architecture

my $arch = $model->architecture;

Returns the model architecture string (e.g., "llama", "qwen2", "phi3", "gemma2"). Returns "unknown" if the architecture is not specified in the model.

use_mmap

my $is_mmap = $model->use_mmap;

Returns true (1) if the model was loaded with memory mapping enabled, false (0) otherwise.

Example:

if ($model->use_mmap) {
    print "Model is memory-mapped (fork-safe)\n";
}

mmap_size

my $size = $model->mmap_size;

Returns the size in bytes of the memory-mapped region, or 0 if the model was not loaded with mmap.

Example:

my $size = $model->mmap_size;
printf "Mapped region: %.2f MB\n", $size / (1024 * 1024) if $size;

mmap_supported

my $supported = Lugh::Model->mmap_supported;

Class method that returns true (1) if memory mapping is supported on the current platform, false (0) otherwise. mmap is supported on POSIX systems (Linux, macOS, *BSD) and Windows.

Example:

if (Lugh::Model->mmap_supported) {
    print "mmap is available on this platform\n";
}

arch_type

my $type = $model->arch_type;

Returns the numeric architecture type code for optimized dispatch. This is used internally to determine which inference path to use.

Architecture type codes include:

0  - UNKNOWN      11 - MPT
1  - LLAMA        12 - STARCODER  
2  - QWEN         13 - STABLELM
3  - QWEN2        14 - INTERNLM
4  - PHI          15 - DEEPSEEK
5  - GEMMA        16 - COMMAND_R
6  - GEMMA2       17 - MAMBA
7  - GPT2         18 - RWKV
8  - GPTJ         19 - BERT
9  - GPTNEOX      20 - T5
10 - FALCON       21 - BLOOM

Example:

if ($model->arch_type == 4) {
    print "This is a Phi model\n";
}

arch_has_combined_qkv

my $has_combined = $model->arch_has_combined_qkv;

Returns true (1) if the model architecture uses combined Q/K/V projection weights in a single tensor, false (0) otherwise.

Models with combined QKV: Phi, Qwen, Qwen2, BLOOM, GPT-2, GPT-J

Example:

if ($model->arch_has_combined_qkv) {
    print "Model uses combined QKV projections\n";
}

arch_has_ffn_gate

my $has_gate = $model->arch_has_ffn_gate;

Returns true (1) if the model architecture uses a gated FFN (SwiGLU), false (0) if it uses a standard 2-layer FFN with GELU activation.

Models without FFN gate (use GELU): GPT-2, GPT-J, GPT-NeoX, BLOOM, Falcon, MPT, Phi

Example:

if (!$model->arch_has_ffn_gate) {
    print "Model uses GELU FFN (no gate)\n";
}

arch_has_post_norm

my $has_post = $model->arch_has_post_norm;

Returns true (1) if the model architecture applies post-normalization after attention and FFN blocks, false (0) otherwise.

Currently only Gemma2 uses post-normalization.

Example:

if ($model->arch_has_post_norm) {
    print "Model uses post-normalization (Gemma2-style)\n";
}

arch_is_recurrent

my $is_recurrent = $model->arch_is_recurrent;

Returns true (1) if the model is a recurrent architecture (MAMBA, RWKV), false (0) for standard transformer architectures.

Note: Recurrent model inference is identified but not yet fully implemented.

Example:

if ($model->arch_is_recurrent) {
    warn "Recurrent models not yet fully supported\n";
}

n_tensors

my $count = $model->n_tensors;

Returns the number of tensors in the model.

n_kv

my $count = $model->n_kv;

Returns the number of metadata key-value pairs in the model.

tensor_names

my @names = $model->tensor_names;

Returns a list of all tensor names in the model.

Example:

my @names = $model->tensor_names;
# Returns: ('token_embd.weight', 'blk.0.attn_norm.weight', ...)

tensor_info

my ($type, $n_dims, $ne0, $ne1, $ne2, $ne3) = $model->tensor_info($name);

Returns information about a specific tensor.

Parameters:

$name - The tensor name

Returns: A list containing:

$type - The ggml type code (0=F32, 1=F16, etc.)
$n_dims - Number of dimensions (1-4)
$ne0, $ne1, $ne2, $ne3 - Size of each dimension

Returns an empty list if the tensor is not found.

Example:

my ($type, $dims, @shape) = $model->tensor_info('token_embd.weight');
# For TinyLlama: (2, 2, 2048, 32000, 1, 1)
# Type 2 = Q4_K, 2D tensor, shape [2048, 32000]

kv_keys

my @keys = $model->kv_keys;

Returns a list of all metadata keys in the model.

Example:

my @keys = $model->kv_keys;
# Returns: ('general.architecture', 'llama.block_count', ...)

get_kv

my $value = $model->get_kv($key);

Returns the value of a metadata key.

Parameters:

$key - The metadata key name

Returns: The value as a scalar (string, number, or boolean), or an array reference for array values. Returns undef if the key is not found.

Example:

my $n_layers = $model->get_kv('llama.block_count');  # 22 for TinyLlama
my $n_embd = $model->get_kv('llama.embedding_length');  # 2048
my $vocab = $model->get_kv('tokenizer.ggml.tokens');  # ['<unk>', '<s>', ...]

COMMON METADATA KEYS

General

general.architecture - Model architecture (e.g., "llama", "qwen2", "phi3")
general.name - Model name
general.quantization_version - Quantization format version

Architecture-specific Keys

Metadata keys are prefixed with the architecture name. The architecture is auto-detected from general.architecture and used to lookup parameters:

LLaMA-style (llama, mistral, etc.):

{arch}.block_count - Number of transformer layers
{arch}.embedding_length - Hidden dimension (n_embd)
{arch}.attention.head_count - Number of attention heads
{arch}.attention.head_count_kv - Number of KV heads (for GQA)
{arch}.attention.layer_norm_rms_epsilon - RMSNorm epsilon
{arch}.context_length - Maximum context length
{arch}.feed_forward_length - FFN intermediate dimension
{arch}.vocab_size - Vocabulary size
{arch}.rope.dimension_count - RoPE rotation dimensions
{arch}.rope.freq_base - RoPE frequency base (10000 for llama)

Where {arch} is the architecture name (e.g., "llama", "qwen2", "phi3", "gemma2").

Example for different architectures:

# LLaMA model
my $layers = $model->get_kv('llama.block_count');

# Qwen2 model  
my $layers = $model->get_kv('qwen2.block_count');

# Phi-3 model
my $layers = $model->get_kv('phi3.block_count');

# Or use architecture() to build the key dynamically
my $arch = $model->architecture;
my $layers = $model->get_kv("$arch.block_count");

Tokenizer

tokenizer.ggml.model - Tokenizer type (e.g., "llama", "gpt2")
tokenizer.ggml.tokens - Vocabulary tokens (array)
tokenizer.ggml.scores - Token scores (array)
tokenizer.ggml.token_type - Token types (array)
tokenizer.ggml.bos_token_id - Beginning of sequence token ID
tokenizer.ggml.eos_token_id - End of sequence token ID
tokenizer.ggml.unknown_token_id - Unknown token ID
tokenizer.ggml.padding_token_id - Padding token ID

TENSOR NAMING CONVENTION

Tensor names follow a standard convention:

Embedding and Output

token_embd.weight - Token embedding matrix [n_embd, n_vocab]
output.weight - Output projection [n_vocab, n_embd]
output_norm.weight - Final layer norm

Attention Tensors (per layer N)

Separate Q/K/V (LLaMA, Mistral, Gemma, etc.):

blk.N.attn_norm.weight - Attention layer norm
blk.N.attn_q.weight - Query projection
blk.N.attn_k.weight - Key projection
blk.N.attn_v.weight - Value projection
blk.N.attn_output.weight - Attention output projection

Combined QKV (Phi, Qwen, BLOOM, GPT-2, GPT-J):

blk.N.attn_qkv.weight - Combined Q/K/V projection [3*n_embd, n_embd]

Post-normalization (Gemma2):

blk.N.attn_post_norm.weight - Post-attention layer norm
blk.N.ffn_post_norm.weight - Post-FFN layer norm

FFN Tensors (per layer N)

Gated FFN / SwiGLU (LLaMA, Mistral, Qwen, Gemma):

blk.N.ffn_norm.weight - FFN layer norm
blk.N.ffn_gate.weight - FFN gate projection (SwiGLU)
blk.N.ffn_up.weight - FFN up projection
blk.N.ffn_down.weight - FFN down projection

Standard FFN / GELU (GPT-2, Falcon, BLOOM, Phi):

blk.N.ffn_up.weight - FFN up projection (no gate)
blk.N.ffn_down.weight - FFN down projection

THREAD SAFETY

Lugh::Model objects are NOT thread-safe. Each Perl thread must create its own Model object. The XS code uses a registry pattern with mutex locks for the global registry, but individual model contexts should not be shared across threads.

MEMORY USAGE

Loading a model allocates memory for all tensors. Memory usage depends on the quantization:

Model Size    Q4_K_M     Q8_0       F16
7B params     4.0 GB     7.0 GB     14 GB
13B params    7.4 GB     13 GB      26 GB
1.1B params   0.6 GB     1.1 GB     2.2 GB

The memory is freed when the Model object goes out of scope.

Memory-Mapped Loading

When use_mmap => 1 is specified, the model file is memory-mapped instead of being copied into heap memory. This provides several benefits for multi-process deployments:

Shared Pages - The OS can share read-only pages across processes. If you fork() after loading a model with mmap, child processes share the same physical memory pages for model weights.
Reduced Memory - Multiple processes loading the same model file will share physical memory pages, reducing total memory usage.
Copy-on-Write - Forked processes only allocate new memory for modified pages, not the entire model.
Lazy Loading - With prefetch => 0, pages are loaded on demand as they're accessed, reducing initial load time.

Example: Multi-process inference with shared model

# Parent process loads model with mmap
my $model = Lugh::Model->new(
    model    => 'llama-7b.gguf',
    use_mmap => 1,
);

# Fork workers - they share model memory pages
for my $i (1..4) {
    my $pid = fork();
    if ($pid == 0) {
        # Child: model weights are shared via mmap
        my $tokenizer = Lugh::Tokenizer->new(model => $model);
        my $inference = Lugh::Inference->new(model => $model);
        # ... process requests ...
        exit(0);
    }
}

Note: While model weights are shared, each process still needs its own Tokenizer and Inference objects, as well as KV caches.

AUTHOR

lnation <email@lnation.org>

LICENSE

This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself.

To install Lugh, copy and paste the appropriate command in to your terminal.

cpanm

cpanm Lugh

CPAN shell

perl -MCPAN -e shell
install Lugh

For more information on module installation, please visit the detailed CPAN module installation guide.

	Global
`s`	Focus search bar
`?`	Bring up this help dialog

	GitHub
`g` `p`	Go to pull requests
`g` `i`	go to github issues (only if github is preferred repository)

	POD
`g` `a`	Go to author
`g` `c`	Go to changes
`g` `i`	Go to issues
`g` `d`	Go to dist
`g` `r`	Go to repository/SCM
`g` `s`	Go to source
`g` `b`	Go to file browse

	Search terms
module: (e.g. module:Plugin)
distribution: (e.g. distribution:Dancer auth)
author: (e.g. author:SONGMU Redis)
version: (e.g. version:1.00)

NAME

VERSION

SYNOPSIS

DESCRIPTION

GGUF Format

Supported Quantization Types

CONSTRUCTOR

new

METHODS

filename

architecture

use_mmap

mmap_size

mmap_supported

arch_type

arch_has_combined_qkv

arch_has_ffn_gate

arch_has_post_norm

arch_is_recurrent

n_tensors

n_kv

tensor_names

tensor_info

kv_keys

get_kv

COMMON METADATA KEYS

General

Architecture-specific Keys

Tokenizer

TENSOR NAMING CONVENTION

Embedding and Output

Attention Tensors (per layer N)

FFN Tensors (per layer N)

THREAD SAFETY

MEMORY USAGE

Memory-Mapped Loading

SEE ALSO

AUTHOR

LICENSE

Module Install Instructions