NAME

Lugh::Model - GGUF Model Loading and Tensor Access

VERSION

Version 0.01

SYNOPSIS

use Lugh;

# Load a GGUF model file
my $model = Lugh::Model->new(
    model => '/path/to/model.gguf'
);

# Get model information
print "Architecture: ", $model->architecture, "\n";
print "Tensors: ", $model->n_tensors, "\n";
print "Metadata keys: ", $model->n_kv, "\n";

# Access model metadata
my $n_layers = $model->get_kv('llama.block_count');
my $n_embd = $model->get_kv('llama.embedding_length');
my $vocab_size = $model->get_kv('llama.vocab_size');

# List all tensors
my @names = $model->tensor_names;

# Get tensor information
my ($type, $n_dims, @shape) = $model->tensor_info('token_embd.weight');

# List all metadata keys
my @keys = $model->kv_keys;

DESCRIPTION

Lugh::Model provides an interface for loading and inspecting GGUF model files. GGUF (GPT-Generated Unified Format) is the standard format for storing large language models, used by llama.cpp and related projects.

The model object loads the entire model into memory, including all tensors with their weights. This allows direct access to model parameters for inference.

GGUF Format

GGUF files contain:

  • Header - Magic number, version, tensor count, metadata count

  • Metadata - Key-value pairs describing the model architecture, hyperparameters, tokenizer vocabulary, and other configuration

  • Tensor Info - Name, dimensions, type, and offset for each tensor

  • Tensor Data - The actual weight data, potentially quantized

Supported Quantization Types

The model loader supports all ggml quantization types, including:

  • F32, F16, BF16 - Full/half precision floats

  • Q4_0, Q4_1, Q4_K_S, Q4_K_M - 4-bit quantization

  • Q5_0, Q5_1, Q5_K_S, Q5_K_M - 5-bit quantization

  • Q8_0, Q8_1, Q8_K - 8-bit quantization

  • Q2_K, Q3_K_S, Q3_K_M, Q3_K_L - 2-3 bit quantization

  • Q6_K - 6-bit quantization

  • IQ1_S, IQ2_XXS, IQ2_XS, IQ2_S, IQ3_XXS, IQ3_XS, IQ3_S, IQ4_NL, IQ4_XS - i-quants

CONSTRUCTOR

new

my $model = Lugh::Model->new(
    model => '/path/to/model.gguf'
);

Creates a new Model object by loading a GGUF file.

Parameters:

  • model (required) - Path to the GGUF model file. Also accepts file or path as aliases.

Returns: A Lugh::Model object.

Throws: Dies if the file cannot be loaded or is not a valid GGUF file.

Example:

my $model = Lugh::Model->new(
    model => '/models/tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf'
);

METHODS

filename

my $path = $model->filename;

Returns the path to the loaded GGUF file.

architecture

my $arch = $model->architecture;

Returns the model architecture string (e.g., "llama", "qwen2", "phi3", "gemma2"). Returns "unknown" if the architecture is not specified in the model.

arch_type

my $type = $model->arch_type;

Returns the numeric architecture type code for optimized dispatch. This is used internally to determine which inference path to use.

Architecture type codes include:

0  - UNKNOWN      11 - MPT
1  - LLAMA        12 - STARCODER  
2  - QWEN         13 - STABLELM
3  - QWEN2        14 - INTERNLM
4  - PHI          15 - DEEPSEEK
5  - GEMMA        16 - COMMAND_R
6  - GEMMA2       17 - MAMBA
7  - GPT2         18 - RWKV
8  - GPTJ         19 - BERT
9  - GPTNEOX      20 - T5
10 - FALCON       21 - BLOOM

Example:

if ($model->arch_type == 4) {
    print "This is a Phi model\n";
}

arch_has_combined_qkv

my $has_combined = $model->arch_has_combined_qkv;

Returns true (1) if the model architecture uses combined Q/K/V projection weights in a single tensor, false (0) otherwise.

Models with combined QKV: Phi, Qwen, Qwen2, BLOOM, GPT-2, GPT-J

Example:

if ($model->arch_has_combined_qkv) {
    print "Model uses combined QKV projections\n";
}

arch_has_ffn_gate

my $has_gate = $model->arch_has_ffn_gate;

Returns true (1) if the model architecture uses a gated FFN (SwiGLU), false (0) if it uses a standard 2-layer FFN with GELU activation.

Models without FFN gate (use GELU): GPT-2, GPT-J, GPT-NeoX, BLOOM, Falcon, MPT, Phi

Example:

if (!$model->arch_has_ffn_gate) {
    print "Model uses GELU FFN (no gate)\n";
}

arch_has_post_norm

my $has_post = $model->arch_has_post_norm;

Returns true (1) if the model architecture applies post-normalization after attention and FFN blocks, false (0) otherwise.

Currently only Gemma2 uses post-normalization.

Example:

if ($model->arch_has_post_norm) {
    print "Model uses post-normalization (Gemma2-style)\n";
}

arch_is_recurrent

my $is_recurrent = $model->arch_is_recurrent;

Returns true (1) if the model is a recurrent architecture (MAMBA, RWKV), false (0) for standard transformer architectures.

Note: Recurrent model inference is identified but not yet fully implemented.

Example:

if ($model->arch_is_recurrent) {
    warn "Recurrent models not yet fully supported\n";
}

n_tensors

my $count = $model->n_tensors;

Returns the number of tensors in the model.

n_kv

my $count = $model->n_kv;

Returns the number of metadata key-value pairs in the model.

tensor_names

my @names = $model->tensor_names;

Returns a list of all tensor names in the model.

Example:

my @names = $model->tensor_names;
# Returns: ('token_embd.weight', 'blk.0.attn_norm.weight', ...)

tensor_info

my ($type, $n_dims, $ne0, $ne1, $ne2, $ne3) = $model->tensor_info($name);

Returns information about a specific tensor.

Parameters:

  • $name - The tensor name

Returns: A list containing:

  • $type - The ggml type code (0=F32, 1=F16, etc.)

  • $n_dims - Number of dimensions (1-4)

  • $ne0, $ne1, $ne2, $ne3 - Size of each dimension

Returns an empty list if the tensor is not found.

Example:

my ($type, $dims, @shape) = $model->tensor_info('token_embd.weight');
# For TinyLlama: (2, 2, 2048, 32000, 1, 1)
# Type 2 = Q4_K, 2D tensor, shape [2048, 32000]

kv_keys

my @keys = $model->kv_keys;

Returns a list of all metadata keys in the model.

Example:

my @keys = $model->kv_keys;
# Returns: ('general.architecture', 'llama.block_count', ...)

get_kv

my $value = $model->get_kv($key);

Returns the value of a metadata key.

Parameters:

  • $key - The metadata key name

Returns: The value as a scalar (string, number, or boolean), or an array reference for array values. Returns undef if the key is not found.

Example:

my $n_layers = $model->get_kv('llama.block_count');  # 22 for TinyLlama
my $n_embd = $model->get_kv('llama.embedding_length');  # 2048
my $vocab = $model->get_kv('tokenizer.ggml.tokens');  # ['<unk>', '<s>', ...]

COMMON METADATA KEYS

General

  • general.architecture - Model architecture (e.g., "llama", "qwen2", "phi3")

  • general.name - Model name

  • general.quantization_version - Quantization format version

Architecture-specific Keys

Metadata keys are prefixed with the architecture name. The architecture is auto-detected from general.architecture and used to lookup parameters:

LLaMA-style (llama, mistral, etc.):

  • {arch}.block_count - Number of transformer layers

  • {arch}.embedding_length - Hidden dimension (n_embd)

  • {arch}.attention.head_count - Number of attention heads

  • {arch}.attention.head_count_kv - Number of KV heads (for GQA)

  • {arch}.attention.layer_norm_rms_epsilon - RMSNorm epsilon

  • {arch}.context_length - Maximum context length

  • {arch}.feed_forward_length - FFN intermediate dimension

  • {arch}.vocab_size - Vocabulary size

  • {arch}.rope.dimension_count - RoPE rotation dimensions

  • {arch}.rope.freq_base - RoPE frequency base (10000 for llama)

Where {arch} is the architecture name (e.g., "llama", "qwen2", "phi3", "gemma2").

Example for different architectures:

# LLaMA model
my $layers = $model->get_kv('llama.block_count');

# Qwen2 model  
my $layers = $model->get_kv('qwen2.block_count');

# Phi-3 model
my $layers = $model->get_kv('phi3.block_count');

# Or use architecture() to build the key dynamically
my $arch = $model->architecture;
my $layers = $model->get_kv("$arch.block_count");

Tokenizer

  • tokenizer.ggml.model - Tokenizer type (e.g., "llama", "gpt2")

  • tokenizer.ggml.tokens - Vocabulary tokens (array)

  • tokenizer.ggml.scores - Token scores (array)

  • tokenizer.ggml.token_type - Token types (array)

  • tokenizer.ggml.bos_token_id - Beginning of sequence token ID

  • tokenizer.ggml.eos_token_id - End of sequence token ID

  • tokenizer.ggml.unknown_token_id - Unknown token ID

  • tokenizer.ggml.padding_token_id - Padding token ID

TENSOR NAMING CONVENTION

Tensor names follow a standard convention:

Embedding and Output

  • token_embd.weight - Token embedding matrix [n_embd, n_vocab]

  • output.weight - Output projection [n_vocab, n_embd]

  • output_norm.weight - Final layer norm

Attention Tensors (per layer N)

Separate Q/K/V (LLaMA, Mistral, Gemma, etc.):

  • blk.N.attn_norm.weight - Attention layer norm

  • blk.N.attn_q.weight - Query projection

  • blk.N.attn_k.weight - Key projection

  • blk.N.attn_v.weight - Value projection

  • blk.N.attn_output.weight - Attention output projection

Combined QKV (Phi, Qwen, BLOOM, GPT-2, GPT-J):

  • blk.N.attn_qkv.weight - Combined Q/K/V projection [3*n_embd, n_embd]

Post-normalization (Gemma2):

  • blk.N.attn_post_norm.weight - Post-attention layer norm

  • blk.N.ffn_post_norm.weight - Post-FFN layer norm

FFN Tensors (per layer N)

Gated FFN / SwiGLU (LLaMA, Mistral, Qwen, Gemma):

  • blk.N.ffn_norm.weight - FFN layer norm

  • blk.N.ffn_gate.weight - FFN gate projection (SwiGLU)

  • blk.N.ffn_up.weight - FFN up projection

  • blk.N.ffn_down.weight - FFN down projection

Standard FFN / GELU (GPT-2, Falcon, BLOOM, Phi):

  • blk.N.ffn_up.weight - FFN up projection (no gate)

  • blk.N.ffn_down.weight - FFN down projection

THREAD SAFETY

Lugh::Model objects are NOT thread-safe. Each Perl thread must create its own Model object. The XS code uses a registry pattern with mutex locks for the global registry, but individual model contexts should not be shared across threads.

MEMORY USAGE

Loading a model allocates memory for all tensors. Memory usage depends on the quantization:

Model Size    Q4_K_M     Q8_0       F16
7B params     4.0 GB     7.0 GB     14 GB
13B params    7.4 GB     13 GB      26 GB
1.1B params   0.6 GB     1.1 GB     2.2 GB

The memory is freed when the Model object goes out of scope.

SEE ALSO

Lugh, Lugh::Tokenizer, Lugh::Inference

https://github.com/ggerganov/ggml/blob/master/docs/gguf.md - GGUF specification

AUTHOR

lnation <email@lnation.org>

LICENSE

This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself.