NAME
Lugh::Model - GGUF Model Loading and Tensor Access
VERSION
Version 0.01
SYNOPSIS
use Lugh;
# Load a GGUF model file
my $model = Lugh::Model->new(
model => '/path/to/model.gguf'
);
# Get model information
print "Architecture: ", $model->architecture, "\n";
print "Tensors: ", $model->n_tensors, "\n";
print "Metadata keys: ", $model->n_kv, "\n";
# Access model metadata
my $n_layers = $model->get_kv('llama.block_count');
my $n_embd = $model->get_kv('llama.embedding_length');
my $vocab_size = $model->get_kv('llama.vocab_size');
# List all tensors
my @names = $model->tensor_names;
# Get tensor information
my ($type, $n_dims, @shape) = $model->tensor_info('token_embd.weight');
# List all metadata keys
my @keys = $model->kv_keys;
DESCRIPTION
Lugh::Model provides an interface for loading and inspecting GGUF model files. GGUF (GPT-Generated Unified Format) is the standard format for storing large language models, used by llama.cpp and related projects.
The model object loads the entire model into memory, including all tensors with their weights. This allows direct access to model parameters for inference.
GGUF Format
GGUF files contain:
Header - Magic number, version, tensor count, metadata count
Metadata - Key-value pairs describing the model architecture, hyperparameters, tokenizer vocabulary, and other configuration
Tensor Info - Name, dimensions, type, and offset for each tensor
Tensor Data - The actual weight data, potentially quantized
Supported Quantization Types
The model loader supports all ggml quantization types, including:
F32, F16, BF16 - Full/half precision floats
Q4_0, Q4_1, Q4_K_S, Q4_K_M - 4-bit quantization
Q5_0, Q5_1, Q5_K_S, Q5_K_M - 5-bit quantization
Q8_0, Q8_1, Q8_K - 8-bit quantization
Q2_K, Q3_K_S, Q3_K_M, Q3_K_L - 2-3 bit quantization
Q6_K - 6-bit quantization
IQ1_S, IQ2_XXS, IQ2_XS, IQ2_S, IQ3_XXS, IQ3_XS, IQ3_S, IQ4_NL, IQ4_XS - i-quants
CONSTRUCTOR
new
my $model = Lugh::Model->new(
model => '/path/to/model.gguf'
);
Creates a new Model object by loading a GGUF file.
Parameters:
model(required) - Path to the GGUF model file. Also acceptsfileorpathas aliases.
Returns: A Lugh::Model object.
Throws: Dies if the file cannot be loaded or is not a valid GGUF file.
Example:
my $model = Lugh::Model->new(
model => '/models/tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf'
);
METHODS
filename
my $path = $model->filename;
Returns the path to the loaded GGUF file.
architecture
my $arch = $model->architecture;
Returns the model architecture string (e.g., "llama", "qwen2", "phi3", "gemma2"). Returns "unknown" if the architecture is not specified in the model.
arch_type
my $type = $model->arch_type;
Returns the numeric architecture type code for optimized dispatch. This is used internally to determine which inference path to use.
Architecture type codes include:
0 - UNKNOWN 11 - MPT
1 - LLAMA 12 - STARCODER
2 - QWEN 13 - STABLELM
3 - QWEN2 14 - INTERNLM
4 - PHI 15 - DEEPSEEK
5 - GEMMA 16 - COMMAND_R
6 - GEMMA2 17 - MAMBA
7 - GPT2 18 - RWKV
8 - GPTJ 19 - BERT
9 - GPTNEOX 20 - T5
10 - FALCON 21 - BLOOM
Example:
if ($model->arch_type == 4) {
print "This is a Phi model\n";
}
arch_has_combined_qkv
my $has_combined = $model->arch_has_combined_qkv;
Returns true (1) if the model architecture uses combined Q/K/V projection weights in a single tensor, false (0) otherwise.
Models with combined QKV: Phi, Qwen, Qwen2, BLOOM, GPT-2, GPT-J
Example:
if ($model->arch_has_combined_qkv) {
print "Model uses combined QKV projections\n";
}
arch_has_ffn_gate
my $has_gate = $model->arch_has_ffn_gate;
Returns true (1) if the model architecture uses a gated FFN (SwiGLU), false (0) if it uses a standard 2-layer FFN with GELU activation.
Models without FFN gate (use GELU): GPT-2, GPT-J, GPT-NeoX, BLOOM, Falcon, MPT, Phi
Example:
if (!$model->arch_has_ffn_gate) {
print "Model uses GELU FFN (no gate)\n";
}
arch_has_post_norm
my $has_post = $model->arch_has_post_norm;
Returns true (1) if the model architecture applies post-normalization after attention and FFN blocks, false (0) otherwise.
Currently only Gemma2 uses post-normalization.
Example:
if ($model->arch_has_post_norm) {
print "Model uses post-normalization (Gemma2-style)\n";
}
arch_is_recurrent
my $is_recurrent = $model->arch_is_recurrent;
Returns true (1) if the model is a recurrent architecture (MAMBA, RWKV), false (0) for standard transformer architectures.
Note: Recurrent model inference is identified but not yet fully implemented.
Example:
if ($model->arch_is_recurrent) {
warn "Recurrent models not yet fully supported\n";
}
n_tensors
my $count = $model->n_tensors;
Returns the number of tensors in the model.
n_kv
my $count = $model->n_kv;
Returns the number of metadata key-value pairs in the model.
tensor_names
my @names = $model->tensor_names;
Returns a list of all tensor names in the model.
Example:
my @names = $model->tensor_names;
# Returns: ('token_embd.weight', 'blk.0.attn_norm.weight', ...)
tensor_info
my ($type, $n_dims, $ne0, $ne1, $ne2, $ne3) = $model->tensor_info($name);
Returns information about a specific tensor.
Parameters:
$name- The tensor name
Returns: A list containing:
$type- The ggml type code (0=F32, 1=F16, etc.)$n_dims- Number of dimensions (1-4)$ne0, $ne1, $ne2, $ne3- Size of each dimension
Returns an empty list if the tensor is not found.
Example:
my ($type, $dims, @shape) = $model->tensor_info('token_embd.weight');
# For TinyLlama: (2, 2, 2048, 32000, 1, 1)
# Type 2 = Q4_K, 2D tensor, shape [2048, 32000]
kv_keys
my @keys = $model->kv_keys;
Returns a list of all metadata keys in the model.
Example:
my @keys = $model->kv_keys;
# Returns: ('general.architecture', 'llama.block_count', ...)
get_kv
my $value = $model->get_kv($key);
Returns the value of a metadata key.
Parameters:
$key- The metadata key name
Returns: The value as a scalar (string, number, or boolean), or an array reference for array values. Returns undef if the key is not found.
Example:
my $n_layers = $model->get_kv('llama.block_count'); # 22 for TinyLlama
my $n_embd = $model->get_kv('llama.embedding_length'); # 2048
my $vocab = $model->get_kv('tokenizer.ggml.tokens'); # ['<unk>', '<s>', ...]
COMMON METADATA KEYS
General
general.architecture- Model architecture (e.g., "llama", "qwen2", "phi3")general.name- Model namegeneral.quantization_version- Quantization format version
Architecture-specific Keys
Metadata keys are prefixed with the architecture name. The architecture is auto-detected from general.architecture and used to lookup parameters:
LLaMA-style (llama, mistral, etc.):
{arch}.block_count- Number of transformer layers{arch}.embedding_length- Hidden dimension (n_embd){arch}.attention.head_count- Number of attention heads{arch}.attention.head_count_kv- Number of KV heads (for GQA){arch}.attention.layer_norm_rms_epsilon- RMSNorm epsilon{arch}.context_length- Maximum context length{arch}.feed_forward_length- FFN intermediate dimension{arch}.vocab_size- Vocabulary size{arch}.rope.dimension_count- RoPE rotation dimensions{arch}.rope.freq_base- RoPE frequency base (10000 for llama)
Where {arch} is the architecture name (e.g., "llama", "qwen2", "phi3", "gemma2").
Example for different architectures:
# LLaMA model
my $layers = $model->get_kv('llama.block_count');
# Qwen2 model
my $layers = $model->get_kv('qwen2.block_count');
# Phi-3 model
my $layers = $model->get_kv('phi3.block_count');
# Or use architecture() to build the key dynamically
my $arch = $model->architecture;
my $layers = $model->get_kv("$arch.block_count");
Tokenizer
tokenizer.ggml.model- Tokenizer type (e.g., "llama", "gpt2")tokenizer.ggml.tokens- Vocabulary tokens (array)tokenizer.ggml.scores- Token scores (array)tokenizer.ggml.token_type- Token types (array)tokenizer.ggml.bos_token_id- Beginning of sequence token IDtokenizer.ggml.eos_token_id- End of sequence token IDtokenizer.ggml.unknown_token_id- Unknown token IDtokenizer.ggml.padding_token_id- Padding token ID
TENSOR NAMING CONVENTION
Tensor names follow a standard convention:
Embedding and Output
token_embd.weight- Token embedding matrix [n_embd, n_vocab]output.weight- Output projection [n_vocab, n_embd]output_norm.weight- Final layer norm
Attention Tensors (per layer N)
Separate Q/K/V (LLaMA, Mistral, Gemma, etc.):
blk.N.attn_norm.weight- Attention layer normblk.N.attn_q.weight- Query projectionblk.N.attn_k.weight- Key projectionblk.N.attn_v.weight- Value projectionblk.N.attn_output.weight- Attention output projection
Combined QKV (Phi, Qwen, BLOOM, GPT-2, GPT-J):
blk.N.attn_qkv.weight- Combined Q/K/V projection [3*n_embd, n_embd]
Post-normalization (Gemma2):
blk.N.attn_post_norm.weight- Post-attention layer normblk.N.ffn_post_norm.weight- Post-FFN layer norm
FFN Tensors (per layer N)
Gated FFN / SwiGLU (LLaMA, Mistral, Qwen, Gemma):
blk.N.ffn_norm.weight- FFN layer normblk.N.ffn_gate.weight- FFN gate projection (SwiGLU)blk.N.ffn_up.weight- FFN up projectionblk.N.ffn_down.weight- FFN down projection
Standard FFN / GELU (GPT-2, Falcon, BLOOM, Phi):
blk.N.ffn_up.weight- FFN up projection (no gate)blk.N.ffn_down.weight- FFN down projection
THREAD SAFETY
Lugh::Model objects are NOT thread-safe. Each Perl thread must create its own Model object. The XS code uses a registry pattern with mutex locks for the global registry, but individual model contexts should not be shared across threads.
MEMORY USAGE
Loading a model allocates memory for all tensors. Memory usage depends on the quantization:
Model Size Q4_K_M Q8_0 F16
7B params 4.0 GB 7.0 GB 14 GB
13B params 7.4 GB 13 GB 26 GB
1.1B params 0.6 GB 1.1 GB 2.2 GB
The memory is freed when the Model object goes out of scope.
SEE ALSO
Lugh, Lugh::Tokenizer, Lugh::Inference
https://github.com/ggerganov/ggml/blob/master/docs/gguf.md - GGUF specification
AUTHOR
lnation <email@lnation.org>
LICENSE
This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself.