NAME
Lugh::Tokenizer - BPE Tokenizer for Text Encoding and Decoding
VERSION
Version 0.01
SYNOPSIS
use Lugh;
# Create tokenizer from a model
my $model = Lugh::Model->new(model => '/path/to/model.gguf');
my $tokenizer = Lugh::Tokenizer->new(model => $model);
# Encode text to tokens
my @tokens = $tokenizer->encode("Hello, world!");
# Returns: (1, 15043, 29892, 3186, 29991)
# BOS, Hello, ",", world, "!"
# Encode without BOS token
my @tokens = $tokenizer->encode("Hello", add_bos => 0);
# Decode tokens back to text
my $text = $tokenizer->decode(@tokens);
my $text = $tokenizer->decode(\@tokens); # Array ref also works
# Get vocabulary information
my $vocab_size = $tokenizer->n_vocab; # 32000 for LLaMA
my $bos = $tokenizer->bos_id; # 1
my $eos = $tokenizer->eos_id; # 2
DESCRIPTION
Lugh::Tokenizer provides text tokenization using BPE (Byte Pair Encoding) with vocabulary loaded from a GGUF model file. It supports encoding text to token IDs and decoding token IDs back to text.
The tokenizer uses a greedy longest-match algorithm for encoding, which is efficient but may not produce optimal tokenization in all cases. For most use cases with LLaMA-style models, this produces correct results.
SentencePiece Compatibility
The tokenizer handles SentencePiece's special underscore prefix (▁) which represents a word boundary (space before the word). When encoding:
"Hello world" → tokens for "▁Hello" and "▁world"
When decoding, the ▁ prefix is converted back to a space:
token "▁Paris" → " Paris"
Special Tokens
Most models include special tokens:
BOS (Beginning of Sequence) - Added at start of input
EOS (End of Sequence) - Indicates generation should stop
UNK (Unknown) - Used for characters not in vocabulary
These are typically:
<s> - BOS (ID 1)
</s> - EOS (ID 2)
<unk> - UNK (ID 0)
CONSTRUCTOR
new
my $tokenizer = Lugh::Tokenizer->new(
model => $model
);
Creates a new Tokenizer from a loaded model.
Parameters:
model(required) - A Lugh::Model object
Returns: A Lugh::Tokenizer object.
Throws: Dies if no model is provided or if the model has no vocabulary.
Example:
my $model = Lugh::Model->new(model => 'model.gguf');
my $tokenizer = Lugh::Tokenizer->new(model => $model);
METHODS
encode
my @tokens = $tokenizer->encode($text);
my @tokens = $tokenizer->encode($text, add_bos => 0);
Encodes text into a sequence of token IDs.
Parameters:
$text- The text string to encodeadd_bos- Whether to prepend BOS token (default: 1)
Returns: A list of token IDs.
Algorithm:
The encoder uses greedy longest-match tokenization:
- 1. Start at the beginning of the text
- 2. Try to match the longest possible token
- 3. For word boundaries (after space/start), try with ▁ prefix first
- 4. If no match found, emit UNK and skip one character
- 5. Repeat until end of text
Example:
my @tokens = $tokenizer->encode("The capital of France is");
# Returns: (1, 450, 7483, 310, 3444, 338)
# BOS, The, capital, of, France, is
# Without BOS:
my @tokens = $tokenizer->encode("Paris", add_bos => 0);
# Returns: (3681) # Just "▁Paris"
decode
my $text = $tokenizer->decode(@token_ids);
my $text = $tokenizer->decode(\@token_ids);
Decodes a sequence of token IDs back to text.
Parameters:
@token_ids- List of token IDs, or an array reference
Returns: The decoded text string.
Notes:
Special tokens (
<s>,</s>, etc.) are skippedSentencePiece ▁ prefix is converted to space
Unknown token IDs return empty string for that position
Example:
my $text = $tokenizer->decode(3681);
# Returns: " Paris"
my $text = $tokenizer->decode(1, 15043, 29892, 3186);
# Returns: "Hello, world" (BOS token skipped)
# Array reference syntax:
my $text = $tokenizer->decode([3681, 338]);
# Returns: " Paris is"
n_vocab
my $size = $tokenizer->n_vocab;
Returns the vocabulary size.
Example:
print "Vocabulary: ", $tokenizer->n_vocab, " tokens\n";
# Vocabulary: 32000 tokens
bos_id
my $id = $tokenizer->bos_id;
Returns the BOS (Beginning of Sequence) token ID.
eos_id
my $id = $tokenizer->eos_id;
Returns the EOS (End of Sequence) token ID.
TOKEN TYPES
Different types of tokens in the vocabulary:
Regular Tokens
Normal subword units:
"hello" → Single token
"▁world" → Word with space prefix
"ing" → Common suffix
Special Tokens
Control tokens with special meaning:
<s> → BOS (beginning of sequence)
</s> → EOS (end of sequence)
<unk> → Unknown token
<pad> → Padding token
Byte Fallback Tokens
For characters not in vocabulary (LLaMA models):
<0x00> through <0xFF> → Raw byte tokens
This allows encoding any UTF-8 text, even with unseen characters.
COMMON PATTERNS
Basic Tokenization
my $model = Lugh::Model->new(model => $path);
my $tokenizer = Lugh::Tokenizer->new(model => $model);
my @tokens = $tokenizer->encode("Hello, world!");
print "Tokens: @tokens\n";
my $decoded = $tokenizer->decode(@tokens);
print "Decoded: $decoded\n";
Token Inspection
# See what each token represents
my @tokens = $tokenizer->encode("The quick brown fox");
for my $id (@tokens) {
my $text = $tokenizer->decode([$id]);
printf "Token %5d: '%s'\n", $id, $text;
}
Chat Template
# Build a chat prompt (LLaMA 2 format)
my $prompt = "<s>[INST] <<SYS>>
You are a helpful assistant.
<</SYS>>
What is the capital of France? [/INST]";
my @tokens = $tokenizer->encode($prompt, add_bos => 0);
Streaming Decode
# Decode one token at a time (for streaming output)
for my $token (@generated_tokens) {
my $text = $tokenizer->decode([$token]);
print $text;
STDOUT->flush();
}
LIMITATIONS
Greedy Algorithm - May not produce optimal BPE tokenization
No Merge Rules - Does not use BPE merge rules, just vocabulary lookup
UTF-8 Only - Input text must be valid UTF-8
No Normalization - Does not perform Unicode normalization
For most LLM inference use cases, these limitations do not significantly impact results.
THREAD SAFETY
Lugh::Tokenizer objects are NOT thread-safe. Each Perl thread must create its own Tokenizer object (though they can share the same Model if created separately in each thread).
SEE ALSO
Lugh, Lugh::Model, Lugh::Inference
https://github.com/google/sentencepiece - SentencePiece tokenizer
https://arxiv.org/abs/1508.07909 - BPE paper
AUTHOR
lnation <email@lnation.org>
LICENSE
This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself.
1 POD Error
The following errors were encountered while parsing the POD:
- Around line 55:
Non-ASCII character seen before =encoding in '(▁)'. Assuming UTF-8