NAME

Lugh::Tokenizer - BPE Tokenizer for Text Encoding and Decoding

VERSION

Version 0.01

SYNOPSIS

use Lugh;

# Create tokenizer from a model
my $model = Lugh::Model->new(model => '/path/to/model.gguf');
my $tokenizer = Lugh::Tokenizer->new(model => $model);

# Encode text to tokens
my @tokens = $tokenizer->encode("Hello, world!");
# Returns: (1, 15043, 29892, 3186, 29991)
#          BOS, Hello, ",", world, "!"

# Encode without BOS token
my @tokens = $tokenizer->encode("Hello", add_bos => 0);

# Decode tokens back to text
my $text = $tokenizer->decode(@tokens);
my $text = $tokenizer->decode(\@tokens);  # Array ref also works

# Get vocabulary information
my $vocab_size = $tokenizer->n_vocab;  # 32000 for LLaMA
my $bos = $tokenizer->bos_id;          # 1
my $eos = $tokenizer->eos_id;          # 2

DESCRIPTION

Lugh::Tokenizer provides text tokenization using BPE (Byte Pair Encoding) with vocabulary loaded from a GGUF model file. It supports encoding text to token IDs and decoding token IDs back to text.

The tokenizer uses a greedy longest-match algorithm for encoding, which is efficient but may not produce optimal tokenization in all cases. For most use cases with LLaMA-style models, this produces correct results.

SentencePiece Compatibility

The tokenizer handles SentencePiece's special underscore prefix (▁) which represents a word boundary (space before the word). When encoding:

"Hello world" → tokens for "▁Hello" and "▁world"

When decoding, the ▁ prefix is converted back to a space:

token "▁Paris" → " Paris"

Special Tokens

Most models include special tokens:

  • BOS (Beginning of Sequence) - Added at start of input

  • EOS (End of Sequence) - Indicates generation should stop

  • UNK (Unknown) - Used for characters not in vocabulary

These are typically:

<s>   - BOS (ID 1)
</s>  - EOS (ID 2)
<unk> - UNK (ID 0)

CONSTRUCTOR

new

my $tokenizer = Lugh::Tokenizer->new(
    model => $model
);

Creates a new Tokenizer from a loaded model.

Parameters:

  • model (required) - A Lugh::Model object

Returns: A Lugh::Tokenizer object.

Throws: Dies if no model is provided or if the model has no vocabulary.

Example:

my $model = Lugh::Model->new(model => 'model.gguf');
my $tokenizer = Lugh::Tokenizer->new(model => $model);

METHODS

encode

my @tokens = $tokenizer->encode($text);
my @tokens = $tokenizer->encode($text, add_bos => 0);

Encodes text into a sequence of token IDs.

Parameters:

  • $text - The text string to encode

  • add_bos - Whether to prepend BOS token (default: 1)

Returns: A list of token IDs.

Algorithm:

The encoder uses greedy longest-match tokenization:

1. Start at the beginning of the text
2. Try to match the longest possible token
3. For word boundaries (after space/start), try with ▁ prefix first
4. If no match found, emit UNK and skip one character
5. Repeat until end of text

Example:

my @tokens = $tokenizer->encode("The capital of France is");
# Returns: (1, 450, 7483, 310, 3444, 338)
#          BOS, The, capital, of, France, is

# Without BOS:
my @tokens = $tokenizer->encode("Paris", add_bos => 0);
# Returns: (3681)  # Just "▁Paris"

decode

my $text = $tokenizer->decode(@token_ids);
my $text = $tokenizer->decode(\@token_ids);

Decodes a sequence of token IDs back to text.

Parameters:

  • @token_ids - List of token IDs, or an array reference

Returns: The decoded text string.

Notes:

  • Special tokens (<s>, </s>, etc.) are skipped

  • SentencePiece ▁ prefix is converted to space

  • Unknown token IDs return empty string for that position

Example:

my $text = $tokenizer->decode(3681);
# Returns: " Paris"

my $text = $tokenizer->decode(1, 15043, 29892, 3186);
# Returns: "Hello, world"  (BOS token skipped)

# Array reference syntax:
my $text = $tokenizer->decode([3681, 338]);
# Returns: " Paris is"

n_vocab

my $size = $tokenizer->n_vocab;

Returns the vocabulary size.

Example:

print "Vocabulary: ", $tokenizer->n_vocab, " tokens\n";
# Vocabulary: 32000 tokens

bos_id

my $id = $tokenizer->bos_id;

Returns the BOS (Beginning of Sequence) token ID.

eos_id

my $id = $tokenizer->eos_id;

Returns the EOS (End of Sequence) token ID.

TOKEN TYPES

Different types of tokens in the vocabulary:

Regular Tokens

Normal subword units:

"hello"  → Single token
"▁world" → Word with space prefix
"ing"    → Common suffix

Special Tokens

Control tokens with special meaning:

<s>     → BOS (beginning of sequence)
</s>    → EOS (end of sequence)
<unk>   → Unknown token
<pad>   → Padding token

Byte Fallback Tokens

For characters not in vocabulary (LLaMA models):

<0x00> through <0xFF>  → Raw byte tokens

This allows encoding any UTF-8 text, even with unseen characters.

COMMON PATTERNS

Basic Tokenization

my $model = Lugh::Model->new(model => $path);
my $tokenizer = Lugh::Tokenizer->new(model => $model);

my @tokens = $tokenizer->encode("Hello, world!");
print "Tokens: @tokens\n";

my $decoded = $tokenizer->decode(@tokens);
print "Decoded: $decoded\n";

Token Inspection

# See what each token represents
my @tokens = $tokenizer->encode("The quick brown fox");
for my $id (@tokens) {
    my $text = $tokenizer->decode([$id]);
    printf "Token %5d: '%s'\n", $id, $text;
}

Chat Template

# Build a chat prompt (LLaMA 2 format)
my $prompt = "<s>[INST] <<SYS>>
You are a helpful assistant.
<</SYS>>

What is the capital of France? [/INST]";

my @tokens = $tokenizer->encode($prompt, add_bos => 0);

Streaming Decode

# Decode one token at a time (for streaming output)
for my $token (@generated_tokens) {
    my $text = $tokenizer->decode([$token]);
    print $text;
    STDOUT->flush();
}

LIMITATIONS

  • Greedy Algorithm - May not produce optimal BPE tokenization

  • No Merge Rules - Does not use BPE merge rules, just vocabulary lookup

  • UTF-8 Only - Input text must be valid UTF-8

  • No Normalization - Does not perform Unicode normalization

For most LLM inference use cases, these limitations do not significantly impact results.

THREAD SAFETY

Lugh::Tokenizer objects are NOT thread-safe. Each Perl thread must create its own Tokenizer object (though they can share the same Model if created separately in each thread).

SEE ALSO

Lugh, Lugh::Model, Lugh::Inference

https://github.com/google/sentencepiece - SentencePiece tokenizer

https://arxiv.org/abs/1508.07909 - BPE paper

AUTHOR

lnation <email@lnation.org>

LICENSE

This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself.

1 POD Error

The following errors were encountered while parsing the POD:

Around line 55:

Non-ASCII character seen before =encoding in '(▁)'. Assuming UTF-8