NAME

Lugh::Speculative - Speculative decoding for faster LLM inference

VERSION

Version 0.10

SYNOPSIS

use Lugh;
use Lugh::Model;
use Lugh::Tokenizer;
use Lugh::Inference;
use Lugh::Speculative;

# Load main (target) and draft models
my $main_model  = Lugh::Model->new(model => 'llama-7b.gguf');
my $draft_model = Lugh::Model->new(model => 'llama-68m.gguf');

# Create tokenizer (use main model's tokenizer - they should be compatible)
my $tokenizer = Lugh::Tokenizer->new(model => $main_model);

# Create inference engines
my $main_inf  = Lugh::Inference->new(model => $main_model, n_ctx => 512, n_threads => 4);
my $draft_inf = Lugh::Inference->new(model => $draft_model, n_ctx => 512, n_threads => 4);

# Create speculative decoder
my $spec = Lugh::Speculative->new(
    inference   => $main_inf,    # Main/target model
    draft       => $draft_inf,   # Draft model (smaller, faster)
    k           => 4,            # Number of draft tokens per step
    temperature => 0.8,
    top_p       => 0.95,
);

# Tokenize prompt
my @prompt_tokens = $tokenizer->encode("The future of AI is");

# Seed C RNG for reproducible draft sampling
Lugh::srand(42);

# Generate tokens speculatively
my $output = $spec->generate(\@prompt_tokens, 100);  # Generate up to 100 tokens

# Decode output
my $text = $tokenizer->decode($output);

# Check acceptance rate
printf "Acceptance rate: %.2f%%\n", $spec->acceptance_rate * 100;
printf "Tokens drafted: %d\n", $spec->tokens_drafted;
printf "Tokens accepted: %d\n", $spec->tokens_accepted;

DESCRIPTION

Lugh::Speculative implements speculative decoding, a technique for accelerating LLM inference by using a smaller "draft" model to generate candidate tokens that are then verified in parallel by the larger "main" model.

The key insight is that verifying multiple tokens in parallel with the main model is much faster than generating them one at a time, as long as the draft model's predictions are often correct. When the draft model makes a wrong prediction, only the incorrect tokens need to be regenerated.

How It Works

1. Draft Phase: The draft model generates K candidate tokens autoregressively.
2. Verify Phase: The main model processes all draft tokens in a single forward pass and computes probabilities for each position.
3. Accept/Reject: Tokens are accepted if the main model assigns them sufficient probability. The first rejected token and all subsequent tokens are discarded.
4. Bonus Token: After rejection, the main model can sample a corrected token from its probability distribution, so at least one token is always accepted.

Performance Benefits

Speculative decoding can provide 2-3x speedup depending on:

How well the draft model predicts the main model's outputs
The relative sizes of the draft and main models
The speculation depth K

METHODS

new

my $spec = Lugh::Speculative->new(%options);

Create a new speculative decoder.

Options:

inference (required): The main/target Lugh::Inference object (larger model). Alias: main
draft (required): The draft Lugh::Inference object (smaller, faster model). Alias: draft_inference
k (default: 4): Speculation depth - number of draft tokens to generate per step. Valid range: 1-16. Alias: depth
temperature (default: 0.8): Sampling temperature for both models.
top_p (default: 0.95): Top-p (nucleus) sampling threshold.

generate

my $tokens = $spec->generate(\@input_tokens, $max_tokens);

Generate tokens speculatively.

Arguments:

input_tokens: Array reference of input token IDs (the prompt).
max_tokens: Maximum number of tokens to generate (default: 256).

Returns an array reference of generated token IDs.

step

my $accepted = $spec->step(\@current_tokens);

Perform one speculation step: draft K tokens, verify, return accepted.

Returns an array reference of accepted token IDs.

draft_tokens

my $drafted = $spec->draft_tokens(\@input_tokens, $n_draft);

Generate N draft tokens using the draft model.

Returns an array reference of drafted token IDs.

verify_tokens

my $accepted = $spec->verify_tokens(\@input_tokens, \@draft_tokens);

Verify draft tokens using the main model.

Returns an array reference of accepted token IDs.

init_caches

my $ok = $spec->init_caches();

Initialize KV caches for both models. Called automatically by generate().

Returns 1 on success, croaks on failure.

Accessors

k: Returns the speculation depth.
temperature: Returns the sampling temperature.
top_p: Returns the top-p threshold.
n_vocab: Returns the vocabulary size (shared between models).

Statistics

acceptance_rate: Returns the ratio of accepted to drafted tokens (0.0 - 1.0).
tokens_drafted: Returns the total number of tokens drafted.
tokens_accepted: Returns the total number of tokens accepted.
total_steps: Returns the total number of speculation steps.
reset_stats: Reset all statistics counters to zero.

REQUIREMENTS

Both models must have the same vocabulary size
The draft model should be significantly smaller/faster than main model
Models should be compatible (e.g., same tokenizer, similar training)

REFERENCES

"Fast Inference from Transformers via Speculative Decoding" (Leviathan et al., 2022)
"Accelerating Large Language Model Decoding with Speculative Sampling" (Chen et al., 2023)

AUTHOR

lnation <email at example.com>

LICENSE AND COPYRIGHT

This is free software; you can redistribute it and/or modify it under the same terms as the Perl 5 programming language system itself.

To install Lugh, copy and paste the appropriate command in to your terminal.

cpanm

cpanm Lugh

CPAN shell

perl -MCPAN -e shell
install Lugh

For more information on module installation, please visit the detailed CPAN module installation guide.

	Global
`s`	Focus search bar
`?`	Bring up this help dialog

	GitHub
`g` `p`	Go to pull requests
`g` `i`	go to github issues (only if github is preferred repository)

	POD
`g` `a`	Go to author
`g` `c`	Go to changes
`g` `i`	Go to issues
`g` `d`	Go to dist
`g` `r`	Go to repository/SCM
`g` `s`	Go to source
`g` `b`	Go to file browse

	Search terms
module: (e.g. module:Plugin)
distribution: (e.g. distribution:Dancer auth)
author: (e.g. author:SONGMU Redis)
version: (e.g. version:1.00)