NAME

Lugh::Speculative - Speculative decoding for faster LLM inference

VERSION

Version 0.10

SYNOPSIS

use Lugh;
use Lugh::Model;
use Lugh::Tokenizer;
use Lugh::Inference;
use Lugh::Speculative;

# Load main (target) and draft models
my $main_model  = Lugh::Model->new(model => 'llama-7b.gguf');
my $draft_model = Lugh::Model->new(model => 'llama-68m.gguf');

# Create tokenizer (use main model's tokenizer - they should be compatible)
my $tokenizer = Lugh::Tokenizer->new(model => $main_model);

# Create inference engines
my $main_inf  = Lugh::Inference->new(model => $main_model, n_ctx => 512, n_threads => 4);
my $draft_inf = Lugh::Inference->new(model => $draft_model, n_ctx => 512, n_threads => 4);

# Create speculative decoder
my $spec = Lugh::Speculative->new(
    inference   => $main_inf,    # Main/target model
    draft       => $draft_inf,   # Draft model (smaller, faster)
    k           => 4,            # Number of draft tokens per step
    temperature => 0.8,
    top_p       => 0.95,
);

# Tokenize prompt
my @prompt_tokens = $tokenizer->encode("The future of AI is");

# Seed C RNG for reproducible draft sampling
Lugh::srand(42);

# Generate tokens speculatively
my $output = $spec->generate(\@prompt_tokens, 100);  # Generate up to 100 tokens

# Decode output
my $text = $tokenizer->decode($output);

# Check acceptance rate
printf "Acceptance rate: %.2f%%\n", $spec->acceptance_rate * 100;
printf "Tokens drafted: %d\n", $spec->tokens_drafted;
printf "Tokens accepted: %d\n", $spec->tokens_accepted;

DESCRIPTION

Lugh::Speculative implements speculative decoding, a technique for accelerating LLM inference by using a smaller "draft" model to generate candidate tokens that are then verified in parallel by the larger "main" model.

The key insight is that verifying multiple tokens in parallel with the main model is much faster than generating them one at a time, as long as the draft model's predictions are often correct. When the draft model makes a wrong prediction, only the incorrect tokens need to be regenerated.

How It Works

1. Draft Phase

The draft model generates K candidate tokens autoregressively.

2. Verify Phase

The main model processes all draft tokens in a single forward pass and computes probabilities for each position.

3. Accept/Reject

Tokens are accepted if the main model assigns them sufficient probability. The first rejected token and all subsequent tokens are discarded.

4. Bonus Token

After rejection, the main model can sample a corrected token from its probability distribution, so at least one token is always accepted.

Performance Benefits

Speculative decoding can provide 2-3x speedup depending on:

  • How well the draft model predicts the main model's outputs

  • The relative sizes of the draft and main models

  • The speculation depth K

METHODS

new

my $spec = Lugh::Speculative->new(%options);

Create a new speculative decoder.

Options:

inference (required)

The main/target Lugh::Inference object (larger model). Alias: main

draft (required)

The draft Lugh::Inference object (smaller, faster model). Alias: draft_inference

k (default: 4)

Speculation depth - number of draft tokens to generate per step. Valid range: 1-16. Alias: depth

temperature (default: 0.8)

Sampling temperature for both models.

top_p (default: 0.95)

Top-p (nucleus) sampling threshold.

generate

my $tokens = $spec->generate(\@input_tokens, $max_tokens);

Generate tokens speculatively.

Arguments:

input_tokens

Array reference of input token IDs (the prompt).

max_tokens

Maximum number of tokens to generate (default: 256).

Returns an array reference of generated token IDs.

step

my $accepted = $spec->step(\@current_tokens);

Perform one speculation step: draft K tokens, verify, return accepted.

Returns an array reference of accepted token IDs.

draft_tokens

my $drafted = $spec->draft_tokens(\@input_tokens, $n_draft);

Generate N draft tokens using the draft model.

Returns an array reference of drafted token IDs.

verify_tokens

my $accepted = $spec->verify_tokens(\@input_tokens, \@draft_tokens);

Verify draft tokens using the main model.

Returns an array reference of accepted token IDs.

init_caches

my $ok = $spec->init_caches();

Initialize KV caches for both models. Called automatically by generate().

Returns 1 on success, croaks on failure.

Accessors

k

Returns the speculation depth.

temperature

Returns the sampling temperature.

top_p

Returns the top-p threshold.

n_vocab

Returns the vocabulary size (shared between models).

Statistics

acceptance_rate

Returns the ratio of accepted to drafted tokens (0.0 - 1.0).

tokens_drafted

Returns the total number of tokens drafted.

tokens_accepted

Returns the total number of tokens accepted.

total_steps

Returns the total number of speculation steps.

reset_stats

Reset all statistics counters to zero.

REQUIREMENTS

  • Both models must have the same vocabulary size

  • The draft model should be significantly smaller/faster than main model

  • Models should be compatible (e.g., same tokenizer, similar training)

SEE ALSO

Lugh, Lugh::Inference, Lugh::Model, Lugh::KVCache

REFERENCES

  • "Fast Inference from Transformers via Speculative Decoding" (Leviathan et al., 2022)

  • "Accelerating Large Language Model Decoding with Speculative Sampling" (Chen et al., 2023)

AUTHOR

lnation <email at example.com>

LICENSE AND COPYRIGHT

This software is copyright (c) 2026 by lnation.

This is free software; you can redistribute it and/or modify it under the same terms as the Perl 5 programming language system itself.