NAME

Lugh::MemoryPool - Reusable compute resources for efficient inference

SYNOPSIS

use Lugh;

# Load model and create inference engine
my $model = Lugh::Model->new(model => 'model.gguf');
my $tokenizer = Lugh::Tokenizer->new(model => $model);
my $inference = Lugh::Inference->new(model => $model);

# Create a memory pool for reusable resources
my $pool = $inference->create_memory_pool();

# Use the pool for multiple inference calls
my @tokens = $tokenizer->encode("Hello, world!");

my @logits = $inference->forward_pool(
    tokens => \@tokens,
    pool   => $pool,
);

# Reset the pool for the next request
$pool->reset();

# Use again with different input
my @tokens2 = $tokenizer->encode("How are you?");
my @logits2 = $inference->forward_pool(
    tokens => \@tokens2,
    pool   => $pool,
);

DESCRIPTION

Lugh::MemoryPool provides pre-allocated compute resources that can be reused across multiple inference calls. This eliminates the overhead of allocating and freeing memory for each forward pass, significantly improving throughput for applications that process many requests.

Memory pools are created from a Lugh::Inference object using the create_memory_pool() method. Each pool contains:

  • A compute context for building graphs

  • A backend instance for execution

  • A graph allocator for tensor memory

METHODS

reset

$pool->reset();

Resets the memory pool to its initial state, ready for the next inference call. This must be called between inference requests to clear the previous computation graph.

Returns: True (1) on success, false (0) on failure.

Example:

# Process multiple requests efficiently
for my $text (@requests) {
    my @tokens = $tokenizer->encode($text);
    my @logits = $inference->forward_pool(
        tokens => \@tokens,
        pool   => $pool,
    );

    # Process logits...

    $pool->reset();  # Prepare for next request
}

DESTROY

Called automatically when the pool goes out of scope. Frees all allocated resources including the backend, allocator, and compute context.

CREATING A MEMORY POOL

Memory pools are created via Lugh::Inference:

my $pool = $inference->create_memory_pool();

The pool inherits configuration from the inference object, including:

  • Backend selection (Metal, CPU, etc.)

  • Thread count

  • Memory allocation size

USING WITH FORWARD METHODS

Use forward_pool() or forward_cache_pool() to leverage the pool:

# Without KV cache
my @logits = $inference->forward_pool(
    tokens => \@tokens,
    pool   => $pool,
);

# With KV cache
my $cache = $inference->create_kv_cache();
my @logits = $inference->forward_cache_pool(
    tokens => \@tokens,
    cache  => $cache,
    pool   => $pool,
);

PERFORMANCE CONSIDERATIONS

When to Use Memory Pools

Memory pools provide the most benefit when:

  • Processing many short requests (chatbots, APIs)

  • Low latency is critical

  • Memory allocation overhead is noticeable in profiling

When Not to Use Memory Pools

Pools may not be necessary when:

  • Processing few, long sequences

  • Memory is severely constrained

  • Using batch processing (which has its own optimizations)

Memory Usage

Each pool allocates a fixed amount of memory (typically 512MB for the compute context). This memory is reused but not freed until the pool is destroyed.

THREAD SAFETY

Memory pools are not thread-safe. Each thread should have its own pool. The pool can be safely reused sequentially within a single thread.

EXAMPLE: HIGH-THROUGHPUT INFERENCE

use Lugh;

my $model = Lugh::Model->new(model => 'model.gguf');
my $tokenizer = Lugh::Tokenizer->new(model => $model);
my $inference = Lugh::Inference->new(model => $model);

# Pre-allocate resources
my $pool = $inference->create_memory_pool();
my $cache = $inference->create_kv_cache();

# Process requests efficiently
sub generate_response {
    my ($prompt) = @_;

    # Reset resources
    $pool->reset();
    $cache->clear();

    my @tokens = $tokenizer->encode($prompt);
    my @generated;

    for (1..100) {  # Generate up to 100 tokens
        my @logits = $inference->forward_cache_pool(
            tokens => \@tokens,
            cache  => $cache,
            pool   => $pool,
        );

        my $next = $inference->sample_top_p(\@logits, temperature => 0.8);
        last if $next == $tokenizer->eos_id;

        push @generated, $next;
        push @tokens, $next;

        $pool->reset();  # Reset for next iteration
    }

    return $tokenizer->decode(\@generated);
}

SEE ALSO

Lugh::Inference - Main inference class with create_memory_pool()

Lugh::KVCache - Key-value cache for efficient generation

Lugh - Main module documentation

AUTHOR

LNATION <email@lnation.org>

LICENSE

This is free software; you can redistribute it and/or modify it under the same terms as Perl itself.