NAME
Lugh::MemoryPool - Reusable compute resources for efficient inference
SYNOPSIS
use Lugh;
# Load model and create inference engine
my $model = Lugh::Model->new(model => 'model.gguf');
my $tokenizer = Lugh::Tokenizer->new(model => $model);
my $inference = Lugh::Inference->new(model => $model);
# Create a memory pool for reusable resources
my $pool = $inference->create_memory_pool();
# Use the pool for multiple inference calls
my @tokens = $tokenizer->encode("Hello, world!");
my @logits = $inference->forward_pool(
tokens => \@tokens,
pool => $pool,
);
# Reset the pool for the next request
$pool->reset();
# Use again with different input
my @tokens2 = $tokenizer->encode("How are you?");
my @logits2 = $inference->forward_pool(
tokens => \@tokens2,
pool => $pool,
);
DESCRIPTION
Lugh::MemoryPool provides pre-allocated compute resources that can be reused across multiple inference calls. This eliminates the overhead of allocating and freeing memory for each forward pass, significantly improving throughput for applications that process many requests.
Memory pools are created from a Lugh::Inference object using the create_memory_pool() method. Each pool contains:
A compute context for building graphs
A backend instance for execution
A graph allocator for tensor memory
METHODS
reset
$pool->reset();
Resets the memory pool to its initial state, ready for the next inference call. This must be called between inference requests to clear the previous computation graph.
Returns: True (1) on success, false (0) on failure.
Example:
# Process multiple requests efficiently
for my $text (@requests) {
my @tokens = $tokenizer->encode($text);
my @logits = $inference->forward_pool(
tokens => \@tokens,
pool => $pool,
);
# Process logits...
$pool->reset(); # Prepare for next request
}
DESTROY
Called automatically when the pool goes out of scope. Frees all allocated resources including the backend, allocator, and compute context.
CREATING A MEMORY POOL
Memory pools are created via Lugh::Inference:
my $pool = $inference->create_memory_pool();
The pool inherits configuration from the inference object, including:
Backend selection (Metal, CPU, etc.)
Thread count
Memory allocation size
USING WITH FORWARD METHODS
Use forward_pool() or forward_cache_pool() to leverage the pool:
# Without KV cache
my @logits = $inference->forward_pool(
tokens => \@tokens,
pool => $pool,
);
# With KV cache
my $cache = $inference->create_kv_cache();
my @logits = $inference->forward_cache_pool(
tokens => \@tokens,
cache => $cache,
pool => $pool,
);
PERFORMANCE CONSIDERATIONS
When to Use Memory Pools
Memory pools provide the most benefit when:
Processing many short requests (chatbots, APIs)
Low latency is critical
Memory allocation overhead is noticeable in profiling
When Not to Use Memory Pools
Pools may not be necessary when:
Processing few, long sequences
Memory is severely constrained
Using batch processing (which has its own optimizations)
Memory Usage
Each pool allocates a fixed amount of memory (typically 512MB for the compute context). This memory is reused but not freed until the pool is destroyed.
THREAD SAFETY
Memory pools are not thread-safe. Each thread should have its own pool. The pool can be safely reused sequentially within a single thread.
EXAMPLE: HIGH-THROUGHPUT INFERENCE
use Lugh;
my $model = Lugh::Model->new(model => 'model.gguf');
my $tokenizer = Lugh::Tokenizer->new(model => $model);
my $inference = Lugh::Inference->new(model => $model);
# Pre-allocate resources
my $pool = $inference->create_memory_pool();
my $cache = $inference->create_kv_cache();
# Process requests efficiently
sub generate_response {
my ($prompt) = @_;
# Reset resources
$pool->reset();
$cache->clear();
my @tokens = $tokenizer->encode($prompt);
my @generated;
for (1..100) { # Generate up to 100 tokens
my @logits = $inference->forward_cache_pool(
tokens => \@tokens,
cache => $cache,
pool => $pool,
);
my $next = $inference->sample_top_p(\@logits, temperature => 0.8);
last if $next == $tokenizer->eos_id;
push @generated, $next;
push @tokens, $next;
$pool->reset(); # Reset for next iteration
}
return $tokenizer->decode(\@generated);
}
SEE ALSO
Lugh::Inference - Main inference class with create_memory_pool()
Lugh::KVCache - Key-value cache for efficient generation
Lugh - Main module documentation
AUTHOR
LNATION <email@lnation.org>
LICENSE
This is free software; you can redistribute it and/or modify it under the same terms as Perl itself.