Changes for version 0.10 - 2026-01-20

  • Fix forward Batch Mode with Per-Sequence KV Caches
    • forward_batch() now accepts caches => [$cache1, $cache2, ...] for per-sequence KV caching in batch mode
    • Enables parallel incremental decoding of multiple conversations
    • Each sequence uses its own independent cache
    • Number of caches must match number of sequences
    • Also supported via forward(sequences => [...], caches => [...])
    • forward_batch_pool() also supports caches parameter
    • New tests in t/28-unified-forward.t
  • Speculative Decoding (Phase 16)
    • New Lugh::Speculative module for faster inference
    • Uses smaller draft model to generate candidate tokens
    • Main model verifies in parallel for 2-3x speedup
    • Dual KV cache management for both models
    • Statistics: acceptance_rate(), tokens_drafted/accepted, total_steps
    • Vocab compatibility validation between models
    • New tests in t/29-speculative.t

Modules

Pure C LLM Inference Engine for Perl (built on ggml)
Memory Context for Tensor Allocation
Computation Graph for Tensor Operations
Transformer Forward Pass and Token Generation
KV Cache for efficient incremental decoding
Low-Rank Adaptation (LoRA) adapter support for Lugh
GGUF Model Loading and Tensor Access
Tensor Operations for Neural Network Computation
Chat Template Formatting for LLM Conversations
Quantization utilities for Lugh tensors
RoPE (Rotary Position Embedding) Scaling Configuration
Speculative decoding for faster LLM inference
N-Dimensional Tensor with ggml Backend
BPE Tokenizer for Text Encoding and Decoding