NAME
Rex::GPU - GPU detection and driver management for Rex
VERSION
version 0.001
SYNOPSIS
use Rex::GPU;
# Detect GPUs only — returns a hashref
my $gpus = gpu_detect();
if (@{ $gpus->{nvidia} }) {
say "NVIDIA GPU: ", $gpus->{nvidia}[0]{name};
}
# Full GPU setup for an RKE2 cluster (detect + drivers + toolkit + containerd)
gpu_setup(
containerd_config => 'rke2', # 'rke2', 'k3s', 'containerd', or 'none'
reboot => 1, # reboot after driver install (first deploy)
);
# For a K3s cluster
gpu_setup(containerd_config => 'k3s');
# Just drivers + toolkit, no containerd config
gpu_setup(containerd_config => 'none');
DESCRIPTION
Rex::GPU provides GPU detection and driver management for Rex. It automates the complete software stack needed to make NVIDIA GPUs available to workloads running in a Kubernetes cluster.
The full pipeline, as executed by "gpu_setup":
- 1. GPU detection — PCI class code scan via
lspci -nnto identify NVIDIA and AMD hardware, filtering out virtual GPUs (virtio, QEMU, VMware). Only CUDA-capable NVIDIA GPUs (RTX, Quadro, Tesla, PCI class0302) trigger driver installation. - 2. NVIDIA driver installation — Distribution-appropriate packages via DKMS for kernel-version independence. Nouveau is blacklisted and the initramfs is regenerated.
- 3. NVIDIA Container Toolkit — Installs
nvidia-container-toolkitfrom the official NVIDIA repository for all supported distributions. - 4. CDI spec generation — Writes
/etc/cdi/nvidia.yamlso the Kubernetes device plugin can enumerate GPU resources without privileged container access. - 5. Containerd runtime configuration — Injects the NVIDIA runtime into the containerd config for the target Kubernetes distribution.
Tested on Hetzner dedicated servers (bare metal) running:
Debian 11 (bullseye), 12 (bookworm), 13 (trixie)
Ubuntu 22.04 (jammy), 24.04 (noble)
RHEL / Rocky Linux / AlmaLinux 8, 9, 10 — CentOS Stream 9, 10
openSUSE Leap 15.6, 16.0
GPUs tested include the NVIDIA RTX 4000 SFF Ada Generation (PCI class 0302, datacenter compute profile).
This module requires Rex::LibSSH (or SFTP) on the connection backend. Hetzner servers do not enable the SFTP subsystem by default; use set connection => "LibSSH" in your Rexfile.
FUNCTIONS
gpu_detect
Detect GPU hardware on the remote host by scanning PCI devices. Installs pciutils if not already present, then parses lspci -nn output.
Returns a hashref with detected GPUs grouped by vendor:
my $gpus = gpu_detect();
# {
# nvidia => [
# {
# name => "NVIDIA RTX 4000 SFF Ada Generation",
# vendor => "nvidia",
# pci_class => "0302", # 0300 = VGA, 0302 = 3D/compute
# compute => 1, # 1 if CUDA-capable
# }
# ],
# amd => [
# {
# name => "Radeon RX 7900 XTX",
# vendor => "amd",
# pci_class => "0300",
# compute => 0, # always 0 (AMD not yet supported)
# }
# ],
# }
Virtual GPUs (virtio, QEMU, VMware, VirtualBox) are detected and silently skipped — both arrays will be empty. See Rex::GPU::Detect for details on the classification logic.
gpu_setup
Detect GPUs and run the full installation pipeline: NVIDIA driver, Container Toolkit, CDI spec generation, and containerd runtime configuration. This is the single call needed to make a node GPU-ready for Kubernetes.
AMD GPUs are detected and logged but not yet supported (a warning is emitted).
gpu_setup(
containerd_config => 'rke2', # containerd integration target
reboot => 1, # reboot after driver install
);
Options:
containerd_config-
Which containerd configuration variant to write. Controls where the NVIDIA runtime snippet is placed:
rke2(default) — writes to/var/lib/rancher/rke2/agent/etc/containerd/config.toml.tmpland drops a snippet in/etc/containerd/conf.d/99-nvidia.tomlcontainerd— runsnvidia-ctk runtime configure --runtime=containerdfor a standalone containerd installationnone— skip containerd configuration entirely (driver and toolkit are still installed)
reboot-
If true, the host is rebooted after driver installation and the function waits (up to 5 minutes, polling every 5 seconds) for it to come back before continuing with toolkit installation and containerd configuration. Default:
0.Rebooting is required on the first deployment if the
nouveauopen-source driver was previously loaded, because nouveau must be unloaded before the NVIDIA driver can bind to the GPU.
Returns the result of "detect" in Rex::GPU::Detect — a hashref with nvidia and amd array keys.
Dies if the connection backend is neither LibSSH nor SFTP-capable.
SEE ALSO
Rex, Rex::LibSSH, Rex::GPU::Detect, Rex::GPU::NVIDIA, Rex::Rancher
SUPPORT
Issues
Please report bugs and feature requests on GitHub at https://github.com/Getty/rex-gpu/issues.
CONTRIBUTING
Contributions are welcome! Please fork the repository and submit a pull request.
AUTHOR
Torsten Raudssus <getty@cpan.org>
COPYRIGHT AND LICENSE
This software is copyright (c) 2026 by Torsten Raudssus <torsten@raudssus.de> https://raudssus.de/.
This is free software; you can redistribute it and/or modify it under the same terms as the Perl 5 programming language system itself.