NAME

Rex::GPU - GPU detection and driver management for Rex

VERSION

version 0.001

SYNOPSIS

use Rex::GPU;

# Detect GPUs only — returns a hashref
my $gpus = gpu_detect();
if (@{ $gpus->{nvidia} }) {
  say "NVIDIA GPU: ", $gpus->{nvidia}[0]{name};
}

# Full GPU setup for an RKE2 cluster (detect + drivers + toolkit + containerd)
gpu_setup(
  containerd_config => 'rke2',   # 'rke2', 'k3s', 'containerd', or 'none'
  reboot            => 1,        # reboot after driver install (first deploy)
);

# For a K3s cluster
gpu_setup(containerd_config => 'k3s');

# Just drivers + toolkit, no containerd config
gpu_setup(containerd_config => 'none');

DESCRIPTION

Rex::GPU provides GPU detection and driver management for Rex. It automates the complete software stack needed to make NVIDIA GPUs available to workloads running in a Kubernetes cluster.

The full pipeline, as executed by "gpu_setup":

1. GPU detection — PCI class code scan via lspci -nn to identify NVIDIA and AMD hardware, filtering out virtual GPUs (virtio, QEMU, VMware). Only CUDA-capable NVIDIA GPUs (RTX, Quadro, Tesla, PCI class 0302) trigger driver installation.
2. NVIDIA driver installation — Distribution-appropriate packages via DKMS for kernel-version independence. Nouveau is blacklisted and the initramfs is regenerated.
3. NVIDIA Container Toolkit — Installs nvidia-container-toolkit from the official NVIDIA repository for all supported distributions.
4. CDI spec generation — Writes /etc/cdi/nvidia.yaml so the Kubernetes device plugin can enumerate GPU resources without privileged container access.
5. Containerd runtime configuration — Injects the NVIDIA runtime into the containerd config for the target Kubernetes distribution.

Tested on Hetzner dedicated servers (bare metal) running:

  • Debian 11 (bullseye), 12 (bookworm), 13 (trixie)

  • Ubuntu 22.04 (jammy), 24.04 (noble)

  • RHEL / Rocky Linux / AlmaLinux 8, 9, 10 — CentOS Stream 9, 10

  • openSUSE Leap 15.6, 16.0

GPUs tested include the NVIDIA RTX 4000 SFF Ada Generation (PCI class 0302, datacenter compute profile).

This module requires Rex::LibSSH (or SFTP) on the connection backend. Hetzner servers do not enable the SFTP subsystem by default; use set connection => "LibSSH" in your Rexfile.

FUNCTIONS

gpu_detect

Detect GPU hardware on the remote host by scanning PCI devices. Installs pciutils if not already present, then parses lspci -nn output.

Returns a hashref with detected GPUs grouped by vendor:

my $gpus = gpu_detect();
# {
#   nvidia => [
#     {
#       name      => "NVIDIA RTX 4000 SFF Ada Generation",
#       vendor    => "nvidia",
#       pci_class => "0302",   # 0300 = VGA, 0302 = 3D/compute
#       compute   => 1,        # 1 if CUDA-capable
#     }
#   ],
#   amd => [
#     {
#       name      => "Radeon RX 7900 XTX",
#       vendor    => "amd",
#       pci_class => "0300",
#       compute   => 0,        # always 0 (AMD not yet supported)
#     }
#   ],
# }

Virtual GPUs (virtio, QEMU, VMware, VirtualBox) are detected and silently skipped — both arrays will be empty. See Rex::GPU::Detect for details on the classification logic.

gpu_setup

Detect GPUs and run the full installation pipeline: NVIDIA driver, Container Toolkit, CDI spec generation, and containerd runtime configuration. This is the single call needed to make a node GPU-ready for Kubernetes.

AMD GPUs are detected and logged but not yet supported (a warning is emitted).

gpu_setup(
  containerd_config => 'rke2',  # containerd integration target
  reboot            => 1,       # reboot after driver install
);

Options:

containerd_config

Which containerd configuration variant to write. Controls where the NVIDIA runtime snippet is placed:

rke2 (default) — writes to /var/lib/rancher/rke2/agent/etc/containerd/config.toml.tmpl and drops a snippet in /etc/containerd/conf.d/99-nvidia.toml
k3s — same as rke2; K3s and RKE2 share the same containerd config include mechanism
containerd — runs nvidia-ctk runtime configure --runtime=containerd for a standalone containerd installation
none — skip containerd configuration entirely (driver and toolkit are still installed)
reboot

If true, the host is rebooted after driver installation and the function waits (up to 5 minutes, polling every 5 seconds) for it to come back before continuing with toolkit installation and containerd configuration. Default: 0.

Rebooting is required on the first deployment if the nouveau open-source driver was previously loaded, because nouveau must be unloaded before the NVIDIA driver can bind to the GPU.

Returns the result of "detect" in Rex::GPU::Detect — a hashref with nvidia and amd array keys.

Dies if the connection backend is neither LibSSH nor SFTP-capable.

SEE ALSO

Rex, Rex::LibSSH, Rex::GPU::Detect, Rex::GPU::NVIDIA, Rex::Rancher

SUPPORT

Issues

Please report bugs and feature requests on GitHub at https://github.com/Getty/rex-gpu/issues.

CONTRIBUTING

Contributions are welcome! Please fork the repository and submit a pull request.

AUTHOR

Torsten Raudssus <getty@cpan.org>

COPYRIGHT AND LICENSE

This software is copyright (c) 2026 by Torsten Raudssus <torsten@raudssus.de> https://raudssus.de/.

This is free software; you can redistribute it and/or modify it under the same terms as the Perl 5 programming language system itself.