NAME
Rex::GPU::NVIDIA - NVIDIA GPU driver and container toolkit management
VERSION
version 0.001
SYNOPSIS
use Rex::GPU::NVIDIA;
# Step 1: Install driver (with reboot on first deploy)
install_driver(reboot => 1);
# Step 2: Install NVIDIA Container Toolkit
install_container_toolkit();
# Step 3: Generate CDI specs for the device plugin
generate_cdi_specs();
# Step 4: Configure containerd for Kubernetes
configure_containerd('rke2'); # 'rke2', 'k3s', or 'containerd'
# Verify the current installation status
my $ok = verify_nvidia();
DESCRIPTION
Rex::GPU::NVIDIA manages the full NVIDIA software stack needed to run GPU-accelerated workloads in Kubernetes: driver installation, the Container Toolkit, CDI spec generation, and containerd runtime configuration.
Each step is OS-aware and handles Debian/Ubuntu, RHEL/Rocky/CentOS, and openSUSE Leap without further configuration.
Driver installation
Drivers are installed via DKMS, so they survive kernel upgrades without needing reinstallation. The nouveau open-source driver is blacklisted and the initramfs is regenerated to prevent it from loading at boot.
On Debian, contrib, non-free, and non-free-firmware components are added to /etc/apt/sources.list automatically if not already present.
On RHEL/Rocky/AlmaLinux/CentOS Stream, the NVIDIA CUDA repository is added and the open-kernel DKMS variant is used. For RHEL 10+ the module streams approach is not available; kmod-nvidia-open-dkms is installed directly.
On openSUSE Leap, the signed kmp-meta package (nvidia-open-driver-G06-signed-kmp-meta for Leap 15.x, nvidia-open-driver-G07-signed-kmp-meta for Leap 16.x) is used to ensure the kernel module and userspace libraries are always at the same version. Stale OSS non-free packages are removed before installation and locked afterwards to prevent nvidia-smi from reporting a Driver/library version mismatch.
Container Toolkit
nvidia-container-toolkit is installed from the official NVIDIA GitHub package repository (https://nvidia.github.io/libnvidia-container/).
CDI specs
Container Device Interface specifications are written to /etc/cdi/nvidia.yaml by nvidia-ctk cdi generate. CDI lets the Kubernetes device plugin enumerate GPU resources without requiring privileged container access.
Containerd configuration
For RKE2 and K3s, the NVIDIA runtime is registered via a drop-in snippet at /etc/containerd/conf.d/99-nvidia.toml, imported via the distribution's config.toml.tmpl mechanism. For standalone containerd, nvidia-ctk runtime configure is used.
Supported distributions:
Debian 11 (bullseye), 12 (bookworm), 13 (trixie)
Ubuntu 22.04 (jammy), 24.04 (noble)
RHEL / Rocky Linux / AlmaLinux 8, 9, 10 — CentOS Stream 9, 10
openSUSE Leap 15.6, 16.0
Tested on Hetzner dedicated servers with NVIDIA RTX 4000 SFF Ada Generation.
install_driver
Install NVIDIA GPU drivers appropriate for the detected OS using DKMS. Blacklists the nouveau driver and rebuilds the initramfs so the blacklist takes effect on next boot.
After installation (and after reboot, if reboot => 1), calls "verify_nvidia" to confirm the kernel module loaded correctly.
Dies if the detected OS is not supported.
Options:
reboot-
If true, the host is rebooted immediately after driver installation. The function waits up to 5 minutes for the host to come back (polling every 5 seconds via SSH reconnect), then continues with verification. Default:
0.Rebooting is required on the first deployment when the
nouveauopen-source driver was previously loaded, because nouveau must be unloaded before the NVIDIA kernel module can bind to the device.
install_driver(); # install only, load module without reboot
install_driver(reboot => 1); # install, reboot, verify
install_container_toolkit
Install the NVIDIA Container Toolkit (nvidia-container-toolkit package) from the official NVIDIA package repository at https://nvidia.github.io/libnvidia-container/.
The repository GPG key is imported and the package repository is registered before installing. On Debian/Ubuntu the signed APT source list is written; on RHEL the .repo file is fetched via curl; on openSUSE Leap the base repository URL is added directly (zypper cannot parse RPM .repo files directly).
Dies if the OS is not supported or if installation fails.
configure_containerd($runtime)
Configure the containerd runtime to use the NVIDIA container runtime. The nvidia-container-runtime binary must already be installed ("install_container_toolkit" provides it); if it is not present this function returns immediately without error.
$runtime selects how containerd is configured:
rke2ork3s(default:rke2)-
Creates
/var/lib/rancher/rke2/agent/etc/containerd/and writes aconfig.toml.tmplthat imports snippets from/etc/containerd/conf.d/. Then writes/etc/containerd/conf.d/99-nvidia.tomlwhich registers the NVIDIA runtime asio.containerd.runc.v2withBinaryName=/usr/bin/nvidia-container-runtime.This approach is used by both RKE2 and K3s because they share the same containerd include mechanism.
containerd-
Calls
nvidia-ctk runtime configure --runtime=containerdand restarts thecontainerdsystemd service. Suitable for standalone (non-Rancher) containerd installations.
verify_nvidia
Verify the current NVIDIA installation by checking three things:
- 1.
nvidiakernel module is loaded (lsmod | grep nvidia) - 2.
nvidia-smi -Lreports at least one GPU - 3.
nvidia-ctkbinary is available (Container Toolkit present)
Returns 1 if all checks pass, 0 if any check fails. A warning is logged for each failure; the function does not die. A partial installation (e.g. driver installed but host not yet rebooted) emits a summary warning noting that features may not work until reboot.
generate_cdi_specs
Generate CDI (Container Device Interface) specifications for all detected NVIDIA GPUs by running nvidia-ctk cdi generate. CDI allows the Kubernetes NVIDIA device plugin to enumerate GPU resources without requiring a privileged container.
Writes output to /etc/cdi/nvidia.yaml. The /etc/cdi/ directory is created if it does not exist.
This step must be run after "install_container_toolkit" (which provides nvidia-ctk) and, on first deploy, after the reboot that activates the NVIDIA kernel module (so the tool can enumerate physical devices).
SEE ALSO
Rex::GPU, Rex::GPU::Detect, https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/
SUPPORT
Issues
Please report bugs and feature requests on GitHub at https://github.com/Getty/rex-gpu/issues.
CONTRIBUTING
Contributions are welcome! Please fork the repository and submit a pull request.
AUTHOR
Torsten Raudssus <getty@cpan.org>
COPYRIGHT AND LICENSE
This software is copyright (c) 2026 by Torsten Raudssus <torsten@raudssus.de> https://raudssus.de/.
This is free software; you can redistribute it and/or modify it under the same terms as the Perl 5 programming language system itself.