NAME

Rex::GPU::NVIDIA - NVIDIA GPU driver and container toolkit management

VERSION

version 0.001

SYNOPSIS

use Rex::GPU::NVIDIA;

# Step 1: Install driver (with reboot on first deploy)
install_driver(reboot => 1);

# Step 2: Install NVIDIA Container Toolkit
install_container_toolkit();

# Step 3: Generate CDI specs for the device plugin
generate_cdi_specs();

# Step 4: Configure containerd for Kubernetes
configure_containerd('rke2');   # 'rke2', 'k3s', or 'containerd'

# Verify the current installation status
my $ok = verify_nvidia();

DESCRIPTION

Rex::GPU::NVIDIA manages the full NVIDIA software stack needed to run GPU-accelerated workloads in Kubernetes: driver installation, the Container Toolkit, CDI spec generation, and containerd runtime configuration.

Each step is OS-aware and handles Debian/Ubuntu, RHEL/Rocky/CentOS, and openSUSE Leap without further configuration.

Driver installation

Drivers are installed via DKMS, so they survive kernel upgrades without needing reinstallation. The nouveau open-source driver is blacklisted and the initramfs is regenerated to prevent it from loading at boot.

On Debian, contrib, non-free, and non-free-firmware components are added to /etc/apt/sources.list automatically if not already present.

On RHEL/Rocky/AlmaLinux/CentOS Stream, the NVIDIA CUDA repository is added and the open-kernel DKMS variant is used. For RHEL 10+ the module streams approach is not available; kmod-nvidia-open-dkms is installed directly.

On openSUSE Leap, the signed kmp-meta package (nvidia-open-driver-G06-signed-kmp-meta for Leap 15.x, nvidia-open-driver-G07-signed-kmp-meta for Leap 16.x) is used to ensure the kernel module and userspace libraries are always at the same version. Stale OSS non-free packages are removed before installation and locked afterwards to prevent nvidia-smi from reporting a Driver/library version mismatch.

Container Toolkit

nvidia-container-toolkit is installed from the official NVIDIA GitHub package repository (https://nvidia.github.io/libnvidia-container/).

CDI specs

Container Device Interface specifications are written to /etc/cdi/nvidia.yaml by nvidia-ctk cdi generate. CDI lets the Kubernetes device plugin enumerate GPU resources without requiring privileged container access.

Containerd configuration

For RKE2 and K3s, the NVIDIA runtime is registered via a drop-in snippet at /etc/containerd/conf.d/99-nvidia.toml, imported via the distribution's config.toml.tmpl mechanism. For standalone containerd, nvidia-ctk runtime configure is used.

Supported distributions:

  • Debian 11 (bullseye), 12 (bookworm), 13 (trixie)

  • Ubuntu 22.04 (jammy), 24.04 (noble)

  • RHEL / Rocky Linux / AlmaLinux 8, 9, 10 — CentOS Stream 9, 10

  • openSUSE Leap 15.6, 16.0

Tested on Hetzner dedicated servers with NVIDIA RTX 4000 SFF Ada Generation.

install_driver

Install NVIDIA GPU drivers appropriate for the detected OS using DKMS. Blacklists the nouveau driver and rebuilds the initramfs so the blacklist takes effect on next boot.

After installation (and after reboot, if reboot => 1), calls "verify_nvidia" to confirm the kernel module loaded correctly.

Dies if the detected OS is not supported.

Options:

reboot

If true, the host is rebooted immediately after driver installation. The function waits up to 5 minutes for the host to come back (polling every 5 seconds via SSH reconnect), then continues with verification. Default: 0.

Rebooting is required on the first deployment when the nouveau open-source driver was previously loaded, because nouveau must be unloaded before the NVIDIA kernel module can bind to the device.

install_driver();              # install only, load module without reboot
install_driver(reboot => 1);   # install, reboot, verify

install_container_toolkit

Install the NVIDIA Container Toolkit (nvidia-container-toolkit package) from the official NVIDIA package repository at https://nvidia.github.io/libnvidia-container/.

The repository GPG key is imported and the package repository is registered before installing. On Debian/Ubuntu the signed APT source list is written; on RHEL the .repo file is fetched via curl; on openSUSE Leap the base repository URL is added directly (zypper cannot parse RPM .repo files directly).

Dies if the OS is not supported or if installation fails.

configure_containerd($runtime)

Configure the containerd runtime to use the NVIDIA container runtime. The nvidia-container-runtime binary must already be installed ("install_container_toolkit" provides it); if it is not present this function returns immediately without error.

$runtime selects how containerd is configured:

rke2 or k3s (default: rke2)

Creates /var/lib/rancher/rke2/agent/etc/containerd/ and writes a config.toml.tmpl that imports snippets from /etc/containerd/conf.d/. Then writes /etc/containerd/conf.d/99-nvidia.toml which registers the NVIDIA runtime as io.containerd.runc.v2 with BinaryName=/usr/bin/nvidia-container-runtime.

This approach is used by both RKE2 and K3s because they share the same containerd include mechanism.

containerd

Calls nvidia-ctk runtime configure --runtime=containerd and restarts the containerd systemd service. Suitable for standalone (non-Rancher) containerd installations.

verify_nvidia

Verify the current NVIDIA installation by checking three things:

1. nvidia kernel module is loaded (lsmod | grep nvidia)
2. nvidia-smi -L reports at least one GPU
3. nvidia-ctk binary is available (Container Toolkit present)

Returns 1 if all checks pass, 0 if any check fails. A warning is logged for each failure; the function does not die. A partial installation (e.g. driver installed but host not yet rebooted) emits a summary warning noting that features may not work until reboot.

generate_cdi_specs

Generate CDI (Container Device Interface) specifications for all detected NVIDIA GPUs by running nvidia-ctk cdi generate. CDI allows the Kubernetes NVIDIA device plugin to enumerate GPU resources without requiring a privileged container.

Writes output to /etc/cdi/nvidia.yaml. The /etc/cdi/ directory is created if it does not exist.

This step must be run after "install_container_toolkit" (which provides nvidia-ctk) and, on first deploy, after the reboot that activates the NVIDIA kernel module (so the tool can enumerate physical devices).

SEE ALSO

Rex::GPU, Rex::GPU::Detect, https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/

SUPPORT

Issues

Please report bugs and feature requests on GitHub at https://github.com/Getty/rex-gpu/issues.

CONTRIBUTING

Contributions are welcome! Please fork the repository and submit a pull request.

AUTHOR

Torsten Raudssus <getty@cpan.org>

COPYRIGHT AND LICENSE

This software is copyright (c) 2026 by Torsten Raudssus <torsten@raudssus.de> https://raudssus.de/.

This is free software; you can redistribute it and/or modify it under the same terms as the Perl 5 programming language system itself.