Nvidia drivers on AWS/Clear Linux: key was rejected by service

TLDR: nvidia drivers are building on Clear Linux on an AWS EC2 instance, but modprobe fails with “key was rejected by service”

Hi, I’m having trouble installing the nvidia drivers on an AWS instance running Clear Linux. I’ve tried following the instructions on this docs page as well as using the bash scripts provided in this forum post

Steps to repro using the bash scripts:

  1. Launch a g4dn.xlarge EC2 instance using the Clear AWS marketplace image, version 35000, ami-0e8bf6a75bdee4a3c

  2. SSH into the machine

  3. sudo swupd bundle-add wget curl c-extras-gcc11 c-basic

  4. wget https://raw.githubusercontent.com/lebensterben/awesome-clear-linux/master/NVIDIA-Driver/pre_install.bash

  5. Set gcc11 as the primary (I’m sure there’s a better way to do this, sorry about this):

    • sudo mv /usr/bin/gcc /usr/bin/gcc-12
    • sudo mv /usr/bin/gcc-11 /usr/bin/gcc
  6. bash pre_install.bash

    • This fails with: The GCC used for compiling the kernel, 11.2.1, is different from the current GCC version, 11.3.1. (Is there a way to get a specific minor version of GCC?)
    • Based on this thread it seems like it is okay to have a different minor version of gcc so we’ll press on
  7. Manually remove lines 6-10 of pre_install.bash to remove the GCC version check, and try step 7 again, which succeeds

  8. Reboot as per instructions

  9. wget https://raw.githubusercontent.com/lebensterben/awesome-clear-linux/master/NVIDIA-Driver/install.bash

  10. bash install.bash

    • This downloads the latest installer, version 515.57
    • This fails, last line in the log file (/var/log/nvidia-installer.log) is ERROR: Failed to run '/usr/bin/dkms add -m nvidia -v 515.57 -k 5.15.43-335.aws': Error! No write access to DKMS tree at /var/lib/dkms
  11. I figured I’d try running sudo mkdir /var/lib/dkms and then rerunning install.bash

    • This fails for a new reason, last line in the log file is an unhelpful ERROR: Unable to load the 'nvidia-drm' kernel module
    • So rerun, replacing --silent with --expert on line 85 of install.bash. Just hit enter for all of the prompts. Eventually we get a more helpful error message: ERROR: Unable to load the 'nvidia-drm' kernel module: 'modprobe: ERROR: could not insert 'nvidia_drm': Key was rejected by service'. lsmod confirms that the nvidia module is not loaded.

Further investigation:

  1. It seems like the modules are being built correctly, e.g. /lib/modules/5.15.43-335.aws/kernel/drivers/video/nvidia.ko exists. Calling sudo modprobe nvidia or sudo insmod /lib/modules/5.15.43-335.aws/kernel/drivers/video/nvidia.ko gives the same error message “Key was rejected by service”.

  2. Totally guessing from the error message, it looks like some sort of signing-related issue? So I found this page: Add kernel modules manually — Documentation for Clear Linux* project which seems to have instructions on how to disable signature checking, but that doesn’t seem to help (even after I tried rebooting multiple times after running the commands in that section just in case). This page: Add kernel modules with DKMS — Documentation for Clear Linux* project mentions that adding the bundle kernel-native-dkms disables kernel module signature verification by writing to a different file, and I checked that the contents of that file are as described by the documentation. The only thing I’m not 100% sure of is whether secure boot is enabled. I assume it is not, as this page says that /sys/firmware/efi will be present if it is enabled, and that file does not exist.

Other attempts:

  1. Removing the --dkms flag on the nvidia installer doesn’t help. It builds the same nvidia.ko module, but calling modprobe on it gives the same error

  2. I also tried creating my own Clear AWS image with the latest version of Clear (36600) using the instructions on this page: Import Clear Linux Image and Launch Instance on AWS — Documentation for Clear Linux* project With this image, I get the exact same behavior on that image, and it doesn’t complain about a GCC version mismatch, so I can omit step 8 above.

Almost. That path just indicates that the system was booted with EFI vs. legacy BIOS. The next item on the referenced page shows how to tell whether Secure Boot was used.

I think the underlying problem you were hitting was `sig_unenforce` kernel patch for the AWS kernel · Issue #2768 · clearlinux/distribution · GitHub, which I just fixed last week.

1 Like