TLDR: nvidia drivers are building on Clear Linux on an AWS EC2 instance, but modprobe fails with “key was rejected by service”
Hi, I’m having trouble installing the nvidia drivers on an AWS instance running Clear Linux. I’ve tried following the instructions on this docs page as well as using the bash scripts provided in this forum post
Steps to repro using the bash scripts:
-
Launch a g4dn.xlarge EC2 instance using the Clear AWS marketplace image, version 35000, ami-0e8bf6a75bdee4a3c
-
SSH into the machine
-
sudo swupd bundle-add wget curl c-extras-gcc11 c-basic
-
wget https://raw.githubusercontent.com/lebensterben/awesome-clear-linux/master/NVIDIA-Driver/pre_install.bash
-
Set gcc11 as the primary (I’m sure there’s a better way to do this, sorry about this):
sudo mv /usr/bin/gcc /usr/bin/gcc-12
sudo mv /usr/bin/gcc-11 /usr/bin/gcc
-
bash pre_install.bash
- This fails with: The GCC used for compiling the kernel, 11.2.1, is different from the current GCC version, 11.3.1. (Is there a way to get a specific minor version of GCC?)
- Based on this thread it seems like it is okay to have a different minor version of gcc so we’ll press on
-
Manually remove lines 6-10 of pre_install.bash to remove the GCC version check, and try step 7 again, which succeeds
-
Reboot as per instructions
-
wget https://raw.githubusercontent.com/lebensterben/awesome-clear-linux/master/NVIDIA-Driver/install.bash
-
bash install.bash
- This downloads the latest installer, version 515.57
- This fails, last line in the log file (/var/log/nvidia-installer.log) is
ERROR: Failed to run '/usr/bin/dkms add -m nvidia -v 515.57 -k 5.15.43-335.aws': Error! No write access to DKMS tree at /var/lib/dkms
-
I figured I’d try running
sudo mkdir /var/lib/dkms
and then rerunninginstall.bash
- This fails for a new reason, last line in the log file is an unhelpful
ERROR: Unable to load the 'nvidia-drm' kernel module
- So rerun, replacing
--silent
with--expert
on line 85 ofinstall.bash
. Just hit enter for all of the prompts. Eventually we get a more helpful error message:ERROR: Unable to load the 'nvidia-drm' kernel module: 'modprobe: ERROR: could not insert 'nvidia_drm': Key was rejected by service'
.lsmod
confirms that thenvidia
module is not loaded.
- This fails for a new reason, last line in the log file is an unhelpful
Further investigation:
-
It seems like the modules are being built correctly, e.g.
/lib/modules/5.15.43-335.aws/kernel/drivers/video/nvidia.ko
exists. Callingsudo modprobe nvidia
orsudo insmod /lib/modules/5.15.43-335.aws/kernel/drivers/video/nvidia.ko
gives the same error message “Key was rejected by service”. -
Totally guessing from the error message, it looks like some sort of signing-related issue? So I found this page: https://docs.01.org/clearlinux/latest/guides/kernel/kernel-modules.html#load-kernel-module which seems to have instructions on how to disable signature checking, but that doesn’t seem to help (even after I tried rebooting multiple times after running the commands in that section just in case). This page: https://docs.01.org/clearlinux/latest/guides/kernel/kernel-modules-dkms.html#install-dkms mentions that adding the bundle
kernel-native-dkms
disables kernel module signature verification by writing to a different file, and I checked that the contents of that file are as described by the documentation. The only thing I’m not 100% sure of is whether secure boot is enabled. I assume it is not, as this page says that/sys/firmware/efi
will be present if it is enabled, and that file does not exist.
Other attempts:
-
Removing the
--dkms
flag on the nvidia installer doesn’t help. It builds the samenvidia.ko
module, but callingmodprobe
on it gives the same error -
I also tried creating my own Clear AWS image with the latest version of Clear (36600) using the instructions on this page: https://docs.01.org/clearlinux/latest/get-started/cloud-install/import-clr-aws.html With this image, I get the exact same behavior on that image, and it doesn’t complain about a GCC version mismatch, so I can omit step 8 above.