NVIDIA drivers with CUDA 10.1 on AWS EC2?

Good morning folks!

We’re having some trouble with installing the latest Nvidia drivers plus CUDA Toolkit 10.1 for Tensorflow in AWS EC2 service. We followed NVIDIA* Drivers guide but replaced native kernel/dkms with AWS variants and we also removed no-nvidia-modprobe parameter for CUDA support (it also fails with the same error when the parameter is specified).

sh NVIDIA-Linux-x86_64-418.87.01.run \
--utility-prefix=/opt/nvidia \
--opengl-prefix=/opt/nvidia \
--compat32-prefix=/opt/nvidia \
--compat32-libdir=lib32 \
--x-prefix=/opt/nvidia \
--x-module-path=/opt/nvidia/lib64/xorg/modules \
--x-library-path=/opt/nvidia/lib64 \
--x-sysconfig-path=/etc/X11/xorg.conf.d \
--documentation-prefix=/opt/nvidia \
--application-profile-path=/etc/nvidia \
--no-precompiled-interface \
--no-distro-scripts \
--force-libglx-indirect \
--glvnd-egl-config-path=/etc/glvnd/egl_vendor.d \
--egl-external-platform-config-path=/etc/egl/egl_external_platform.d  \
--dkms \
--silent

Console: output.log
Build: make.log
Log: nvidia-installer.log
Dump: nvidia-bug-report.log
Driver version: 418.87.01
Clear Linux version: 31380

NVIDIA driver v418.87.01 was acquired from NVIDIA Driver Downloads for our system’s GPU (Tesla M60):

Product Type: Tesla
Product Series: M-Class
Product: M60
Operating System: Linux 64-bit
CUDA toolkit: 10.1
Language: English (US)

We also tried different combinations of driver versions such as 440.26 / 435.21, different parameters by removing dkms/nvidia-modprobe/using installer flags from Github issue #464 to no avail.

Furthermore this issue originates from AWS DKMS bundle for supporting Nvidia drivers and CUDA in Tensorflow but it got closed since the requested bundle got added.

It seems like it’s not functioning correctly or something critical for a AWS system is missing (such as libGLdispatch.so.0). We tried this on a local computer that’s outside of AWS and it works. What could be missing here?

We highly appreciate any help with this issue.
Thanks.

In this case, it is an incompatibility with theNVIDIA driver and this specific version of the Linux kernel.

Unfortunately, this happens pretty often between kernel updates and it either gets resolved in future versions or someone creates a version-specific patch to the drivers to maintain compatibility.

Hi, please refer to the compatibility matrix of tensorflow-gpu for details.

Currently for tensorflow v1 and v2, it requires CUDA 10.0 with gcc7 on Linux kernel >5.0.

So first get 10.0 instead of 10.1, and then install gcc7 with sudo swupd bundle-add c-extras-gcc7.

If you don’t need CUDA for development, you don’t have to install gcc7, but you still need to remove no-modprobe.

Additionally, CUDA ships with an old version of NVIDIA driver, you can safely extract the CUDA toolkit from the *.run file and only install CUDA itself, after you installed the up-to-date version of NVIDIA driver youself.

Hey there!

We tried different driver versions such as the v440.26 (pre-release) Nvidia driver from 5 days ago, check my post here for the error log. Same goes for recommended v435.21 driver (here’s the error log for that version).

Furthermore we are able to successfully install the NVIDIA drivers outside of Amazon AWS with the same kernel version, in my opinion the issue could be related to kernel-aws and kernel-aws-dkms bundle variants.

Hey and thanks for your observation as well as the instructions! We actually have the full Tensorflow environment running in a local machine but we need to get this to work in our cloud cluster that’s running on AWS for production purposes.

I highly believe something is missing in the AWS variants of kernel+dkms as I’ve described above since we can get this to run just fine on a local computer but not in AWS EC2.

The error from your 435.21 install is not the same as the error from your 418.17 install.

You’re correct, I compared v418.87.01, v435.21 and v440.26 to showcase that it’s breaking with a variety of driver versions.

If we were to focus on the latest pre-release driver v440.26, how can we approach this issue to resolve it, are there any steps we can try?

You’d better not to use pre-release driver because its technical support is limited, even from NVIDIA.

It’s a very good point and a miss from our side in order to approach this correctly. In this case, if we focus on the latest stable driver v435.21, is there anything we can attempt such as applying parameters or similar?

Thanks you.

I wrote a post regarding CUDA and tensorflow-gpu.
In short, you have to install CUDA 10.0 because this is the latest version supported by tensorflow-gpu.
Then depending on whether you use tensorflow as a run-time dependency, such as use it for machine learning models offered by PyTorch, or you need to develop tensorflow program, you may need to install GCC7.

Yeah focusing on the problem here makes sense because the errors in that build log don’t indicate a compilation error but rather some other installation problem. It’s complaining about module symbols not being defined/available.

...
ERROR: "down_interruptible" [/var/lib/dkms/nvidia/435.21/build/nvidia-modeset.ko] undefined!
ERROR: "nvidia_unregister_module" [/var/lib/dkms/nvidia/435.21/build/nvidia-modeset.ko] undefined!
ERROR: "nvidia_get_rm_ops" [/var/lib/dkms/nvidia/435.21/build/nvidia-modeset.ko] undefined!
ERROR: "proc_mkdir_mode" [/var/lib/dkms/nvidia/435.21/build/nvidia-modeset.ko] undefined!
...

My understanding is this is generated by depmod as a mash of the kernel provided /lib/kernel/System.map* and /lib/modules/<version>/modules.symbols. For example, cat /lib/modules/5.3.7-853.native/modules.symbols | grep nvUvmInterfaceSessionCreate exists but cat /lib/modules/5.3.7-171.aws/modules.symbols | grep nvUvmInterfaceSessionCreate does not.

I talked with one of our kernel gurus @btwarden about what could cause this and we found this difference between the native and aws kernel config: CONFIG_TRIM_UNUSED_KSYMS=y. More info here: Linux 4.7 Adds Option To Remove Exported Kernel Symbols That Go Unused - Phoronix Disabling this in the aws kernel allowed the NVIDIA build to complete successfully.

The change is in the build pipeline and should coming soon to a theater release near you.

2 Likes

Thanks a lot for the details and if anyone is looking for the post, it’s available here.

Awesome, great job!

Would there happen to be a Github project to track the release of this fix?

I believe it should show up in line 777 of linux-aws/config at 2e8f4127d2edae9ebd96bdb8174fc165b5c26fa2 · clearlinux-pkgs/linux-aws · GitHub

Here’s the commit :slight_smile:

2 Likes