We’re having some trouble with installing the latest Nvidia drivers plus CUDA Toolkit 10.1 for Tensorflow in AWS EC2 service. We followed NVIDIA* Drivers guide but replaced native kernel/dkms with AWS variants and we also removed no-nvidia-modprobe parameter for CUDA support (it also fails with the same error when the parameter is specified).
NVIDIA driver v418.87.01 was acquired from NVIDIA Driver Downloads for our system’s GPU (Tesla M60):
Product Type: Tesla
Product Series: M-Class
Product: M60
Operating System: Linux 64-bit
CUDA toolkit: 10.1
Language: English (US)
We also tried different combinations of driver versions such as 440.26 / 435.21, different parameters by removing dkms/nvidia-modprobe/using installer flags from Github issue #464 to no avail.
It seems like it’s not functioning correctly or something critical for a AWS system is missing (such as libGLdispatch.so.0). We tried this on a local computer that’s outside of AWS and it works. What could be missing here?
We highly appreciate any help with this issue.
Thanks.
Unfortunately, this happens pretty often between kernel updates and it either gets resolved in future versions or someone creates a version-specific patch to the drivers to maintain compatibility.
Hi, please refer to the compatibility matrix of tensorflow-gpu for details.
Currently for tensorflow v1 and v2, it requires CUDA 10.0 with gcc7 on Linux kernel >5.0.
So first get 10.0 instead of 10.1, and then install gcc7 with sudo swupd bundle-add c-extras-gcc7.
If you don’t need CUDA for development, you don’t have to install gcc7, but you still need to remove no-modprobe.
Additionally, CUDA ships with an old version of NVIDIA driver, you can safely extract the CUDA toolkit from the *.run file and only install CUDA itself, after you installed the up-to-date version of NVIDIA driver youself.
We tried different driver versions such as the v440.26 (pre-release) Nvidia driver from 5 days ago, check my post here for the error log. Same goes for recommended v435.21 driver (here’s the error log for that version).
Furthermore we are able to successfully install the NVIDIA drivers outside of Amazon AWS with the same kernel version, in my opinion the issue could be related to kernel-aws and kernel-aws-dkms bundle variants.
Hey and thanks for your observation as well as the instructions! We actually have the full Tensorflow environment running in a local machine but we need to get this to work in our cloud cluster that’s running on AWS for production purposes.
I highly believe something is missing in the AWS variants of kernel+dkms as I’ve described above since we can get this to run just fine on a local computer but not in AWS EC2.
It’s a very good point and a miss from our side in order to approach this correctly. In this case, if we focus on the latest stable driver v435.21, is there anything we can attempt such as applying parameters or similar?
I wrote a post regarding CUDA and tensorflow-gpu.
In short, you have to install CUDA 10.0 because this is the latest version supported by tensorflow-gpu.
Then depending on whether you use tensorflow as a run-time dependency, such as use it for machine learning models offered by PyTorch, or you need to develop tensorflow program, you may need to install GCC7.
Yeah focusing on the problem here makes sense because the errors in that build log don’t indicate a compilation error but rather some other installation problem. It’s complaining about module symbols not being defined/available.
My understanding is this is generated by depmod as a mash of the kernel provided /lib/kernel/System.map* and /lib/modules/<version>/modules.symbols. For example, cat /lib/modules/5.3.7-853.native/modules.symbols | grep nvUvmInterfaceSessionCreate exists but cat /lib/modules/5.3.7-171.aws/modules.symbols | grep nvUvmInterfaceSessionCreate does not.