Good morning folks!
We’re having some trouble with installing the latest Nvidia drivers plus CUDA Toolkit 10.1 for Tensorflow in AWS EC2 service. We followed NVIDIA* Drivers guide but replaced native kernel/dkms with AWS variants and we also removed
no-nvidia-modprobe parameter for CUDA support (it also fails with the same error when the parameter is specified).
sh NVIDIA-Linux-x86_64-418.87.01.run \ --utility-prefix=/opt/nvidia \ --opengl-prefix=/opt/nvidia \ --compat32-prefix=/opt/nvidia \ --compat32-libdir=lib32 \ --x-prefix=/opt/nvidia \ --x-module-path=/opt/nvidia/lib64/xorg/modules \ --x-library-path=/opt/nvidia/lib64 \ --x-sysconfig-path=/etc/X11/xorg.conf.d \ --documentation-prefix=/opt/nvidia \ --application-profile-path=/etc/nvidia \ --no-precompiled-interface \ --no-distro-scripts \ --force-libglx-indirect \ --glvnd-egl-config-path=/etc/glvnd/egl_vendor.d \ --egl-external-platform-config-path=/etc/egl/egl_external_platform.d \ --dkms \ --silent
NVIDIA driver v418.87.01 was acquired from NVIDIA Driver Downloads for our system’s GPU (Tesla M60):
Product Type: Tesla Product Series: M-Class Product: M60 Operating System: Linux 64-bit CUDA toolkit: 10.1 Language: English (US)
We also tried different combinations of driver versions such as 440.26 / 435.21, different parameters by removing dkms/nvidia-modprobe/using installer flags from Github issue #464 to no avail.
Furthermore this issue originates from AWS DKMS bundle for supporting Nvidia drivers and CUDA in Tensorflow but it got closed since the requested bundle got added.
It seems like it’s not functioning correctly or something critical for a AWS system is missing (such as libGLdispatch.so.0). We tried this on a local computer that’s outside of AWS and it works. What could be missing here?
We highly appreciate any help with this issue.