Docker with NVIDIA driver

I am having an issue with setting up a docker container with the nvidia driver. I followed the instructions to set up the NVIDIA driver and it appears to be working fine. I’ve installed the CUDA toolkit, though this is unnecessary with the newest versions of Docker. I run the following:

docker run --gpus all nvidia/cuda -base nvidia-smi

But get the following error:
docker: Error response from daemon: could not select device driver “” with capabilities: [[gpu]]

Has anyone been ran into/resolved this issue?

Here are the outputs that I believe show the driver is installed correctly:

ls -l /dev/nvidia*
crw-rw-rw- 1 root root 195, 0 Dec 8 18:49 /dev/nvidia0
crw-rw-rw- 1 root root 195, 1 Dec 8 18:49 /dev/nvidia1
crw-rw-rw- 1 root root 195, 255 Dec 8 18:49 /dev/nvidiactl
crw-rw-rw- 1 root root 195, 254 Dec 8 18:49 /dev/nvidia-modeset

lsmod | grep ^nvidia
nvidia_drm 45056 9
nvidia_modeset 1114112 12 nvidia_drm
nvidia 19931136 617 nvidia_modeset
nvidiafb 53248 0

Did you disable modprobe when you install the NVIDIA driver?

I did. I thought that was necessary to make it work?

I followed the official guide on the clearlinux site exactly

Try not disabling modprobe. It causes problems with CUDA modules.

Cool! I didn’t know this so thanks for mentioning it.

This issue in their GitHub suggests that means part of the prerequisites is missing (installing the alternative nvidia-container-toolkit ). It looks like the provide they provide apt/yum repos so you’ll have to compile it by source or dig into their packaging tools.

Another thing to check for is the default docker runtime. Some bundles in Clear Linux current add helpers to default to kata containers which could be getting in the way:

$ sudo docker info | grep "Default Runtime"
Default Runtime: runc

The job of nvidia-modprobe is to load the nvidia driver and create the /dev/nvidia* character devices if they don’t already exist, like it tends to be the case on headless systems. Based on @cmack5644 output it looks like those character devices are already there so I don’t think it is the problem in this case. (I could be wrong!)

nvidia-modprobe is a relatively simple helper that could be done a variety of other ways. If I remember right, the CUDA toolkit just happens to rely on nvidia-modprobed specifically. But since CUDA isn’t required on the host anymore it should be a non-issue.

Okay. This makes sense.