Help building pytorch from source for anaconda

Not strictly related to CL the OS, but more of a question of building a package on CL.

I understand there is a pytorch bundle, but AFAIK there is no CUDA support and it is not packaged standalone, and not along side anaconda.

I have tried building Pytorch from source, backed with Anaconda.
The specific steps I followed:

  1. Installed the Nvidia drivers using the provided guide, I added /opt/nvidia/bin to my $PATH in ~/.profile, and also verified that nvidia-smi worked
  2. Installed the c-extras-gcc8 bundle, and modified my anaconda activate.d and deactivate.d, such that the appropriate gcc/g++ would be used, following the CUDA guide from clear linux
  3. Installed CUDA from the nvidia website, after which I added /opt/cuda/lib64 to my LD_LIBRARY_PATH in ~/.profile, and added /opt/cuda/bin to my $PATH
  4. conda install -c pytorch magma-cuda102 to enable linalg support on the GPU
  5. Successfully built pytorch from tag v1.4.0 after applying patch FS#65202 : [python-pytorch-opt-cuda] incompatible nccl
  6. Fails at GPU tensor with THCudaCheck FAIL file=../aten/src/THC/THCGeneral.cpp line=50 error=999 : unknown error

This was puzzling, as I had managed to build and import Pytorch v1.4.0 on CentOS 7 just the past week. So my guess is that something in clear linux is causing this breakage.

Full details that I filed in the pytorch forums: Linking error in torch_shm_manager near end of compilation · Issue #34431 · pytorch/pytorch · GitHub

Not sure if this will help, since you seem more experienced than me, but I had some trouble with Meshroom until I added ~/.local/bin to my PATH to direct it to pyside2. You could check to see if there are python specific dependencies installed in a different path?

Figured it out, I had to remove the --no-nvidia-modprobe flag when installing the nvidia drivers. This was actually noted in Bash scripts to automate installation of NVIDIA proprietary driver
which was a post on this very forum. I think the guide should also be changed, this frustrated me for several hours…

You know, I actually thought about that and those very bash scripts and nearly mentioned it, but I made the assumption that you had installed the modprobe as part of the CUDA install guide. I can’t even tell you how many times I’ve rerun those bash scripts every time something goes wrong, lol…