-
Notifications
You must be signed in to change notification settings - Fork 0
Description
Some important info about the CUDA software stack, and how it could change (subject to testing):
As most of you know we have 3 layers:
- the kernel modules (
/lib/modules/$(uname -r)/extra/nvidia.ko.xzand related) - the user-mode driver component used to run CUDA applications (
/usr/lib64/nvidia/libcuda.so) - the CUDA toolkit (from
module load cuda)
up so far we assumed that 1 & 2 are tightly coupled. But an NVidia employee in the EasyBuild slack clarified they are not, and libcuda.so.1 is forward compatible and the newest libcuda (465.x) is compatible with kernel drivers going all the way back to 418.40.04+.
Note that in fact there are four maintained driver families: the long term support ones (R418, EOL Mar 2022, R450, EOL Jul 2023) and short term ones (R460, EOL Jan 2022, and R465). Béluga and Graham are running an R460 version, Cedar is at R455, which is no longer supported.
So this means that we could put the newest libcuda in cvmfs and the sysadmins only need to worry about the kernel modules. This will need to be tested of course (which we can do via LD_LIBRARY_PATH and/or the cvmfs-dev repo).
Once libcuda is in place all cuda toolkit modules, including 11.3, can then be used on all clusters, irrespective of the kernel driver (as long as it's >= R418.40.04), and the present Lmod check could become obsolete.
As for kernel modules, clusters could consider staying with an R450 version, since with libcuda in cvmfs it no longer needs to be upgraded to 460 to stay compatible with newer CUDA toolkit versions.
see this
https://docs.nvidia.com/datacenter/tesla/drivers/#lifecycle
and this:
https://docs.nvidia.com/deploy/cuda-compatibility/index.html#cuda-compatibility-platform