-
Notifications
You must be signed in to change notification settings - Fork 66
Closed
Labels
Description
Accelerator support
In 2021.03 we added the machinery to enable adding GPU support to EESSI by adding a new trusted directory ( where the GPU runtime libraries are expected to be found) to the compatability loader. This approach doesn't do any sanity-checking though and leaves the door open to things not working (for example because the driver is too old for the CUDA shipped with EESSI). I think we can do better than this:
- Add a single variant symlink to the CMVFS config that points (by default) to
/opt/eessi- Since the pilot version number occurs right at the root, this will have to be something like
/cvmfs/pilot.eessi-hpc.org/host_injections
- Since the pilot version number occurs right at the root, this will have to be something like
- Create a script that controls the structure underneath this folder
- Driver libraries are related to the compatability layer so I would suggest, e.g.,
/cvmfs/pilot.eessi-hpc.org/host_injections/2021.03/compat/linux/x86_64/lib(and this will be added as a trusted glibc dir for the compatability layer) - The script will check that driver libraries are adequate to support the requested stack (and tell you what to do if they are not)
- It will search for all the libraries required to use the CUDA-enable EESSI stack (see recent OpenSSL easyblock PR for the machinery to do this)
- It will place symlinks to the libraries in the designated subfolder
- It will perform sanity checks to verify that the setup works, e.g.,
LD_LIBRARY_PATH=/cvmfs/pilot.eessi-hpc.org/host_injections/2021.03/compat/linux/x86_64/lib my_cuda_verification_exec
- Driver libraries are related to the compatability layer so I would suggest, e.g.,
- The script can tell the user how to add all the CUDA-enabled modules to their environment
- I think CUDA-enabled modules should be installed in a separate path, for a flat naming scheme then making the modules available just means an additional
module use .... For a hierarchical scheme, it should mean setting an additional environment variable. - EasyBuild has all the machinery needed to support this, we just need to build this configuration into our hooks.
- I think CUDA-enabled modules should be installed in a separate path, for a flat naming scheme then making the modules available just means an additional
RPATH overrides
We can also use this approach to allow for controlled ABI-compatible overrides of the software EESSI provides:
- For RPATH injections, this is related to the software installations themselves so should follow the existing subdirectory structure of EESSI underneath the symlink, e.g.,
/cvmfs/pilot.eessi-hpc.org/host_injections/2021.03/software/linux/x86_64/intel/skylake_avx512/rpath_overrides/{software}/{version?}/lib - We then follow similar approach as for the accelerator support: find the libraries required for the override, symlink them, sanity check that executables use the override libraries and not the EESSI installations (again see recent OpenSSL easyblock)
- For security, we should limit the ability to override the installation and support overrides on a case by case basis. A definite use case is MPI, IO is another likely candidate.
- We can control the use of overrides via our EESSI EasyBuild hook. For example for OpenMPI, we only use the override if OpenMPI is part of the toolchain being used or is listed as a dependency