-
Notifications
You must be signed in to change notification settings - Fork 66
Add CUDA support #172
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add CUDA support #172
Changes from all commits
065efd1
caa43bf
d7212a0
48f4455
c50daa2
d4e85cc
590e042
7b9bb49
01844c6
0e8861f
7d6af69
2cc5ce9
d53e80e
850c20e
9b2e72f
5f82658
16e87af
cf65a37
7319db2
6537725
2ba47e4
ac268b1
bb5301b
75ce850
17b7662
03b01f1
dadb170
0f5884f
5f2c1f6
efe5f88
ab95873
63fded6
d3cadb5
81e4135
ec9dd69
0c1004d
7a9827b
6e86649
e38391b
b2a4865
fb73d12
8c8a227
f90dd66
a24e09c
3d0ebad
fe1843f
70e5dec
1075d0b
d65fe30
2ac2671
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,27 @@ | ||
| # How to add GPU support | ||
| The collection of scripts in this directory enables you to add GPU support to your setup. | ||
| Note that currently this means that CUDA support can be added for Nvidia GPUs. AMD GPUs are not yet supported (feel free to contribute that though!). | ||
| To enable the usage of the CUDA runtime in your setup, simply run the following script: | ||
| ``` | ||
| ./add_nvidia_gpu_support.sh | ||
| ``` | ||
| This script will install the compatibility libraries (and only those by default!) you need to use the shipped runtime environment of CUDA. | ||
|
|
||
| If you plan on using the full CUDA suite, i.e. if you want to load the CUDA module, you will have to modify the script execution as follows: | ||
| ``` | ||
| export INSTALL_CUDA=true && ./add_nvidia_gpu_support.sh | ||
| ``` | ||
| This will again install the needed compatibility libraries as well as the whole CUDA suite. | ||
|
|
||
| If you need a different CUDA version than what is shipped with EESSI, you can also specify that particular version for the script: | ||
| ``` | ||
| export INSTALL_CUDA_VERSION=xx.y.z && export INSTALL_CUDA=true && ./add_nvidia_gpu_support.sh | ||
| ``` | ||
| Please note, however, that versions for which the runtime is not shipped with EESSI are not installed in the default modules path. | ||
| Thus, you will have to add the following to your modules path to get access to your custom CUDA version: | ||
| ``` | ||
| module use ${EESSI_SOFTWARE_PATH/versions/host_injections}/modules/all/ | ||
| ``` | ||
| ## Prerequisites and tips | ||
| * You need write permissions to `/cvmfs/pilot.eessi-hpc.org/host_injections` (which by default is a symlink to `/opt/eessi` but can be configured in your CVMFS config file to point somewhere else). If you would like to make a system-wide installation you should change this in your configuration to point somewhere on a shared filesystem. | ||
| * If you want to install CUDA on a node without GPUs (e.g. on a login node where you want to be able to compile your CUDA-enabled code), you should `export INSTALL_WO_GPU=true` in order to skip checks and tests that can only succeed if you have access to a GPU. This approach is not recommended as there is a chance the CUDA compatibility library installed is not compatible with the existing CUDA driver on GPU nodes (and this will not be detected). |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,14 @@ | ||
| #!/bin/bash | ||
|
|
||
| cat << EOF | ||
ocaisa marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| This is not implemented yet :( | ||
|
|
||
| If you would like to contribute this support there are a few things you will | ||
| need to consider: | ||
| - We will need to change the Lmod property added to GPU software so we can | ||
| distinguish AMD and Nvidia GPUs | ||
| - Support should be implemented in user space, if this is not possible (e.g., | ||
| requires a driver update) you need to tell the user what to do | ||
| - Support needs to be _verified_ and a trigger put in place (like the existence | ||
| of a particular path) so we can tell Lmod to display the associated modules | ||
| EOF | ||
| Original file line number | Diff line number | Diff line change | ||||
|---|---|---|---|---|---|---|
| @@ -0,0 +1,75 @@ | ||||||
| #!/bin/bash | ||||||
|
|
||||||
| # Drop into the prefix shell or pipe this script into a Prefix shell with | ||||||
ocaisa marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||
| # $EPREFIX/startprefix <<< /path/to/this_script.sh | ||||||
|
|
||||||
| install_cuda="${INSTALL_CUDA:=false}" | ||||||
| install_cuda_version="${INSTALL_CUDA_VERSION:=11.3.1}" | ||||||
| install_p7zip_version="${INSTALL_P7ZIP_VERSION:=17.04-GCCcore-10.3.0}" | ||||||
|
|
||||||
| # If you want to install CUDA support on login nodes (typically without GPUs), | ||||||
| # set this variable to true. This will skip all GPU-dependent checks | ||||||
| install_wo_gpu=false | ||||||
ocaisa marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||
| [ "$INSTALL_WO_GPU" = true ] && install_wo_gpu=true | ||||||
|
|
||||||
| # verify existence of nvidia-smi or this is a waste of time | ||||||
| # Check if nvidia-smi exists and can be executed without error | ||||||
| if [[ "${install_wo_gpu}" != "true" ]]; then | ||||||
| if command -v nvidia-smi > /dev/null 2>&1; then | ||||||
| nvidia-smi > /dev/null 2>&1 | ||||||
| if [ $? -ne 0 ]; then | ||||||
| echo "nvidia-smi was found but returned error code, exiting now..." >&2 | ||||||
| echo "If you do not have a GPU on this device but wish to force the installation," | ||||||
| echo "please set the environment variable INSTALL_WO_GPU=true" | ||||||
| exit 1 | ||||||
| fi | ||||||
| echo "nvidia-smi found, continue setup." | ||||||
| else | ||||||
| echo "nvidia-smi not found, exiting now..." >&2 | ||||||
| echo "If you do not have a GPU on this device but wish to force the installation," | ||||||
| echo "please set the environment variable INSTALL_WO_GPU=true" | ||||||
| exit 1 | ||||||
ocaisa marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||
| fi | ||||||
| else | ||||||
| echo "You requested to install CUDA without GPUs present." | ||||||
| echo "This means that all GPU-dependent tests/checks will be skipped!" | ||||||
| fi | ||||||
|
|
||||||
| EESSI_SILENT=1 source /cvmfs/pilot.eessi-hpc.org/latest/init/bash | ||||||
|
|
||||||
| ############################################################################################## | ||||||
| # Check that the CUDA driver version is adequate | ||||||
| # ( | ||||||
| # needs to be r450 or r470 which are LTS, other production branches are acceptable but not | ||||||
| # recommended, below r450 is not compatible [with an exception we will not explore,see | ||||||
| # https://docs.nvidia.com/datacenter/tesla/drivers/#cuda-drivers] | ||||||
| # ) | ||||||
| # only check first number in case of multiple GPUs | ||||||
| if [[ "${install_wo_gpu}" != "true" ]]; then | ||||||
| driver_major_version=$(nvidia-smi --query-gpu=driver_version --format=csv,noheader | tail -n1) | ||||||
| driver_major_version="${driver_major_version%%.*}" | ||||||
| # Now check driver_version for compatability | ||||||
| # Check driver is at least LTS driver R450, see https://docs.nvidia.com/datacenter/tesla/drivers/#cuda-drivers | ||||||
| if (( $driver_major_version < 450 )); then | ||||||
| echo "Your NVIDIA driver version is too old, please update first.." | ||||||
| exit 1 | ||||||
| fi | ||||||
| fi | ||||||
|
|
||||||
| ############################################################################################### | ||||||
| # Install CUDA | ||||||
| cuda_install_dir="${EESSI_SOFTWARE_PATH/versions/host_injections}" | ||||||
| mkdir -p ${cuda_install_dir} | ||||||
| if [ "${install_cuda}" != false ]; then | ||||||
| bash $(dirname "$BASH_SOURCE")/cuda_utils/install_cuda.sh ${install_cuda_version} ${cuda_install_dir} | ||||||
| fi | ||||||
|
Comment on lines
+59
to
+65
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Let's break this into separate script (and PR) since it will be needed by #212 You also need to check the exit code on the creation of |
||||||
| ############################################################################################### | ||||||
| # Prepare installation of CUDA compat libraries, i.e. install p7zip if it is missing | ||||||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. You can drop stuff related to p7zip because of #212 (and that also means we can drop |
||||||
| $(dirname "$BASH_SOURCE")/cuda_utils/prepare_cuda_compatlibs.sh ${install_p7zip_version} ${cuda_install_dir} | ||||||
| ############################################################################################### | ||||||
| # Try installing five different versions of CUDA compat libraries until the test works. | ||||||
| # Otherwise, give up | ||||||
| bash $(dirname "$BASH_SOURCE")/cuda_utils/install_cuda_compatlibs_loop.sh ${cuda_install_dir} ${install_cuda_version} | ||||||
|
|
||||||
| cuda_version_file="/cvmfs/pilot.eessi-hpc.org/host_injections/nvidia/latest/version.txt" | ||||||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
I also think that this creation should be part of |
||||||
| echo ${install_cuda_version} > ${cuda_version_file} | ||||||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,19 @@ | ||
| #!/bin/bash | ||
|
|
||
| # Get arch type from EESSI environment | ||
| if [[ -z "${EESSI_CPU_FAMILY}" ]]; then | ||
| # set up basic environment variables, EasyBuild and Lmod | ||
| EESSI_SILENT=1 source /cvmfs/pilot.eessi-hpc.org/latest/init/bash | ||
| fi | ||
| eessi_cpu_family="${EESSI_CPU_FAMILY:-x86_64}" | ||
|
|
||
| # build URL for CUDA libraries | ||
| # take rpm file for compat libs from rhel8 folder, deb and rpm files contain the same libraries | ||
| cuda_url="https://developer.download.nvidia.com/compute/cuda/repos/rhel8/"${eessi_cpu_family}"/" | ||
| # get all versions in decending order | ||
| files=$(curl -s "${cuda_url}" | grep 'cuda-compat' | sed 's/<\/\?[^>]\+>//g' | xargs -n1 | /cvmfs/pilot.eessi-hpc.org/latest/compat/linux/${eessi_cpu_family}/bin/sort -r --version-sort ) | ||
| if [[ -z "${files// }" ]]; then | ||
| echo "Could not find any compat lib files under" ${cuda_url} | ||
| exit 1 | ||
| fi | ||
| for file in $files; do echo "${cuda_url}$file"; done |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,39 @@ | ||
| #!/bin/bash | ||
|
|
||
| install_cuda_version=$1 | ||
| cuda_install_dir=$2 | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. General CUDA installation is done via #212 now so I don't think you need this argument. This script is only about installing the CUDA package under |
||
|
|
||
| # TODO: Can we do a trimmed install? | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is done now via your hook |
||
| # Only install CUDA if specified version is not found. | ||
| # This is only relevant for users, the shipped CUDA installation will | ||
| # always be in versions instead of host_injections and have symlinks pointing | ||
| # to host_injections for everything we're not allowed to ship | ||
| if [ -f ${cuda_install_dir}/software/CUDA/${install_cuda_version}/EULA.txt ]; then | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The if/else is still good, except we should be checking under the
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. You should construct
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Also, we should allow for a forced installation to override this check
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It prefer the we ship the EULA text so I think we should check for an expected broken symlink here: |
||
| echo "CUDA software found! No need to install CUDA again, proceeding with tests" | ||
| else | ||
| # - as an installation location just use $EESSI_SOFTWARE_PATH but replacing `versions` with `host_injections` | ||
| # (CUDA is a binary installation so no need to worry too much about this) | ||
| # The install is pretty fat, you need lots of space for download/unpack/install (~3*5GB), need to do a space check before we proceed | ||
| avail_space=$(df --output=avail ${cuda_install_dir}/ | tail -n 1 | awk '{print $1}') | ||
| if (( ${avail_space} < 16000000 )); then | ||
| echo "Need more disk space to install CUDA, exiting now..." | ||
| exit 1 | ||
|
Comment on lines
+17
to
+20
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is a tricky one, we need space for sources, space for the build, space for the install but people can choose where to put all these. I guess we leave it as is for now, but allow people to set an envvar to override this check (and tell them that envvar in the error message) |
||
| fi | ||
| # install cuda in host_injections | ||
| module load EasyBuild | ||
| # we need the --rebuild option and a random dir for the module if the module file is shipped with EESSI | ||
| if [ -f ${EESSI_SOFTWARE_PATH}/modules/all/CUDA/${install_cuda_version}.lua ]; then | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If this script is standalone, we'll need to guarantee |
||
| tmpdir=$(mktemp -d) | ||
| extra_args="--rebuild --installpath-modules=${tmpdir}" | ||
| fi | ||
| eb ${extra_args} --installpath=${cuda_install_dir}/ CUDA-${install_cuda_version}.eb | ||
| ret=$? | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Let's import the bash functions defined in |
||
| if [ $ret -ne 0 ]; then | ||
| echo "CUDA installation failed, please check EasyBuild logs..." | ||
| exit 1 | ||
| fi | ||
| # clean up tmpdir if it exists | ||
| if [ -f ${EESSI_SOFTWARE_PATH}/modules/all/CUDA/${install_cuda_version}.lua ]; then | ||
| rm -rf ${tmpdir} | ||
| fi | ||
| fi | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,92 @@ | ||
| #!/bin/bash | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Shouldn't this be a compat layer bash?
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It's fine as is, as long as the first thing we do is source the EESSI environment |
||
|
|
||
| libs_url=$1 | ||
| cuda_install_dir=$2 | ||
|
|
||
| current_dir=$(dirname $(realpath $0)) | ||
| host_injections_dir="/cvmfs/pilot.eessi-hpc.org/host_injections/nvidia" | ||
| host_injection_linker_dir=${EESSI_EPREFIX/versions/host_injections} | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The assumption here is that the EESSI environment has already been sourced |
||
|
|
||
| # Create a general space for our NVIDIA compat drivers | ||
| if [ -w /cvmfs/pilot.eessi-hpc.org/host_injections ]; then | ||
| mkdir -p ${host_injections_dir} | ||
| else | ||
| echo "Cannot write to eessi host_injections space, exiting now..." >&2 | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Let's start using |
||
| exit 1 | ||
| fi | ||
| cd ${host_injections_dir} | ||
|
|
||
| # Check if our target CUDA is satisfied by what is installed already | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Do we know what our target CUDA version is at this point. And if the
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If the supported CUDA version is new enough and comes from an EESSI installation of the CUDA compat libs, we can already exit gracefully.
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We should leverage the contents of |
||
| # TODO: Find required CUDA version and see if we need an update | ||
| driver_cuda_version=$(nvidia-smi -q --display=COMPUTE | grep CUDA | awk 'NF>1{print $NF}' | sed s/\\.//) | ||
| eessi_cuda_version=$(LD_LIBRARY_PATH=${host_injections_dir}/latest/compat/:$LD_LIBRARY_PATH nvidia-smi -q --display=COMPUTE | grep CUDA | awk 'NF>1{print $NF}' | sed s/\\.//) | ||
| if [[ $driver_cuda_version =~ ^[0-9]+$ ]]; then | ||
| if [ "$driver_cuda_version" -gt "$eessi_cuda_version" ]; then echo "You need to update your CUDA compatability libraries"; fi | ||
| fi | ||
|
|
||
| # If not, grab the latest compat library RPM or deb | ||
| # Download and unpack in temporary directory, easier cleanup after installation | ||
| tmpdir=$(mktemp -d) | ||
| cd $tmpdir | ||
| compat_file=${libs_url##*/} | ||
| wget ${libs_url} | ||
ocaisa marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| echo $compat_file | ||
|
|
||
| # Unpack it | ||
| # rpm files are the default for all OSes | ||
| # Keep support for deb files in case it is needed in the future | ||
| file_extension=${compat_file##*.} | ||
| if [[ ${file_extension} == "rpm" ]]; then | ||
| # p7zip is installed under host_injections for now, make that known to the environment | ||
| if [ -d ${cuda_install_dir}/modules/all ]; then | ||
| module use ${cuda_install_dir}/modules/all/ | ||
| fi | ||
|
Comment on lines
+40
to
+43
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. You can drop this |
||
| # Load p7zip to extract files from rpm file | ||
| module load p7zip | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The assumption here is that the EESSI environment has already been sourced |
||
| # Extract .cpio | ||
| 7z x ${compat_file} | ||
| # Extract lib* | ||
| 7z x ${compat_file/rpm/cpio} | ||
|
Comment on lines
+47
to
+49
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We should pipe the output of these to |
||
| # Restore symlinks | ||
| cd usr/local/cuda-*/compat | ||
| ls *.so *.so.? | xargs -i -I % sh -c '{ echo -n ln -sf" "; cat %; echo " "%; }'| xargs -i sh -c "{}" | ||
| cd - | ||
| elif [[ ${file_extension} == "deb" ]]; then | ||
| ar x ${compat_file} | ||
| tar xf data.tar.* | ||
| else | ||
| echo "File extension of cuda compat lib not supported, exiting now..." >&2 | ||
| exit 1 | ||
| fi | ||
| cd $host_injections_dir | ||
| cuda_dir=$(basename ${tmpdir}/usr/local/cuda-*) | ||
| # TODO: This would prevent error messages if folder already exists, but could be problematic if only some files are missing in destination dir | ||
| rm -rf ${cuda_dir} | ||
| mv -n ${tmpdir}/usr/local/cuda-* . | ||
| rm -r ${tmpdir} | ||
|
|
||
| # Add a symlink that points the latest version to the version we just installed | ||
| ln -sfn ${cuda_dir} latest | ||
|
|
||
| if [ ! -e latest ] ; then | ||
| echo "Symlink to latest cuda compat lib version is broken, exiting now..." | ||
| exit 1 | ||
| fi | ||
|
|
||
| # Create the space to host the libraries | ||
| mkdir -p ${host_injection_linker_dir} | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We should always check exit codes on our commands, seems like a function that does that for us is needed |
||
| # Symlink in the path to the latest libraries | ||
| if [ ! -d "${host_injection_linker_dir}/lib" ]; then | ||
| ln -s ${host_injections_dir}/latest/compat ${host_injection_linker_dir}/lib | ||
| elif [ ! "${host_injection_linker_dir}/lib" -ef "${host_injections_dir}/latest/compat" ]; then | ||
| echo "CUDA compat libs symlink exists but points to the wrong location, please fix this..." | ||
| echo "${host_injection_linker_dir}/lib should point to ${host_injections_dir}/latest/compat" | ||
| exit 1 | ||
| fi | ||
|
|
||
| # return to initial dir | ||
| cd $current_dir | ||
|
|
||
| echo | ||
| echo CUDA driver compatability drivers installed for CUDA version: | ||
| echo ${cuda_dir/cuda-/} | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'd drop the CUDA version supported into If we verify the installation in
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We could also use the |
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think
gpuis a recognised property in Lmod so a good choice for now. Once we add AMD support it will get more complicated.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can add a new property by extending the property table
propT. To do so, we could add a fileinit/lmodrc.luawith a new property. This file can be loaded using the env var$LMOD_RC. Unfortunately, we do not seem to be able to add entries toarchbut rather have to add a new property (or find a way to extendarchthat I'm missing).