Skip to content

Conversation

@casparvl
Copy link
Contributor

@casparvl casparvl commented Aug 6, 2025

Try to change the subdir in which the CUDA toolkit is installed so that it also doesn't include the CPU microarchitecture

…at it also doesnt include the CPU microarchitecture
@casparvl
Copy link
Contributor Author

casparvl commented Aug 6, 2025

bot: build repo:eessi.io-2023.06-software instance:eessi-bot-surf architecture:x86_64/amd/zen4 accelerator:nvidia/cc90

@eessi-bot-surf
Copy link

eessi-bot-surf bot commented Aug 6, 2025

New job on instance eessi-bot-surf for CPU micro-architecture x86_64-amd-zen4 and accelerator nvidia/cc90 for repository eessi.io-2023.06-software in job dir /projects/eessibot/eessi-bot-surf/jobs/2025.08/pr_59/13622631

date job status comment
Aug 06 20:32:59 UTC 2025 submitted job id 13622631 will be eligible to start in about 20 seconds
Aug 06 20:33:08 UTC 2025 received job awaits launch by Slurm scheduler
Aug 06 20:33:21 UTC 2025 running job 13622631 is running
Aug 06 20:42:47 UTC 2025 finished
😁 SUCCESS (click triangle for details)
Details
✅ job output file slurm-13622631.out
✅ no message matching FATAL:
✅ no message matching ERROR:
✅ no message matching FAILED:
✅ no message matching required modules missing:
✅ found message(s) matching No missing installations
✅ found message matching .tar.gz created!
Artefacts
eessi-2023.06-software-linux-x86_64-amd-zen4-17545126210.tar.gzsize: 2067 MiB (2167763827 bytes)
entries: 5559
modules under 2023.06/software/linux/x86_64/amd/zen4/accel/nvidia/cc90/modules/all
CUDA/12.1.1.lua
software under 2023.06/software/linux/x86_64/amd/zen4/accel/nvidia/cc90/software
CUDA/12.1.1
reprod directories under 2023.06/software/linux/x86_64/amd/zen4/accel/nvidia/cc90/reprod
no reprod directories in tarball
other under 2023.06/software/linux/x86_64/amd/zen4/accel/nvidia/cc90
2023.06/init/easybuild/eb_hooks.py
2023.06/scripts/gpu_support/nvidia/install_cuda_and_libraries.sh
Aug 06 20:42:47 UTC 2025 test result
😁 SUCCESS (click triangle for details)
ReFrame Summary
[ SKIP ] (1/8) Skipping GPU test : only 1 GPU available for this test case
[ SKIP ] (2/8) Skipping GPU test : only 1 GPU available for this test case
[ SKIP ] (3/8) Skipping GPU test : only 1 GPU available for this test case
[ SKIP ] (4/8) Skipping GPU test : only 1 GPU available for this test case
[ SKIP ] (5/8) Skipping test : 1 GPU(s) available for this test case, need exactly 2
[ SKIP ] (6/8) Skipping test : 1 GPU(s) available for this test case, need exactly 2
[ SKIP ] (7/8) Skipping test : 1 GPU(s) available for this test case, need exactly 2
[ SKIP ] (8/8) Skipping test : 1 GPU(s) available for this test case, need exactly 2
[ PASSED ] Ran 0/8 test case(s) from 8 check(s) (0 failure(s), 8 skipped, 0 aborted)
Details
✅ job output file slurm-13622631.out
✅ no message matching ERROR:
✅ no message matching [\s*FAILED\s*].*Ran .* test case

@casparvl
Copy link
Contributor Author

casparvl commented Aug 6, 2025

Hmmm, success, but not what I planned. Installdir for the install-cuda-and-libraries:

installpath               (E) = /cvmfs/software.eessi.io/host_injections/2023.06/software/linux/x86_64/amd/zen4

I wanted it to be /cvmfs/software.eessi.io/host_injections/2023.06/software/linux/x86_64.

I guess the sed command isn't correct:

sed: -e expression #1, char 20: unknown option to `s'

The odd thing is that this should have broken the sanity check for installing CUDA in the software-layer, because that should have created symlinks that point to this directory.

@casparvl
Copy link
Contributor Author

casparvl commented Aug 6, 2025

bot: build repo:eessi.io-2023.06-software instance:eessi-bot-surf architecture:x86_64/amd/zen4 accelerator:nvidia/cc90

@eessi-bot-surf
Copy link

eessi-bot-surf bot commented Aug 6, 2025

New job on instance eessi-bot-surf for CPU micro-architecture x86_64-amd-zen4 and accelerator nvidia/cc90 for repository eessi.io-2023.06-software in job dir /projects/eessibot/eessi-bot-surf/jobs/2025.08/pr_59/13629738

date job status comment
Aug 06 20:56:43 UTC 2025 submitted job id 13629738 will be eligible to start in about 20 seconds
Aug 06 20:56:53 UTC 2025 received job awaits launch by Slurm scheduler
Aug 06 20:57:07 UTC 2025 running job 13629738 is running
Aug 06 21:06:41 UTC 2025 finished
😁 SUCCESS (click triangle for details)
Details
✅ job output file slurm-13629738.out
✅ no message matching FATAL:
✅ no message matching ERROR:
✅ no message matching FAILED:
✅ no message matching required modules missing:
✅ found message(s) matching No missing installations
✅ found message matching .tar.gz created!
Artefacts
eessi-2023.06-software-linux-x86_64-amd-zen4-17545140440.tar.gzsize: 2067 MiB (2167757923 bytes)
entries: 5559
modules under 2023.06/software/linux/x86_64/amd/zen4/accel/nvidia/cc90/modules/all
CUDA/12.1.1.lua
software under 2023.06/software/linux/x86_64/amd/zen4/accel/nvidia/cc90/software
CUDA/12.1.1
reprod directories under 2023.06/software/linux/x86_64/amd/zen4/accel/nvidia/cc90/reprod
no reprod directories in tarball
other under 2023.06/software/linux/x86_64/amd/zen4/accel/nvidia/cc90
2023.06/init/easybuild/eb_hooks.py
2023.06/scripts/gpu_support/nvidia/install_cuda_and_libraries.sh
Aug 06 21:06:41 UTC 2025 test result
😁 SUCCESS (click triangle for details)
ReFrame Summary
[ SKIP ] (1/8) Skipping GPU test : only 1 GPU available for this test case
[ SKIP ] (2/8) Skipping GPU test : only 1 GPU available for this test case
[ SKIP ] (3/8) Skipping GPU test : only 1 GPU available for this test case
[ SKIP ] (4/8) Skipping GPU test : only 1 GPU available for this test case
[ SKIP ] (5/8) Skipping test : 1 GPU(s) available for this test case, need exactly 2
[ SKIP ] (6/8) Skipping test : 1 GPU(s) available for this test case, need exactly 2
[ SKIP ] (7/8) Skipping test : 1 GPU(s) available for this test case, need exactly 2
[ SKIP ] (8/8) Skipping test : 1 GPU(s) available for this test case, need exactly 2
[ PASSED ] Ran 0/8 test case(s) from 8 check(s) (0 failure(s), 8 skipped, 0 aborted)
Details
✅ job output file slurm-13629738.out
✅ no message matching ERROR:
✅ no message matching [\s*FAILED\s*].*Ran .* test case

@casparvl
Copy link
Contributor Author

casparvl commented Aug 6, 2025

bot: build repo:eessi.io-2023.06-software instance:eessi-bot-surf architecture:x86_64/amd/zen4 accelerator:nvidia/cc90

@eessi-bot-surf
Copy link

eessi-bot-surf bot commented Aug 6, 2025

New job on instance eessi-bot-surf for CPU micro-architecture x86_64-amd-zen4 and accelerator nvidia/cc90 for repository eessi.io-2023.06-software in job dir /projects/eessibot/eessi-bot-surf/jobs/2025.08/pr_59/13632056

date job status comment
Aug 06 20:59:47 UTC 2025 submitted job id 13632056 will be eligible to start in about 20 seconds
Aug 06 21:00:01 UTC 2025 received job awaits launch by Slurm scheduler
Aug 06 21:00:16 UTC 2025 running job 13632056 is running
Aug 06 21:15:34 UTC 2025 finished
😁 SUCCESS (click triangle for details)
Details
✅ job output file slurm-13632056.out
✅ no message matching FATAL:
✅ no message matching ERROR:
✅ no message matching FAILED:
✅ no message matching required modules missing:
✅ found message(s) matching No missing installations
✅ found message matching .tar.gz created!
Artefacts
eessi-2023.06-software-linux-x86_64-amd-zen4-17545145870.tar.gzsize: 2067 MiB (2167731041 bytes)
entries: 5559
modules under 2023.06/software/linux/x86_64/amd/zen4/accel/nvidia/cc90/modules/all
CUDA/12.1.1.lua
software under 2023.06/software/linux/x86_64/amd/zen4/accel/nvidia/cc90/software
CUDA/12.1.1
reprod directories under 2023.06/software/linux/x86_64/amd/zen4/accel/nvidia/cc90/reprod
no reprod directories in tarball
other under 2023.06/software/linux/x86_64/amd/zen4/accel/nvidia/cc90
2023.06/init/easybuild/eb_hooks.py
2023.06/scripts/gpu_support/nvidia/install_cuda_and_libraries.sh
Aug 06 21:15:34 UTC 2025 test result
😁 SUCCESS (click triangle for details)
ReFrame Summary
[ SKIP ] (1/8) Skipping GPU test : only 1 GPU available for this test case
[ SKIP ] (2/8) Skipping GPU test : only 1 GPU available for this test case
[ SKIP ] (3/8) Skipping GPU test : only 1 GPU available for this test case
[ SKIP ] (4/8) Skipping GPU test : only 1 GPU available for this test case
[ SKIP ] (5/8) Skipping test : 1 GPU(s) available for this test case, need exactly 2
[ SKIP ] (6/8) Skipping test : 1 GPU(s) available for this test case, need exactly 2
[ SKIP ] (7/8) Skipping test : 1 GPU(s) available for this test case, need exactly 2
[ SKIP ] (8/8) Skipping test : 1 GPU(s) available for this test case, need exactly 2
[ PASSED ] Ran 0/8 test case(s) from 8 check(s) (0 failure(s), 8 skipped, 0 aborted)
Details
✅ job output file slurm-13632056.out
✅ no message matching ERROR:
✅ no message matching [\s*FAILED\s*].*Ran .* test case

@casparvl
Copy link
Contributor Author

casparvl commented Aug 6, 2025

That's more like it!

installpath               (E) = /cvmfs/software.eessi.io/host_injections/2023.06/software/linux/x86_64

Now I still need to carefully check the symlinks for the installations, to make sure they also refer here (because the old location also still contains CUDA, so it wouldn't lead to a broken install - making any mistakes harder to spot).

@casparvl
Copy link
Contributor Author

casparvl commented Aug 6, 2025

Yep, symlinks are still 'wrong', pointing to the old location:

lrwxrwxrwx   1 eessibot prjs1395  110 Aug  6 23:09 ptxas -> /cvmfs/software.eessi.io/host_injections/2023.06/software/linux/x86_64/amd/zen4/software/CUDA/12.1.1/bin/ptxas

I'll check further tomorrow. The EB build log will probably show some output form the eb_hooks.

@casparvl
Copy link
Contributor Author

casparvl commented Aug 6, 2025

bot: build repo:eessi.io-2023.06-software instance:eessi-bot-surf architecture:x86_64/amd/zen4 accelerator:nvidia/cc90

@eessi-bot-surf
Copy link

eessi-bot-surf bot commented Aug 6, 2025

New job on instance eessi-bot-surf for CPU micro-architecture x86_64-amd-zen4 and accelerator nvidia/cc90 for repository eessi.io-2023.06-software in job dir /projects/eessibot/eessi-bot-surf/jobs/2025.08/pr_59/13643484

date job status comment
Aug 06 21:17:44 UTC 2025 submitted job id 13643484 will be eligible to start in about 20 seconds
Aug 06 21:17:48 UTC 2025 received job awaits launch by Slurm scheduler
Aug 06 21:18:21 UTC 2025 running job 13643484 is running
Aug 06 21:27:44 UTC 2025 finished
😁 SUCCESS (click triangle for details)
Details
✅ job output file slurm-13643484.out
✅ no message matching FATAL:
✅ no message matching ERROR:
✅ no message matching FAILED:
✅ no message matching required modules missing:
✅ found message(s) matching No missing installations
✅ found message matching .tar.gz created!
Artefacts
eessi-2023.06-software-linux-x86_64-amd-zen4-17545153110.tar.gzsize: 2067 MiB (2167725181 bytes)
entries: 5559
modules under 2023.06/software/linux/x86_64/amd/zen4/accel/nvidia/cc90/modules/all
CUDA/12.1.1.lua
software under 2023.06/software/linux/x86_64/amd/zen4/accel/nvidia/cc90/software
CUDA/12.1.1
reprod directories under 2023.06/software/linux/x86_64/amd/zen4/accel/nvidia/cc90/reprod
no reprod directories in tarball
other under 2023.06/software/linux/x86_64/amd/zen4/accel/nvidia/cc90
2023.06/init/easybuild/eb_hooks.py
2023.06/scripts/gpu_support/nvidia/install_cuda_and_libraries.sh
Aug 06 21:27:44 UTC 2025 test result
😁 SUCCESS (click triangle for details)
ReFrame Summary
[ SKIP ] (1/8) Skipping GPU test : only 1 GPU available for this test case
[ SKIP ] (2/8) Skipping GPU test : only 1 GPU available for this test case
[ SKIP ] (3/8) Skipping GPU test : only 1 GPU available for this test case
[ SKIP ] (4/8) Skipping GPU test : only 1 GPU available for this test case
[ SKIP ] (5/8) Skipping test : 1 GPU(s) available for this test case, need exactly 2
[ SKIP ] (6/8) Skipping test : 1 GPU(s) available for this test case, need exactly 2
[ SKIP ] (7/8) Skipping test : 1 GPU(s) available for this test case, need exactly 2
[ SKIP ] (8/8) Skipping test : 1 GPU(s) available for this test case, need exactly 2
[ PASSED ] Ran 0/8 test case(s) from 8 check(s) (0 failure(s), 8 skipped, 0 aborted)
Details
✅ job output file slurm-13643484.out
✅ no message matching ERROR:
✅ no message matching [\s*FAILED\s*].*Ran .* test case

@casparvl
Copy link
Contributor Author

casparvl commented Aug 6, 2025

That looks better:

 ls -al 2023.06/software/linux/x86_64/amd/zen4/accel/nvidia/cc90/software/CUDA/12.1.1/bin/nvcc
lrwxrwxrwx 1 eessibot prjs1395 100 Aug  6 23:21 2023.06/software/linux/x86_64/amd/zen4/accel/nvidia/cc90/software/CUDA/12.1.1/bin/nvcc -> /cvmfs/software.eessi.io/host_injections/2023.06/software/linux/x86_64/software/CUDA/12.1.1/bin/nvcc

…to e.g. /cvmfs/software.eessi.io/host_injections/x86_64, i.e. only include the CPU family in the prefix, not microarchitecture or accelerator architecture. Since these are binary installs, we don't need multiple copies, and requiring site admins to run the install scripts once per micro-architecture is just annoying (and requires more storage)
@casparvl
Copy link
Contributor Author

casparvl commented Aug 7, 2025

bot: build repo:eessi.io-2023.06-software instance:eessi-bot-surf architecture:x86_64/amd/zen4 accelerator:nvidia/cc90

@eessi-bot-surf
Copy link

eessi-bot-surf bot commented Aug 7, 2025

New job on instance eessi-bot-surf for CPU micro-architecture x86_64-amd-zen4 and accelerator nvidia/cc90 for repository eessi.io-2023.06-software in job dir /projects/eessibot/eessi-bot-surf/jobs/2025.08/pr_59/13737255

date job status comment
Aug 07 11:18:09 UTC 2025 submitted job id 13737255 will be eligible to start in about 20 seconds
Aug 07 11:18:18 UTC 2025 received job awaits launch by Slurm scheduler
Aug 07 11:18:31 UTC 2025 running job 13737255 is running
Aug 07 11:27:56 UTC 2025 finished
😢 FAILURE (click triangle for details)
Details
✅ job output file slurm-13737255.out
✅ no message matching FATAL:
❌ found message matching ERROR:
❌ found message matching FAILED:
✅ no message matching required modules missing:
✅ found message(s) matching No missing installations
✅ found message matching .tar.gz created!
Artefacts
eessi-2023.06-software-linux-x86_64-amd-zen4-17545660100.tar.gzsize: 0 MiB (23442 bytes)
entries: 2
modules under 2023.06/software/linux/x86_64/amd/zen4/accel/nvidia/cc90/modules/all
no module files in tarball
software under 2023.06/software/linux/x86_64/amd/zen4/accel/nvidia/cc90/software
no software packages in tarball
reprod directories under 2023.06/software/linux/x86_64/amd/zen4/accel/nvidia/cc90/reprod
no reprod directories in tarball
other under 2023.06/software/linux/x86_64/amd/zen4/accel/nvidia/cc90
2023.06/init/easybuild/eb_hooks.py
2023.06/scripts/gpu_support/nvidia/install_cuda_and_libraries.sh
Aug 07 11:27:56 UTC 2025 test result
😁 SUCCESS (click triangle for details)
ReFrame Summary
[ SKIP ] (1/8) Skipping GPU test : only 1 GPU available for this test case
[ SKIP ] (2/8) Skipping GPU test : only 1 GPU available for this test case
[ SKIP ] (3/8) Skipping GPU test : only 1 GPU available for this test case
[ SKIP ] (4/8) Skipping GPU test : only 1 GPU available for this test case
[ SKIP ] (5/8) Skipping test : 1 GPU(s) available for this test case, need exactly 2
[ SKIP ] (6/8) Skipping test : 1 GPU(s) available for this test case, need exactly 2
[ SKIP ] (7/8) Skipping test : 1 GPU(s) available for this test case, need exactly 2
[ SKIP ] (8/8) Skipping test : 1 GPU(s) available for this test case, need exactly 2
[ PASSED ] Ran 0/8 test case(s) from 8 check(s) (0 failure(s), 8 skipped, 0 aborted)
Details
✅ job output file slurm-13737255.out
❌ found message matching ERROR:
✅ no message matching [\s*FAILED\s*].*Ran .* test case

@casparvl casparvl changed the title Adapt subdir for CUDA toolkig in host injections Adapt subdir for CUDA toolkit in host injections Aug 7, 2025
…DNN package was found in the old host-injections location (with micro-arch specific subdir). Also, adapt the path to search for the regular LmodError
@casparvl
Copy link
Contributor Author

casparvl commented Aug 7, 2025

bot: build repo:eessi.io-2023.06-software instance:eessi-bot-surf architecture:x86_64/amd/zen4 accelerator:nvidia/cc90

@eessi-bot-surf
Copy link

eessi-bot-surf bot commented Aug 7, 2025

New job on instance eessi-bot-surf for CPU micro-architecture x86_64-amd-zen4 and accelerator nvidia/cc90 for repository eessi.io-2023.06-software in job dir /projects/eessibot/eessi-bot-surf/jobs/2025.08/pr_59/13739465

date job status comment
Aug 07 11:55:25 UTC 2025 submitted job id 13739465 will be eligible to start in about 20 seconds
Aug 07 11:55:32 UTC 2025 received job awaits launch by Slurm scheduler
Aug 07 11:55:56 UTC 2025 running job 13739465 is running
Aug 07 12:00:33 UTC 2025 finished
😢 FAILURE (click triangle for details)
Details
✅ job output file slurm-13739465.out
✅ no message matching FATAL:
❌ found message matching ERROR:
❌ found message matching FAILED:
❌ found message matching required modules missing:
❌ no message matching No missing installations
✅ found message matching .tar.gz created!
Artefacts
eessi-2023.06-software-linux-x86_64-amd-zen4-17545679620.tar.gzsize: 0 MiB (23442 bytes)
entries: 2
modules under 2023.06/software/linux/x86_64/amd/zen4/accel/nvidia/cc90/modules/all
no module files in tarball
software under 2023.06/software/linux/x86_64/amd/zen4/accel/nvidia/cc90/software
no software packages in tarball
reprod directories under 2023.06/software/linux/x86_64/amd/zen4/accel/nvidia/cc90/reprod
no reprod directories in tarball
other under 2023.06/software/linux/x86_64/amd/zen4/accel/nvidia/cc90
2023.06/init/easybuild/eb_hooks.py
2023.06/scripts/gpu_support/nvidia/install_cuda_and_libraries.sh
Aug 07 12:00:33 UTC 2025 test result
😁 SUCCESS (click triangle for details)
ReFrame Summary
[ SKIP ] (1/8) Skipping GPU test : only 1 GPU available for this test case
[ SKIP ] (2/8) Skipping GPU test : only 1 GPU available for this test case
[ SKIP ] (3/8) Skipping GPU test : only 1 GPU available for this test case
[ SKIP ] (4/8) Skipping GPU test : only 1 GPU available for this test case
[ SKIP ] (5/8) Skipping test : 1 GPU(s) available for this test case, need exactly 2
[ SKIP ] (6/8) Skipping test : 1 GPU(s) available for this test case, need exactly 2
[ SKIP ] (7/8) Skipping test : 1 GPU(s) available for this test case, need exactly 2
[ SKIP ] (8/8) Skipping test : 1 GPU(s) available for this test case, need exactly 2
[ PASSED ] Ran 0/8 test case(s) from 8 check(s) (0 failure(s), 8 skipped, 0 aborted)
Details
✅ job output file slurm-13739465.out
❌ found message matching ERROR:
✅ no message matching [\s*FAILED\s*].*Ran .* test case

@casparvl
Copy link
Contributor Author

casparvl commented Aug 7, 2025

== FAILED: Installation ended unsuccessfully (build directory: /tmp/tmp.4EJug6QIRZ/temp_install_storage/cuda_n_co.Ho7/build/CUDA/12.1.1/system-system): build failed (first 300 chars): Failed to create directory /cvmfs/software.eessi.io/host_injections/x86_64/software/CUDA/12.1.1: [Errno 13] Permission denied: '/cvmfs/software.eessi.io/host_injections/x86_64' (took 7 mins 30 secs)

Hmmm, that's strange. This directory is writeable:

$ ls -ald /path/to/bot/host-injections
drwxrwsr-x+ 5 ABC XYZ 4096 Aug  7 13:19 /path/to/bot/host-injections

@casparvl
Copy link
Contributor Author

casparvl commented Aug 7, 2025

Also:

grep: /cvmfs/software.eessi.io/host_injections/nvidia/x86_64/host/driver_version.txt: Permission denied
ESC[33mThe host GPU driver libraries version have changed. Now its: (v575.57.08)ESC[0m
ESC[33mCleaning out outdated symlinks.ESC[0m
rm: cannot remove '/cvmfs/software.eessi.io/host_injections/nvidia/x86_64/host/cuda_version.txt': Permission denied
rm: cannot remove '/cvmfs/software.eessi.io/host_injections/nvidia/x86_64/host/driver_version.txt': Permission denied
rm: cannot remove '/cvmfs/software.eessi.io/host_injections/nvidia/x86_64/host/host': Permission denied
rm: cannot remove '/cvmfs/software.eessi.io/host_injections/nvidia/x86_64/host/latest': Permission denied
rm: cannot remove '/cvmfs/software.eessi.io/host_injections/nvidia/x86_64/host/libEGL.so': Permission denied
rm: cannot remove '/cvmfs/software.eessi.io/host_injections/nvidia/x86_64/host/libEGL.so.1': Permission denied
rm: cannot remove '/cvmfs/software.eessi.io/host_injections/nvidia/x86_64/host/libEGL_nvidia.so.0': Permission denied
rm: cannot remove '/cvmfs/software.eessi.io/host_injections/nvidia/x86_64/host/libGL.so': Permission denied
rm: cannot remove '/cvmfs/software.eessi.io/host_injections/nvidia/x86_64/host/libGL.so.1': Permission denied
rm: cannot remove '/cvmfs/software.eessi.io/host_injections/nvidia/x86_64/host/libGLESv1_CM.so': Permission denied
rm: cannot remove '/cvmfs/software.eessi.io/host_injections/nvidia/x86_64/host/libGLESv1_CM.so.1': Permission denied
rm: cannot remove '/cvmfs/software.eessi.io/host_injections/nvidia/x86_64/host/libGLESv1_CM_nvidia.so.1': Permission denied
rm: cannot remove '/cvmfs/software.eessi.io/host_injections/nvidia/x86_64/host/libGLESv2.so': Permission denied
rm: cannot remove '/cvmfs/software.eessi.io/host_injections/nvidia/x86_64/host/libGLESv2.so.2': Permission denied
rm: cannot remove '/cvmfs/software.eessi.io/host_injections/nvidia/x86_64/host/libGLESv2_nvidia.so.2': Permission denied
rm: cannot remove '/cvmfs/software.eessi.io/host_injections/nvidia/x86_64/host/libGLX.so': Permission denied
rm: cannot remove '/cvmfs/software.eessi.io/host_injections/nvidia/x86_64/host/libGLX.so.0': Permission denied
rm: cannot remove '/cvmfs/software.eessi.io/host_injections/nvidia/x86_64/host/libGLX_nvidia.so.0': Permission denied
rm: cannot remove '/cvmfs/software.eessi.io/host_injections/nvidia/x86_64/host/libGLdispatch.so': Permission denied
rm: cannot remove '/cvmfs/software.eessi.io/host_injections/nvidia/x86_64/host/libGLdispatch.so.0': Permission denied
rm: cannot remove '/cvmfs/software.eessi.io/host_injections/nvidia/x86_64/host/libOpenCL.so.1': Permission denied
rm: cannot remove '/cvmfs/software.eessi.io/host_injections/nvidia/x86_64/host/libOpenGL.so': Permission denied
rm: cannot remove '/cvmfs/software.eessi.io/host_injections/nvidia/x86_64/host/libOpenGL.so.0': Permission denied
rm: cannot remove '/cvmfs/software.eessi.io/host_injections/nvidia/x86_64/host/libcuda.so': Permission denied
rm: cannot remove '/cvmfs/software.eessi.io/host_injections/nvidia/x86_64/host/libcuda.so.1': Permission denied
rm: cannot remove '/cvmfs/software.eessi.io/host_injections/nvidia/x86_64/host/libcudadebugger.so.1': Permission denied
rm: cannot remove '/cvmfs/software.eessi.io/host_injections/nvidia/x86_64/host/libnvcuvid.so': Permission denied
rm: cannot remove '/cvmfs/software.eessi.io/host_injections/nvidia/x86_64/host/libnvcuvid.so.1': Permission denied
rm: cannot remove '/cvmfs/software.eessi.io/host_injections/nvidia/x86_64/host/libnvidia-cfg.so': Permission denied
rm: cannot remove '/cvmfs/software.eessi.io/host_injections/nvidia/x86_64/host/libnvidia-cfg.so.1': Permission denied
rm: cannot remove '/cvmfs/software.eessi.io/host_injections/nvidia/x86_64/host/libnvidia-egl-wayland.so.1': Permission denied
rm: cannot remove '/cvmfs/software.eessi.io/host_injections/nvidia/x86_64/host/libnvidia-eglcore.so.555.42.06': Permission denied
rm: cannot remove '/cvmfs/software.eessi.io/host_injections/nvidia/x86_64/host/libnvidia-encode.so': Permission denied
rm: cannot remove '/cvmfs/software.eessi.io/host_injections/nvidia/x86_64/host/libnvidia-encode.so.1': Permission denied
rm: cannot remove '/cvmfs/software.eessi.io/host_injections/nvidia/x86_64/host/libnvidia-fbc.so': Permission denied
rm: cannot remove '/cvmfs/software.eessi.io/host_injections/nvidia/x86_64/host/libnvidia-fbc.so.1': Permission denied
rm: cannot remove '/cvmfs/software.eessi.io/host_injections/nvidia/x86_64/host/libnvidia-glcore.so.555.42.06': Permission denied
rm: cannot remove '/cvmfs/software.eessi.io/host_injections/nvidia/x86_64/host/libnvidia-glsi.so.555.42.06': Permission denied
rm: cannot remove '/cvmfs/software.eessi.io/host_injections/nvidia/x86_64/host/libnvidia-glvkspirv.so.555.42.06': Permission denied
rm: cannot remove '/cvmfs/software.eessi.io/host_injections/nvidia/x86_64/host/libnvidia-gpucomp.so.555.42.06': Permission denied
rm: cannot remove '/cvmfs/software.eessi.io/host_injections/nvidia/x86_64/host/libnvidia-gtk3.so.555.42.06': Permission denied
rm: cannot remove '/cvmfs/software.eessi.io/host_injections/nvidia/x86_64/host/libnvidia-ml.so': Permission denied
rm: cannot remove '/cvmfs/software.eessi.io/host_injections/nvidia/x86_64/host/libnvidia-ml.so.1': Permission denied
rm: cannot remove '/cvmfs/software.eessi.io/host_injections/nvidia/x86_64/host/libnvidia-nvvm.so': Permission denied
rm: cannot remove '/cvmfs/software.eessi.io/host_injections/nvidia/x86_64/host/libnvidia-nvvm.so.4': Permission denied
rm: cannot remove '/cvmfs/software.eessi.io/host_injections/nvidia/x86_64/host/libnvidia-opencl.so.1': Permission denied
rm: cannot remove '/cvmfs/software.eessi.io/host_injections/nvidia/x86_64/host/libnvidia-opticalflow.so.1': Permission denied
rm: cannot remove '/cvmfs/software.eessi.io/host_injections/nvidia/x86_64/host/libnvidia-ptxjitcompiler.so': Permission denied
rm: cannot remove '/cvmfs/software.eessi.io/host_injections/nvidia/x86_64/host/libnvidia-ptxjitcompiler.so.1': Permission denied
rm: cannot remove '/cvmfs/software.eessi.io/host_injections/nvidia/x86_64/host/libnvidia-rtcore.so.555.42.06': Permission denied
rm: cannot remove '/cvmfs/software.eessi.io/host_injections/nvidia/x86_64/host/libnvidia-tls.so.555.42.06': Permission denied
rm: cannot remove '/cvmfs/software.eessi.io/host_injections/nvidia/x86_64/host/libnvoptix.so.1': Permission denied

That's really strange, it looks like the issue I had before when the bind-mounting became the default, except: the repo is really fuse-mounted here:

add fusemount options for CVMFS repo 'eessi.io-2023.06-software'
Using a fuse mount for /cvmfs/eessi.io-2023.06-software
...
singularity  run --nv --contain --fusemount container:cvmfs2 software.eessi.io /cvmfs_ro/software.eessi.io --fusemount container:unionfs -o cow /tmp/software.eessi.io/overlay-upper=RW:/cvmfs_ro/software.eessi.io=RO /cvmfs/software.eessi.io /tmp/eessibot/EESSI/eessi_job.mp0KqBw1YK/eessi
.iTYZSjkO6k/ghcr.io_eessi_build_node_debian12.sif /gpfs/work1/1/eessibot/eessi-bot-surf/jobs/2025.08/pr_59/event_31042a10-7380-11f0-8928-4b97b9f16a29/run_000/linux_x86_64_amd_zen4/eessi.io-2023.06-software/install_software_layer.sh --build-logs-dir /projects/eessibot/eessi-bot-surf/bui
ldlogs --shared-fs-path /projects/eessibot/eessi-bot-surf/SHARED
...

@casparvl
Copy link
Contributor Author

casparvl commented Aug 7, 2025

bot: build repo:eessi.io-2023.06-software instance:eessi-bot-surf architecture:x86_64/amd/zen4 accelerator:nvidia/cc90

@eessi-bot-surf
Copy link

eessi-bot-surf bot commented Aug 7, 2025

New job on instance eessi-bot-surf for CPU micro-architecture x86_64-amd-zen4 and accelerator nvidia/cc90 for repository eessi.io-2023.06-software in job dir /projects/eessibot/eessi-bot-surf/jobs/2025.08/pr_59/13743306

date job status comment
Aug 07 12:25:45 UTC 2025 submitted job id 13743306 will be eligible to start in about 20 seconds
Aug 07 12:25:50 UTC 2025 received job awaits launch by Slurm scheduler
Aug 07 12:26:13 UTC 2025 running job 13743306 is running
Aug 07 12:30:50 UTC 2025 finished
😢 FAILURE (click triangle for details)
Details
✅ job output file slurm-13743306.out
✅ no message matching FATAL:
❌ found message matching ERROR:
❌ found message matching FAILED:
❌ found message matching required modules missing:
❌ no message matching No missing installations
✅ found message matching .tar.gz created!
Artefacts
eessi-2023.06-software-linux-x86_64-amd-zen4-17545697770.tar.gzsize: 0 MiB (23441 bytes)
entries: 2
modules under 2023.06/software/linux/x86_64/amd/zen4/accel/nvidia/cc90/modules/all
no module files in tarball
software under 2023.06/software/linux/x86_64/amd/zen4/accel/nvidia/cc90/software
no software packages in tarball
reprod directories under 2023.06/software/linux/x86_64/amd/zen4/accel/nvidia/cc90/reprod
no reprod directories in tarball
other under 2023.06/software/linux/x86_64/amd/zen4/accel/nvidia/cc90
2023.06/init/easybuild/eb_hooks.py
2023.06/scripts/gpu_support/nvidia/install_cuda_and_libraries.sh
Aug 07 12:30:50 UTC 2025 test result
😁 SUCCESS (click triangle for details)
ReFrame Summary
[ SKIP ] (1/8) Skipping GPU test : only 1 GPU available for this test case
[ SKIP ] (2/8) Skipping GPU test : only 1 GPU available for this test case
[ SKIP ] (3/8) Skipping GPU test : only 1 GPU available for this test case
[ SKIP ] (4/8) Skipping GPU test : only 1 GPU available for this test case
[ SKIP ] (5/8) Skipping test : 1 GPU(s) available for this test case, need exactly 2
[ SKIP ] (6/8) Skipping test : 1 GPU(s) available for this test case, need exactly 2
[ SKIP ] (7/8) Skipping test : 1 GPU(s) available for this test case, need exactly 2
[ SKIP ] (8/8) Skipping test : 1 GPU(s) available for this test case, need exactly 2
[ PASSED ] Ran 0/8 test case(s) from 8 check(s) (0 failure(s), 8 skipped, 0 aborted)
Details
✅ job output file slurm-13743306.out
❌ found message matching ERROR:
✅ no message matching [\s*FAILED\s*].*Ran .* test case

@casparvl
Copy link
Contributor Author

casparvl commented Aug 7, 2025

Hm, issue might have been two bot jobs trying at the same time. I cleaned out the host_injections/x86_64 dir for the bot, so that we can start fresh.

bot: build repo:eessi.io-2023.06-software instance:eessi-bot-surf architecture:x86_64/amd/zen4 accelerator:nvidia/cc90

@eessi-bot-aws
Copy link

eessi-bot-aws bot commented Aug 15, 2025

New job on instance eessi-bot-mc-aws for CPU micro-architecture x86_64-intel-icelake and accelerator nvidia/cc80 for repository eessi.io-2023.06-software in job dir /project/def-users/SHARED/jobs/2025.08/pr_59/83112

date job status comment
Aug 15 18:40:27 UTC 2025 submitted job id 83112 awaits release by job manager
Aug 15 18:41:15 UTC 2025 released job awaits launch by Slurm scheduler
Aug 15 18:45:26 UTC 2025 running job 83112 is running
Aug 15 19:10:25 UTC 2025 finished
😁 SUCCESS (click triangle for details)
Details
✅ job output file slurm-83112.out
✅ no message matching FATAL:
✅ no message matching ERROR:
✅ no message matching FAILED:
✅ no message matching required modules missing:
✅ found message(s) matching No missing installations
✅ found message matching .tar.gz created!
Artefacts
eessi-2023.06-software-linux-x86_64-intel-icelake-accel-nvidia-cc80-17552842750.tar.gzsize: 5072 MiB (5318411346 bytes)
entries: 11913
modules under 2023.06/software/linux/x86_64/intel/icelake/accel/nvidia/cc80/modules/all
CUDA/12.1.1.lua
CUDA/12.4.0.lua
cuDNN/8.9.2.26-CUDA-12.1.1.lua
software under 2023.06/software/linux/x86_64/intel/icelake/accel/nvidia/cc80/software
CUDA/12.1.1
CUDA/12.4.0
cuDNN/8.9.2.26-CUDA-12.1.1
reprod directories under 2023.06/software/linux/x86_64/intel/icelake/accel/nvidia/cc80/reprod
no reprod directories in tarball
other under 2023.06/software/linux/x86_64/intel/icelake/accel/nvidia/cc80
2023.06/init/easybuild/eb_hooks.py
2023.06/scripts/gpu_support/nvidia/install_cuda_and_libraries.sh
2023.06/scripts/gpu_support/nvidia/install_cuda_host_injections.sh
2023.06/software/linux/x86_64/intel/icelake/.lmod/SitePackage.lua
Aug 15 19:10:25 UTC 2025 test result
😢 FAILURE (click triangle for details)
Reason
EESSI test suite was not run, test step itself failed to execute.
Details
✅ job output file slurm-83112.out
❌ found message matching ERROR:
✅ no message matching [\s*FAILED\s*].*Ran .* test case
Aug 18 10:38:26 UTC 2025 uploaded transfer of eessi-2023.06-software-linux-x86_64-intel-icelake-accel-nvidia-cc80-17552842750.tar.gz to S3 bucket succeeded

@eessi-bot-aws
Copy link

eessi-bot-aws bot commented Aug 15, 2025

New job on instance eessi-bot-mc-aws for CPU micro-architecture x86_64-generic and accelerator nvidia/cc80 for repository eessi.io-2023.06-software in job dir /project/def-users/SHARED/jobs/2025.08/pr_59/83113

date job status comment
Aug 15 18:40:31 UTC 2025 submitted job id 83113 awaits release by job manager
Aug 15 18:41:10 UTC 2025 released job awaits launch by Slurm scheduler
Aug 15 18:45:21 UTC 2025 running job 83113 is running
Aug 15 19:12:32 UTC 2025 finished
😁 SUCCESS (click triangle for details)
Details
✅ job output file slurm-83113.out
✅ no message matching FATAL:
✅ no message matching ERROR:
✅ no message matching FAILED:
✅ no message matching required modules missing:
✅ found message(s) matching No missing installations
✅ found message matching .tar.gz created!
Artefacts
eessi-2023.06-software-linux-x86_64-generic-accel-nvidia-cc80-17552843150.tar.gzsize: 5072 MiB (5318415126 bytes)
entries: 11913
modules under 2023.06/software/linux/x86_64/generic/accel/nvidia/cc80/modules/all
CUDA/12.1.1.lua
CUDA/12.4.0.lua
cuDNN/8.9.2.26-CUDA-12.1.1.lua
software under 2023.06/software/linux/x86_64/generic/accel/nvidia/cc80/software
CUDA/12.1.1
CUDA/12.4.0
cuDNN/8.9.2.26-CUDA-12.1.1
reprod directories under 2023.06/software/linux/x86_64/generic/accel/nvidia/cc80/reprod
no reprod directories in tarball
other under 2023.06/software/linux/x86_64/generic/accel/nvidia/cc80
2023.06/init/easybuild/eb_hooks.py
2023.06/scripts/gpu_support/nvidia/install_cuda_and_libraries.sh
2023.06/scripts/gpu_support/nvidia/install_cuda_host_injections.sh
2023.06/software/linux/x86_64/generic/.lmod/SitePackage.lua
Aug 15 19:12:32 UTC 2025 test result
😢 FAILURE (click triangle for details)
Reason
EESSI test suite was not run, test step itself failed to execute.
Details
✅ job output file slurm-83113.out
❌ found message matching ERROR:
✅ no message matching [\s*FAILED\s*].*Ran .* test case
Aug 18 10:49:37 UTC 2025 uploaded transfer of eessi-2023.06-software-linux-x86_64-generic-accel-nvidia-cc80-17552843150.tar.gz to S3 bucket succeeded

@Neves-P
Copy link
Member

Neves-P commented Aug 18, 2025

The ingestion is complete and took about 120 minutes. The staging, i.e., creating the PR EESSI/staging_bundles#7 took about 120 minutes.

real	123m15.124s
user	63m51.337s
sys	7m33.614s

Edit: fixed, thanks Caspar! 😄

@casparvl
Copy link
Contributor Author

Just to clarify: this is actually the staging (i.e. fetching of tarballs by Stratum 0 from the S3 bucket, and opening the staging PR.

@Neves-P just informed me that the timing for the actual ingestion (i.e. the ingestion into CVMFS of the tarballs that were already staged on the Stratum 0), was:

real	110m8.823s
user	358m0.438s
sys	13m13.256s

@casparvl
Copy link
Contributor Author

bot: build repo:eessi.io-2025.06-software instance:eessi-bot-surf architecture:x86_64/amd/zen4 accelerator:nvidia/cc90

@casparvl
Copy link
Contributor Author

Hmmm, maybe the SURF bot is not configured for 2025.06 yet.

@casparvl
Copy link
Contributor Author

bot: build repo:eessi.io-2025.06-software instance:eessi-bot-jsc architecture:aarch64/nvidia/grace accelerator:nvidia/cc90

@eessi-bot-jsc
Copy link

eessi-bot-jsc bot commented Aug 19, 2025

New job on instance eessi-bot-jsc for CPU micro-architecture aarch64-nvidia-grace and accelerator nvidia/cc90 for repository eessi.io-2025.06-software in job dir /p/project1/ceasybuilders/eessibot/jobs/2025.08/pr_59/14003074

date job status comment
Aug 19 09:14:18 UTC 2025 submitted job id 14003074 awaits release by job manager
Aug 19 09:15:27 UTC 2025 released job awaits launch by Slurm scheduler
Aug 19 09:16:36 UTC 2025 running job 14003074 is running
Aug 19 09:17:38 UTC 2025 finished
😢 FAILURE (click triangle for details)
Details
✅ job output file slurm-14003074.out
✅ no message matching FATAL:
❌ found message matching ERROR:
✅ no message matching FAILED:
✅ no message matching required modules missing:
❌ no message matching No missing installations
✅ found message matching .tar.gz created!
Artefacts
eessi-2025.06-software-linux-aarch64-nvidia-grace-accel-nvidia-cc90-17555949770.tar.gzsize: 0 MiB (23541 bytes)
entries: 3
modules under 2025.06/software/linux/aarch64/nvidia/grace/accel/nvidia/cc90/modules/all
no module files in tarball
software under 2025.06/software/linux/aarch64/nvidia/grace/accel/nvidia/cc90/software
no software packages in tarball
reprod directories under 2025.06/software/linux/aarch64/nvidia/grace/accel/nvidia/cc90/reprod
no reprod directories in tarball
other under 2025.06/software/linux/aarch64/nvidia/grace/accel/nvidia/cc90
2025.06/init/easybuild/eb_hooks.py
2025.06/scripts/gpu_support/nvidia/install_cuda_and_libraries.sh
2025.06/scripts/gpu_support/nvidia/install_cuda_host_injections.sh
Aug 19 09:17:38 UTC 2025 test result
😢 FAILURE (click triangle for details)
Reason
EESSI test suite was not run, test step itself failed to execute.
Details
✅ job output file slurm-14003074.out
❌ found message matching ERROR:
✅ no message matching [\s*FAILED\s*].*Ran .* test case

@boegel
Copy link
Contributor

boegel commented Aug 19, 2025

@casparvl For now, we only need the hooks file in the CPU-only directories for 2025.06, like versions/2025.06/software/linux/x86_64/amd/zen3

@boegel
Copy link
Contributor

boegel commented Aug 19, 2025

bot: build repo:eessi.io-2025.06-software instance:eessi-bot-mc-aws arch:x86_64/amd/zen2

@eessi-bot-aws
Copy link

eessi-bot-aws bot commented Aug 19, 2025

New job on instance eessi-bot-mc-aws for CPU micro-architecture x86_64-amd-zen2 for repository eessi.io-2025.06-software in job dir /project/def-users/SHARED/jobs/2025.08/pr_59/83958

date job status comment
Aug 19 09:47:54 UTC 2025 submitted job id 83958 awaits release by job manager
Aug 19 09:48:55 UTC 2025 released job awaits launch by Slurm scheduler
Aug 19 09:54:01 UTC 2025 running job 83958 is running
Aug 19 09:55:02 UTC 2025 finished
😢 FAILURE (click triangle for details)
Details
✅ job output file slurm-83958.out
✅ no message matching FATAL:
❌ found message matching ERROR:
✅ no message matching FAILED:
✅ no message matching required modules missing:
❌ no message matching No missing installations
✅ found message matching .tar.gz created!
Artefacts
eessi-2025.06-software-linux-x86_64-amd-zen2-1755597234-0.tar.gzsize: 0 MiB (23532 bytes)
entries: 3
modules under 2025.06/software/linux/x86_64/amd/zen2/modules/all
no module files in tarball
software under 2025.06/software/linux/x86_64/amd/zen2/software
no software packages in tarball
reprod directories under 2025.06/software/linux/x86_64/amd/zen2/reprod
no reprod directories in tarball
other under 2025.06/software/linux/x86_64/amd/zen2
2025.06/init/easybuild/eb_hooks.py
2025.06/scripts/gpu_support/nvidia/install_cuda_and_libraries.sh
2025.06/scripts/gpu_support/nvidia/install_cuda_host_injections.sh
Aug 19 09:55:02 UTC 2025 test result
😢 FAILURE (click triangle for details)
Reason
EESSI test suite was not run, test step itself failed to execute.
Details
✅ job output file slurm-83958.out
❌ found message matching ERROR:
✅ no message matching [\s*FAILED\s*].*Ran .* test case

@boegel
Copy link
Contributor

boegel commented Aug 19, 2025

Creating tarball works, but build is marked as failed, perhaps because there's nothing in software layer yet for EESSI 2025.06?

In that case, we can unbreak the chicken-egg situation by building & deploying recent GCC versions via EESSI/software-layer#1146, and ignore failing CI for 2025.06 for now?

@boegel
Copy link
Contributor

boegel commented Aug 19, 2025

Error during build triggered for 2025.06:

ERROR: Easystack file easystacks/software.eessi.io/2023.06/accel/nvidia/rebuilds/20250807-eb-5.1.1-CUDA-cuDNN-new-host-injections-dir.yml is not intended for EESSI version 2025.06, giving up!

So it's clear that deploying the updated hooks for 2025.06 for this PR isn't going to work out.

We can handle that via EESSI/software-layer#1146 imho, so this is ready to merge?

@casparvl Do you agree?

@casparvl
Copy link
Contributor Author

EESSI/software-layer#1146 is not going to solve that error, is it? The reason the error is thrown is because this PR contains an EasyStack file that is not targetted at that repo, I guess. Solution would still be my original plan: create a PR identical to this, but strip the EasyStack files :) I'll do it now

casparvl pushed a commit to casparvl/software-layer-scripts that referenced this pull request Aug 19, 2025
@boegel
Copy link
Contributor

boegel commented Aug 19, 2025

EESSI/software-layer#1146 is not going to solve that error, is it? The reason the error is thrown is because this PR contains an EasyStack file that is not targetted at that repo, I guess. Solution would still be my original plan: create a PR identical to this, but strip the EasyStack files :) I'll do it now

EESSI/software-layer#1146 would result in picking up on the updated hooks, but yes, a separate PR with only the updated hook in here would be a cleaner approach

@casparvl
Copy link
Contributor Author

EESSI/software-layer#1146 would result in picking up on the updated hooks, but yes, a separate PR with only the updated hook in here would be a cleaner approach

Ah, that's what you mean. That's true. Anyway, I created #64
I'll start builds for that :)

@boegel
Copy link
Contributor

boegel commented Aug 20, 2025

New PR that replaces #64:

@boegel
Copy link
Contributor

boegel commented Aug 20, 2025

These changes were deployed for EESSI 2025.06 via #68, so CI should go green soon now (after we re-trigger the check)...

@boegel boegel merged commit 99c82b5 into EESSI:main Aug 20, 2025
74 of 78 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants