-
Notifications
You must be signed in to change notification settings - Fork 15
Adapt subdir for CUDA toolkit in host injections #59
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…at it also doesnt include the CPU microarchitecture
|
bot: build repo:eessi.io-2023.06-software instance:eessi-bot-surf architecture:x86_64/amd/zen4 accelerator:nvidia/cc90 |
|
New job on instance
|
|
Hmmm, success, but not what I planned. Installdir for the I wanted it to be I guess the The odd thing is that this should have broken the sanity check for installing CUDA in the |
|
bot: build repo:eessi.io-2023.06-software instance:eessi-bot-surf architecture:x86_64/amd/zen4 accelerator:nvidia/cc90 |
|
New job on instance
|
|
bot: build repo:eessi.io-2023.06-software instance:eessi-bot-surf architecture:x86_64/amd/zen4 accelerator:nvidia/cc90 |
|
New job on instance
|
|
That's more like it! Now I still need to carefully check the symlinks for the installations, to make sure they also refer here (because the old location also still contains CUDA, so it wouldn't lead to a broken install - making any mistakes harder to spot). |
|
Yep, symlinks are still 'wrong', pointing to the old location: I'll check further tomorrow. The EB build log will probably show some output form the eb_hooks. |
|
bot: build repo:eessi.io-2023.06-software instance:eessi-bot-surf architecture:x86_64/amd/zen4 accelerator:nvidia/cc90 |
|
New job on instance
|
|
That looks better: |
…to e.g. /cvmfs/software.eessi.io/host_injections/x86_64, i.e. only include the CPU family in the prefix, not microarchitecture or accelerator architecture. Since these are binary installs, we don't need multiple copies, and requiring site admins to run the install scripts once per micro-architecture is just annoying (and requires more storage)
|
bot: build repo:eessi.io-2023.06-software instance:eessi-bot-surf architecture:x86_64/amd/zen4 accelerator:nvidia/cc90 |
|
New job on instance
|
…DNN package was found in the old host-injections location (with micro-arch specific subdir). Also, adapt the path to search for the regular LmodError
|
bot: build repo:eessi.io-2023.06-software instance:eessi-bot-surf architecture:x86_64/amd/zen4 accelerator:nvidia/cc90 |
|
New job on instance
|
Hmmm, that's strange. This directory is writeable: |
|
Also: That's really strange, it looks like the issue I had before when the bind-mounting became the default, except: the repo is really fuse-mounted here: |
|
bot: build repo:eessi.io-2023.06-software instance:eessi-bot-surf architecture:x86_64/amd/zen4 accelerator:nvidia/cc90 |
|
New job on instance
|
|
Hm, issue might have been two bot jobs trying at the same time. I cleaned out the bot: build repo:eessi.io-2023.06-software instance:eessi-bot-surf architecture:x86_64/amd/zen4 accelerator:nvidia/cc90 |
|
New job on instance
|
|
New job on instance
|
|
Edit: fixed, thanks Caspar! 😄 |
|
Just to clarify: this is actually the staging (i.e. fetching of tarballs by Stratum 0 from the S3 bucket, and opening the staging PR. @Neves-P just informed me that the timing for the actual ingestion (i.e. the ingestion into CVMFS of the tarballs that were already staged on the Stratum 0), was: |
|
bot: build repo:eessi.io-2025.06-software instance:eessi-bot-surf architecture:x86_64/amd/zen4 accelerator:nvidia/cc90 |
|
Hmmm, maybe the SURF bot is not configured for 2025.06 yet. |
|
bot: build repo:eessi.io-2025.06-software instance:eessi-bot-jsc architecture:aarch64/nvidia/grace accelerator:nvidia/cc90 |
|
New job on instance
|
|
@casparvl For now, we only need the hooks file in the CPU-only directories for |
|
bot: build repo:eessi.io-2025.06-software instance:eessi-bot-mc-aws arch:x86_64/amd/zen2 |
|
New job on instance
|
|
Creating tarball works, but build is marked as failed, perhaps because there's nothing in software layer yet for EESSI 2025.06? In that case, we can unbreak the chicken-egg situation by building & deploying recent GCC versions via EESSI/software-layer#1146, and ignore failing CI for 2025.06 for now? |
|
Error during build triggered for So it's clear that deploying the updated hooks for 2025.06 for this PR isn't going to work out. We can handle that via EESSI/software-layer#1146 imho, so this is ready to merge? @casparvl Do you agree? |
|
EESSI/software-layer#1146 is not going to solve that error, is it? The reason the error is thrown is because this PR contains an EasyStack file that is not targetted at that repo, I guess. Solution would still be my original plan: create a PR identical to this, but strip the EasyStack files :) I'll do it now |
EESSI/software-layer#1146 would result in picking up on the updated hooks, but yes, a separate PR with only the updated hook in here would be a cleaner approach |
Ah, that's what you mean. That's true. Anyway, I created #64 |
|
New PR that replaces #64:
|
|
These changes were deployed for EESSI 2025.06 via #68, so CI should go green soon now (after we re-trigger the check)... |
Try to change the subdir in which the CUDA toolkit is installed so that it also doesn't include the CPU microarchitecture