Skip to content

Conversation

@tetron
Copy link
Member

@tetron tetron commented Dec 22, 2021

Express that a CommandLineTool requires NVIDA CUDA GPU support. Includes support for Docker and singularity. Supports requesting multiple GPUs, but currently does not attempt to allocate GPUs among multiple concurrent processes.

Example command line tool:

cwlVersion: v1.2
class: CommandLineTool
baseCommand: nvidia-smi
inputs: []
outputs: []
$namespaces:
  cwltool: http://commonwl.org/cwltool#
hints:
  cwltool:CUDARequirement:
    cudaVersionMin: "11.4"
    cudaComputeCapabilityMin: "3.0"
    deviceCountMin: 1
    deviceCountMax: 8
  DockerRequirement:
    dockerPull: nvidia/cuda:11.4.0-base-ubuntu20.04

@codecov
Copy link

codecov bot commented Dec 22, 2021

Codecov Report

Merging #1581 (6e4f428) into main (62fe629) will decrease coverage by 0.21%.
The diff coverage is 39.21%.

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #1581      +/-   ##
==========================================
- Coverage   66.12%   65.90%   -0.22%     
==========================================
  Files          91       93       +2     
  Lines       16345    16447     +102     
  Branches     4344     4358      +14     
==========================================
+ Hits        10808    10840      +32     
- Misses       4391     4455      +64     
- Partials     1146     1152       +6     
Impacted Files Coverage Δ
cwltool/process.py 86.08% <ø> (ø)
cwltool/job.py 76.37% <33.33%> (-0.92%) ⬇️
cwltool/cuda.py 35.29% <35.29%> (ø)
cwltool/docker.py 68.37% <40.00%> (-0.62%) ⬇️
cwltool/singularity.py 58.60% <50.00%> (-0.17%) ⬇️
cwltool/main.py 66.97% <100.00%> (+0.08%) ⬆️
job.py 61.53% <0.00%> (-0.34%) ⬇️
docker.py 49.57% <0.00%> (-0.21%) ⬇️
singularity.py 52.55% <0.00%> (-0.05%) ⬇️
... and 3 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 62fe629...6e4f428. Read the comment docs.

@tetron tetron marked this pull request as ready for review December 22, 2021 23:20
Copy link
Member

@mr-c mr-c left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Huzzah! Can we have some tests? I'll try to find an CUDA box someplace to run this

Copy link
Member

@mr-c mr-c left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The singularity version of the tests passed for me on the BAZIS cluster: https://libguides.vu.nl/rdm/high-performance-computing

@tetron
Copy link
Member Author

tetron commented Jan 3, 2022

Maybe add a comment that nvidia-smi returns non-zero if there is no GPU detected?

I'm not entirely sure what it does, but it seems reasonable to assume that either it'll exit non-zero, or it will return XML with <attached_gpus>0</attached_gpus> and in either case it'll be detected as CUDA not available.

@lgtm-com
Copy link

lgtm-com bot commented Jan 3, 2022

This pull request introduces 1 alert when merging 06992d9 into 62fe629 - view on LGTM.com

new alerts:

  • 1 for Unused import

@mr-c
Copy link
Member

mr-c commented Jan 3, 2022

Maybe add a comment that nvidia-smi returns non-zero if there is no GPU detected?

I'm not entirely sure what it does, but it seems reasonable to assume that either it'll exit non-zero, or it will return XML with <attached_gpus>0</attached_gpus> and in either case it'll be detected as CUDA not available.

Yes. I was referring to the nvidia-smi invocation within the tests themselves: https://github.com/common-workflow-language/cwltool/pull/1581/files#diff-3dac8debea16928fd49923af94fa8e1af78788a1c44d86eb3b6112bbc7516993R11

@tetron
Copy link
Member Author

tetron commented Jan 3, 2022

Oh, I see what you mean. Yes, I'm trying to confirm what it does if it has the libraries but doesn't find any devices.

@tetron
Copy link
Member Author

tetron commented Jan 3, 2022

It's annoying because it turns out nvidia-smi is one of the things that gets injected from the host system into the container, so the standard CUDA runtime container doesn't actually have it.

@tetron tetron enabled auto-merge (squash) January 3, 2022 18:55
@lgtm-com
Copy link

lgtm-com bot commented Jan 3, 2022

This pull request introduces 1 alert when merging 6e4f428 into 62fe629 - view on LGTM.com

new alerts:

  • 1 for Unused import

@mr-c
Copy link
Member

mr-c commented Aug 29, 2022

FYI, the proposal here was updated in #1629

The differences being

  1. cudaComputeCapabilityMin is now cudaComputeCapability: a single value is a minimum capability, or if a list of values is specifed then only GPUs with compute capabilities in the list should be considered.
  2. deviceCountMin was renamed cudaDeviceCountMin and can also be an expression in addition to an int. Likewise for deviceCountMax becoming cudaDeviceCountMax

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants