-
Notifications
You must be signed in to change notification settings - Fork 762
{devel}[foss/2021a] PyTorch v1.11.0 w/ Python 3.9.5 w/ CUDA 11.3.1 #15137
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
{devel}[foss/2021a] PyTorch v1.11.0 w/ Python 3.9.5 w/ CUDA 11.3.1 #15137
Conversation
|
Test report by @casparvl |
|
Test report by @casparvl |
|
Ok, not sure what's going on here. The checksums is different every time: |
|
I'm seeing the sources are obtained like this: Is creating that tarball somehow not deterministic...? |
… Since it is the result of a git clone, and then tarring the result, the checksum is not reproducible.
|
@boegelbot please test @ generoso |
|
@casparvl: Request for testing this PR well received on login1 PR test command '
Test results coming soon (I hope)... - notification for comment with ID 1072637802 processed Message to humans: this is just bookkeeping information for me, |
|
Test report by @boegelbot |
|
Test report by @casparvl |
|
Test report by @casparvl |
|
The following tests fail: distributed/test_c10d_gloo Something times out here, but I can't reproduce using . I can however reproduce when run with Potential cause: too short timeout, which is set by test_c10d_gloo error[patched] distributed/_shard/sharded_optim/test_sharded_optim test_shared_optim error[patched] distributed/_shard/sharded_tensor/ops/test_linear failed! test_linear error[patched] distributed/_shard/sharded_tensor/test_megatron_prototype test_megatron_prototype errordistributed/rpc/test_tensorpipe_agent Update 07-04-2022: I can't reproduce this failure when running just Maybe it was just a hickup...? test_tensorpipe_agent errortest_jit_cuda_fuser test_jit_cuda_fuser error[patched] test_jit_fuser_te Update 07-04-2022: I can't reproduce this now... The test passes now. The error is only slightly out of range, maybe it's because of the state of the random generator? Still... strange. test_jit_fuser_te error[excluded] test_model_dump test_model_dump error[patched] test_stateless test_stateless error |
|
For |
|
Thanks @branfosj . I found the same issue independently... should have checked for comments here first, clearly :D I created an EB-patch out of it that I'll upload. 7 more failing tests to go on my end... |
…s that support TensorFloat32 datatypes
|
Test report by @Micket |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
@boegel want to me to wait with merging this until that easyblock update is in?
I think some people have issues, but this passes for me, i think it's as good as all the existing pytorch'es we have.
@Micket Hmm, no, just using the updated easyblocks from easybuilders/easybuild-easyblocks#2742 should work. Somehow the enhanced
Yes, I think it makes sense to first merge easybuilders/easybuild-easyblocks#2742 |
|
@Flamefire Somehow |
|
hm, well i somehow got (sorry formatting got messed up in the build report) |
|
Test report by @Flamefire |
|
Test report by @Flamefire |
|
@Flamefire Quite a bit of failing tests on AMD Rome for you, and then build trouble on POWER... Any idea if this is worth blocking this PR over? |
@Micket The problem is probably caused by a custom |
Build issue on POWER is due to CUDA 11 while the driver only supports CUDA 10. Working on that so can be ignored for now. |
easybuild/easyconfigs/p/PyTorch/PyTorch-1.11.0-foss-2021a-CUDA-11.3.1.eb
Show resolved
Hide resolved
|
Test report by @casparvl |
|
Test report by @casparvl |
|
Ok, ignore those two build failures, that's just me being stupid and trying to build twice in the same prefix... |
|
Test report by @casparvl |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Time to merge this, if any additional problems arise with the tests, we can deal with those in a follow-up PR...
|
Going in, thanks @casparvl! |
|
Test report by @casparvl |
|
Test report by @boegel |
We see spurious failures due to timeouts in`test_allreduce_coalesced_basics` but only when running the whole test suite with `python run_test.py --verbose -i distributed/test_c10d_gloo`. Increasing the timeout to 50s should provide enough leeway to avoid this. Note that the default for the `_timeout` is 30 minutes. Originally reported in EasyBuild at easybuilders/easybuild-easyconfigs#15137 (comment) and patch proposed by @casparvl Pull Request resolved: #85474 Approved by: https://github.com/rohan-varma
We see spurious failures due to timeouts in`test_allreduce_coalesced_basics` but only when running the whole test suite with `python run_test.py --verbose -i distributed/test_c10d_gloo`. Increasing the timeout to 50s should provide enough leeway to avoid this. Note that the default for the `_timeout` is 30 minutes. Originally reported in EasyBuild at easybuilders/easybuild-easyconfigs#15137 (comment) and patch proposed by @casparvl Pull Request resolved: #85474 Approved by: https://github.com/rohan-varma
(created using
eb --new-pr)