Skip to content

Conversation

@casparvl
Copy link
Contributor

(created using eb --new-pr)

@casparvl
Copy link
Contributor Author

Test report by @casparvl
FAILED
Build succeeded for 0 out of 1 (1 easyconfigs in total)
software2.lisa.surfsara.nl - Linux debian 10.11, x86_64, Intel(R) Xeon(R) Bronze 3104 CPU @ 1.70GHz, 4 x NVIDIA NVIDIA TITAN V, 470.103.01, Python 2.7.16
See https://gist.github.com/3dfc19313bc784b062df7692d0553189 for a full test report.

@casparvl
Copy link
Contributor Author

Test report by @casparvl
FAILED
Build succeeded for 0 out of 1 (1 easyconfigs in total)
software2.lisa.surfsara.nl - Linux debian 10.11, x86_64, Intel(R) Xeon(R) Bronze 3104 CPU @ 1.70GHz, 4 x NVIDIA NVIDIA TITAN V, 470.103.01, Python 2.7.16
See https://gist.github.com/8c74762dbfb9845ba571337ea497e4e4 for a full test report.

@casparvl
Copy link
Contributor Author

Ok, not sure what's going on here. The checksums is different every time:

casparl@software1:~/easyconfigs-surfsara/p/PyTorch$ eblocalinstall PyTorch-1.11.0-foss-2021a-CUDA-11.3.1.eb --fetch --force-download
...
casparl@software1:~/easyconfigs-surfsara/p/PyTorch$ sha256sum /home/casparl/.local/easybuild/sources/p/PyTorch/PyTorch-1.11.0.tar.gz
8937693a1c9ce14284aa815699733a9ef11ce2a2da23e64d3a5420ed7ac30bc4  /home/casparl/.local/easybuild/sources/p/PyTorch/PyTorch-1.11.0.tar.gz
casparl@software1:~/easyconfigs-surfsara/p/PyTorch$ eblocalinstall PyTorch-1.11.0-foss-2021a-CUDA-11.3.1.eb --fetch --force-download
...
casparl@software1:~/easyconfigs-surfsara/p/PyTorch$ sha256sum /home/casparl/.local/easybuild/sources/p/PyTorch/PyTorch-1.11.0.tar.gz
16fe189e4aa59882ee1c929ab66bf86541050fb2cbd1d277c28a62b651f59d89  /home/casparl/.local/easybuild/sources/p/PyTorch/PyTorch-1.11.0.tar.gz

@casparvl
Copy link
Contributor Author

I'm seeing the sources are obtained like this:

== 2022-03-18 18:04:52,612 run.py:623 INFO cmd "git clone --depth 1 --branch v1.11.0 --recursive https://github.com/pytorch/pytorch.git" exited with exit code 0 and output:
...
== 2022-03-18 18:04:52,613 run.py:233 INFO running cmd: git describe --exact-match --tags HEAD
== 2022-03-18 18:04:52,636 run.py:233 INFO running cmd: tar cfvz /home/casparl/.local/easybuild/sources/p/PyTorch/PyTorch-1.11.0.tar.gz --exclude .git pytorch
== 2022-03-18 18:05:08,410 run.py:623 INFO cmd "tar cfvz /home/casparl/.local/easybuild/sources/p/PyTorch/PyTorch-1.11.0.tar.gz --exclude .git pytorch"

Is creating that tarball somehow not deterministic...?

… Since it is the result of a git clone, and then tarring the result, the checksum is not reproducible.
@casparvl casparvl added this to the next release (4.5.4?) milestone Mar 18, 2022
@casparvl
Copy link
Contributor Author

@boegelbot please test @ generoso

@boegelbot
Copy link
Collaborator

@casparvl: Request for testing this PR well received on login1

PR test command 'EB_PR=15137 EB_ARGS= /opt/software/slurm/bin/sbatch --job-name test_PR_15137 --ntasks=4 ~/boegelbot/eb_from_pr_upload_generoso.sh' executed!

  • exit code: 0
  • output:
Submitted batch job 8282

Test results coming soon (I hope)...

- notification for comment with ID 1072637802 processed

Message to humans: this is just bookkeeping information for me,
it is of no use to you (unless you think I have a bug, which I don't).

@boegelbot
Copy link
Collaborator

Test report by @boegelbot
SUCCESS
Build succeeded for 1 out of 1 (1 easyconfigs in total)
cns1 - Linux Rocky Linux 8.5, x86_64, Intel(R) Xeon(R) CPU E5-2667 v3 @ 3.20GHz (haswell), Python 3.6.8
See https://gist.github.com/b926975d2d8cbdb883e10c84d1b18dd9 for a full test report.

@casparvl
Copy link
Contributor Author

Test report by @casparvl
FAILED
Build succeeded for 0 out of 1 (1 easyconfigs in total)
gcn1 - Linux RHEL 8.4, x86_64, Intel(R) Xeon(R) Platinum 8360Y CPU @ 2.40GHz, 4 x NVIDIA NVIDIA A100-SXM4-40GB, 470.57.02, Python 3.6.8
See https://gist.github.com/01e420025972aa5a2f1f27eb773b2469 for a full test report.

@casparvl
Copy link
Contributor Author

Test report by @casparvl
FAILED
Build succeeded for 0 out of 1 (1 easyconfigs in total)
software2.lisa.surfsara.nl - Linux debian 10.11, x86_64, Intel(R) Xeon(R) Bronze 3104 CPU @ 1.70GHz, 4 x NVIDIA NVIDIA TITAN V, 470.103.01, Python 2.7.16
See https://gist.github.com/881db4ec46e9d66379f93172ff7e0f76 for a full test report.

@casparvl
Copy link
Contributor Author

casparvl commented Mar 21, 2022

The following tests fail:

distributed/test_c10d_gloo failed!
distributed/_shard/sharded_optim/test_sharded_optim failed!
distributed/_shard/sharded_tensor/ops/test_linear failed!
distributed/_shard/sharded_tensor/test_megatron_prototype failed!
distributed/rpc/test_tensorpipe_agent failed!
test_jit_cuda_fuser failed!
test_jit_fuser_te failed!
test_model_dump failed!
test_stateless failed!

distributed/test_c10d_gloo

Something times out here, but I can't reproduce using

python distributed/test_c10d_gloo.py -v ProcessGroupGlooTest.test_allreduce_coalesced_basics

. I can however reproduce when run with

python run_test.py --verbose -i distributed/test_c10d_gloo

Potential cause: too short timeout, which is set by test/distributed/test_c10d_gloo.py line 219 . Increasing that timeout to 500 (which is probably way too high), I can now succesfully complete the test suite. I'll increase it to 50, which should be enough

test_c10d_gloo error

======================================================================
ERROR: test_allreduce_coalesced_basics (__main__.ProcessGroupGlooTest)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/scratch-shared/casparl/eb-44s6beao/tmpxz_jta8y/lib/python3.9/site-packages/torch/testing/_internal/common_distributed.py", line 484, in wrapper
    self._join_processes(fn)
  File "/scratch-shared/casparl/eb-44s6beao/tmpxz_jta8y/lib/python3.9/site-packages/torch/testing/_internal/common_distributed.py", line 703, in _join_processes
    self._check_return_codes(elapsed_time)
  File "/scratch-shared/casparl/eb-44s6beao/tmpxz_jta8y/lib/python3.9/site-packages/torch/testing/_internal/common_distributed.py", line 748, in _check_return_codes
    raise RuntimeError(error)
RuntimeError: Process 0 exited with error code 10 and exception:
Traceback (most recent call last):
  File "/scratch-shared/casparl/eb-44s6beao/tmpxz_jta8y/lib/python3.9/site-packages/torch/testing/_internal/common_distributed.py", line 601, in run_test
    getattr(self, test_name)()
  File "/scratch-shared/casparl/eb-44s6beao/tmpxz_jta8y/lib/python3.9/site-packages/torch/testing/_internal/common_distributed.py", line 486, in wrapper
    fn()
  File "/scratch-shared/casparl/eb-44s6beao/tmpxz_jta8y/lib/python3.9/site-packages/torch/testing/_internal/common_utils.py", line 3098, in wrapper
    return func(*args, **kwargs)
  File "/gpfs/scratch1/shared/casparl/PyTorch/1.11.0/foss-2021a-CUDA-11.3.1/pytorch/test/distributed/test_c10d_gloo.py", line 608, in test_allreduce_coalesced_basics
    self._test_allreduce_coalesced_basics(lambda t: t.clone())
  File "/gpfs/scratch1/shared/casparl/PyTorch/1.11.0/foss-2021a-CUDA-11.3.1/pytorch/test/distributed/test_c10d_gloo.py", line 590, in _test_allreduce_coalesced_basics
    pg = self._create_process_group_gloo(
  File "/gpfs/scratch1/shared/casparl/PyTorch/1.11.0/foss-2021a-CUDA-11.3.1/pytorch/test/distributed/test_c10d_gloo.py", line 209, in _create_process_group_gloo
    pg = c10d.ProcessGroupGloo(store, self.rank, self.world_size, opts)
RuntimeError: Wait timeout
Exception raised from wait at ../torch/csrc/distributed/c10d/FileStore.cpp:452 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x5f (0x1519cdacd98f in /gpfs/scratch1/shared/casparl/PyTorch
/1.11.0/foss-2021a-CUDA-11.3.1/pytorch/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, char const*) + 0xd9 (0x1519cdaabdc6 in /gpfs/scratch1/shared/casparl/PyTorch/1.11.0/foss-2021a-CUDA-11.3.1/pytorch/
torch/lib/libc10.so)
frame #2: <unknown function> + 0x3380512 (0x1519d8b22512 in /gpfs/scratch1/shared/casparl/PyTorch/1.11.0/foss-2021a-CUDA-11.3.1/pytorch/torch/lib/libtorch_cpu.so)
frame #3: gloo::rendezvous::PrefixStore::wait(std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::c
har_traits<char>, std::allocator<char> > > > const&, std::chrono::duration<long, std::ratio<1l, 1000l> > const&) + 0x111 (0x1519da22f611 in /gpfs/scratch1/shared/casparl/PyTorch/1.11.0/foss-202
1a-CUDA-11.3.1/pytorch/torch/lib/libtorch_cpu.so)
frame #4: gloo::rendezvous::Context::connectFullMesh(gloo::rendezvous::Store&, std::shared_ptr<gloo::transport::Device>&) + 0x149c (0x1519da22d24c in /gpfs/scratch1/shared/casparl/PyTorch/1.11.
0/foss-2021a-CUDA-11.3.1/pytorch/torch/lib/libtorch_cpu.so)
frame #5: c10d::ProcessGroupGloo::ProcessGroupGloo(c10::intrusive_ptr<c10d::Store, c10::detail::intrusive_target_default_null_type<c10d::Store> > const&, int, int, c10::intrusive_ptr<c10d::Proc
essGroupGloo::Options, c10::detail::intrusive_target_default_null_type<c10d::ProcessGroupGloo::Options> >) + 0x469 (0x1519d8b3c349 in /gpfs/scratch1/shared/casparl/PyTorch/1.11.0/foss-2021a-CUD
A-11.3.1/pytorch/torch/lib/libtorch_cpu.so)
frame #6: <unknown function> + 0x9747b9 (0x1519de58f7b9 in /gpfs/scratch1/shared/casparl/PyTorch/1.11.0/foss-2021a-CUDA-11.3.1/pytorch/torch/lib/libtorch_python.so)
frame #7: <unknown function> + 0x350c61 (0x1519ddf6bc61 in /gpfs/scratch1/shared/casparl/PyTorch/1.11.0/foss-2021a-CUDA-11.3.1/pytorch/torch/lib/libtorch_python.so)
<omitting python frames>
frame #13: <unknown function> + 0x34e409 (0x1519ddf69409 in /gpfs/scratch1/shared/casparl/PyTorch/1.11.0/foss-2021a-CUDA-11.3.1/pytorch/torch/lib/libtorch_python.so)
frame #56: __libc_start_main + 0xf3 (0x151a175d0493 in /lib64/libc.so.6)
frame #57: _start + 0x2e (0x400d2e in /sw/arch/Centos8/EB_production/2021/software/Python/3.9.5-GCCcore-10.3.0/bin/python)

[patched] distributed/_shard/sharded_optim/test_sharded_optim
Fixed by https://github.com/pytorch/pytorch/pull/73309/files

test_shared_optim error
======================================================================
ERROR: test_named_params_with_sharded_tensor (__main__.TestShardedOptimizer)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/scratch-shared/casparl/eb-u986gtd2/tmpkj4e39ni/lib/python3.9/site-packages/torch/testing/_internal/common_distributed.py", line 484, in wrapper
    self._join_processes(fn)
  File "/scratch-shared/casparl/eb-u986gtd2/tmpkj4e39ni/lib/python3.9/site-packages/torch/testing/_internal/common_distributed.py", line 703, in _join_processes
    self._check_return_codes(elapsed_time)
  File "/scratch-shared/casparl/eb-u986gtd2/tmpkj4e39ni/lib/python3.9/site-packages/torch/testing/_internal/common_distributed.py", line 748, in _check_return_codes
    raise RuntimeError(error)
RuntimeError: Process 0 exited with error code 10 and exception:
Traceback (most recent call last):
  File "/scratch-shared/casparl/eb-u986gtd2/tmpkj4e39ni/lib/python3.9/site-packages/torch/testing/_internal/common_distributed.py", line 601, in run_test
    getattr(self, test_name)()
  File "/scratch-shared/casparl/eb-u986gtd2/tmpkj4e39ni/lib/python3.9/site-packages/torch/testing/_internal/common_distributed.py", line 486, in wrapper
    fn()
  File "/scratch-shared/casparl/eb-u986gtd2/tmpkj4e39ni/lib/python3.9/site-packages/torch/testing/_internal/distributed/_shard/sharded_tensor/__init__.py", line 92, in wrapper
    func(self)
  File "/scratch-shared/casparl/eb-u986gtd2/tmpkj4e39ni/lib/python3.9/site-packages/torch/testing/_internal/common_distributed.py", line 131, in wrapper
    return func(*args, **kwargs)
  File "/scratch-shared/casparl/eb-u986gtd2/tmpkj4e39ni/lib/python3.9/site-packages/torch/testing/_internal/common_utils.py", line 3098, in wrapper
    return func(*args, **kwargs)
  File "/gpfs/scratch1/shared/casparl/PyTorch/1.11.0/foss-2021a-CUDA-11.3.1/pytorch/test/distributed/_shard/sharded_optim/test_sharded_optim.py", line 162, in test_named_params_with_sharded_tensor
    sharded_model = MyShardedModel(spec=rowwise_spec).cuda(self.rank)
  File "/gpfs/scratch1/shared/casparl/PyTorch/1.11.0/foss-2021a-CUDA-11.3.1/pytorch/test/distributed/_shard/sharded_optim/test_sharded_optim.py", line 35, in __init__
    self.sharded_param = sharded_tensor.rand(spec, 20, 10, requires_grad=True, process_group=group)
NameError: name 'sharded_tensor' is not defined
...
======================================================================
ERROR: test_sharded_optim (__main__.TestShardedOptimizer)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/scratch-shared/casparl/eb-u986gtd2/tmpkj4e39ni/lib/python3.9/site-packages/torch/testing/_internal/common_distributed.py", line 484, in wrapper
    self._join_processes(fn)
  File "/scratch-shared/casparl/eb-u986gtd2/tmpkj4e39ni/lib/python3.9/site-packages/torch/testing/_internal/common_distributed.py", line 703, in _join_processes
    self._check_return_codes(elapsed_time)
  File "/scratch-shared/casparl/eb-u986gtd2/tmpkj4e39ni/lib/python3.9/site-packages/torch/testing/_internal/common_distributed.py", line 748, in _check_return_codes
    raise RuntimeError(error)
RuntimeError: Process 0 exited with error code 10 and exception:
Traceback (most recent call last):
  File "/scratch-shared/casparl/eb-u986gtd2/tmpkj4e39ni/lib/python3.9/site-packages/torch/testing/_internal/common_distributed.py", line 601, in run_test
    getattr(self, test_name)()
  File "/scratch-shared/casparl/eb-u986gtd2/tmpkj4e39ni/lib/python3.9/site-packages/torch/testing/_internal/common_distributed.py", line 486, in wrapper
    fn()
  File "/scratch-shared/casparl/eb-u986gtd2/tmpkj4e39ni/lib/python3.9/site-packages/torch/testing/_internal/distributed/_shard/sharded_tensor/__init__.py", line 92, in wrapper
    func(self)
  File "/scratch-shared/casparl/eb-u986gtd2/tmpkj4e39ni/lib/python3.9/site-packages/torch/testing/_internal/common_distributed.py", line 131, in wrapper
    return func(*args, **kwargs)
  File "/scratch-shared/casparl/eb-u986gtd2/tmpkj4e39ni/lib/python3.9/site-packages/torch/testing/_internal/common_utils.py", line 3098, in wrapper
    return func(*args, **kwargs)
  File "/gpfs/scratch1/shared/casparl/PyTorch/1.11.0/foss-2021a-CUDA-11.3.1/pytorch/test/distributed/_shard/sharded_optim/test_sharded_optim.py", line 103, in test_sharded_optim
    sharded_model = MyShardedModel(spec=rowwise_spec).cuda(self.rank)
  File "/gpfs/scratch1/shared/casparl/PyTorch/1.11.0/foss-2021a-CUDA-11.3.1/pytorch/test/distributed/_shard/sharded_optim/test_sharded_optim.py", line 35, in __init__
    self.sharded_param = sharded_tensor.rand(spec, 20, 10, requires_grad=True, process_group=group)
NameError: name 'sharded_tensor' is not defined



----------------------------------------------------------------------
Ran 2 tests in 5.851s

FAILED (errors=2)

[patched] distributed/_shard/sharded_tensor/ops/test_linear failed!
This seems to be a TensorFloat32 issue, the difference criterion on the test is probably too tight.
Update 07-04-2022: confirmed TF32 issue, since it disappears when ran with export NVIDIA_TF32_OVERRIDE=1

test_linear error
======================================================================
ERROR: test_sharded_linear_colwise (__main__.TestShardedTensorOpsLinear)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/scratch-shared/casparl/eb-u986gtd2/tmpkj4e39ni/lib/python3.9/site-packages/torch/testing/_internal/common_distributed.py", line 484, in wrapper
    self._join_processes(fn)
  File "/scratch-shared/casparl/eb-u986gtd2/tmpkj4e39ni/lib/python3.9/site-packages/torch/testing/_internal/common_distributed.py", line 703, in _join_processes
    self._check_return_codes(elapsed_time)
  File "/scratch-shared/casparl/eb-u986gtd2/tmpkj4e39ni/lib/python3.9/site-packages/torch/testing/_internal/common_distributed.py", line 748, in _check_return_codes
    raise RuntimeError(error)
RuntimeError: Process 0 exited with error code 10 and exception:
Traceback (most recent call last):
  File "/scratch-shared/casparl/eb-u986gtd2/tmpkj4e39ni/lib/python3.9/site-packages/torch/testing/_internal/common_distributed.py", line 601, in run_test
    getattr(self, test_name)()
  File "/scratch-shared/casparl/eb-u986gtd2/tmpkj4e39ni/lib/python3.9/site-packages/torch/testing/_internal/common_distributed.py", line 486, in wrapper
    fn()
  File "/scratch-shared/casparl/eb-u986gtd2/tmpkj4e39ni/lib/python3.9/site-packages/torch/testing/_internal/distributed/_shard/sharded_tensor/__init__.py", line 92, in wrapper
    func(self)
  File "/scratch-shared/casparl/eb-u986gtd2/tmpkj4e39ni/lib/python3.9/site-packages/torch/testing/_internal/common_distributed.py", line 131, in wrapper
    return func(*args, **kwargs)
  File "/scratch-shared/casparl/eb-u986gtd2/tmpkj4e39ni/lib/python3.9/site-packages/torch/testing/_internal/common_utils.py", line 3098, in wrapper
    return func(*args, **kwargs)
  File "/gpfs/scratch1/shared/casparl/PyTorch/1.11.0/foss-2021a-CUDA-11.3.1/pytorch/test/distributed/_shard/sharded_tensor/ops/test_linear.py", line 147, in test_sharded_linear_colwise
    self._run_sharded_linear(spec, [2, 17], [17, 12], 0)
  File "/gpfs/scratch1/shared/casparl/PyTorch/1.11.0/foss-2021a-CUDA-11.3.1/pytorch/test/distributed/_shard/sharded_tensor/ops/test_linear.py", line 80, in _run_sharded_linear
    self.assertEqual(local_output, sharded_output)
  File "/scratch-shared/casparl/eb-u986gtd2/tmpkj4e39ni/lib/python3.9/site-packages/torch/testing/_internal/common_utils.py", line 2121, in assertEqual
    assert_equal(
  File "/scratch-shared/casparl/eb-u986gtd2/tmpkj4e39ni/lib/python3.9/site-packages/torch/testing/_comparison.py", line 1080, in assert_equal
    raise error_metas[0].to_error()
AssertionError: Tensor-likes are not close!

Mismatched elements: 19 / 24 (79.2%)
Greatest absolute difference: 0.00015527009963989258 at index (1, 2) (up to 1e-05 allowed)
Greatest relative difference: 0.003451630071531454 at index (1, 8) (up to 1.3e-06 allowed)

...

======================================================================
ERROR: test_sharded_linear_rowwise (__main__.TestShardedTensorOpsLinear)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/scratch-shared/casparl/eb-u986gtd2/tmpkj4e39ni/lib/python3.9/site-packages/torch/testing/_internal/common_distributed.py", line 484, in wrapper
    self._join_processes(fn)
  File "/scratch-shared/casparl/eb-u986gtd2/tmpkj4e39ni/lib/python3.9/site-packages/torch/testing/_internal/common_distributed.py", line 703, in _join_processes
    self._check_return_codes(elapsed_time)
  File "/scratch-shared/casparl/eb-u986gtd2/tmpkj4e39ni/lib/python3.9/site-packages/torch/testing/_internal/common_distributed.py", line 748, in _check_return_codes
    raise RuntimeError(error)
RuntimeError: Process 1 exited with error code 10 and exception:
Traceback (most recent call last):
  File "/scratch-shared/casparl/eb-u986gtd2/tmpkj4e39ni/lib/python3.9/site-packages/torch/testing/_internal/common_distributed.py", line 601, in run_test
    getattr(self, test_name)()
  File "/scratch-shared/casparl/eb-u986gtd2/tmpkj4e39ni/lib/python3.9/site-packages/torch/testing/_internal/common_distributed.py", line 486, in wrapper
    fn()
  File "/scratch-shared/casparl/eb-u986gtd2/tmpkj4e39ni/lib/python3.9/site-packages/torch/testing/_internal/distributed/_shard/sharded_tensor/__init__.py", line 92, in wrapper
    func(self)
  File "/scratch-shared/casparl/eb-u986gtd2/tmpkj4e39ni/lib/python3.9/site-packages/torch/testing/_internal/common_distributed.py", line 131, in wrapper
    return func(*args, **kwargs)
  File "/scratch-shared/casparl/eb-u986gtd2/tmpkj4e39ni/lib/python3.9/site-packages/torch/testing/_internal/common_utils.py", line 3098, in wrapper
    return func(*args, **kwargs)
  File "/gpfs/scratch1/shared/casparl/PyTorch/1.11.0/foss-2021a-CUDA-11.3.1/pytorch/test/distributed/_shard/sharded_tensor/ops/test_linear.py", line 173, in test_sharded_linear_rowwise
    self._run_sharded_linear(spec, [5, 19], [19, 11], 1)
  File "/gpfs/scratch1/shared/casparl/PyTorch/1.11.0/foss-2021a-CUDA-11.3.1/pytorch/test/distributed/_shard/sharded_tensor/ops/test_linear.py", line 80, in _run_sharded_linear
    self.assertEqual(local_output, sharded_output)
  File "/scratch-shared/casparl/eb-u986gtd2/tmpkj4e39ni/lib/python3.9/site-packages/torch/testing/_internal/common_utils.py", line 2121, in assertEqual
    assert_equal(
  File "/scratch-shared/casparl/eb-u986gtd2/tmpkj4e39ni/lib/python3.9/site-packages/torch/testing/_comparison.py", line 1080, in assert_equal
    raise error_metas[0].to_error()
AssertionError: Tensor-likes are not close!

Mismatched elements: 49 / 55 (89.1%)
Greatest absolute difference: 0.00022366642951965332 at index (2, 10) (up to 1e-05 allowed)
Greatest relative difference: 0.007865325821188743 at index (0, 2) (up to 1.3e-06 allowed)

[patched] distributed/_shard/sharded_tensor/test_megatron_prototype
This seems to be a TensorFloat32 issue, the difference criterion on the test is probably too tight.
Update 07-04-2022: confirmed TF32 issue, since it disappears when ran with export NVIDIA_TF32_OVERRIDE=1

test_megatron_prototype error
======================================================================
ERROR: test_megatron_two_layer_prototype (__main__.TestShardedTensorMegatronLinear)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/scratch-shared/casparl/eb-u986gtd2/tmpkj4e39ni/lib/python3.9/site-packages/torch/testing/_internal/common_distributed.py", line 484, in wrapper
    self._join_processes(fn)
  File "/scratch-shared/casparl/eb-u986gtd2/tmpkj4e39ni/lib/python3.9/site-packages/torch/testing/_internal/common_distributed.py", line 703, in _join_processes
    self._check_return_codes(elapsed_time)
  File "/scratch-shared/casparl/eb-u986gtd2/tmpkj4e39ni/lib/python3.9/site-packages/torch/testing/_internal/common_distributed.py", line 748, in _check_return_codes
    raise RuntimeError(error)
RuntimeError: Process 1 exited with error code 10 and exception:
Traceback (most recent call last):
  File "/scratch-shared/casparl/eb-u986gtd2/tmpkj4e39ni/lib/python3.9/site-packages/torch/testing/_internal/common_distributed.py", line 601, in run_test
    getattr(self, test_name)()
  File "/scratch-shared/casparl/eb-u986gtd2/tmpkj4e39ni/lib/python3.9/site-packages/torch/testing/_internal/common_distributed.py", line 486, in wrapper
    fn()
  File "/scratch-shared/casparl/eb-u986gtd2/tmpkj4e39ni/lib/python3.9/site-packages/torch/testing/_internal/distributed/_shard/sharded_tensor/__init__.py", line 92, in wrapper
    func(self)
  File "/scratch-shared/casparl/eb-u986gtd2/tmpkj4e39ni/lib/python3.9/site-packages/torch/testing/_internal/common_distributed.py", line 131, in wrapper
    return func(*args, **kwargs)
  File "/scratch-shared/casparl/eb-u986gtd2/tmpkj4e39ni/lib/python3.9/site-packages/torch/testing/_internal/common_utils.py", line 3098, in wrapper
    return func(*args, **kwargs)
  File "/gpfs/scratch1/shared/casparl/PyTorch/1.11.0/foss-2021a-CUDA-11.3.1/pytorch/test/distributed/_shard/sharded_tensor/test_megatron_prototype.py", line 217, in test_megatron_two_layer_prototype
    self._run_megatron_linear(spec, [22, 17], [[17, 12], [12, 29]])
  File "/gpfs/scratch1/shared/casparl/PyTorch/1.11.0/foss-2021a-CUDA-11.3.1/pytorch/test/distributed/_shard/sharded_tensor/test_megatron_prototype.py", line 116, in _run_megatron_linear
    self.assertEqual(local_output, sharded_output)
  File "/scratch-shared/casparl/eb-u986gtd2/tmpkj4e39ni/lib/python3.9/site-packages/torch/testing/_internal/common_utils.py", line 2121, in assertEqual
    assert_equal(
  File "/scratch-shared/casparl/eb-u986gtd2/tmpkj4e39ni/lib/python3.9/site-packages/torch/testing/_comparison.py", line 1080, in assert_equal
    raise error_metas[0].to_error()
AssertionError: Tensor-likes are not close!

Mismatched elements: 419 / 638 (65.7%)
Greatest absolute difference: 0.00010281801223754883 at index (1, 0) (up to 1e-05 allowed)
Greatest relative difference: 0.05141183701507647 at index (2, 20) (up to 1.3e-06 allowed)

distributed/rpc/test_tensorpipe_agent
I think the problem in this test is that it tries to set the timeout to a rather short value (500ms) and then the rpc.init_rpc doesn't complete in that timeframe. That's not what this test is supposed to check, it is simply supposed to check if the timeout can be set to anything else. Maybe we should try timeout = 5 on line 4863 of pytorch/torch/testing/_internal/distributed/rpc/rpc_test.py

Update 07-04-2022: I can't reproduce this failure when running just

python -m unittest distributed.rpc.test_tensorpipe_agent.TensorPipeTensorPipeAgentRpcTest.test_tensorpipe_set_default_timeout -v

Maybe it was just a hickup...?

test_tensorpipe_agent error
======================================================================
ERROR: test_tensorpipe_set_default_timeout (__main__.TensorPipeTensorPipeAgentRpcTest)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/scratch-shared/casparl/eb-u986gtd2/tmpkj4e39ni/lib/python3.9/site-packages/torch/testing/_internal/common_distributed.py", line 484, in wrapper
    self._join_processes(fn)
  File "/scratch-shared/casparl/eb-u986gtd2/tmpkj4e39ni/lib/python3.9/site-packages/torch/testing/_internal/common_distributed.py", line 703, in _join_processes
    self._check_return_codes(elapsed_time)
  File "/scratch-shared/casparl/eb-u986gtd2/tmpkj4e39ni/lib/python3.9/site-packages/torch/testing/_internal/common_distributed.py", line 748, in _check_return_codes
    raise RuntimeError(error)
RuntimeError: Process 2 exited with error code 10 and exception:
Traceback (most recent call last):
  File "/scratch-shared/casparl/eb-u986gtd2/tmpkj4e39ni/lib/python3.9/site-packages/torch/testing/_internal/common_distributed.py", line 601, in run_test
    getattr(self, test_name)()
  File "/scratch-shared/casparl/eb-u986gtd2/tmpkj4e39ni/lib/python3.9/site-packages/torch/testing/_internal/common_distributed.py", line 486, in wrapper
    fn()
  File "/scratch-shared/casparl/eb-u986gtd2/tmpkj4e39ni/lib/python3.9/site-packages/torch/testing/_internal/dist_utils.py", line 79, in new_test_method
    return_value = old_test_method(self, *arg, **kwargs)
  File "/scratch-shared/casparl/eb-u986gtd2/tmpkj4e39ni/lib/python3.9/site-packages/torch/testing/_internal/distributed/rpc/rpc_test.py", line 4869, in test_tensorpipe_set_default_timeout
    rpc.init_rpc(
  File "/scratch-shared/casparl/eb-u986gtd2/tmpkj4e39ni/lib/python3.9/site-packages/torch/distributed/rpc/__init__.py", line 190, in init_rpc
    _init_rpc_backend(backend, store, name, rank, world_size, rpc_backend_options)
  File "/scratch-shared/casparl/eb-u986gtd2/tmpkj4e39ni/lib/python3.9/site-packages/torch/distributed/rpc/__init__.py", line 224, in _init_rpc_backend
    rpc_agent = backend_registry.init_backend(
  File "/scratch-shared/casparl/eb-u986gtd2/tmpkj4e39ni/lib/python3.9/site-packages/torch/distributed/rpc/backend_registry.py", line 97, in init_backend
    return backend.value.init_backend_handler(*args, **kwargs)
  File "/scratch-shared/casparl/eb-u986gtd2/tmpkj4e39ni/lib/python3.9/site-packages/torch/distributed/rpc/backend_registry.py", line 267, in _tensorpipe_init_backend_handler
    group = _init_process_group(store, rank, world_size)
  File "/scratch-shared/casparl/eb-u986gtd2/tmpkj4e39ni/lib/python3.9/site-packages/torch/distributed/rpc/backend_registry.py", line 105, in _init_process_group
    group = dist.ProcessGroupGloo(store, rank, world_size, process_group_timeout)
RuntimeError: Timeout waiting for key: rpc_prefix_0/0/rank_0 after 500 ms
Exception raised from get at ../torch/csrc/distributed/c10d/FileStore.cpp:362 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x5f (0x149ec899e98f in /gpfs/scratch1/shared/casparl/PyTorch/1.11.0/foss-2021a-CUDA-11.3.1/pytorch/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0xe4 (0x149ec897cc89 in /gpfs/scratch1/shared/casparl/PyTorch/1.11.0/foss-2021a-CUDA-11.3.1/pytorch/torch/lib/libc10.so)
frame #2: c10d::FileStore::get(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0x9bb (0x149ed39f6b7b in /gpfs/scratch1/shared/casparl/PyTorch/1.11.0/foss-2021a-CUDA-11.3.1/pytorch/torch/lib/libtorch_cpu.so)
frame #3: c10d::PrefixStore::get(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0x2f (0x149ed39f921f in /gpfs/scratch1/shared/casparl/PyTorch/1.11.0/foss-2021a-CUDA-11.3.1/pytorch/torch/lib/libtorch_cpu.so)
frame #4: <unknown function> + 0x33a9138 (0x149ed3a1c138 in /gpfs/scratch1/shared/casparl/PyTorch/1.11.0/foss-2021a-CUDA-11.3.1/pytorch/torch/lib/libtorch_cpu.so)
frame #5: gloo::rendezvous::PrefixStore::get(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0x2e3 (0x149ed51003c3 in /gpfs/scratch1/shared/casparl/PyTorch/1.11.0/foss-2021a-CUDA-11.3.1/pytorch/torch/lib/libtorch_cpu.so)
frame #6: gloo::rendezvous::Context::connectFullMesh(gloo::rendezvous::Store&, std::shared_ptr<gloo::transport::Device>&) + 0x55a (0x149ed50fd30a in /gpfs/scratch1/shared/casparl/PyTorch/1.11.0/foss-2021a-CUDA-11.3.1/pytorch/torch/lib/libtorch_cpu.so)
frame #7: c10d::ProcessGroupGloo::ProcessGroupGloo(c10::intrusive_ptr<c10d::Store, c10::detail::intrusive_target_default_null_type<c10d::Store> > const&, int, int, c10::intrusive_ptr<c10d::ProcessGroupGloo::Options, c10::detail::intrusive_target_default_null_type<c10d::ProcessGroupGloo::Options> >) + 0x469 (0x149ed3a0d349 in /gpfs/scratch1/shared/casparl/PyTorch/1.11.0/foss-2021a-CUDA-11.3.1/pytorch/torch/lib/libtorch_cpu.so)
frame #8: <unknown function> + 0x9721ce (0x149ed945e1ce in /gpfs/scratch1/shared/casparl/PyTorch/1.11.0/foss-2021a-CUDA-11.3.1/pytorch/torch/lib/libtorch_python.so)
frame #9: <unknown function> + 0x350c61 (0x149ed8e3cc61 in /gpfs/scratch1/shared/casparl/PyTorch/1.11.0/foss-2021a-CUDA-11.3.1/pytorch/torch/lib/libtorch_python.so)
<omitting python frames>
frame #15: <unknown function> + 0x34e409 (0x149ed8e3a409 in /gpfs/scratch1/shared/casparl/PyTorch/1.11.0/foss-2021a-CUDA-11.3.1/pytorch/torch/lib/libtorch_python.so)

test_jit_cuda_fuser

test_jit_cuda_fuser error
======================================================================
ERROR: test_native_layer_norm_bfloat (__main__.TestCudaFuser)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/scratch-shared/casparl/eb-u986gtd2/tmpkj4e39ni/lib/python3.9/site-packages/torch/testing/_internal/common_utils.py", line 1754, in wrapper
    method(*args, **kwargs)
  File "/gpfs/scratch1/shared/casparl/PyTorch/1.11.0/foss-2021a-CUDA-11.3.1/pytorch/test/test_jit_cuda_fuser.py", line 1323, in test_native_layer_norm_bfloat
    self._native_layer_norm_helper(input_shape, norm_shape, torch.bfloat16, "cuda", 1e-1)
  File "/gpfs/scratch1/shared/casparl/PyTorch/1.11.0/foss-2021a-CUDA-11.3.1/pytorch/test/test_jit_cuda_fuser.py", line 1272, in _native_layer_norm_helper
    jit_o, jit_mean, jit_rstd = t_jit(x)
  File "/scratch-shared/casparl/eb-u986gtd2/tmpkj4e39ni/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/scratch-shared/casparl/eb-u986gtd2/tmpkj4e39ni/lib/python3.9/site-packages/torch/testing/_internal/common_utils.py", line 430, in prof_meth_call
    return prof_callable(meth_call, *args, **kwargs)
  File "/scratch-shared/casparl/eb-u986gtd2/tmpkj4e39ni/lib/python3.9/site-packages/torch/testing/_internal/common_utils.py", line 424, in prof_callable
    return callable(*args, **kwargs)
RuntimeError: The following operation failed in the TorchScript interpreter.
Traceback of TorchScript (most recent call last):
RuntimeError: input_extent.value() % split_factor.value() == 0INTERNAL ASSERT FAILED at "../torch/csrc/jit/codegen/cuda/executor_utils.cpp":428, please report a bug to PyTorch. Non-divisible split with vectorization is detected. Extent: 975. Factor: 2
======================================================================
FAIL: test_batch_norm_half (__main__.TestCudaFuser)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/scratch-shared/casparl/eb-u986gtd2/tmpkj4e39ni/lib/python3.9/site-packages/torch/testing/_internal/common_utils.py", line 1754, in wrapper
    method(*args, **kwargs)
  File "/gpfs/scratch1/shared/casparl/PyTorch/1.11.0/foss-2021a-CUDA-11.3.1/pytorch/test/test_jit_cuda_fuser.py", line 2840, in test_batch_norm_half
    self._test_batch_norm_impl_index_helper(4, 8, 5, affine, track_running_stats, training, torch.half)
  File "/gpfs/scratch1/shared/casparl/PyTorch/1.11.0/foss-2021a-CUDA-11.3.1/pytorch/test/test_jit_cuda_fuser.py", line 2792, in _test_batch_norm_impl_index_helper
    self.assertGraphContainsExactly(t_jit.graph_for(x), FUSION_GUARD, 1, consider_subgraphs=True)
  File "/scratch-shared/casparl/eb-u986gtd2/tmpkj4e39ni/lib/python3.9/site-packages/torch/testing/_internal/jit_utils.py", line 324, in assertGraphContainsExactly
    perform_assert(graph, kind, count, num_kind_nodes,
  File "/scratch-shared/casparl/eb-u986gtd2/tmpkj4e39ni/lib/python3.9/site-packages/torch/testing/_internal/jit_utils.py", line 317, in perform_assert
    raise AssertionError(
AssertionError: graph(%self : __torch__.MyModule,
      %x.1 : Tensor):
  %2 : float = prim::Constant[value=1.0000000000000001e-05]() # /scratch-shared/casparl/eb-u986gtd2/tmpkj4e39ni/lib/python3.9/site-packages/torch/nn/modules/batchnorm.py:179:12
  %3 : float = prim::Constant[value=0.10000000000000001]()
...
  %o.6 : Tensor = aten::batch_norm(%o.10, %weight, %bias, %running_mean, %running_var, %training.1, %3, %2, %11) # /scratch-shared/casparl/eb-u986gtd2/tmpkj4e39ni/lib/python3.9/site-packages/torch/nn/functional.py:2421:11
  return (%o.6)

Error: graph contains 0 prim::CudaFusionGuard nodes (including subgraphs) but expected 1
======================================================================
FAIL: test_batch_norm_impl_index_correctness (__main__.TestCudaFuser)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/scratch-shared/casparl/eb-u986gtd2/tmpkj4e39ni/lib/python3.9/site-packages/torch/testing/_internal/common_utils.py", line 1754, in wrapper
    method(*args, **kwargs)
  File "/gpfs/scratch1/shared/casparl/PyTorch/1.11.0/foss-2021a-CUDA-11.3.1/pytorch/test/test_jit_cuda_fuser.py", line 2865, in test_batch_norm_impl_index_correctness
    self._test_batch_norm_impl_index_helper(b, c, hw, affine, track_running_stats, training)
  File "/gpfs/scratch1/shared/casparl/PyTorch/1.11.0/foss-2021a-CUDA-11.3.1/pytorch/test/test_jit_cuda_fuser.py", line 2792, in _test_batch_norm_impl_index_helper
    self.assertGraphContainsExactly(t_jit.graph_for(x), FUSION_GUARD, 1, consider_subgraphs=True)
  File "/scratch-shared/casparl/eb-u986gtd2/tmpkj4e39ni/lib/python3.9/site-packages/torch/testing/_internal/jit_utils.py", line 324, in assertGraphContainsExactly
    perform_assert(graph, kind, count, num_kind_nodes,
  File "/scratch-shared/casparl/eb-u986gtd2/tmpkj4e39ni/lib/python3.9/site-packages/torch/testing/_internal/jit_utils.py", line 317, in perform_assert
    raise AssertionError(
AssertionError: graph(%self : __torch__.MyModule,
      %x.1 : Tensor):
  %2 : float = prim::Constant[value=1.0000000000000001e-05]() # /scratch-shared/casparl/eb-u986gtd2/tmpkj4e39ni/lib/python3.9/site-packages/torch/nn/modules/batchnorm.py:179:12
  %3 : float = prim::Constant[value=0.10000000000000001]()
...
    block1():
      -> ()
  %o.6 : Tensor = aten::batch_norm(%o.10, %weight, %bias, %running_mean, %running_var, %training.1, %3, %2, %11) # /scratch-shared/casparl/eb-u986gtd2/tmpkj4e39ni/lib/python3.9/site-packages/torch/nn/functional.py:2421:11
  return (%o.6)

Error: graph contains 0 prim::CudaFusionGuard nodes (including subgraphs) but expected 1

[patched] test_jit_fuser_te
This seems to be a TensorFloat32 issue, the difference criterion on the test is probably too tight.

Update 07-04-2022: I can't reproduce this now... The test passes now.
Update 08-04-2022: This issue is only reproducible if the test suite is driven by run_test.py it seems:

#Results in test error for test_lstm_traced
python run_test.py --verbose -i test_jit_fuser_te
# Does not result in test error:
python test_jit_fuser_te.py -v -k test_lstm_traced

The error is only slightly out of range, maybe it's because of the state of the random generator? Still... strange.

test_jit_fuser_te error
======================================================================
FAIL: test_lstm_traced (__main__.TestTEFuserDynamic)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/gpfs/scratch1/shared/casparl/PyTorch/1.11.0/foss-2021a-CUDA-11.3.1/pytorch/test/test_jit_fuser_te.py", line 959, in test_lstm_traced
    ge = self.checkTrace(LSTMCellF, inputs)
  File "/scratch-shared/casparl/eb-u986gtd2/tmpkj4e39ni/lib/python3.9/site-packages/torch/testing/_internal/jit_utils.py", line 621, in checkTrace
    self.assertEqual(grads, grads_ge)
  File "/scratch-shared/casparl/eb-u986gtd2/tmpkj4e39ni/lib/python3.9/site-packages/torch/testing/_internal/common_utils.py", line 2121, in assertEqual
    assert_equal(
  File "/scratch-shared/casparl/eb-u986gtd2/tmpkj4e39ni/lib/python3.9/site-packages/torch/testing/_comparison.py", line 1080, in assert_equal
    raise error_metas[0].to_error()
AssertionError: Tensor-likes are not close!

Mismatched elements: 6 / 30 (20.0%)
Greatest absolute difference: 2.6792287826538086e-05 at index (2, 7) (up to 1e-05 allowed)
Greatest relative difference: 0.00027116975587184925 at index (2, 2) (up to 1.3e-06 allowed)


The failure occurred for item [0]

----------------------------------------------------------------------
Ran 720 tests in 646.532s

FAILED (failures=1, skipped=549)
test_jit_fuser_te failed!

[excluded] test_model_dump
Update 07-04-2022: I can't reproduce this now... The test passes now.
Update 14-04-2022: The test seems to fail only when run as part of the full test suite by EasyBuild. I cannot reproduce the failure when running e.g. python test_model_dump.py -v in the test folder of the PyTorch build path.

test_model_dump error
======================================================================
ERROR: test_main (__main__.TestModelDump)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/gpfs/scratch1/shared/casparl/PyTorch/1.11.0/foss-2021a-CUDA-11.3.1/pytorch/test/test_model_dump.py", line 135, in test_main
    torch.utils.model_dump.main(
  File "/scratch-shared/casparl/eb-u986gtd2/tmpkj4e39ni/lib/python3.9/site-packages/torch/utils/model_dump/__init__.py", line 384, in main
    info = get_model_info(args.model, title=args.title)
  File "/scratch-shared/casparl/eb-u986gtd2/tmpkj4e39ni/lib/python3.9/site-packages/torch/utils/model_dump/__init__.py", line 205, in get_model_info
    with zipfile.ZipFile(path_or_file) as zf:
  File "/sw/arch/Centos8/EB_production/2021/software/Python/3.9.5-GCCcore-10.3.0/lib/python3.9/zipfile.py", line 1257, in __init__
    self._RealGetContents()
  File "/sw/arch/Centos8/EB_production/2021/software/Python/3.9.5-GCCcore-10.3.0/lib/python3.9/zipfile.py", line 1324, in _RealGetContents
    raise BadZipFile("File is not a zip file")
zipfile.BadZipFile: File is not a zip file

----------------------------------------------------------------------
Ran 9 tests in 1.575s

FAILED (errors=1, skipped=1)
test_model_dump failed!

[patched] test_stateless
This seems to be a TensorFloat32 issue, the difference criterion on the test is probably too tight.
Update 07-04-2022: confirmed TF32 issue, since it disappears when ran with export NVIDIA_TF32_OVERRIDE=1

test_stateless error
======================================================================
FAIL: test_functional_call_with_data_parallel (__main__.TestStatelessFunctionalAPI)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/gpfs/scratch1/shared/casparl/PyTorch/1.11.0/foss-2021a-CUDA-11.3.1/pytorch/test/test_stateless.py", line 77, in test_functional_call_with_data_parallel
    self._run_call_with_mock_module(dp_module, device='cuda', prefix='module')
  File "/gpfs/scratch1/shared/casparl/PyTorch/1.11.0/foss-2021a-CUDA-11.3.1/pytorch/test/test_stateless.py", line 45, in _run_call_with_mock_module
    self.assertEqual(x, res)
  File "/scratch-shared/casparl/eb-u986gtd2/tmpkj4e39ni/lib/python3.9/site-packages/torch/testing/_internal/common_utils.py", line 2121, in assertEqual
    assert_equal(
  File "/scratch-shared/casparl/eb-u986gtd2/tmpkj4e39ni/lib/python3.9/site-packages/torch/testing/_comparison.py", line 1080, in assert_equal
    raise error_metas[0].to_error()
AssertionError: Tensor-likes are not close!

Mismatched elements: 1 / 1 (100.0%)
Greatest absolute difference: 7.855892181396484e-05 at index (0, 0) (up to 1e-05 allowed)
Greatest relative difference: 0.0003024223155545113 at index (0, 0) (up to 1.3e-06 allowed)


----------------------------------------------------------------------
Ran 7 tests in 1.033s

FAILED (failures=1)
test_stateless failed!

@branfosj
Copy link
Member

@casparvl
Copy link
Contributor Author

Thanks @branfosj . I found the same issue independently... should have checked for comments here first, clearly :D I created an EB-patch out of it that I'll upload. 7 more failing tests to go on my end...

@Micket
Copy link
Contributor

Micket commented Jun 25, 2022

Test report by @Micket
SUCCESS
Build succeeded for 1 out of 1 (1 easyconfigs in total)
alvis3-14 - Linux Rocky Linux 8.5, x86_64, Intel(R) Xeon(R) Gold 6338 CPU @ 2.00GHz, 4 x NVIDIA NVIDIA A100-SXM4-40GB, 515.43.04, Python 3.6.8
See https://gist.github.com/14a0bd8a910519ddb466cc69fbcfd914 for a full test report.

Micket
Micket previously approved these changes Jun 25, 2022
Copy link
Contributor

@Micket Micket left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@boegel want to me to wait with merging this until that easyblock update is in?
I think some people have issues, but this passes for me, i think it's as good as all the existing pytorch'es we have.

@boegel
Copy link
Member

boegel commented Jun 27, 2022

apparently i need updated framework to test this updated easyblock?

@Micket Hmm, no, just using the updated easyblocks from easybuilders/easybuild-easyblocks#2742 should work. Somehow the enhanced PythonPackage easyblock was not used in your test?

want to me to wait with merging this until that easyblock update is in?

Yes, I think it makes sense to first merge easybuilders/easybuild-easyblocks#2742

@boegel
Copy link
Member

boegel commented Jun 27, 2022

@Flamefire Somehow magma is still missing?

Missing modules for dependencies (use --robot?): magma/2.6.1-foss-2021a-CUDA-11.3.1

@Micket
Copy link
Contributor

Micket commented Jun 27, 2022

hm, well i somehow got

**FAIL (unhandled exception: test_step() got an unexpected keyword argument 'return_output_ec')Traceback (most recent call last):
File "/apps/Common/software/EasyBuild/4.5.5/lib/python3.6/site-packages/easybuild/main.py", line 128, in build_and_install_software (ec_res['success'], app_log, err) = build_and_install_one(ec, init_env)
File "/apps/Common/software/EasyBuild/4.5.5/lib/python3.6/site-packages/easybuild/framework/easyblock.py", line 4058, in build_and_install_one result = app.run_all_steps(run_test_cases=run_test_cases)
File "/apps/Common/software/EasyBuild/4.5.5/lib/python3.6/site-packages/easybuild/framework/easyblock.py", line 3941, in run_all_steps self.run_step(step_name, step_methods) 
File "/apps/Common/software/EasyBuild/4.5.5/lib/python3.6/site-packages/easybuild/framework/easyblock.py", line 3776, in run_step step_method(self)()
File "/apps/Common/software/EasyBuild/4.5.5/lib/python3.6/site-packages/easybuild/framework/easyblock.py", line 2588, in _test_step self.test_step()
File "/local/tmp.438439/included-easyblocks-vl6f137u/easybuild/easyblocks/pytorch.py", line 259, in test_step (out, ec) = super(EB_PyTorch, self).test_step(return_output_ec=True) TypeError: test_step() got an unexpected keyword argument 'return_output_ec' 

(sorry formatting got messed up in the build report)

@Flamefire
Copy link
Contributor

Test report by @Flamefire
FAILED
Build succeeded for 0 out of 1 (1 easyconfigs in total)
taurusi8006 - Linux CentOS Linux 7.9.2009, x86_64, AMD EPYC 7352 24-Core Processor (zen2), 8 x NVIDIA NVIDIA A100-SXM4-40GB, 470.57.02, Python 2.7.5
See https://gist.github.com/c6263127ca3b20277e8202880631861f for a full test report.

@Flamefire
Copy link
Contributor

Test report by @Flamefire
FAILED
Build succeeded for 24 out of 25 (1 easyconfigs in total)
taurusml2 - Linux RHEL 7.6, POWER, 8335-GTX (power9le), 6 x NVIDIA Tesla V100-SXM2-32GB, 440.64.00, Python 2.7.5
See https://gist.github.com/1742fadcae7a8e398494f53e75dcd4ed for a full test report.

@boegel
Copy link
Member

boegel commented Jul 6, 2022

@Flamefire Quite a bit of failing tests on AMD Rome for you, and then build trouble on POWER...

Any idea if this is worth blocking this PR over?

@boegel
Copy link
Member

boegel commented Jul 6, 2022

Edit: apparently i need updated framework to test this updated easyblock?

@Micket The problem is probably caused by a custom pythonpackage.py easyblock that you have, which is picked up due to --include-easyblocks='/apps/c3se-easyblocks/*.py' in your EasyBuild configuration (and gets progress over the modified pythonpackage.py from easybuilders/easybuild-easyblocks#2742

@Flamefire
Copy link
Contributor

@Flamefire Quite a bit of failing tests on AMD Rome for you, and then build trouble on POWER...

Any idea if this is worth blocking this PR over?

Build issue on POWER is due to CUDA 11 while the driver only supports CUDA 10. Working on that so can be ignored for now.
On Rome I'm not sure yet. Haven't caught up with everything yet.

@casparvl
Copy link
Contributor Author

casparvl commented Jul 7, 2022

Test report by @casparvl
Using easyblocks from PR(s) easybuilders/easybuild-easyblocks#2742
FAILED
Build succeeded for 0 out of 1 (1 easyconfigs in total)
tcn1.local.snellius.surf.nl - Linux RHEL 8.4, x86_64, AMD EPYC 7H12 64-Core Processor, Python 3.6.8
See https://gist.github.com/e822864141953c055a7df9ddd635466d for a full test report.

@casparvl
Copy link
Contributor Author

casparvl commented Jul 7, 2022

Test report by @casparvl
Using easyblocks from PR(s) easybuilders/easybuild-easyblocks#2742
FAILED
Build succeeded for 0 out of 1 (1 easyconfigs in total)
software2.lisa.surfsara.nl - Linux debian 10.12, x86_64, Intel(R) Xeon(R) Bronze 3104 CPU @ 1.70GHz, 4 x NVIDIA NVIDIA TITAN V, 470.103.01, Python 3.7.3
See https://gist.github.com/34333e1da179537c29c3a8bff3287e2e for a full test report.

@casparvl
Copy link
Contributor Author

casparvl commented Jul 7, 2022

Ok, ignore those two build failures, that's just me being stupid and trying to build twice in the same prefix...
Two more tests are coming up, one from a node with 4x A100, another from a CPU-only node (different system, different OS).

@easybuilders easybuilders deleted a comment from boegelbot Jul 7, 2022
@casparvl
Copy link
Contributor Author

casparvl commented Jul 7, 2022

Test report by @casparvl
Using easyblocks from PR(s) easybuilders/easybuild-easyblocks#2742
SUCCESS
Build succeeded for 1 out of 1 (1 easyconfigs in total)
software1.lisa.surfsara.nl - Linux debian 10.12, x86_64, Intel(R) Xeon(R) Silver 4110 CPU @ 2.10GHz, Python 3.7.3
See https://gist.github.com/356fa2a33f0dbc92e23a922b8c967a8f for a full test report.

Copy link
Member

@boegel boegel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Time to merge this, if any additional problems arise with the tests, we can deal with those in a follow-up PR...

@boegel
Copy link
Member

boegel commented Jul 7, 2022

Going in, thanks @casparvl!

@boegel boegel merged commit cbbe3ff into easybuilders:develop Jul 7, 2022
@casparvl
Copy link
Contributor Author

casparvl commented Jul 7, 2022

Test report by @casparvl
Using easyblocks from PR(s) easybuilders/easybuild-easyblocks#2742
SUCCESS
Build succeeded for 1 out of 1 (1 easyconfigs in total)
gcn1 - Linux RHEL 8.4, x86_64, Intel(R) Xeon(R) Platinum 8360Y CPU @ 2.40GHz, 4 x NVIDIA NVIDIA A100-SXM4-40GB, 515.43.04, Python 3.6.8
See https://gist.github.com/3dc4e7fcc40406b3750f2ef484db74ef for a full test report.

@boegel
Copy link
Member

boegel commented Jul 8, 2022

Test report by @boegel
Using easyblocks from PR(s) easybuilders/easybuild-easyblocks#2742
SUCCESS
Build succeeded for 1 out of 1 (1 easyconfigs in total)
node3303.joltik.os - Linux RHEL 8.4, x86_64, Intel(R) Xeon(R) Gold 6242 CPU @ 2.80GHz (cascadelake), 1 x NVIDIA Tesla V100-SXM2-32GB, 510.73.08, Python 3.6.8
See https://gist.github.com/6ca0527bdff9c37eed8dca14aa6f9eff for a full test report.

pytorchmergebot pushed a commit to pytorch/pytorch that referenced this pull request Sep 22, 2022
We see spurious failures due to timeouts in`test_allreduce_coalesced_basics` but only when running the whole test suite with
`python run_test.py --verbose -i distributed/test_c10d_gloo`. Increasing the timeout to 50s should provide enough leeway to avoid this. Note that the default for the `_timeout` is 30 minutes.

Originally reported in EasyBuild at easybuilders/easybuild-easyconfigs#15137 (comment) and patch proposed by @casparvl
Pull Request resolved: #85474
Approved by: https://github.com/rohan-varma
mehtanirav pushed a commit to pytorch/pytorch that referenced this pull request Oct 4, 2022
We see spurious failures due to timeouts in`test_allreduce_coalesced_basics` but only when running the whole test suite with
`python run_test.py --verbose -i distributed/test_c10d_gloo`. Increasing the timeout to 50s should provide enough leeway to avoid this. Note that the default for the `_timeout` is 30 minutes.

Originally reported in EasyBuild at easybuilders/easybuild-easyconfigs#15137 (comment) and patch proposed by @casparvl
Pull Request resolved: #85474
Approved by: https://github.com/rohan-varma
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants