Skip to content

Conversation

@Flamefire
Copy link
Contributor

@Flamefire Flamefire commented Sep 24, 2021

(created using eb --new-pr)

fixes #2577, fixes easybuilders/easybuild-easyconfigs#14120

@Flamefire
Copy link
Contributor Author

Test report by @Flamefire

Overview of tested easyconfigs (in order)

  • SUCCESS TensorFlow-2.4.1-foss-2020b.eb

Build succeeded for 1 out of 1 (1 easyconfigs in total)
taurusi8028 - Linux centos linux 7.9.2009, x86_64, AMD EPYC 7352 24-Core Processor (zen2), Python 2.7.5
See https://gist.github.com/7695dcc6671f0e18cba427cd975be612 for a full test report.

@Flamefire
Copy link
Contributor Author

Test report by @Flamefire

Overview of tested easyconfigs (in order)

Build succeeded for 0 out of 1 (1 easyconfigs in total)
taurusi8029 - Linux centos linux 7.9.2009, x86_64, AMD EPYC 7352 24-Core Processor (zen2), Python 2.7.5
See https://gist.github.com/116468f218bd966c105b57c10100b2b5 for a full test report.

@Flamefire
Copy link
Contributor Author

Test report by @Flamefire

Overview of tested easyconfigs (in order)

Build succeeded for 0 out of 1 (1 easyconfigs in total)
taurusi8012 - Linux centos linux 7.9.2009, x86_64, AMD EPYC 7352 24-Core Processor (zen2), Python 2.7.5
See https://gist.github.com/d1f6d771007592bad70ffb7f04174cf7 for a full test report.

@Flamefire
Copy link
Contributor Author

Test report by @Flamefire

Overview of tested easyconfigs (in order)

  • SUCCESS TensorFlow-2.4.1-foss-2020b.eb

Build succeeded for 1 out of 1 (1 easyconfigs in total)
taurusa7 - Linux centos linux 7.7.1908, x86_64, Intel(R) Xeon(R) CPU E5-2603 v4 @ 1.70GHz (broadwell), Python 2.7.5
See https://gist.github.com/7cb754a88b326bdc7bf9ca9fa278efd2 for a full test report.

@Flamefire
Copy link
Contributor Author

Test report by @Flamefire

Overview of tested easyconfigs (in order)

  • SUCCESS TensorFlow-2.6.0-foss-2021a-CUDA-11.3.1.eb

Build succeeded for 1 out of 1 (1 easyconfigs in total)
taurusa7 - Linux centos linux 7.7.1908, x86_64, Intel(R) Xeon(R) CPU E5-2603 v4 @ 1.70GHz (broadwell), Python 2.7.5
See https://gist.github.com/57cb680dc34cf746f429ed91ee3185a7 for a full test report.

@Flamefire
Copy link
Contributor Author

Test report by @Flamefire

Overview of tested easyconfigs (in order)

  • SUCCESS h5py-3.1.0-fosscuda-2020b.eb
  • SUCCESS TensorFlow-2.5.0-fosscuda-2020b.eb

Build succeeded for 2 out of 2 (1 easyconfigs in total)
taurusa7 - Linux centos linux 7.7.1908, x86_64, Intel(R) Xeon(R) CPU E5-2603 v4 @ 1.70GHz (broadwell), Python 2.7.5
See https://gist.github.com/b0791e3a9b5d45149e3e50472c3f9e68 for a full test report.

@Flamefire
Copy link
Contributor Author

Test report by @Flamefire

Overview of tested easyconfigs (in order)

  • SUCCESS UnZip-6.0-GCCcore-8.3.0.eb
  • SUCCESS TensorFlow-2.4.1-fosscuda-2019b-Python-3.7.4.eb

Build succeeded for 2 out of 2 (1 easyconfigs in total)
taurusa7 - Linux centos linux 7.7.1908, x86_64, Intel(R) Xeon(R) CPU E5-2603 v4 @ 1.70GHz (broadwell), Python 2.7.5
See https://gist.github.com/161ab61dee8f5bf2204c7bedce4551b3 for a full test report.

@boegel boegel added this to the next release (4.5.0?) milestone Sep 29, 2021
@boegel boegel changed the title Don't use --config=mkl for TF 2.4+ don't use --config=mkl for TensorFlow 2.4+ Sep 29, 2021
@Flamefire
Copy link
Contributor Author

More or less blocked by tensorflow/tensorflow#52151

@boegel
Copy link
Member

boegel commented Oct 13, 2021

I consider this change important enough to not let it be blocked by the failing mkl_fused_batch_norm_op_test, we can (temporarily) add that as an ignored test, so we can go ahead and merge this PR.

Especially since this apparently fixes two performance issues: the threading oversubscription for CPU-only TensorFlow reported in #2577, but also tf.matmul preferring MKL on CPU over GPU issue reported in easybuilders/easybuild-easyconfigs#14120...

@boegel
Copy link
Member

boegel commented Oct 14, 2021

Test report by @boegel

Overview of tested easyconfigs (in order)

  • SUCCESS TensorFlow-2.5.0-foss-2020b.eb
  • SUCCESS TensorFlow-2.6.0-foss-2021a.eb
  • SUCCESS TensorFlow-2.4.1-foss-2020b.eb

Build succeeded for 3 out of 3 (3 easyconfigs in total)
node3139.skitty.os - Linux centos linux 7.9.2009, x86_64, Intel(R) Xeon(R) Gold 6140 CPU @ 2.30GHz (skylake_avx512), Python 3.6.8
See https://gist.github.com/294c57503da360ddded7dfc4d42c75d4 for a full test report.

@boegel
Copy link
Member

boegel commented Oct 14, 2021

Test report by @boegel

Overview of tested easyconfigs (in order)

Build succeeded for 1 out of 3 (3 easyconfigs in total)
node3520.doduo.os - Linux RHEL 8.2, x86_64, AMD EPYC 7552 48-Core Processor (zen2), Python 3.6.8
See https://gist.github.com/3469ae5c80fe0b2c05f376841e47dee0 for a full test report.

@boegel
Copy link
Member

boegel commented Oct 14, 2021

Failing tests on systems with AMD CPUs due to tensorflow/tensorflow#52151, I'll open a PR to filter out those that broken test so we can proceed here...

@akesandgren
Copy link
Contributor

akesandgren commented Oct 14, 2021

I get these two failing for TensorFlow-2.6.0-foss-2021a-CUDA-11.3.1.eb --include-easyblocks-from-pr 2583
on our broadwell with K80, twice in a row now.
//tensorflow/core/kernels:fused_batch_norm_ex_op_test_gpu
//tensorflow/core/kernels/mlir_generated:gpu_unary_ops_test_gpu

Trying without this PR...

And my original production build of it on that system combo did not fail and that was without this PR.

Weird, I get that problem even without this PR now.

@boegel
Copy link
Member

boegel commented Oct 14, 2021

@akesandgren I'd say let's open an issue on that?

@akesandgren
Copy link
Contributor

akesandgren commented Oct 14, 2021

I will once I've figured out a bit more on what's happening... I think my container env wasn't fully correct for TF-CUDA when I initially built it, but now it is. It was previously lacking nvidia-smi and thus decided to skip the GPU tests. Something I only discovered today...

And at least one of the failing tests may be due to too little memory on the K80, or that multiple tests are running at the same time and stealing memory from each other...

@boegel
Copy link
Member

boegel commented Oct 15, 2021

Test report by @boegel

Overview of tested easyconfigs (in order)

  • SUCCESS TensorFlow-2.4.1-foss-2020b.eb
  • SUCCESS TensorFlow-2.4.1-fosscuda-2019b-Python-3.7.4.eb
  • SUCCESS TensorFlow-2.4.1-fosscuda-2020b.eb

Build succeeded for 3 out of 3 (3 easyconfigs in total)
node3309.joltik.os - Linux centos linux 7.9.2009, x86_64, Intel(R) Xeon(R) Gold 6242 CPU @ 2.80GHz (cascadelake), 1 x NVIDIA NVIDIA Tesla V100-SXM2-32GB, 465.19.01, Python 3.6.8
See https://gist.github.com/cabb68c4f78eda268517b8f6709c5fb2 for a full test report.

@boegel
Copy link
Member

boegel commented Oct 15, 2021

Test report by @boegel

Overview of tested easyconfigs (in order)

  • SUCCESS TensorFlow-2.4.1-foss-2020b.eb
  • SUCCESS TensorFlow-2.4.1-fosscuda-2019b-Python-3.7.4.eb
  • SUCCESS TensorFlow-2.4.1-fosscuda-2020b.eb

Build succeeded for 3 out of 3 (3 easyconfigs in total)
node3309.joltik.os - Linux centos linux 7.9.2009, x86_64, Intel(R) Xeon(R) Gold 6242 CPU @ 2.80GHz (cascadelake), 1 x NVIDIA NVIDIA Tesla V100-SXM2-32GB, 465.19.01, Python 3.6.8
See https://gist.github.com/02ba89107f021cc2ce39e6c8a56ffe43 for a full test report.

@boegel
Copy link
Member

boegel commented Oct 15, 2021

@akesandgren Any additional updates?

I would really like to move forward with this (but we also need easybuilders/easybuild-easyconfigs#14151 and easybuilders/easybuild-easyconfigs#14153 merged, first ideally).

@akesandgren
Copy link
Contributor

Yeah, most likely caused by K80's having too little memory. Works fine on our V100's so I'll just drop those two tests in our hooks when on K80's

@akesandgren
Copy link
Contributor

So in case I was unclear, the problem only appears on K80 and appears regardless of this PRs existance.
Move forward...

@boegel
Copy link
Member

boegel commented Oct 26, 2021

With both easybuilders/easybuild-easyconfigs#14151 and easybuilders/easybuild-easyconfigs#14153 merged, I don't see any reason to hold this back any further...

@boegel boegel force-pushed the 20210924164325_new_pr_tensorflow branch from 11264eb to 517129b Compare October 27, 2021 18:57
@boegel
Copy link
Member

boegel commented Oct 27, 2021

Test report by @boegel

Overview of tested easyconfigs (in order)

  • SUCCESS TensorFlow-2.2.3-foss-2020b.eb
  • SUCCESS TensorFlow-2.6.0-foss-2021a.eb

Build succeeded for 2 out of 2 (2 easyconfigs in total)
node2635.swalot.os - Linux centos linux 7.9.2009, x86_64, Intel(R) Xeon(R) CPU E5-2660 v3 @ 2.60GHz (haswell), Python 3.6.8
See https://gist.github.com/45f2ef7fc6dacfd098afaea2543f1cbe for a full test report.

@boegel boegel merged commit 2b5134b into easybuilders:develop Oct 27, 2021
@Micket
Copy link
Contributor

Micket commented Oct 27, 2021

Is the subject still accurate here? Because we should do this for all 2.x versions in the CUDA builds.

@boegel
Copy link
Member

boegel commented Oct 27, 2021

Test report by @boegel

Overview of tested easyconfigs (in order)

  • SUCCESS TensorFlow-2.4.1-foss-2020b.eb

Build succeeded for 1 out of 1 (1 easyconfigs in total)
node3151.skitty.os - Linux centos linux 7.9.2009, x86_64, Intel(R) Xeon(R) Gold 6140 CPU @ 2.30GHz (skylake_avx512), Python 3.6.8
See https://gist.github.com/96cde227f8a5f06b42f6c1097bb0c98f for a full test report.

@Flamefire Flamefire deleted the 20210924164325_new_pr_tensorflow branch October 29, 2021 09:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

4 participants