Skip to content

Conversation

@boegel
Copy link
Member

@boegel boegel commented Sep 23, 2020

(created using eb --new-pr)
requires easybuilders/easybuild-easyblocks#2184 + #11320 (UCX) + #11332 (hwloc)

@boegel boegel added the update label Sep 23, 2020
@boegel boegel added this to the next release (4.3.1) milestone Sep 23, 2020
@boegel
Copy link
Member Author

boegel commented Sep 23, 2020

@boegelbot please test @ generoso

@boegelbot
Copy link
Collaborator

@boegel: Request for testing this PR well received on generoso

PR test command 'EB_PR=11333 EB_ARGS= /apps/slurm/default/bin/sbatch --job-name test_PR_11333 ~/boegelbot/eb_from_pr_upload_generoso.sh' executed!

  • exit code: 0
  • output:
Submitted batch job 7856

Test results coming soon (I hope)...

- notification for comment with ID 697295413 processed

Message to humans: this is just bookkeeping information for me,
it is of no use to you (unless you think I have a bug, which I don't).

@boegelbot
Copy link
Collaborator

Test report by @boegelbot
SUCCESS
Build succeeded for 4 out of 4 (4 easyconfigs in this PR)
generoso-x-1 - Linux centos linux 8.2.2004, x86_64, Intel(R) Xeon(R) CPU E5-2667 v3 @ 3.20GHz (haswell), Python 3.6.8
See https://gist.github.com/b62bfc33c94b6a4fd0d17ee00f8f1e75 for a full test report.

@boegel
Copy link
Member Author

boegel commented Sep 23, 2020

Test report by @boegel
SUCCESS
Build succeeded for 4 out of 4 (4 easyconfigs in this PR)
node3406.kirlia.os - Linux centos linux 7.8.2003, x86_64, Intel(R) Xeon(R) Gold 6240 CPU @ 2.60GHz (cascadelake), Python 2.7.5
See https://gist.github.com/ba213ae6dbf1b2f33bf35108c19d1e84 for a full test report.

@boegel
Copy link
Member Author

boegel commented Sep 23, 2020

Test report by @boegel
SUCCESS
Build succeeded for 4 out of 4 (4 easyconfigs in this PR)
node3149.skitty.os - Linux centos linux 7.8.2003, x86_64, Intel(R) Xeon(R) Gold 6140 CPU @ 2.30GHz, Python 3.6.8
See https://gist.github.com/b1a7fec7331b23b529b5f01a34bb735b for a full test report.

@boegel
Copy link
Member Author

boegel commented Sep 23, 2020

Test report by @boegel
SUCCESS
Build succeeded for 4 out of 4 (4 easyconfigs in this PR)
node2609.swalot.os - Linux centos linux 7.8.2003, x86_64, Intel(R) Xeon(R) CPU E5-2660 v3 @ 2.60GHz (haswell), Python 2.7.5
See https://gist.github.com/54e0c16b41bdeb8e18cf5527a0ea4afd for a full test report.

@boegel boegel added the 2020b issues & PRs related to 2020b label Sep 24, 2020
@lexming
Copy link
Contributor

lexming commented Sep 25, 2020

Test report by @lexming
SUCCESS
Build succeeded for 5 out of 5 (4 easyconfigs in this PR)
node127.hydra.os - Linux centos linux 7.7.1908, x86_64, Intel(R) Xeon(R) CPU E5-2680 v2 @ 2.80GHz, Python 2.7.5
See https://gist.github.com/e6d314163c5aedf8d32de7f0f9056d9f for a full test report.

@lexming
Copy link
Contributor

lexming commented Sep 25, 2020

Test report by @lexming
SUCCESS
Build succeeded for 5 out of 5 (4 easyconfigs in this PR)
node376.hydra.os - Linux centos linux 7.7.1908, x86_64, Intel(R) Xeon(R) Gold 6148 CPU @ 2.40GHz, Python 2.7.5
See https://gist.github.com/07f7fa869c7e57849c243ebc26b6175d for a full test report.

@easybuilders easybuilders deleted a comment from boegelbot Sep 25, 2020
Copy link
Contributor

@lexming lexming left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This OpenMPI is not working well on my side. A simple MPI hello world program fails to initialise OpenFabrics

$ mpirun ./test
[node379.hydra.os:24944] [[51950,0],0] ORTE_ERROR_LOG: Out of resource in file util/show_help.c at line 501
--------------------------------------------------------------------------
WARNING: There was an error initializing an OpenFabrics device.

  Local host:   node378
  Local device: mlx5_0
--------------------------------------------------------------------------
Hello world from processor node379.hydra.os, rank 0 out of 2 processors
Hello world from processor node378.hydra.os, rank 1 out of 2 processors

OSU-Micro-benchmarks has the same issue

# OSU MPI Latency Test v5.6.3
# Size          Latency (us)
1024                    2.08
2048                    2.83
4096                    3.72
8192                    5.46
16384                   7.56
32768                   9.83
65536                  14.34
131072                 22.34
262144                 32.28
524288                 54.46
1048576                97.91
2097152               181.33
4194304               354.37
--------------------------------------------------------------------------
WARNING: There was an error initializing an OpenFabrics device.

  Local host:   node378
  Local device: mlx5_0
--------------------------------------------------------------------------
[node379.hydra.os:15539] [[38701,0],0] ORTE_ERROR_LOG: Data unpack would read past end of buffer in file util/show_help.c at line 501

The execution completes in both cases, but those errors are not good.

@terjekv
Copy link
Collaborator

terjekv commented Sep 25, 2020

Started a test build on a "clean" arm box. It'll take a bit. It started building M4... The box has no toolchains. :)

@terjekv
Copy link
Collaborator

terjekv commented Sep 25, 2020

Test report by @terjekv
SUCCESS
Build succeeded for 37 out of 37 (4 easyconfigs in this PR)
arm2 - Linux ubuntu 18.04, AArch64, UNKNOWN, Python 3.6.9
See https://gist.github.com/3f3bd7ad22e365787aba06bd717bc751 for a full test report.

@boegel
Copy link
Member Author

boegel commented Sep 26, 2020

This OpenMPI is not working well on my side. A simple MPI hello world program fails to initialise OpenFabrics

The problem here is that we should be configuring OpenMPI with --without-verbs when we're using UCX.
That certainly fixes the problem for me (and it's a known issue, see https://www.open-mpi.org/faq/?category=all#ofa-device-error.

Please try again with the updated OpenMPI easyblock from easybuilders/easybuild-easyblocks#2188 .

@lexming
Copy link
Contributor

lexming commented Sep 26, 2020

Test report by @lexming
SUCCESS
Build succeeded for 1 out of 1 (1 easyconfigs in this PR)
node128.hydra.os - Linux centos linux 7.7.1908, x86_64, Intel(R) Xeon(R) CPU E5-2680 v2 @ 2.80GHz, Python 2.7.5
See https://gist.github.com/96567ab27531f1376f4b0764aa6edc36 for a full test report.

@lexming
Copy link
Contributor

lexming commented Sep 26, 2020

Test report by @lexming
SUCCESS
Build succeeded for 1 out of 1 (1 easyconfigs in this PR)
node378.hydra.os - Linux centos linux 7.7.1908, x86_64, Intel(R) Xeon(R) Gold 6148 CPU @ 2.40GHz, Python 2.7.5
See https://gist.github.com/6314f78e9edcc6d1e3dfd75fe5257922 for a full test report.

@lexming
Copy link
Contributor

lexming commented Sep 26, 2020

@boegel thanks a lot, that was indeed the issue. We have been already disabling verbs in our production system, but I was totally misled by the ORTE_ERROR_LOG: Data unpack would read past... error.

Copy link
Contributor

@lexming lexming left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@boegel
Copy link
Member Author

boegel commented Sep 30, 2020

@lexming So let's merge? Or do you want to see more tests?

@boegel
Copy link
Member Author

boegel commented Sep 30, 2020

Test report by @boegel
SUCCESS
Build succeeded for 4 out of 4 (4 easyconfigs in this PR)
node3502.doduo.os - Linux RHEL 8.2, x86_64, AMD EPYC 7302P 16-Core Processor (zen2), Python 3.6.8
See https://gist.github.com/bb1df99da5e1cdd18ba1f4d58c1b7d14 for a full test report.

@lexming
Copy link
Contributor

lexming commented Sep 30, 2020

Going in, thanks @boegel !

@lexming lexming merged commit ef4e18c into easybuilders:develop Sep 30, 2020
@boegel boegel deleted the 20200923113619_new_pr_OpenMPI405 branch October 2, 2020 20:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

2020b issues & PRs related to 2020b update

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants