{mpi}[GCC/11.3.0] OpenMPI v4.1.4 #15426

jfgrimm · 2022-05-04T12:06:36Z

Draft OpenMPI w/ GCC 11.3.0, for latest release candidate. Will update once it is released.

depends on:

(created using eb --new-pr)

….0-GCCcore-11.3.0.eb, PMIx-4.1.2-GCCcore-11.3.0.eb, UCX-1.12.1-GCCcore-11.3.0.eb, OpenMPI-4.1.4rc1-GCC-11.3.0.eb

SebastianAchilles · 2022-05-05T15:28:27Z

Test report by @SebastianAchilles
SUCCESS
Build succeeded for 5 out of 5 (5 easyconfigs in total)
zen2-ubuntu-eb - Linux Ubuntu 22.04, x86_64, AMD EPYC 7452 32-Core Processor (zen2), Python 3.10.4
See https://gist.github.com/a87f21bff4de4c8c5836f142252813a4 for a full test report.

easybuild/easyconfigs/u/UCX/UCX-1.12.1-GCCcore-11.3.0.eb

ocaisa · 2022-05-19T13:38:51Z

Can we consider adding the patch in #14919 (comment) and

configopts = '--with-cuda=internal'

?

ocaisa · 2022-05-19T14:24:27Z

@bartoldeman May have another patch to reduce the overhead of CUDA that this might introduce (see comment from @Micket)

Micket · 2022-05-19T14:55:27Z

I think Bart was looking to get that performance patch merged upstream, so they would likely have some opinions on how to do it and how they like to have their defines.

But of course, we don't have to care about that when patching all the versions that have already been released. I stole this from the thread (I hope it wasn't secret Bart!)

diff -ur openmpi-4.1.1.orig/opal/datatype/opal_convertor.c openmpi-4.1.1/opal/datatype/opal_convertor.c
--- openmpi-4.1.1.orig/opal/datatype/opal_convertor.c	2021-04-24 13:28:07.000000000 -0400
+++ openmpi-4.1.1/opal/datatype/opal_convertor.c	2022-05-06 11:40:08.698454429 -0400
@@ -41,7 +41,7 @@
 #if OPAL_CUDA_SUPPORT
 #include "opal/datatype/opal_datatype_cuda.h"
 #define MEMCPY_CUDA( DST, SRC, BLENGTH, CONVERTOR ) \
-    CONVERTOR->cbmemcpy( (DST), (SRC), (BLENGTH), (CONVERTOR) )
+    opal_cuda_memcpy( (DST), (SRC), (BLENGTH), (CONVERTOR) )
 #endif
 
 static void opal_convertor_construct( opal_convertor_t* convertor )
@@ -51,9 +51,6 @@
     convertor->partial_length = 0;
     convertor->remoteArch     = opal_local_arch;
     convertor->flags          = OPAL_DATATYPE_FLAG_NO_GAPS | CONVERTOR_COMPLETED;
-#if OPAL_CUDA_SUPPORT
-    convertor->cbmemcpy       = &opal_cuda_memcpy;
-#endif
 }
 
 
@@ -694,9 +691,6 @@
         destination->bConverted = source->bConverted;
         destination->stack_pos  = source->stack_pos;
     }
-#if OPAL_CUDA_SUPPORT
-    destination->cbmemcpy   = source->cbmemcpy;
-#endif
     return OPAL_SUCCESS;
 }
 
diff -ur openmpi-4.1.1.orig/opal/datatype/opal_convertor.h openmpi-4.1.1/opal/datatype/opal_convertor.h
--- openmpi-4.1.1.orig/opal/datatype/opal_convertor.h	2021-04-24 13:28:07.000000000 -0400
+++ openmpi-4.1.1/opal/datatype/opal_convertor.h	2022-05-06 09:59:06.242836736 -0400
@@ -118,7 +118,6 @@
     dt_stack_t                    static_stack[DT_STATIC_STACK_SIZE];  /**< local stack for small datatypes */
 
 #if OPAL_CUDA_SUPPORT
-    memcpy_fct_t                  cbmemcpy;       /**< memcpy or cuMemcpy */
     void *                        stream;         /**< CUstream for async copy */
 #endif
 };
diff -ur openmpi-4.1.1.orig/opal/datatype/opal_datatype_cuda.c openmpi-4.1.1/opal/datatype/opal_datatype_cuda.c
--- openmpi-4.1.1.orig/opal/datatype/opal_datatype_cuda.c	2021-04-24 13:28:07.000000000 -0400
+++ openmpi-4.1.1/opal/datatype/opal_datatype_cuda.c	2022-05-06 10:26:38.659033919 -0400
@@ -48,10 +48,6 @@
         opal_cuda_support_init();
     }
 
-    /* This is needed to handle case where convertor is not fully initialized
-     * like when trying to do a sendi with convertor on the statck */
-    convertor->cbmemcpy = (memcpy_fct_t)&opal_cuda_memcpy;
-
     /* If not enabled, then nothing else to do */
     if (!opal_cuda_enabled) {
         return;
@@ -112,20 +108,17 @@
 }
 
 /*
- * With CUDA enabled, all contiguous copies will pass through this function.
- * Therefore, the first check is to see if the convertor is a GPU buffer.
+ * With CUDA enabled, all contiguous copies will pass through opal_cuda_memcpy which
+ * calls this function. Therefore, that function checks inline to see if the convertor
+ * is a GPU buffer.
  * Note that if there is an error with any of the CUDA calls, the program
  * aborts as there is no recovering.
  */
 
-void *opal_cuda_memcpy(void *dest, const void *src, size_t size, opal_convertor_t* convertor)
+void *opal_cuda_memcpy_gpu(void *dest, const void *src, size_t size, opal_convertor_t* convertor)
 {
     int res;
 
-    if (!(convertor->flags & CONVERTOR_CUDA)) {
-        return memcpy(dest, src, size);
-    }
-
     if (convertor->flags & CONVERTOR_CUDA_ASYNC) {
         res = ftable.gpu_cu_memcpy_async(dest, (void *)src, size, convertor);
     } else {
diff -ur openmpi-4.1.1.orig/opal/datatype/opal_datatype_cuda.h openmpi-4.1.1/opal/datatype/opal_datatype_cuda.h
--- openmpi-4.1.1.orig/opal/datatype/opal_datatype_cuda.h	2021-04-24 13:28:07.000000000 -0400
+++ openmpi-4.1.1/opal/datatype/opal_datatype_cuda.h	2022-05-06 10:34:26.120375882 -0400
@@ -10,6 +10,8 @@
 #ifndef _OPAL_DATATYPE_CUDA_H
 #define _OPAL_DATATYPE_CUDA_H
 
+#include <string.h>
+
 /* Structure to hold CUDA support functions that gets filled in when the
  * common cuda code is initialized.  This removes any dependency on <cuda.h>
  * in the opal cuda datatype code. */
@@ -24,10 +26,24 @@
 void mca_cuda_convertor_init(opal_convertor_t* convertor, const void *pUserBuf);
 bool opal_cuda_check_bufs(char *dest, char *src);
 bool opal_cuda_check_one_buf(char *buf, opal_convertor_t *convertor );
-void* opal_cuda_memcpy(void * dest, const void * src, size_t size, opal_convertor_t* convertor);
+void* opal_cuda_memcpy_gpu(void * dest, const void * src, size_t size, opal_convertor_t* convertor);
 void* opal_cuda_memcpy_sync(void * dest, const void * src, size_t size);
 void* opal_cuda_memmove(void * dest, void * src, size_t size);
 void opal_cuda_add_initialization_function(int (*fptr)(opal_common_cuda_function_table_t *));
 void opal_cuda_set_copy_function_async(opal_convertor_t* convertor, void *stream);
 
+/*
+ * With CUDA enabled, all contiguous copies will pass through this function.
+ * Therefore, the first check is to see if the convertor is a GPU buffer.
+ */
+
+static inline void *opal_cuda_memcpy(void *dest, const void *src, size_t size, opal_convertor_t* convertor)
+{
+    if (OPAL_LIKELY(!(convertor->flags & CONVERTOR_CUDA))) {
+        return memcpy(dest, src, size);
+    }
+
+    return opal_cuda_memcpy_gpu(dest, src, size, convertor);
+}
+
 #endif
diff -ur openmpi-4.1.1.orig/opal/datatype/opal_datatype_pack.h openmpi-4.1.1/opal/datatype/opal_datatype_pack.h
--- openmpi-4.1.1.orig/opal/datatype/opal_datatype_pack.h	2021-04-24 13:28:07.000000000 -0400
+++ openmpi-4.1.1/opal/datatype/opal_datatype_pack.h	2022-05-06 10:29:01.375528587 -0400
@@ -21,9 +21,10 @@
 
 #if !defined(CHECKSUM) && OPAL_CUDA_SUPPORT
 /* Make use of existing macro to do CUDA style memcpy */
+#include "opal/datatype/opal_datatype_cuda.h"
 #undef MEMCPY_CSUM
 #define MEMCPY_CSUM( DST, SRC, BLENGTH, CONVERTOR ) \
-    CONVERTOR->cbmemcpy( (DST), (SRC), (BLENGTH), (CONVERTOR) )
+    opal_cuda_memcpy( (DST), (SRC), (BLENGTH), (CONVERTOR) )
 #endif
 
 /**
diff -ur openmpi-4.1.1.orig/opal/datatype/opal_datatype_unpack.c openmpi-4.1.1/opal/datatype/opal_datatype_unpack.c
--- openmpi-4.1.1.orig/opal/datatype/opal_datatype_unpack.c	2021-04-24 13:28:07.000000000 -0400
+++ openmpi-4.1.1/opal/datatype/opal_datatype_unpack.c	2022-05-06 10:16:50.547111046 -0400
@@ -41,6 +41,9 @@
 #include "opal/datatype/opal_datatype_checksum.h"
 #include "opal/datatype/opal_datatype_unpack.h"
 #include "opal/datatype/opal_datatype_prototypes.h"
+#if OPAL_CUDA_SUPPORT
+#include "opal/datatype/opal_datatype_cuda.h"
+#endif
 
 #if defined(CHECKSUM)
 #define opal_unpack_general_function            opal_unpack_general_checksum
@@ -207,7 +210,7 @@
 #if OPAL_CUDA_SUPPORT
     /* In the case where the data is being unpacked from device memory, need to
      * use the special host to device memory copy. */
-    pConvertor->cbmemcpy(saved_data, user_data, data_length, pConvertor );
+    opal_cuda_memcpy(saved_data, user_data, data_length, pConvertor );
 #else
     MEMCPY( saved_data, user_data, data_length );
 #endif
@@ -227,10 +230,10 @@
      * bytes need to be converted back to their original values. */
     {
         char resaved_data[16];
-        pConvertor->cbmemcpy(resaved_data, user_data, data_length, pConvertor );
+        opal_cuda_memcpy(resaved_data, user_data, data_length, pConvertor );
         for(size_t i = 0; i < data_length; i++ ) {
             if( unused_byte == resaved_data[i] )
-                pConvertor->cbmemcpy(&user_data[i], &saved_data[i], 1, pConvertor);
+                opal_cuda_memcpy(&user_data[i], &saved_data[i], 1, pConvertor);
         }
     }
 #else
diff -ur openmpi-4.1.1.orig/opal/datatype/opal_datatype_unpack.h openmpi-4.1.1/opal/datatype/opal_datatype_unpack.h
--- openmpi-4.1.1.orig/opal/datatype/opal_datatype_unpack.h	2021-04-24 13:28:07.000000000 -0400
+++ openmpi-4.1.1/opal/datatype/opal_datatype_unpack.h	2022-05-06 10:29:13.791484625 -0400
@@ -21,9 +21,10 @@
 
 #if !defined(CHECKSUM) && OPAL_CUDA_SUPPORT
 /* Make use of existing macro to do CUDA style memcpy */
+#include "opal/datatype/opal_datatype_cuda.h"
 #undef MEMCPY_CSUM
 #define MEMCPY_CSUM( DST, SRC, BLENGTH, CONVERTOR ) \
-    CONVERTOR->cbmemcpy( (DST), (SRC), (BLENGTH), (CONVERTOR) )
+    opal_cuda_memcpy( (DST), (SRC), (BLENGTH), (CONVERTOR) )
 #endif
 
 /**

bartoldeman · 2022-05-19T18:32:24Z

yes I have a patch open here, but need to follow through: open-mpi/ompi#10364
This takes a different approach of conditional compilation which has even less runtime overhead.

The performance difference isn't enormous, up to 10% on very small messages with MPI_Alltoall (which I think are rare in the first place).

easybuild/easyconfigs/o/OpenMPI/OpenMPI-4.1.4rc1-GCC-11.3.0.eb

… patch

jfgrimm · 2022-05-27T09:41:19Z

Test report by @jfgrimm
SUCCESS
Build succeeded for 2 out of 2 (1 easyconfigs in total)
himem01.pri.viking.alces.network - Linux CentOS Linux 7.9.2009, x86_64, Intel(R) Xeon(R) Platinum 8160 CPU @ 2.10GHz (skylake_avx512), Python 3.6.8
See https://gist.github.com/fbf16ca89962b4fd38b9dbe3c726bf28 for a full test report.

branfosj · 2022-05-27T09:54:11Z

Test report by @branfosj
SUCCESS
Build succeeded for 2 out of 2 (1 easyconfigs in total)
bear-pg0105u36b.bear.cluster - Linux RHEL 8.6, x86_64, Intel(R) Xeon(R) Platinum 8360Y CPU @ 2.40GHz (icelake), Python 3.6.8
See https://gist.github.com/e36284283a9d74b61dc4d340c59bcedb for a full test report.

branfosj · 2022-05-27T10:00:44Z

Test report by @branfosj
SUCCESS
Build succeeded for 2 out of 2 (1 easyconfigs in total)
bear-pg0211u08b.bear.cluster - Linux Ubuntu 20.04, x86_64, Intel(R) Xeon(R) Gold 6248 CPU @ 2.50GHz (cascadelake), Python 3.8.5
See https://gist.github.com/c35ee6acca6dec90ca1b6c21291755f9 for a full test report.

branfosj · 2022-05-27T10:02:20Z

Test report by @branfosj
SUCCESS
Build succeeded for 2 out of 2 (1 easyconfigs in total)
bear-pg0211u03a.bear.cluster - Linux RHEL 8.6, x86_64, Intel(R) Xeon(R) Gold 6248 CPU @ 2.50GHz (cascadelake), Python 3.6.8
See https://gist.github.com/fd61eee1994f288173804cdd8f31fad8 for a full test report.

branfosj · 2022-05-27T10:10:33Z

Test report by @branfosj
SUCCESS
Build succeeded for 2 out of 2 (1 easyconfigs in total)
bear-pg0306u03a.bear.cluster - Linux RHEL 8.5, POWER, 8335-GTX (power9le), 4 x NVIDIA Tesla V100-SXM2-16GB, 470.57.02, Python 3.6.8
See https://gist.github.com/b3c3401be9550918d34d826d10709fdb for a full test report.

bartoldeman · 2022-05-27T11:29:38Z

@boegelbot please test @ generoso

boegelbot · 2022-05-27T11:30:15Z

@bartoldeman: Request for testing this PR well received on login1

PR test command 'EB_PR=15426 EB_ARGS= /opt/software/slurm/bin/sbatch --job-name test_PR_15426 --ntasks=4 ~/boegelbot/eb_from_pr_upload_generoso.sh' executed!

exit code: 0
output:

Submitted batch job 8580

Test results coming soon (I hope)...

- notification for comment with ID 1139532313 processed

Message to humans: this is just bookkeeping information for me,
it is of no use to you (unless you think I have a bug, which I don't).

…asyconfigs into 20220504130632_new_pr_libevent2112

bartoldeman

LGTM

boegelbot · 2022-05-27T11:47:38Z

Test report by @boegelbot
SUCCESS
Build succeeded for 1 out of 1 (1 easyconfigs in total)
cns1 - Linux Rocky Linux 8.5, x86_64, Intel(R) Xeon(R) CPU E5-2667 v3 @ 3.20GHz (haswell), Python 3.6.8
See https://gist.github.com/3f258624db6ab74c1b60cfc1b05d6806 for a full test report.

boegel · 2022-05-27T12:13:17Z

Test report by @boegel
SUCCESS
Build succeeded for 1 out of 1 (1 easyconfigs in total)
fair-mastodon-c6g-2xlarge-0001 - Linux rocky linux 8.5, AArch64, ARM UNKNOWN (graviton2), Python 3.6.8
See https://gist.github.com/4822d53bb88f5611c9b83b439a56a825 for a full test report.

boegel · 2022-05-27T13:41:22Z

Test report by @boegel
SUCCESS
Build succeeded for 2 out of 2 (1 easyconfigs in total)
node3141.skitty.os - Linux RHEL 8.4, x86_64, Intel(R) Xeon(R) Gold 6140 CPU @ 2.30GHz, Python 3.6.8
See https://gist.github.com/ee0b383173a98b1935e8597367a18663 for a full test report.

SebastianAchilles · 2022-05-27T14:13:14Z

@boegelbot please test @ jsc-zen2

boegelbot · 2022-05-27T14:17:08Z

@SebastianAchilles: Request for testing this PR well received on jsczen2l1.int.jsc-zen2.easybuild-test.cluster

PR test command 'EB_PR=15426 EB_ARGS= /opt/software/slurm/bin/sbatch --job-name test_PR_15426 --ntasks=8 ~/boegelbot/eb_from_pr_upload_jsc-zen2.sh' executed!

exit code: 0
output:

Submitted batch job 1239

Test results coming soon (I hope)...

- notification for comment with ID 1139659333 processed

Message to humans: this is just bookkeeping information for me,
it is of no use to you (unless you think I have a bug, which I don't).

boegelbot · 2022-05-27T14:30:59Z

Test report by @boegelbot
SUCCESS
Build succeeded for 2 out of 2 (1 easyconfigs in total)
jsczen2c1.int.jsc-zen2.easybuild-test.cluster - Linux Rocky Linux 8.5, x86_64, AMD EPYC 7742 64-Core Processor (zen2), Python 3.6.8
See https://gist.github.com/ace17d2f17c1a6c854af660c6adc56b9 for a full test report.

Autotools build dep was added

boegel · 2022-05-28T10:50:40Z

Going in, thanks @jfgrimm!

adding easyconfigs: libevent-2.1.12-GCCcore-11.3.0.eb, libfabric-1.15…

7ca21f5

….0-GCCcore-11.3.0.eb, PMIx-4.1.2-GCCcore-11.3.0.eb, UCX-1.12.1-GCCcore-11.3.0.eb, OpenMPI-4.1.4rc1-GCC-11.3.0.eb

jfgrimm added the update label May 4, 2022

jfgrimm marked this pull request as draft May 4, 2022 12:06

jfgrimm added this to the next release (4.5.5?) milestone May 4, 2022

boegel changed the title ~~{lib}[GCCcore/11.3.0] libevent v2.1.12, libfabric v1.15.0, PMIx v4.1.2, ...~~ {lib}[GCCcore/11.3.0] OpenMPI v4.1.4, UCX v1.12.1, libfabric v1.15.0, libevent v2.1.12, , PMIx v4.1.2 May 6, 2022

boegel reviewed May 6, 2022

View reviewed changes

easybuild/easyconfigs/u/UCX/UCX-1.12.1-GCCcore-11.3.0.eb Show resolved Hide resolved

branfosj mentioned this pull request May 6, 2022

{lib}[GCCcore/11.3.0] UCX v1.12.1 #15451

Merged

split libfabric, libevent, UCX, PMIx into other PRs

6ad3915

jfgrimm changed the title ~~{lib}[GCCcore/11.3.0] OpenMPI v4.1.4, UCX v1.12.1, libfabric v1.15.0, libevent v2.1.12, , PMIx v4.1.2~~ {lib}[GCCcore/11.3.0] OpenMPI v4.1.4 May 6, 2022

This comment was marked as outdated.

Sign in to view

add OpenMPI internal cuda patch

5af7506

This comment was marked as outdated.

Sign in to view

Micket previously requested changes May 20, 2022

View reviewed changes

easybuild/easyconfigs/o/OpenMPI/OpenMPI-4.1.4rc1-GCC-11.3.0.eb Show resolved Hide resolved

update patch, (pre-)configopts to match easybuilders#15528

c52b082

This comment was marked as outdated.

Sign in to view

jfgrimm added a commit to jfgrimm/easybuild-easyconfigs that referenced this pull request May 23, 2022

remove OpenMPI from this PR (use easybuilders#15426)

57e8228

boegel requested changes May 27, 2022

View reviewed changes

easybuild/easyconfigs/o/OpenMPI/OpenMPI-4.1.4rc1-GCC-11.3.0.eb Outdated Show resolved Hide resolved

easybuild/easyconfigs/o/OpenMPI/OpenMPI-4.1.4rc1-GCC-11.3.0.eb Outdated Show resolved Hide resolved

jfgrimm added 2 commits May 27, 2022 10:09

bump to 4.1.4 release, and update patches to include CUDA performance…

080edcd

… patch

remove OpenMPI release candidate

35869ca

jfgrimm marked this pull request as ready for review May 27, 2022 09:14

jfgrimm requested review from Micket and boegel May 27, 2022 09:18

Merge branch 'develop' of https://github.com/easybuilders/easybuild-e…

156e01d

…asyconfigs into 20220504130632_new_pr_libevent2112

bartoldeman approved these changes May 27, 2022

View reviewed changes

jfgrimm changed the title ~~{lib}[GCCcore/11.3.0] OpenMPI v4.1.4~~ {lib}[GCC/11.3.0] OpenMPI v4.1.4 May 27, 2022

jfgrimm changed the title ~~{lib}[GCC/11.3.0] OpenMPI v4.1.4~~ {mpi}[GCC/11.3.0] OpenMPI v4.1.4 May 27, 2022

boegel approved these changes May 27, 2022

View reviewed changes

boegel mentioned this pull request May 27, 2022

{toolchain} gompi/2022.05 (+ OSU-Micro-Benchmarks v5.9) #15553

Merged

boegel merged commit f098026 into easybuilders:develop May 28, 2022

boegel mentioned this pull request May 29, 2022

{numlib}[GCC/11.3.0] FFTW v3.3.10 #15406

Merged

{mpi}[GCC/11.3.0] OpenMPI v4.1.4 #15426

{mpi}[GCC/11.3.0] OpenMPI v4.1.4 #15426

Uh oh!

Conversation

jfgrimm commented May 4, 2022 • edited by bartoldeman Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SebastianAchilles commented May 5, 2022

Uh oh!

Uh oh!

This comment was marked as outdated.

ocaisa commented May 19, 2022

Uh oh!

ocaisa commented May 19, 2022

Uh oh!

This comment was marked as outdated.

Micket commented May 19, 2022

Uh oh!

bartoldeman commented May 19, 2022

Uh oh!

Uh oh!

This comment was marked as outdated.

Uh oh!

Uh oh!

jfgrimm commented May 27, 2022

Uh oh!

branfosj commented May 27, 2022

Uh oh!

branfosj commented May 27, 2022

Uh oh!

branfosj commented May 27, 2022

Uh oh!

branfosj commented May 27, 2022

Uh oh!

bartoldeman commented May 27, 2022

Uh oh!

boegelbot commented May 27, 2022

Uh oh!

bartoldeman left a comment

Choose a reason for hiding this comment

Uh oh!

boegelbot commented May 27, 2022

Uh oh!

boegel commented May 27, 2022

Uh oh!

boegel commented May 27, 2022

Uh oh!

SebastianAchilles commented May 27, 2022

Uh oh!

boegelbot commented May 27, 2022

Uh oh!

boegelbot commented May 27, 2022

Uh oh!

boegel commented May 28, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

jfgrimm commented May 4, 2022 •

edited by bartoldeman

Loading