Skip to content

Conversation

@jfgrimm
Copy link
Member

@jfgrimm jfgrimm commented May 4, 2022

Draft OpenMPI w/ GCC 11.3.0, for latest release candidate. Will update once it is released.

depends on:

(created using eb --new-pr)

….0-GCCcore-11.3.0.eb, PMIx-4.1.2-GCCcore-11.3.0.eb, UCX-1.12.1-GCCcore-11.3.0.eb, OpenMPI-4.1.4rc1-GCC-11.3.0.eb
@jfgrimm jfgrimm added the update label May 4, 2022
@jfgrimm jfgrimm marked this pull request as draft May 4, 2022 12:06
@jfgrimm jfgrimm added this to the next release (4.5.5?) milestone May 4, 2022
@SebastianAchilles
Copy link
Member

Test report by @SebastianAchilles
SUCCESS
Build succeeded for 5 out of 5 (5 easyconfigs in total)
zen2-ubuntu-eb - Linux Ubuntu 22.04, x86_64, AMD EPYC 7452 32-Core Processor (zen2), Python 3.10.4
See https://gist.github.com/a87f21bff4de4c8c5836f142252813a4 for a full test report.

@boegel boegel changed the title {lib}[GCCcore/11.3.0] libevent v2.1.12, libfabric v1.15.0, PMIx v4.1.2, ... {lib}[GCCcore/11.3.0] OpenMPI v4.1.4, UCX v1.12.1, libfabric v1.15.0, libevent v2.1.12, , PMIx v4.1.2 May 6, 2022
@jfgrimm jfgrimm changed the title {lib}[GCCcore/11.3.0] OpenMPI v4.1.4, UCX v1.12.1, libfabric v1.15.0, libevent v2.1.12, , PMIx v4.1.2 {lib}[GCCcore/11.3.0] OpenMPI v4.1.4 May 6, 2022
@boegelbot

This comment was marked as outdated.

@ocaisa
Copy link
Member

ocaisa commented May 19, 2022

Can we consider adding the patch in #14919 (comment) and

configopts = '--with-cuda=internal'

?

@ocaisa
Copy link
Member

ocaisa commented May 19, 2022

@bartoldeman May have another patch to reduce the overhead of CUDA that this might introduce (see comment from @Micket)

@boegelbot

This comment was marked as outdated.

@Micket
Copy link
Contributor

Micket commented May 19, 2022

I think Bart was looking to get that performance patch merged upstream, so they would likely have some opinions on how to do it and how they like to have their defines.

But of course, we don't have to care about that when patching all the versions that have already been released. I stole this from the thread (I hope it wasn't secret Bart!)

diff -ur openmpi-4.1.1.orig/opal/datatype/opal_convertor.c openmpi-4.1.1/opal/datatype/opal_convertor.c
--- openmpi-4.1.1.orig/opal/datatype/opal_convertor.c	2021-04-24 13:28:07.000000000 -0400
+++ openmpi-4.1.1/opal/datatype/opal_convertor.c	2022-05-06 11:40:08.698454429 -0400
@@ -41,7 +41,7 @@
 #if OPAL_CUDA_SUPPORT
 #include "opal/datatype/opal_datatype_cuda.h"
 #define MEMCPY_CUDA( DST, SRC, BLENGTH, CONVERTOR ) \
-    CONVERTOR->cbmemcpy( (DST), (SRC), (BLENGTH), (CONVERTOR) )
+    opal_cuda_memcpy( (DST), (SRC), (BLENGTH), (CONVERTOR) )
 #endif
 
 static void opal_convertor_construct( opal_convertor_t* convertor )
@@ -51,9 +51,6 @@
     convertor->partial_length = 0;
     convertor->remoteArch     = opal_local_arch;
     convertor->flags          = OPAL_DATATYPE_FLAG_NO_GAPS | CONVERTOR_COMPLETED;
-#if OPAL_CUDA_SUPPORT
-    convertor->cbmemcpy       = &opal_cuda_memcpy;
-#endif
 }
 
 
@@ -694,9 +691,6 @@
         destination->bConverted = source->bConverted;
         destination->stack_pos  = source->stack_pos;
     }
-#if OPAL_CUDA_SUPPORT
-    destination->cbmemcpy   = source->cbmemcpy;
-#endif
     return OPAL_SUCCESS;
 }
 
diff -ur openmpi-4.1.1.orig/opal/datatype/opal_convertor.h openmpi-4.1.1/opal/datatype/opal_convertor.h
--- openmpi-4.1.1.orig/opal/datatype/opal_convertor.h	2021-04-24 13:28:07.000000000 -0400
+++ openmpi-4.1.1/opal/datatype/opal_convertor.h	2022-05-06 09:59:06.242836736 -0400
@@ -118,7 +118,6 @@
     dt_stack_t                    static_stack[DT_STATIC_STACK_SIZE];  /**< local stack for small datatypes */
 
 #if OPAL_CUDA_SUPPORT
-    memcpy_fct_t                  cbmemcpy;       /**< memcpy or cuMemcpy */
     void *                        stream;         /**< CUstream for async copy */
 #endif
 };
diff -ur openmpi-4.1.1.orig/opal/datatype/opal_datatype_cuda.c openmpi-4.1.1/opal/datatype/opal_datatype_cuda.c
--- openmpi-4.1.1.orig/opal/datatype/opal_datatype_cuda.c	2021-04-24 13:28:07.000000000 -0400
+++ openmpi-4.1.1/opal/datatype/opal_datatype_cuda.c	2022-05-06 10:26:38.659033919 -0400
@@ -48,10 +48,6 @@
         opal_cuda_support_init();
     }
 
-    /* This is needed to handle case where convertor is not fully initialized
-     * like when trying to do a sendi with convertor on the statck */
-    convertor->cbmemcpy = (memcpy_fct_t)&opal_cuda_memcpy;
-
     /* If not enabled, then nothing else to do */
     if (!opal_cuda_enabled) {
         return;
@@ -112,20 +108,17 @@
 }
 
 /*
- * With CUDA enabled, all contiguous copies will pass through this function.
- * Therefore, the first check is to see if the convertor is a GPU buffer.
+ * With CUDA enabled, all contiguous copies will pass through opal_cuda_memcpy which
+ * calls this function. Therefore, that function checks inline to see if the convertor
+ * is a GPU buffer.
  * Note that if there is an error with any of the CUDA calls, the program
  * aborts as there is no recovering.
  */
 
-void *opal_cuda_memcpy(void *dest, const void *src, size_t size, opal_convertor_t* convertor)
+void *opal_cuda_memcpy_gpu(void *dest, const void *src, size_t size, opal_convertor_t* convertor)
 {
     int res;
 
-    if (!(convertor->flags & CONVERTOR_CUDA)) {
-        return memcpy(dest, src, size);
-    }
-
     if (convertor->flags & CONVERTOR_CUDA_ASYNC) {
         res = ftable.gpu_cu_memcpy_async(dest, (void *)src, size, convertor);
     } else {
diff -ur openmpi-4.1.1.orig/opal/datatype/opal_datatype_cuda.h openmpi-4.1.1/opal/datatype/opal_datatype_cuda.h
--- openmpi-4.1.1.orig/opal/datatype/opal_datatype_cuda.h	2021-04-24 13:28:07.000000000 -0400
+++ openmpi-4.1.1/opal/datatype/opal_datatype_cuda.h	2022-05-06 10:34:26.120375882 -0400
@@ -10,6 +10,8 @@
 #ifndef _OPAL_DATATYPE_CUDA_H
 #define _OPAL_DATATYPE_CUDA_H
 
+#include <string.h>
+
 /* Structure to hold CUDA support functions that gets filled in when the
  * common cuda code is initialized.  This removes any dependency on <cuda.h>
  * in the opal cuda datatype code. */
@@ -24,10 +26,24 @@
 void mca_cuda_convertor_init(opal_convertor_t* convertor, const void *pUserBuf);
 bool opal_cuda_check_bufs(char *dest, char *src);
 bool opal_cuda_check_one_buf(char *buf, opal_convertor_t *convertor );
-void* opal_cuda_memcpy(void * dest, const void * src, size_t size, opal_convertor_t* convertor);
+void* opal_cuda_memcpy_gpu(void * dest, const void * src, size_t size, opal_convertor_t* convertor);
 void* opal_cuda_memcpy_sync(void * dest, const void * src, size_t size);
 void* opal_cuda_memmove(void * dest, void * src, size_t size);
 void opal_cuda_add_initialization_function(int (*fptr)(opal_common_cuda_function_table_t *));
 void opal_cuda_set_copy_function_async(opal_convertor_t* convertor, void *stream);
 
+/*
+ * With CUDA enabled, all contiguous copies will pass through this function.
+ * Therefore, the first check is to see if the convertor is a GPU buffer.
+ */
+
+static inline void *opal_cuda_memcpy(void *dest, const void *src, size_t size, opal_convertor_t* convertor)
+{
+    if (OPAL_LIKELY(!(convertor->flags & CONVERTOR_CUDA))) {
+        return memcpy(dest, src, size);
+    }
+
+    return opal_cuda_memcpy_gpu(dest, src, size, convertor);
+}
+
 #endif
diff -ur openmpi-4.1.1.orig/opal/datatype/opal_datatype_pack.h openmpi-4.1.1/opal/datatype/opal_datatype_pack.h
--- openmpi-4.1.1.orig/opal/datatype/opal_datatype_pack.h	2021-04-24 13:28:07.000000000 -0400
+++ openmpi-4.1.1/opal/datatype/opal_datatype_pack.h	2022-05-06 10:29:01.375528587 -0400
@@ -21,9 +21,10 @@
 
 #if !defined(CHECKSUM) && OPAL_CUDA_SUPPORT
 /* Make use of existing macro to do CUDA style memcpy */
+#include "opal/datatype/opal_datatype_cuda.h"
 #undef MEMCPY_CSUM
 #define MEMCPY_CSUM( DST, SRC, BLENGTH, CONVERTOR ) \
-    CONVERTOR->cbmemcpy( (DST), (SRC), (BLENGTH), (CONVERTOR) )
+    opal_cuda_memcpy( (DST), (SRC), (BLENGTH), (CONVERTOR) )
 #endif
 
 /**
diff -ur openmpi-4.1.1.orig/opal/datatype/opal_datatype_unpack.c openmpi-4.1.1/opal/datatype/opal_datatype_unpack.c
--- openmpi-4.1.1.orig/opal/datatype/opal_datatype_unpack.c	2021-04-24 13:28:07.000000000 -0400
+++ openmpi-4.1.1/opal/datatype/opal_datatype_unpack.c	2022-05-06 10:16:50.547111046 -0400
@@ -41,6 +41,9 @@
 #include "opal/datatype/opal_datatype_checksum.h"
 #include "opal/datatype/opal_datatype_unpack.h"
 #include "opal/datatype/opal_datatype_prototypes.h"
+#if OPAL_CUDA_SUPPORT
+#include "opal/datatype/opal_datatype_cuda.h"
+#endif
 
 #if defined(CHECKSUM)
 #define opal_unpack_general_function            opal_unpack_general_checksum
@@ -207,7 +210,7 @@
 #if OPAL_CUDA_SUPPORT
     /* In the case where the data is being unpacked from device memory, need to
      * use the special host to device memory copy. */
-    pConvertor->cbmemcpy(saved_data, user_data, data_length, pConvertor );
+    opal_cuda_memcpy(saved_data, user_data, data_length, pConvertor );
 #else
     MEMCPY( saved_data, user_data, data_length );
 #endif
@@ -227,10 +230,10 @@
      * bytes need to be converted back to their original values. */
     {
         char resaved_data[16];
-        pConvertor->cbmemcpy(resaved_data, user_data, data_length, pConvertor );
+        opal_cuda_memcpy(resaved_data, user_data, data_length, pConvertor );
         for(size_t i = 0; i < data_length; i++ ) {
             if( unused_byte == resaved_data[i] )
-                pConvertor->cbmemcpy(&user_data[i], &saved_data[i], 1, pConvertor);
+                opal_cuda_memcpy(&user_data[i], &saved_data[i], 1, pConvertor);
         }
     }
 #else
diff -ur openmpi-4.1.1.orig/opal/datatype/opal_datatype_unpack.h openmpi-4.1.1/opal/datatype/opal_datatype_unpack.h
--- openmpi-4.1.1.orig/opal/datatype/opal_datatype_unpack.h	2021-04-24 13:28:07.000000000 -0400
+++ openmpi-4.1.1/opal/datatype/opal_datatype_unpack.h	2022-05-06 10:29:13.791484625 -0400
@@ -21,9 +21,10 @@
 
 #if !defined(CHECKSUM) && OPAL_CUDA_SUPPORT
 /* Make use of existing macro to do CUDA style memcpy */
+#include "opal/datatype/opal_datatype_cuda.h"
 #undef MEMCPY_CSUM
 #define MEMCPY_CSUM( DST, SRC, BLENGTH, CONVERTOR ) \
-    CONVERTOR->cbmemcpy( (DST), (SRC), (BLENGTH), (CONVERTOR) )
+    opal_cuda_memcpy( (DST), (SRC), (BLENGTH), (CONVERTOR) )
 #endif
 
 /**

@bartoldeman
Copy link
Contributor

yes I have a patch open here, but need to follow through: open-mpi/ompi#10364
This takes a different approach of conditional compilation which has even less runtime overhead.

The performance difference isn't enormous, up to 10% on very small messages with MPI_Alltoall (which I think are rare in the first place).

Micket
Micket previously requested changes May 20, 2022
@boegelbot

This comment was marked as outdated.

jfgrimm added a commit to jfgrimm/easybuild-easyconfigs that referenced this pull request May 23, 2022
@jfgrimm jfgrimm marked this pull request as ready for review May 27, 2022 09:14
@jfgrimm jfgrimm requested review from Micket and boegel May 27, 2022 09:18
@jfgrimm
Copy link
Member Author

jfgrimm commented May 27, 2022

Test report by @jfgrimm
SUCCESS
Build succeeded for 2 out of 2 (1 easyconfigs in total)
himem01.pri.viking.alces.network - Linux CentOS Linux 7.9.2009, x86_64, Intel(R) Xeon(R) Platinum 8160 CPU @ 2.10GHz (skylake_avx512), Python 3.6.8
See https://gist.github.com/fbf16ca89962b4fd38b9dbe3c726bf28 for a full test report.

@branfosj
Copy link
Member

Test report by @branfosj
SUCCESS
Build succeeded for 2 out of 2 (1 easyconfigs in total)
bear-pg0105u36b.bear.cluster - Linux RHEL 8.6, x86_64, Intel(R) Xeon(R) Platinum 8360Y CPU @ 2.40GHz (icelake), Python 3.6.8
See https://gist.github.com/e36284283a9d74b61dc4d340c59bcedb for a full test report.

@branfosj
Copy link
Member

Test report by @branfosj
SUCCESS
Build succeeded for 2 out of 2 (1 easyconfigs in total)
bear-pg0211u08b.bear.cluster - Linux Ubuntu 20.04, x86_64, Intel(R) Xeon(R) Gold 6248 CPU @ 2.50GHz (cascadelake), Python 3.8.5
See https://gist.github.com/c35ee6acca6dec90ca1b6c21291755f9 for a full test report.

@branfosj
Copy link
Member

Test report by @branfosj
SUCCESS
Build succeeded for 2 out of 2 (1 easyconfigs in total)
bear-pg0211u03a.bear.cluster - Linux RHEL 8.6, x86_64, Intel(R) Xeon(R) Gold 6248 CPU @ 2.50GHz (cascadelake), Python 3.6.8
See https://gist.github.com/fd61eee1994f288173804cdd8f31fad8 for a full test report.

@branfosj
Copy link
Member

Test report by @branfosj
SUCCESS
Build succeeded for 2 out of 2 (1 easyconfigs in total)
bear-pg0306u03a.bear.cluster - Linux RHEL 8.5, POWER, 8335-GTX (power9le), 4 x NVIDIA Tesla V100-SXM2-16GB, 470.57.02, Python 3.6.8
See https://gist.github.com/b3c3401be9550918d34d826d10709fdb for a full test report.

@bartoldeman
Copy link
Contributor

@boegelbot please test @ generoso

@boegelbot
Copy link
Collaborator

@bartoldeman: Request for testing this PR well received on login1

PR test command 'EB_PR=15426 EB_ARGS= /opt/software/slurm/bin/sbatch --job-name test_PR_15426 --ntasks=4 ~/boegelbot/eb_from_pr_upload_generoso.sh' executed!

  • exit code: 0
  • output:
Submitted batch job 8580

Test results coming soon (I hope)...

- notification for comment with ID 1139532313 processed

Message to humans: this is just bookkeeping information for me,
it is of no use to you (unless you think I have a bug, which I don't).

Copy link
Contributor

@bartoldeman bartoldeman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@boegelbot
Copy link
Collaborator

Test report by @boegelbot
SUCCESS
Build succeeded for 1 out of 1 (1 easyconfigs in total)
cns1 - Linux Rocky Linux 8.5, x86_64, Intel(R) Xeon(R) CPU E5-2667 v3 @ 3.20GHz (haswell), Python 3.6.8
See https://gist.github.com/3f258624db6ab74c1b60cfc1b05d6806 for a full test report.

@jfgrimm jfgrimm changed the title {lib}[GCCcore/11.3.0] OpenMPI v4.1.4 {lib}[GCC/11.3.0] OpenMPI v4.1.4 May 27, 2022
@jfgrimm jfgrimm changed the title {lib}[GCC/11.3.0] OpenMPI v4.1.4 {mpi}[GCC/11.3.0] OpenMPI v4.1.4 May 27, 2022
@boegel
Copy link
Member

boegel commented May 27, 2022

Test report by @boegel
SUCCESS
Build succeeded for 1 out of 1 (1 easyconfigs in total)
fair-mastodon-c6g-2xlarge-0001 - Linux rocky linux 8.5, AArch64, ARM UNKNOWN (graviton2), Python 3.6.8
See https://gist.github.com/4822d53bb88f5611c9b83b439a56a825 for a full test report.

@boegel
Copy link
Member

boegel commented May 27, 2022

Test report by @boegel
SUCCESS
Build succeeded for 2 out of 2 (1 easyconfigs in total)
node3141.skitty.os - Linux RHEL 8.4, x86_64, Intel(R) Xeon(R) Gold 6140 CPU @ 2.30GHz, Python 3.6.8
See https://gist.github.com/ee0b383173a98b1935e8597367a18663 for a full test report.

@SebastianAchilles
Copy link
Member

@boegelbot please test @ jsc-zen2

@boegelbot
Copy link
Collaborator

@SebastianAchilles: Request for testing this PR well received on jsczen2l1.int.jsc-zen2.easybuild-test.cluster

PR test command 'EB_PR=15426 EB_ARGS= /opt/software/slurm/bin/sbatch --job-name test_PR_15426 --ntasks=8 ~/boegelbot/eb_from_pr_upload_jsc-zen2.sh' executed!

  • exit code: 0
  • output:
Submitted batch job 1239

Test results coming soon (I hope)...

- notification for comment with ID 1139659333 processed

Message to humans: this is just bookkeeping information for me,
it is of no use to you (unless you think I have a bug, which I don't).

@boegelbot
Copy link
Collaborator

Test report by @boegelbot
SUCCESS
Build succeeded for 2 out of 2 (1 easyconfigs in total)
jsczen2c1.int.jsc-zen2.easybuild-test.cluster - Linux Rocky Linux 8.5, x86_64, AMD EPYC 7742 64-Core Processor (zen2), Python 3.6.8
See https://gist.github.com/ace17d2f17c1a6c854af660c6adc56b9 for a full test report.

@boegel boegel dismissed Micket’s stale review May 28, 2022 10:49

Autotools build dep was added

@boegel
Copy link
Member

boegel commented May 28, 2022

Going in, thanks @jfgrimm!

@boegel boegel merged commit f098026 into easybuilders:develop May 28, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants