-
Notifications
You must be signed in to change notification settings - Fork 920
Open
Description
Background information
What version of Open MPI are you using? (e.g., v3.0.5, v4.0.2, git branch name and hash, etc.)
v5.0.0rc7
Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)
From a source tarball.
configured with:
../configure CFLAGS="-g" --prefix=${PREFIX} --with-ucx=${UCX_PREFIX} --disable-man-pages --with-pmix=internal --with-hwloc=internal --with-libevent=internal --without-hcoll
Please describe the system on which you are running
- Operating system/version: Linux 3.10.0-514.26.2.el7.x86_64
- Computer hardware: Intel Xeon Gold 6154 x 2 (36 cores in total)
- Network type: InfiniBand EDR 4x (100Gbps)
Details of the problem
MPI_Get
causes an internal error under a specific condition.
Error I got:
[1660029730.372314] [sca1282:136709:0] ib_md.c:379 UCX ERROR ibv_exp_reg_mr(address=0x2afde7ec4000, length=4096, access=0xf) failed: Resource temporarily unavailable
[1660029730.372354] [sca1282:136709:0] ucp_mm.c:143 UCX ERROR failed to register address 0x2afde7ec4000 mem_type bit 0x1 length 4096 on md[4]=mlx5_0: Input/output error (md reg_mem_types 0x1)
[1660029730.372365] [sca1282:136709:0] ucp_request.c:356 UCX ERROR failed to register user buffer datatype 0x8 address 0x2afde7ec4000 len 4096: Input/output error
[sca1282:136709] ../../../../../opal/mca/common/ucx/common_ucx_wpool.h:376 Error: ucp_get_nbi failed: -3
[sca1282:00000] *** An error occurred in MPI_Get
[sca1282:00000] *** reported by process [2902982657,0]
[sca1282:00000] *** on win ucx window 3
[sca1282:00000] *** MPI_ERR_OTHER: known error not in list
[sca1282:00000] *** MPI_ERRORS_ARE_FATAL (processes in this win will now abort,
[sca1282:00000] *** and MPI will try to terminate your MPI job as well)
Here is minimal code to reproduce this error:
#include <stdlib.h>
#include <mpi.h>
int main(int argc, char** argv) {
MPI_Init(&argc, &argv);
int rank, nranks;
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
MPI_Comm_size(MPI_COMM_WORLD, &nranks);
/* size_t array_size = (size_t)128 * 1024 * 1024; // OK */
size_t array_size = (size_t)1024 * 1024 * 1024; // error
/* size_t block_size = 1024; // OK */
/* size_t block_size = 2048; // OK */
size_t block_size = 4096; // error
/* size_t block_size = 8192; // error */
/* size_t block_size = 16384; // error */
/* size_t block_size = 32768; // OK */
/* size_t block_size = 65536; // OK */
size_t local_size = array_size / nranks;
/* char* buf = (char*)aligned_alloc(2048, array_size); // OK */
char* buf = (char*)aligned_alloc(4096, array_size); // error
void* baseptr;
MPI_Win win;
MPI_Win_allocate(local_size,
1,
MPI_INFO_NULL,
MPI_COMM_WORLD,
&baseptr,
&win);
MPI_Win_lock_all(0, win);
/* int interleave = 0; // OK */
int interleave = 1; // error
if (rank == 0) {
for (size_t i = 0; i < array_size / block_size; i++) {
int target_rank;
size_t target_disp;
if (interleave) {
target_rank = i % nranks;
target_disp = i / nranks * block_size;
} else {
target_rank = i * block_size / local_size;
target_disp = i * block_size - target_rank * local_size;
}
if (target_rank != rank) {
MPI_Get(buf + i * block_size,
block_size,
MPI_BYTE,
target_rank,
target_disp,
block_size,
MPI_BYTE,
win);
}
}
MPI_Win_flush_all(win);
}
MPI_Win_unlock_all(win);
MPI_Win_free(&win);
free(buf);
MPI_Finalize();
return 0;
}
In this code, rank 0 gathers data from all other ranks into a single local array.
MPI_Get
is issued for each rank in the granularity of block_size
.
The above error happens under the following conditions:
- only when the array size (
array_size
) is large enough (in this case 1GB) - only when the block size (
block_size
) for eachMPI_Get
is 4096, 8192, or 16384 - only when the local array (
buf
) is aligned with 4096 bytes - only with the interleave policy (
interleave=1
means that rank0 chooses the target rank forMPI_Get
in a round-robin fashion) - only when two or more processes are spawned on different nodes (not intra-node)
Otherwise, the error did not happen in my environment.
Compile the code (test.c
) and run on 2 nodes (1 process/node):
mpicc test.c
mpirun -n 2 -N 1 ./a.out