osc/ucx: internal error in MPI_Get (resource temporarily unavailable)

## Background information

### What version of Open MPI are you using? (e.g., v3.0.5, v4.0.2, git branch name and hash, etc.)

v5.0.0rc7

### Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

From a source tarball.

configured with:
```
../configure CFLAGS="-g" --prefix=${PREFIX} --with-ucx=${UCX_PREFIX} --disable-man-pages --with-pmix=internal --with-hwloc=internal --with-libevent=internal --without-hcoll
```

### Please describe the system on which you are running

* Operating system/version: Linux 3.10.0-514.26.2.el7.x86_64
* Computer hardware: Intel Xeon Gold 6154 x 2 (36 cores in total)
* Network type: InfiniBand EDR 4x (100Gbps)

-----------------------------

## Details of the problem

`MPI_Get` causes an internal error under a specific condition.

Error I got:
```
[1660029730.372314] [sca1282:136709:0]           ib_md.c:379  UCX  ERROR ibv_exp_reg_mr(address=0x2afde7ec4000, length=4096, access=0xf) failed: Resource temporarily unavailable
[1660029730.372354] [sca1282:136709:0]          ucp_mm.c:143  UCX  ERROR failed to register address 0x2afde7ec4000 mem_type bit 0x1 length 4096 on md[4]=mlx5_0: Input/output error (md reg_mem_types 0x1)
[1660029730.372365] [sca1282:136709:0]     ucp_request.c:356  UCX  ERROR failed to register user buffer datatype 0x8 address 0x2afde7ec4000 len 4096: Input/output error
[sca1282:136709] ../../../../../opal/mca/common/ucx/common_ucx_wpool.h:376  Error: ucp_get_nbi failed: -3
[sca1282:00000] *** An error occurred in MPI_Get
[sca1282:00000] *** reported by process [2902982657,0]
[sca1282:00000] *** on win ucx window 3
[sca1282:00000] *** MPI_ERR_OTHER: known error not in list
[sca1282:00000] *** MPI_ERRORS_ARE_FATAL (processes in this win will now abort,
[sca1282:00000] ***    and MPI will try to terminate your MPI job as well)
```

Here is minimal code to reproduce this error:
```c
#include <stdlib.h>
#include <mpi.h>

int main(int argc, char** argv) {
  MPI_Init(&argc, &argv);

  int rank, nranks;
  MPI_Comm_rank(MPI_COMM_WORLD, &rank);
  MPI_Comm_size(MPI_COMM_WORLD, &nranks);

  /* size_t array_size = (size_t)128 * 1024 * 1024; // OK */
  size_t array_size = (size_t)1024 * 1024 * 1024; // error

  /* size_t block_size = 1024;  // OK */
  /* size_t block_size = 2048;  // OK */
  size_t block_size = 4096;  // error
  /* size_t block_size = 8192;  // error */
  /* size_t block_size = 16384; // error */
  /* size_t block_size = 32768; // OK */
  /* size_t block_size = 65536; // OK */

  size_t local_size = array_size / nranks;

  /* char* buf = (char*)aligned_alloc(2048, array_size); // OK */
  char* buf = (char*)aligned_alloc(4096, array_size); // error

  void* baseptr;
  MPI_Win win;
  MPI_Win_allocate(local_size,
                   1,
                   MPI_INFO_NULL,
                   MPI_COMM_WORLD,
                   &baseptr,
                   &win);
  MPI_Win_lock_all(0, win);

  /* int interleave = 0; // OK */
  int interleave = 1; // error

  if (rank == 0) {
    for (size_t i = 0; i < array_size / block_size; i++) {
      int target_rank;
      size_t target_disp;
      if (interleave) {
        target_rank = i % nranks;
        target_disp = i / nranks * block_size;
      } else {
        target_rank = i * block_size / local_size;
        target_disp = i * block_size - target_rank * local_size;
      }
      if (target_rank != rank) {
        MPI_Get(buf + i * block_size,
                block_size,
                MPI_BYTE,
                target_rank,
                target_disp,
                block_size,
                MPI_BYTE,
                win);
      }
    }

    MPI_Win_flush_all(win);
  }

  MPI_Win_unlock_all(win);
  MPI_Win_free(&win);

  free(buf);

  MPI_Finalize();
  return 0;
}
```

In this code, rank 0 gathers data from all other ranks into a single local array.
`MPI_Get` is issued for each rank in the granularity of `block_size`.

The above error happens under the following conditions:
* only when the array size (`array_size`) is large enough (in this case 1GB)
* only when the block size (`block_size`) for each `MPI_Get` is 4096, 8192, or 16384
* only when the local array (`buf`) is aligned with 4096 bytes
* only with the interleave policy (`interleave=1` means that rank0 chooses the target rank for `MPI_Get` in a round-robin fashion)
* only when two or more processes are spawned on different nodes (not intra-node)

Otherwise, the error did not happen in my environment.

Compile the code (`test.c`) and run on 2 nodes (1 process/node):
```sh
mpicc test.c
mpirun -n 2 -N 1 ./a.out
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

osc/ucx: internal error in MPI_Get (resource temporarily unavailable) #10639

Background information

What version of Open MPI are you using? (e.g., v3.0.5, v4.0.2, git branch name and hash, etc.)

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

Please describe the system on which you are running

Details of the problem

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

osc/ucx: internal error in MPI_Get (resource temporarily unavailable) #10639

Description

Background information

What version of Open MPI are you using? (e.g., v3.0.5, v4.0.2, git branch name and hash, etc.)

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

Please describe the system on which you are running

Details of the problem

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions