-
Notifications
You must be signed in to change notification settings - Fork 497
Closed
Labels
Description
Describe the bug
When I try to run a mpi application - ior, it threw those error messages:
[1581060992.288391] [daishan:9817 :0] select.c:438 UCX ERROR no active messages transport to <no debug data>: posix/memory - Destination is unreachable, sysv/memory - Destination is unreachable, self/memory - Destination is unreachable, sockcm/sockaddr - no am bcopy, rdmacm/sockaddr - no am bcopy, cma/memory - no am bcopy
Abort(1091471) on node 11 (rank 11 in comm 0): Fatal error in PMPI_Init: Other MPI error, error stack:
MPIR_Init_thread(703)........:
MPID_Init(958)...............:
MPIDI_OFI_mpi_init_hook(1382): OFI get address vector map failed
[1581060992.288448] [daishan:9814 :0] select.c:438 UCX ERROR no active messages transport to <no debug data>: posix/memory - Destination is unreachable, sysv/memory - Destination is unreachable, self/memory - Destination is unreachable, sockcm/sockaddr - no am bcopy, rdmacm/sockaddr - no am bcopy, cma/memory - no am bcopy
[1581060992.288452] [daishan:9815 :0] select.c:438 UCX ERROR no active messages transport to <no debug data>: posix/memory - Destination is unreachable, sysv/memory - Destination is unreachable, self/memory - Destination is unreachable, sockcm/sockaddr - no am bcopy, rdmacm/sockaddr - no am bcopy, cma/memory - no am bcopy
[1581060992.288445] [daishan:9816 :0] select.c:438 UCX ERROR no active messages transport to <no debug data>: posix/memory - Destination is unreachable, sysv/memory - Destination is unreachable, self/memory - Destination is unreachable, sockcm/sockaddr - no am bcopy, rdmacm/sockaddr - no am bcopy, cma/memory - no am bcopy
[1581060992.288448] [daishan:9818 :0] select.c:438 UCX ERROR no active messages transport to <no debug data>: posix/memory - Destination is unreachable, sysv/memory - Destination is unreachable, self/memory - Destination is unreachable, sockcm/sockaddr - no am bcopy, rdmacm/sockaddr - no am bcopy, cma/memory - no am bcopy
[1581060992.288446] [daishan:9820 :0] select.c:438 UCX ERROR no active messages transport to <no debug data>: posix/memory - Destination is unreachable, sysv/memory - Destination is unreachable, self/memory - Destination is unreachable, sockcm/sockaddr - no am bcopy, rdmacm/sockaddr - no am bcopy, cma/memory - no am bcopy
Abort(1091471) on node 13 (rank 13 in comm 0): Fatal error in PMPI_Init: Other MPI error, error stack:
MPIR_Init_thread(703)........:
MPID_Init(958)...............:
MPIDI_OFI_mpi_init_hook(1382): OFI get address vector map failed
[1581060992.288452] [daishan:9821 :0] select.c:438 UCX ERROR no active messages transport to <no debug data>: posix/memory - Destination is unreachable, sysv/memory - Destination is unreachable, self/memory - Destination is unreachable, sockcm/sockaddr - no am bcopy, rdmacm/sockaddr - no am bcopy, cma/memory - no am bcopy
[1581060992.288449] [daishan:9822 :0] select.c:438 UCX ERROR no active messages transport to <no debug data>: posix/memory - Destination is unreachable, sysv/memory - Destination is unreachable, self/memory - Destination is unreachable, sockcm/sockaddr - no am bcopy, rdmacm/sockaddr - no am bcopy, cma/memory - no am bcopy
Abort(1091471) on node 15 (rank 15 in comm 0): Fatal error in PMPI_Init: Other MPI error, error stack:
MPIR_Init_thread(703)........:
MPID_Init(958)...............:
MPIDI_OFI_mpi_init_hook(1382): OFI get address vector map failed
Steps to Reproduce
- Command line
mpirun -n 16 -ppn 8 -f ./hostfile /home/user/Repository/io-500-dev/build/ior/src/ior -w -s 50000 -a MVFS --mvfs.sock=io500.sock -i 1 -C -Q 1 -g -G 27 -k -e -t 47008 -b 47008 -o /opt/datafiles/io500.2020.02.07-15.41.54/ior_hard/IOR_file -O stoneWallingStatusFile=/opt/datafiles/io500.2020.02.07-15.41.54/ior_hard/stonewall -O stoneWallingWearOut=1 -D 20
- UCX version used
ucx 1.4 and ucx 1.7 (Found a similar question in this repo, so I switch to ucx1.7 but got same errors) - Any UCX environment variables used
No
Setup and versions
- OS version (e.g Linux distro) + CPU architecture (x86_64/aarch64/ppc64le/...)
- Centos 7.6, Kernel 4.20
- For RDMA/IB/RoCE related issues:
- Driver version:
- MLNX_OFED_LINUX-4.7-1.0.0.1
- HW information
- Driver version:
CA 'mlx5_0'
CA type: MT4117
Number of ports: 1
Firmware version: 14.26.1040
Hardware version: 0
Node GUID: 0x506b4b0300494a2a
System image GUID: 0x506b4b0300494a2a
Port 1:
State: Active
Physical state: LinkUp
Rate: 25
Base lid: 0
LMC: 0
SM lid: 0
Capability mask: 0x00010000
Port GUID: 0x526b4bfffe494a2a
Link layer: Ethernet
CA 'mlx5_1'
CA type: MT4119
Number of ports: 1
Firmware version: 16.26.1040
Hardware version: 0
Node GUID: 0x98039b0300855d92
System image GUID: 0x98039b0300855d92
Port 1:
State: Active
Physical state: LinkUp
Rate: 100
Base lid: 0
LMC: 0
SM lid: 0
Capability mask: 0x00010000
Port GUID: 0x9a039bfffe855d92
Link layer: Ethernet
CA 'mlx5_2'
CA type: MT4119
Number of ports: 1
Firmware version: 16.26.1040
Hardware version: 0
Node GUID: 0x98039b0300855d93
System image GUID: 0x98039b0300855d92
Port 1:
State: Down
Physical state: Disabled
Rate: 40
Base lid: 0
LMC: 0
SM lid: 0
Capability mask: 0x00010000
Port GUID: 0x9a039bfffe855d93
Link layer: Ethernet
Additional information (depending on the issue)
-
OpenMPI version
Intel Mpi (Intel(R) MPI Library for Linux* OS, Version 2019 Update 6 Build 20191024) -
Output of
ucx_info -dto show transports and devices recognized by UCX
ucx_info.txt.txt
At the very beginning, ucx 1.4 works fine with Intel's Mpi. I met this error after I uninstalled openmpi and switched to intel's mpi, I don't know whether it's due to I removed some necessary compenents during this proceduce. Anyway, when I install intel's mpi, it doesn't warn me any about it.
Tell me if you have any ideas.
Thanks in advance.