Skip to content

UCX ERROR no active messages transport #4742

@yimin-zhao

Description

@yimin-zhao

Describe the bug

When I try to run a mpi application - ior, it threw those error messages:

[1581060992.288391] [daishan:9817 :0]         select.c:438  UCX  ERROR no active messages transport to <no debug data>: posix/memory - Destination is unreachable, sysv/memory - Destination is unreachable, self/memory - Destination is unreachable, sockcm/sockaddr - no am bcopy, rdmacm/sockaddr - no am bcopy, cma/memory - no am bcopy
Abort(1091471) on node 11 (rank 11 in comm 0): Fatal error in PMPI_Init: Other MPI error, error stack:
MPIR_Init_thread(703)........:
MPID_Init(958)...............:
MPIDI_OFI_mpi_init_hook(1382): OFI get address vector map failed
[1581060992.288448] [daishan:9814 :0]         select.c:438  UCX  ERROR no active messages transport to <no debug data>: posix/memory - Destination is unreachable, sysv/memory - Destination is unreachable, self/memory - Destination is unreachable, sockcm/sockaddr - no am bcopy, rdmacm/sockaddr - no am bcopy, cma/memory - no am bcopy
[1581060992.288452] [daishan:9815 :0]         select.c:438  UCX  ERROR no active messages transport to <no debug data>: posix/memory - Destination is unreachable, sysv/memory - Destination is unreachable, self/memory - Destination is unreachable, sockcm/sockaddr - no am bcopy, rdmacm/sockaddr - no am bcopy, cma/memory - no am bcopy
[1581060992.288445] [daishan:9816 :0]         select.c:438  UCX  ERROR no active messages transport to <no debug data>: posix/memory - Destination is unreachable, sysv/memory - Destination is unreachable, self/memory - Destination is unreachable, sockcm/sockaddr - no am bcopy, rdmacm/sockaddr - no am bcopy, cma/memory - no am bcopy
[1581060992.288448] [daishan:9818 :0]         select.c:438  UCX  ERROR no active messages transport to <no debug data>: posix/memory - Destination is unreachable, sysv/memory - Destination is unreachable, self/memory - Destination is unreachable, sockcm/sockaddr - no am bcopy, rdmacm/sockaddr - no am bcopy, cma/memory - no am bcopy
[1581060992.288446] [daishan:9820 :0]         select.c:438  UCX  ERROR no active messages transport to <no debug data>: posix/memory - Destination is unreachable, sysv/memory - Destination is unreachable, self/memory - Destination is unreachable, sockcm/sockaddr - no am bcopy, rdmacm/sockaddr - no am bcopy, cma/memory - no am bcopy
Abort(1091471) on node 13 (rank 13 in comm 0): Fatal error in PMPI_Init: Other MPI error, error stack:
MPIR_Init_thread(703)........:
MPID_Init(958)...............:
MPIDI_OFI_mpi_init_hook(1382): OFI get address vector map failed
[1581060992.288452] [daishan:9821 :0]         select.c:438  UCX  ERROR no active messages transport to <no debug data>: posix/memory - Destination is unreachable, sysv/memory - Destination is unreachable, self/memory - Destination is unreachable, sockcm/sockaddr - no am bcopy, rdmacm/sockaddr - no am bcopy, cma/memory - no am bcopy
[1581060992.288449] [daishan:9822 :0]         select.c:438  UCX  ERROR no active messages transport to <no debug data>: posix/memory - Destination is unreachable, sysv/memory - Destination is unreachable, self/memory - Destination is unreachable, sockcm/sockaddr - no am bcopy, rdmacm/sockaddr - no am bcopy, cma/memory - no am bcopy
Abort(1091471) on node 15 (rank 15 in comm 0): Fatal error in PMPI_Init: Other MPI error, error stack:
MPIR_Init_thread(703)........:
MPID_Init(958)...............:
MPIDI_OFI_mpi_init_hook(1382): OFI get address vector map failed

Steps to Reproduce

  • Command line
mpirun -n 16 -ppn 8 -f ./hostfile /home/user/Repository/io-500-dev/build/ior/src/ior -w -s 50000 -a MVFS --mvfs.sock=io500.sock -i 1 -C -Q 1 -g -G 27 -k -e -t 47008 -b 47008 -o /opt/datafiles/io500.2020.02.07-15.41.54/ior_hard/IOR_file -O stoneWallingStatusFile=/opt/datafiles/io500.2020.02.07-15.41.54/ior_hard/stonewall -O stoneWallingWearOut=1 -D 20
  • UCX version used
    ucx 1.4 and ucx 1.7 (Found a similar question in this repo, so I switch to ucx1.7 but got same errors)
  • Any UCX environment variables used
    No

Setup and versions

  • OS version (e.g Linux distro) + CPU architecture (x86_64/aarch64/ppc64le/...)
    • Centos 7.6, Kernel 4.20
  • For RDMA/IB/RoCE related issues:
    • Driver version:
      • MLNX_OFED_LINUX-4.7-1.0.0.1
    • HW information
CA 'mlx5_0'
	CA type: MT4117
	Number of ports: 1
	Firmware version: 14.26.1040
	Hardware version: 0
	Node GUID: 0x506b4b0300494a2a
	System image GUID: 0x506b4b0300494a2a
	Port 1:
		State: Active
		Physical state: LinkUp
		Rate: 25
		Base lid: 0
		LMC: 0
		SM lid: 0
		Capability mask: 0x00010000
		Port GUID: 0x526b4bfffe494a2a
		Link layer: Ethernet
CA 'mlx5_1'
	CA type: MT4119
	Number of ports: 1
	Firmware version: 16.26.1040
	Hardware version: 0
	Node GUID: 0x98039b0300855d92
	System image GUID: 0x98039b0300855d92
	Port 1:
		State: Active
		Physical state: LinkUp
		Rate: 100
		Base lid: 0
		LMC: 0
		SM lid: 0
		Capability mask: 0x00010000
		Port GUID: 0x9a039bfffe855d92
		Link layer: Ethernet
CA 'mlx5_2'
	CA type: MT4119
	Number of ports: 1
	Firmware version: 16.26.1040
	Hardware version: 0
	Node GUID: 0x98039b0300855d93
	System image GUID: 0x98039b0300855d92
	Port 1:
		State: Down
		Physical state: Disabled
		Rate: 40
		Base lid: 0
		LMC: 0
		SM lid: 0
		Capability mask: 0x00010000
		Port GUID: 0x9a039bfffe855d93
		Link layer: Ethernet

Additional information (depending on the issue)

  • OpenMPI version
    Intel Mpi (Intel(R) MPI Library for Linux* OS, Version 2019 Update 6 Build 20191024)

  • Output of ucx_info -d to show transports and devices recognized by UCX
    ucx_info.txt.txt

At the very beginning, ucx 1.4 works fine with Intel's Mpi. I met this error after I uninstalled openmpi and switched to intel's mpi, I don't know whether it's due to I removed some necessary compenents during this proceduce. Anyway, when I install intel's mpi, it doesn't warn me any about it.

Tell me if you have any ideas.
Thanks in advance.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions