Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions mooncake-transfer-engine/include/config.h
Original file line number Diff line number Diff line change
Expand Up @@ -57,6 +57,8 @@ struct GlobalConfig {
size_t fragment_limit = 16384;
bool enable_dest_device_affinity = false;
EndpointStoreType endpoint_store_type = EndpointStoreType::SIEVE;
int traffic_class = -1; // -1 means not set, will be read from MC_IB_TC
int ib_pci_relaxed_ordering_mode = 2; // 0: off, 1: on if supported, 2: auto
};

void loadGlobalConfig(GlobalConfig &config);
Expand Down
26 changes: 26 additions & 0 deletions mooncake-transfer-engine/src/config.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -293,6 +293,32 @@ void loadGlobalConfig(GlobalConfig &config) {
"MC_ENDPOINT_STORE_TYPE, it should be FIFO|SIEVE";
}
}

const char *traffic_class_env = std::getenv("MC_IB_TC");
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the recommended setting for this value?

if (traffic_class_env) {
int val = atoi(traffic_class_env);
if (val >= 0 && val <= 255)
config.traffic_class = val;
else
LOG(WARNING) << "Ignore value from environment variable "
"MC_IB_TC, it should be 0-255";
}

const char *ib_relaxed_ordering_env =
std::getenv("MC_IB_PCI_RELAXED_ORDERING");
if (ib_relaxed_ordering_env) {
int val = atoi(ib_relaxed_ordering_env);
if (val >= 0 && val <= 2)
config.ib_pci_relaxed_ordering_mode = val;
else
LOG(WARNING)
<< "Ignore value from environment variable "
"MC_IB_PCI_RELAXED_ORDERING, it should be 0|1|2";
}
if (traffic_class_env) {
LOG(INFO) << "[Config] traffic_class set to " << config.traffic_class
<< " via MC_IB_TC";
}
}

std::string mtuLengthToString(ibv_mtu mtu) {
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -405,6 +405,9 @@ int RdmaEndPoint::doSetupConnection(int qp_index, const std::string &peer_gid,
// TODO gidIndex and portNum must fetch from REMOTE
attr.ah_attr.grh.sgid_index = context_.gidIndex();
attr.ah_attr.grh.hop_limit = MAX_HOP_LIMIT;
if (globalConfig().traffic_class >= 0) {
attr.ah_attr.grh.traffic_class = globalConfig().traffic_class;
}
attr.ah_attr.dlid = peer_lid;
attr.ah_attr.sl = 0;
attr.ah_attr.src_path_bits = 0;
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -24,6 +24,8 @@
#include <set>
#include <thread>

#include <dlfcn.h>
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make sure these two files are common-used in all types RDMA lib

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dlfcn.h is a POSIX standard header that provides dynamic loading functions (dlopen, dlsym, etc.).
It is part of glibc on Linux and is not tied to any specific RDMA implementation.

Also, the previous #include <infiniband/verbs.h> has been removed in the latest revision.


#include "common.h"
#include "config.h"
#include "memory_location.h"
Expand All @@ -32,7 +34,56 @@
#include "transport/rdma_transport/rdma_endpoint.h"

namespace mooncake {
RdmaTransport::RdmaTransport() {}

static bool MCIbRelaxedOrderingEnabled = false;
static int MCIbRelaxedOrderingMode = 2;

// Mode definition for MC_IB_PCI_RELAXED_ORDERING env.
// 0 - disabled, 1 - enabled if supported, 2 - auto (default, same as 1 today).
static int getIbRelaxedOrderingMode() {
int val = globalConfig().ib_pci_relaxed_ordering_mode;
if (val < 0 || val > 2) {
return 2;
}
return val;
}

// Determine whether RELAXED_ORDERING is enabled and possible
// This function checks for ibv_reg_mr_iova2 symbol which is available
// in IBVERBS_1.8 and above. The feature is only supported in IBVERBS_1.8+.
bool has_ibv_reg_mr_iova2(void) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This helper function has_ibv_reg_mr_iova2 is only used within this file. To limit its scope to this translation unit and prevent potential linkage conflicts, it's good practice to declare it as static. This aligns with the declaration of MCIbRelaxedOrderingEnabled.

Suggested change
bool has_ibv_reg_mr_iova2(void) {
static bool has_ibv_reg_mr_iova2(void) {

void *handle = dlopen("libibverbs.so", RTLD_NOW);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is too tricky for detecting the status.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR introduces Relaxed Ordering (RO) support with two mechanisms:

1. Capability detection
RO is detected by checking for ibv_reg_mr_iova2 in the RDMA dynamic library, following NCCL’s approach. NCCL notes that static-lin can check the function directly, but MoonCake relies on dynamic linking, so the dynamic-link path is sufficient.
2. Explicit control (New)
An environment variable is added to allow users to explicitly enable or disable RO.

This approach is lightweight, reliable for our dynamic-linking model, and consistent with industry practice.

if (!handle) {
handle = dlopen("libibverbs.so.1", RTLD_NOW);
if (!handle) return false;
}

void *sym = dlsym(handle, "ibv_reg_mr_iova2");
dlclose(handle);
return sym != NULL;
}

RdmaTransport::RdmaTransport() {
MCIbRelaxedOrderingMode = getIbRelaxedOrderingMode();
if (MCIbRelaxedOrderingMode == 0) {
LOG(INFO) << "[RDMA] Relaxed ordering disabled via "
<< "MC_IB_PCI_RELAXED_ORDERING=0. "
<< "Falling back to strict ordering.";
MCIbRelaxedOrderingEnabled = false;
return;
}

MCIbRelaxedOrderingEnabled = has_ibv_reg_mr_iova2();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use this implementation as a reference?

if (MCIbRelaxedOrderingEnabled) {
LOG(INFO) << "[RDMA] Relaxed ordering is supported on this host; "
"IBV_ACCESS_RELAXED_ORDERING will be requested for "
"registered memory regions.";
} else {
LOG(INFO) << "[RDMA] Relaxed ordering is NOT supported ("
<< "ibv_reg_mr_iova2 missing or unavailable). "
<< "Falling back to strict ordering.";
}
}

RdmaTransport::~RdmaTransport() {
#ifdef CONFIG_USE_BATCH_DESC_SET
Expand Down Expand Up @@ -132,10 +183,17 @@ int RdmaTransport::registerLocalMemory(void *addr, size_t length,
bool update_metadata) {
(void)remote_accessible;
BufferDesc buffer_desc;
const static int access_rights = IBV_ACCESS_LOCAL_WRITE |
IBV_ACCESS_REMOTE_WRITE |
IBV_ACCESS_REMOTE_READ;

const int kBaseAccessRights = IBV_ACCESS_LOCAL_WRITE |
IBV_ACCESS_REMOTE_WRITE |
IBV_ACCESS_REMOTE_READ;

static int access_rights = 0;
if (access_rights == 0) {
access_rights = kBaseAccessRights;
if (MCIbRelaxedOrderingEnabled) {
access_rights |= IBV_ACCESS_RELAXED_ORDERING;
}
}
Comment on lines +191 to +196
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The one-time initialization of access_rights is not thread-safe. registerLocalMemory can be called concurrently (e.g., from registerLocalMemoryBatch), creating a data race when checking if (access_rights == 0) and then modifying it. This is undefined behavior. Please use std::call_once to ensure this initialization is performed atomically and exactly once across all threads.

Suggested change
if (access_rights == 0) {
access_rights = kBaseAccessRights;
if (MCIbRelaxedOrderingEnabled) {
access_rights |= IBV_ACCESS_RELAXED_ORDERING;
}
}
static std::once_flag access_rights_flag;
std::call_once(access_rights_flag, [&]() {
access_rights = kBaseAccessRights;
if (MCIbRelaxedOrderingEnabled) {
access_rights |= IBV_ACCESS_RELAXED_ORDERING;
}
});

bool do_pre_touch = context_list_.size() > 0 &&
std::thread::hardware_concurrency() >= 4 &&
length >= (size_t)4 * 1024 * 1024 * 1024;
Expand Down
Loading